Data connectors

Data connectors manage the libraries and configurations that are required for hosts to connect to various data sources. A data connector contains the type, the URI, the authentication method, and all of the required libraries to access the data source.

Data connectors simplify data source administration and configuration, and separate credential management and data usage from applications. You can configure multiple data connectors for a Spark instance group and switch between data sources on demand from the cluster management console. When data connectors are configured and deployed with your Spark instance group, you can use them to connect to data sources when you create notebook services and submit Spark batch applications.

There are five data connector types built in with IBM® Spectrum Conductor. The following table highlights each of the data connector types and their configuration requirements.

Note: The following notes and limitations apply when data connectors are configured for a Spark instance group:

By design, data connectors for the standard authentication HDFS file system cannot be used in the same application as a Kerberos-protected HDFS file system to prevent anti-affinity configuration errors.
All data is shared among users in a Spark instance group when using data connectors with notebooks.
If a Spark instance group is configured to use a data connector for the fs.defaultFS parameter, using Hive context in the Zeppelin notebook is not supported.

For more information about adding data connectors to a Spark instance group, see Adding data connectors.

Table 1. Data connector configuration requirements
Data connector type	Configuration requirements
IBM Cloud Object Storage	S3 Endpoint URL: The endpoint URL used for the S3 connection. Bucket: The S3 connection bucket name. Set the fs.s3d.`data_connector_name`.access.key and fs.s3d.`data_connector_name`.secret.key properties within your application code. See Adding data connectors.
IBM Spectrum Scale (HDFS Transparency)	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is `8020`.
HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is `8020`.
Kerberos secured HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is `8020`. Data Access Principal: The data access principal. Keytab File Path: The path to the keytab file to access HDFS with Kerberos enabled. Principal Pattern: The pattern used to determine if the principal is valid.
Kerberos TGT secured HDFS	Access URI: The Access URI must be based on the NameNode address, which is used as the fs.default.FS parameter in the Hadoop configuration file. If a port is not specified, the default port number used is `8020`. Principal Pattern: The pattern used to determine if the principal is valid.