IBM Support

Can I connect from Spark & Python runtime to Cloud Object Storage by using AWS S3 protocol?

Question & Answer


Question

Can I connect from Spark & Python runtime to Cloud Object Storage by using AWS S3 protocol?

Answer

You can connect from Spark & Python runtime to Cloud Object Storage by using the S3 protocol with this steps.
1. Download JAR files that are needed to connect to Cloud Object Storage by using the S3 protocol from Maven Repository and move them to CPD cluster. Change the version if necessary.
  • aws-java-sdk-1.12.393.jar
  • aws-java-sdk-core-1.12.393.jar
  • aws-java-sdk-s3-1.12.393.jar
  • hadoop-annotations-2.8.5.jar
  • hadoop-auth-2.8.5.jar
  • hadoop-aws-2.8.5.jar
  • hadoop-client-2.8.5.jar
  • hadoop-common-2.8.5.jar
  • hadoop-hdfs-2.8.5.jar
  • htrace-core4-4.0.1-incubating.jar
  • httpclient-4.5.jar
2. Locate an ibm-nginx pod.
# ibm_nginx_pod=$(oc get pods | grep ibm-nginx | head -1 | cut -f1 -d\ )
# echo $ibm_nginx_pod
3. Create a directory called dbdrivers if not already created inside the pod in the "user-home/_global_" directory.
# oc exec ${ibm_nginx_pod} -- mkdir -p "/user-home/_global_/dbdrivers"
4. Copy the JAR files into the dbdrivers directory.
# oc cp <jar-name.jar> ${ibm_nginx_pod}:/user-home/_global_/dbdrivers/
5. Run the code to connect to Cloud Object Storage. Following is a sample code.
data = [{
 'dt': '2022-01-01',
 'id': 1
}, {
 'dt': '2022-01-01',
 'id': 2
}, {
 'dt': '2022-01-01',
 'id': 3
}, {
 'dt': '2022-02-01',
 'id': 1
}
]

from pyspark.sql.types import StructField, StringType, StructType, IntegerType

schema = StructType([
    StructField("dt",StringType(),True),
    StructField("id", IntegerType(), True)
])

df = spark.createDataFrame(data=data, schema=schema)

hconf=sc._jsc.hadoopConfiguration()

hconf.set("fs.s3a.endpoint", "<ENDPOINT>")
hconf.set("fs.s3a.access.key","<ACCESS KEY>")
hconf.set("fs.s3a.secret.key","<SECRET KEY>")


df.write.option("header",True).mode("overwrite").csv("s3a://cos-test-cpd/output.csv")

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m50000000ClViAAK","label":"Organize-\u003EData Connections"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.5.0"}]

Document Information

Modified date:
06 March 2023

UID

ibm16960471