Question & Answer
Question
Can I connect from Spark & Python runtime to Cloud Object Storage by using AWS S3 protocol?
Answer
You can connect from Spark & Python runtime to Cloud Object Storage by using the S3 protocol with this steps.
1. Download JAR files that are needed to connect to Cloud Object Storage by using the S3 protocol from Maven Repository and move them to CPD cluster. Change the version if necessary.
- aws-java-sdk-1.12.393.jar
- aws-java-sdk-core-1.12.393.jar
- aws-java-sdk-s3-1.12.393.jar
- hadoop-annotations-2.8.5.jar
- hadoop-auth-2.8.5.jar
- hadoop-aws-2.8.5.jar
- hadoop-client-2.8.5.jar
- hadoop-common-2.8.5.jar
- hadoop-hdfs-2.8.5.jar
- htrace-core4-4.0.1-incubating.jar
- httpclient-4.5.jar
2. Locate an ibm-nginx pod.
# ibm_nginx_pod=$(oc get pods | grep ibm-nginx | head -1 | cut -f1 -d\ )
# echo $ibm_nginx_pod
3. Create a directory called dbdrivers if not already created inside the pod in the "user-home/_global_" directory.
# oc exec ${ibm_nginx_pod} -- mkdir -p "/user-home/_global_/dbdrivers"
4. Copy the JAR files into the dbdrivers directory.
# oc cp <jar-name.jar> ${ibm_nginx_pod}:/user-home/_global_/dbdrivers/
5. Run the code to connect to Cloud Object Storage. Following is a sample code.
data = [{
'dt': '2022-01-01',
'id': 1
}, {
'dt': '2022-01-01',
'id': 2
}, {
'dt': '2022-01-01',
'id': 3
}, {
'dt': '2022-02-01',
'id': 1
}
]
from pyspark.sql.types import StructField, StringType, StructType, IntegerType
schema = StructType([
StructField("dt",StringType(),True),
StructField("id", IntegerType(), True)
])
df = spark.createDataFrame(data=data, schema=schema)
hconf=sc._jsc.hadoopConfiguration()
hconf.set("fs.s3a.endpoint", "<ENDPOINT>")
hconf.set("fs.s3a.access.key","<ACCESS KEY>")
hconf.set("fs.s3a.secret.key","<SECRET KEY>")
df.write.option("header",True).mode("overwrite").csv("s3a://cos-test-cpd/output.csv")
[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m50000000ClViAAK","label":"Organize-\u003EData Connections"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.5.0"}]
Was this topic helpful?
Document Information
Modified date:
06 March 2023
UID
ibm16960471