October 27, 2021 By Henrik Loeser 4 min read

How to use IBM Cloud Code Engine with cron-scheduled jobs to build your data lake.

Recently, I began a new side project. It involves mobility data and its analysis and visualization — consider it a data science project. After identifying the right data providers and open data APIs, the next step is to build the data lake. It requires you to regularly download data from various sources, then upload it to a Cloud Object Storage/S3 bucket to build the data lake.

Traditionally, I would have to set up a virtual machine to run the scheduled data scraping jobs. Thanks to serverless compute offerings like IBM Cloud Code Engine, I can cut costs and environmental impact. My scripts are still run based on a cron-controlled schedule, but I only use compute resources for a few seconds per hour. Thus, I pay only a fraction of the earlier costs. All data is uploaded to Cloud Object Storage (COS). From there, it can be easily accessed by data science projects and notebooks hosted in IBM Watson Studio or queried by the SQL Query service (see diagram below).

In the following post, I provide an overview of the project and its components. Thereafter, I discuss technical details like the Dockerfile to containerize my scripting:

My script runs in Code Engine as job, based on a cron schedule.

Combining Code Engine, cron and IBM Cloud Object Storage

Instead of utilizing a virtual machine, the scripts for data scraping/retrieving data from open data APIs are deployed to IBM Cloud Code Engine. Code Engine is a fully managed, serverless platform for containerized workloads. That means my scripts need to run within containers. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.

To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.

Technical details

Independent of where the scraper script is run and for which source site, the structure is always the same (see script below). We need to determine the name of the data file, which ideally includes the current date and time. Thereafter, we can retrieve the data and store the result in a file. The data retrieval might require parameters and providing an API key. I usually compress data with gzip before storing it. There are different ways of uploading a file to COS. An easy approach is to utilize the IBM Cloud CLI plugin and its object-put command. It requires you to be logged in to IBM Cloud (using an API key) with a region and a resource group set:

#!/bin/bash
set -ex
# use current date and time for file name
DATE=`date "+%Y%m%d_%H%M"`

# retrieve the data and store it to a file
curl -s -X GET "https://data-platform.example.com/v1/someAPI?${MY_PARAMETERS}" -H "x-api-key: ${MY_DATA_API_KEY}" > ${DATE}.json

# compress the file
gzip ${DATE}.json

# IBM Cloud login and data upload to COS
IBMCLOUD_API_KEY=${IBMCLOUD_APIKEY} ibmcloud login  -g default -r us-south
ibmcloud cos object-put --bucket scooter --key ${DATE}.json.gz --body ${DATE}.json.gz

Script to fetch data from API and upload to Cloud Object Storage as a data lake.

In order to run the above script in a container, we need the IBM Cloud CLI environment. Thus, our Dockerfile is mainly composed of a chain of commands to update the base operating system and then to install the IBM Cloud CLI and the COS plugin. Thereafter, it copies over the above script, which is also run by default:

# Small base image
FROM alpine
# Upgrade the OS, install some common tools and then
# the IBM Cloud CLI and Cloud Object Storage plugin
RUN apk update && apk upgrade && apk add bash curl jq git ncurses && \
    curl -fsSL https://clis.cloud.ibm.com/install/linux | bash && \
    ln -s /usr/local/bin/ibmcloud /usr/local/bin/ic && \
    ibmcloud plugin install cloud-object-storage

COPY script.sh /script.sh
WORKDIR /app
ENTRYPOINT [ "/script.sh" ]

Dockerfile.

With those definitions in place, everything is ready to build the container image and push it to the IBM Cloud Container Registry. Then, it is straightforward to create a job based on it and schedule job runs. For the job, I created environment variables to pass in the IBM Cloud API key and the necessary parameters and key for the data retrieval.

Once everything was in place and supposed to work, I used the IBM Cloud CLI with the COS plugin to check that the automatic, serverless data retrieval was working as expected. The files use an UTC timestamp and have been uploaded every two hours — as configured in my test case:

Listing uploaded data in a Cloud Object Storage bucket.

Conclusions

Using a serverless container platform like IBM Cloud Code Engine, it is possible to easily set up data scraping and retrieval jobs for building up a data lake. By avoiding “always-on” virtual machines and only using computing power when needed, unnecessary resource consumption and costs are avoided. From my experience, it is also easier to set up and it reduces maintenance and security-related work.

If you are interested in learning more about Code Engine, I recommend the following tutorials and blogs:

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Was this article helpful?
YesNo

More from Cloud

A major upgrade to Db2® Warehouse on IBM Cloud®

2 min read - We’re thrilled to announce a major upgrade to Db2® Warehouse on IBM Cloud®, which introduces several new capabilities that make Db2 Warehouse even more performant, capable, and cost-effective. Here's what's new Up to 34 times cheaper storage costs The next generation of Db2 Warehouse introduces support for Db2 column-organized tables in Cloud Object Storage. Db2 Warehouse on IBM Cloud customers can now store massive datasets on a resilient, highly scalable storage tier, costing up to 34x less. Up to 4 times…

Manage the routing of your observability log and event data 

4 min read - Comprehensive environments include many sources of observable data to be aggregated and then analyzed for infrastructure and app performance management. Connecting and aggregating the data sources to observability tools need to be flexible. Some use cases might require all data to be aggregated into one common location while others have narrowed scope. Optimizing where observability data is processed enables businesses to maximize insights while managing to cost, compliance and data residency objectives.  As announced on 29 March 2024, IBM Cloud® released its next-gen observability…

The recipe for RAG: How cloud services enable generative AI outcomes across industries

4 min read - According to research from IBM®, about 42% of enterprises surveyed have AI in use in their businesses. Of all the use cases, many of us are now extremely familiar with natural language processing AI chatbots that can answer our questions and assist with tasks such as composing emails or essays. Yet even with widespread adoption of these chatbots, enterprises are still occasionally experiencing some challenges. For example, these chatbots can produce inconsistent results as they’re pulling from large data stores…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters