October 27, 2021 By Henrik Loeser 4 min read

How to use IBM Cloud Code Engine with cron-scheduled jobs to build your data lake.

Recently, I began a new side project. It involves mobility data and its analysis and visualization — consider it a data science project. After identifying the right data providers and open data APIs, the next step is to build the data lake. It requires you to regularly download data from various sources, then upload it to a Cloud Object Storage/S3 bucket to build the data lake.

Traditionally, I would have to set up a virtual machine to run the scheduled data scraping jobs. Thanks to serverless compute offerings like IBM Cloud Code Engine, I can cut costs and environmental impact. My scripts are still run based on a cron-controlled schedule, but I only use compute resources for a few seconds per hour. Thus, I pay only a fraction of the earlier costs. All data is uploaded to Cloud Object Storage (COS). From there, it can be easily accessed by data science projects and notebooks hosted in IBM Watson Studio or queried by the SQL Query service (see diagram below).

In the following post, I provide an overview of the project and its components. Thereafter, I discuss technical details like the Dockerfile to containerize my scripting:

My script runs in Code Engine as job, based on a cron schedule.

Combining Code Engine, cron and IBM Cloud Object Storage

Instead of utilizing a virtual machine, the scripts for data scraping/retrieving data from open data APIs are deployed to IBM Cloud Code Engine. Code Engine is a fully managed, serverless platform for containerized workloads. That means my scripts need to run within containers. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.

To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.

Technical details

Independent of where the scraper script is run and for which source site, the structure is always the same (see script below). We need to determine the name of the data file, which ideally includes the current date and time. Thereafter, we can retrieve the data and store the result in a file. The data retrieval might require parameters and providing an API key. I usually compress data with gzip before storing it. There are different ways of uploading a file to COS. An easy approach is to utilize the IBM Cloud CLI plugin and its object-put command. It requires you to be logged in to IBM Cloud (using an API key) with a region and a resource group set:

#!/bin/bash
set -ex
# use current date and time for file name
DATE=`date "+%Y%m%d_%H%M"`

# retrieve the data and store it to a file
curl -s -X GET "https://data-platform.example.com/v1/someAPI?${MY_PARAMETERS}" -H "x-api-key: ${MY_DATA_API_KEY}" > ${DATE}.json

# compress the file
gzip ${DATE}.json

# IBM Cloud login and data upload to COS
IBMCLOUD_API_KEY=${IBMCLOUD_APIKEY} ibmcloud login  -g default -r us-south
ibmcloud cos object-put --bucket scooter --key ${DATE}.json.gz --body ${DATE}.json.gz

Script to fetch data from API and upload to Cloud Object Storage as a data lake.

In order to run the above script in a container, we need the IBM Cloud CLI environment. Thus, our Dockerfile is mainly composed of a chain of commands to update the base operating system and then to install the IBM Cloud CLI and the COS plugin. Thereafter, it copies over the above script, which is also run by default:

# Small base image
FROM alpine
# Upgrade the OS, install some common tools and then
# the IBM Cloud CLI and Cloud Object Storage plugin
RUN apk update && apk upgrade && apk add bash curl jq git ncurses && \
    curl -fsSL https://clis.cloud.ibm.com/install/linux | bash && \
    ln -s /usr/local/bin/ibmcloud /usr/local/bin/ic && \
    ibmcloud plugin install cloud-object-storage

COPY script.sh /script.sh
WORKDIR /app
ENTRYPOINT [ "/script.sh" ]

Dockerfile.

With those definitions in place, everything is ready to build the container image and push it to the IBM Cloud Container Registry. Then, it is straightforward to create a job based on it and schedule job runs. For the job, I created environment variables to pass in the IBM Cloud API key and the necessary parameters and key for the data retrieval.

Once everything was in place and supposed to work, I used the IBM Cloud CLI with the COS plugin to check that the automatic, serverless data retrieval was working as expected. The files use an UTC timestamp and have been uploaded every two hours — as configured in my test case:

Listing uploaded data in a Cloud Object Storage bucket.

Conclusions

Using a serverless container platform like IBM Cloud Code Engine, it is possible to easily set up data scraping and retrieval jobs for building up a data lake. By avoiding “always-on” virtual machines and only using computing power when needed, unnecessary resource consumption and costs are avoided. From my experience, it is also easier to set up and it reduces maintenance and security-related work.

If you are interested in learning more about Code Engine, I recommend the following tutorials and blogs:

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Was this article helpful?
YesNo

More from Cloud

How a US bank modernized its mainframe applications with IBM Consulting and Microsoft Azure

9 min read - As organizations strive to stay ahead of the curve in today's fast-paced digital landscape, mainframe application modernization has emerged as a critical component of any digital transformation strategy. In this blog, we'll discuss the example of a US bank which embarked on a journey to modernize its mainframe applications. This strategic project has helped it to transform into a more modern, flexible and agile business. In looking at the ways in which it approached the problem, you’ll gain insights into…

The power of the mainframe and cloud-native applications 

4 min read - Mainframe modernization refers to the process of transforming legacy mainframe systems, applications and infrastructure to align with modern technology and business standards. This process unlocks the power of mainframe systems, enabling organizations to use their existing investments in mainframe technology and capitalize on the benefits of modernization. By modernizing mainframe systems, organizations can improve agility, increase efficiency, reduce costs, and enhance customer experience.  Mainframe modernization empowers organizations to harness the latest technologies and tools, such as cloud computing, artificial intelligence,…

Modernize your mainframe applications with Azure

4 min read - Mainframes continue to play a vital role in many businesses' core operations. According to new research from IBM's Institute for Business Value, a significant 7 out of 10 IT executives believe that mainframe-based applications are crucial to their business and technology strategies. However, the rapid pace of digital transformation is forcing companies to modernize across their IT landscape, and as the pace of innovation continuously accelerates, organizations must react and adapt to these changes or risk being left behind. Mainframe…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters