October 23, 2020 By Torsten Steinbach 2 min read

A critical factor for smart business decisions is learning from the behaviour of your applications and users.

The information that fuels such learning is available in the logs generated by your solution stack. Way too often, however, it is still quite hard to consume these logs with analytic frameworks and algorithms.

In IBM Cloud, we have established a pattern of using serverless SQL jobs to process and analyze log data that has been archived to cloud object storage. IBM Log Analysis with LogDNA is the standard logging service of IBM Cloud, and it supports archiving to object storage out of the box. In a similar way, the auditing service called Cloud Activity Tracker is built on top of LogDNA and allows you to archive and analyze auditing records in such a way.

A frequent hurdle that many users face in this process is the fact that log archive files are sometimes very large, which makes the processing and analytics with the SQL service very inefficient and slow. This is further amplified by the fact that log archives are often stored with gzip compression. The problem here is that gzip is a non-splittable compression codec, so that the Spark-based scale-out serverless SQL service in IBM Cloud cannot read the large log archive file in parallel, giving away a lot of performance benefits.

A new solution: IBM Cloud Code Engine

To overcome this hurdle you can use another brand-new serverless runtime feature in IBM Cloud — the IBM Cloud Code Engine, which was recently launched and is currently available to everyone in open beta. It provides a very flexible way to run any code in a serverless fashion. It can run code directly from the source or from docker images that you have prepared.

We have just published a docker image to split and recompress large log archives into splittable compression codec. You can use it as is and run as a serverless batch job in IBM Cloud Code Engine. The image source and detailed description on how to deploy and run it with Code Engine can be found here.

Step-by-step instructions

The basic steps are quite straight forward:

  1. Create a Code Engine project.
  2. Create a batch job definition referencing the docker image.
  3. Set object storage bucket, object name, and credentials for both — the large input log archive and the split output.
  4. Submit the job

By running the split job serverless directly inside the cloud, it is close to the data in object storage and can use private endpoints to read and write. This way, you can run the entire process (read, decompress, split, compress, write) in close to a minute for a 1 GB large compressed log archive.

This efficient and serverless splitting of log archives paves the path for a full serverless log processing and analytics pipeline using SQL. The following illustrates the entire serverless log processing pipeline:

Learn more

Was this article helpful?
YesNo

More from Cloud

New 4th Gen Intel Xeon profiles and dynamic network bandwidth shake up the IBM Cloud Bare Metal Servers for VPC portfolio

3 min read - We’re pleased to announce that 4th Gen Intel® Xeon® processors on IBM Cloud Bare Metal Servers for VPC are available on IBM Cloud. Our customers can now provision Intel’s newest microarchitecture inside their own virtual private cloud and gain access to a host of performance enhancements, including more core-to-memory ratios (21 new server profiles/) and dynamic network bandwidth exclusive to IBM Cloud VPC. For anyone keeping track, that’s 3x as many provisioning options than our current 2nd Gen Intel Xeon…

IBM and AWS: Driving the next-gen SAP transformation  

5 min read - SAP is the epicenter of business operations for companies around the world. In fact, 77% of the world’s transactional revenue touches an SAP system, and 92% of the Forbes Global 2000 companies use SAP, according to Frost & Sullivan.   Global challenges related to profitability, supply chains and sustainability are creating economic uncertainty for many companies. Modernizing SAP systems and embracing cloud environments like AWS can provide these companies with a real-time view of their business operations, fueling growth and increasing…

Experience unmatched data resilience with IBM Storage Defender and IBM Storage FlashSystem

3 min read - IBM Storage Defender is a purpose-built end-to-end data resilience solution designed to help businesses rapidly restart essential operations in the event of a cyberattack or other unforeseen events. It simplifies and orchestrates business recovery processes by providing a comprehensive view of data resilience and recoverability across primary and  auxiliary storage in a single interface. IBM Storage Defender deploys AI-powered sensors to quickly detect threats and anomalies. Signals from all available sensors are aggregated by IBM Storage Defender, whether they come…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters