The integration of Lithops with IBM Cloud Code Engine – our next-generation serverless offering — offers unprecedented flexibility.

Object storage data (pre-)processing, hyperparameter optimization, searching and processing logs, heavy computational tasks (e.g., Monte Carlo simulations or genome analytics), downloading large volumes of data, web scraping, model scoring, etc. are just a few examples of scenarios where lots of CPU, memory and/or network-intensive work needs to be done.

A common approach to handling this programmatically is to just run a “for” loop and kick off asynchronous processing within that loop. A common approach in Python is to use multiprocessing.Pool or concurrent.futures, where the map operation is called with the function to be executed as one parameter, and the list of (many) objects to be processed as a parameter. The remainder of this doc focuses on the former one for the sake of simplicity, but there is also an equivalent approach for concurrent.futures.

Below is a conceptual example of how this works. The map operation receives two parameters:

results = multiprocessing.Pool.map(convert_image, [image1, image2, image3])
 
def convert_image(image_url):
    <logic downloading the image and converting it somehow>
 
print(results[0].get())
print(results[1].get())
…

The beauty of this approach is that it is entirely serverless. In order to use it, a developer only has to pass in the operation to be executed n times and the n objects they’d like to have processed. However, with the original version of these libraries, the restriction is that this can only take advantage of the CPU cores, memory, network bandwidth, etc. available to the (virtual) machine the Python process is running on.

Wouldn’t it be nice if for each call of Pool.map with n objects as a parameter, n containers got spun up behind the scenes (or a smaller number, in case the elements are chunked)? Each of the containers would handle their part of the work, and they would vanish automatically once the work was completed. This would also demonstrate nicely how the often-discussed relevance of cloud to developers can be realized. The developer only has to write code; when executing the code, hundreds or thousands of CPU cores are (de-)provisioned transparently behind the scenes:

Lithops + IBM Cloud Code Engine

This foundational approach is implemented in the open source Lithops project as a client-side library.

The integration of Lithops with IBM Cloud Code Engine — our next-generation serverless offering — offers unprecedented flexibility. Amongst several other things, Code Engine allows you to allocate a large number of parallel containers behind the scenes. Each of them can be provisioned within seconds, with a max of 32 GBs of memory, 8 CPU cores and a maximum execution time of 2 hours (and these are just the defaults every user gets out-of-the-box; we can raise them further on a per-user basis).

The code changes required for adopting Lithops in any existing Python program are very minimal. In the ideal case, it’s literally only about following a so-called “drop-in library” approach — changing an import statement from multiprocessing.Pool to lithops.multiprocessing.Pool.

The initial config instructions are also very minimal — you can find them documented here.

Beyond that, Lithops allows for pure parallelization across a conceptually unlimited pool of resources and the application of a reduce/aggregation step at the end. That can be accomplished by simply passing in the reduce function as a third parameter — for example, map_reduce(business_logic, <list_of_data_elements, reduce_operation).

The price doesn’t change — whether it’s 10 cores for 100 seconds or 100 cores for 10 seconds, it doesn’t make a difference for the price. Obviously, depending on the nature of the problem, there is a certain percentage of overhead for distributing the work, which needs to be taken into account.

Another advantage is that instead of an approach where capacity has to be allocated based on the most expensive operation in your Python program, each individual map operation in a longer Python program would just dynamically allocate exactly the capacity need. This allows for a significant performance boost in combination with significant cost savings (see diagram below):

Obviously, this approach can be leveraged in interactive (data science and other) scenarios, where the data scientist wants to run some heavy processing operation while waiting in front of their screen for it to finish — by using pure Python with an editor, using a Jupyter notebook (e.g., in Watson Studio or elsewhere), etc. It’s also applicable to continuously running backend applications written in Python (or, basically, any other piece of Python code which needs to do some heavy lifting).

As indicated in the diagrams above, from a developer perspective, this looks like a single program running on a single computer. In reality, a very large amount of distributed capacity is being (de-)allocated dynamically. All of this takes us one step further towards our vision of a serverless supercomputer, where we treat the cloud as a single computer in itself vs. a collection of VMs, containers, web servers, app servers and database servers. Watch this space.

Get started

Try it out by yourself by using IBM Cloud Code Engine and following the instructions captured here.

We would like to express our special thanks to URV, in particular Josep Sampe and Pedro Garcia-Lopez, who have been absolutely instrumental in developing Lithops and contributed multi-processing API. We very much enjoy and appreciate the collaboration.

Was this article helpful?
YesNo

More from Cloud

New 4th Gen Intel Xeon profiles and dynamic network bandwidth shake up the IBM Cloud Bare Metal Servers for VPC portfolio

3 min read - We’re pleased to announce that 4th Gen Intel® Xeon® processors on IBM Cloud Bare Metal Servers for VPC are available on IBM Cloud. Our customers can now provision Intel’s newest microarchitecture inside their own virtual private cloud and gain access to a host of performance enhancements, including more core-to-memory ratios (21 new server profiles/) and dynamic network bandwidth exclusive to IBM Cloud VPC. For anyone keeping track, that’s 3x as many provisioning options than our current 2nd Gen Intel Xeon…

IBM and AWS: Driving the next-gen SAP transformation  

5 min read - SAP is the epicenter of business operations for companies around the world. In fact, 77% of the world’s transactional revenue touches an SAP system, and 92% of the Forbes Global 2000 companies use SAP, according to Frost & Sullivan.   Global challenges related to profitability, supply chains and sustainability are creating economic uncertainty for many companies. Modernizing SAP systems and embracing cloud environments like AWS can provide these companies with a real-time view of their business operations, fueling growth and increasing…

Experience unmatched data resilience with IBM Storage Defender and IBM Storage FlashSystem

3 min read - IBM Storage Defender is a purpose-built end-to-end data resilience solution designed to help businesses rapidly restart essential operations in the event of a cyberattack or other unforeseen events. It simplifies and orchestrates business recovery processes by providing a comprehensive view of data resilience and recoverability across primary and  auxiliary storage in a single interface. IBM Storage Defender deploys AI-powered sensors to quickly detect threats and anomalies. Signals from all available sensors are aggregated by IBM Storage Defender, whether they come…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters