The integration of Lithops with IBM Cloud Code Engine – our next-generation serverless offering — offers unprecedented flexibility.

Object storage data (pre-)processing, hyperparameter optimization, searching and processing logs, heavy computational tasks (e.g., Monte Carlo simulations or genome analytics), downloading large volumes of data, web scraping, model scoring, etc. are just a few examples of scenarios where lots of CPU, memory and/or network-intensive work needs to be done.

A common approach to handling this programmatically is to just run a “for” loop and kick off asynchronous processing within that loop. A common approach in Python is to use multiprocessing.Pool or concurrent.futures, where the map operation is called with the function to be executed as one parameter, and the list of (many) objects to be processed as a parameter. The remainder of this doc focuses on the former one for the sake of simplicity, but there is also an equivalent approach for concurrent.futures.

Below is a conceptual example of how this works. The map operation receives two parameters:

results = multiprocessing.Pool.map(convert_image, [image1, image2, image3])
 
def convert_image(image_url):
    <logic downloading the image and converting it somehow>
 
print(results[0].get())
print(results[1].get())
…

The beauty of this approach is that it is entirely serverless. In order to use it, a developer only has to pass in the operation to be executed n times and the n objects they’d like to have processed. However, with the original version of these libraries, the restriction is that this can only take advantage of the CPU cores, memory, network bandwidth, etc. available to the (virtual) machine the Python process is running on.

Wouldn’t it be nice if for each call of Pool.map with n objects as a parameter, n containers got spun up behind the scenes (or a smaller number, in case the elements are chunked)? Each of the containers would handle their part of the work, and they would vanish automatically once the work was completed. This would also demonstrate nicely how the often-discussed relevance of cloud to developers can be realized. The developer only has to write code; when executing the code, hundreds or thousands of CPU cores are (de-)provisioned transparently behind the scenes:

Lithops + IBM Cloud Code Engine

This foundational approach is implemented in the open source Lithops project as a client-side library.

The integration of Lithops with IBM Cloud Code Engine — our next-generation serverless offering — offers unprecedented flexibility. Amongst several other things, Code Engine allows you to allocate a large number of parallel containers behind the scenes. Each of them can be provisioned within seconds, with a max of 32 GBs of memory, 8 CPU cores and a maximum execution time of 2 hours (and these are just the defaults every user gets out-of-the-box; we can raise them further on a per-user basis).

The code changes required for adopting Lithops in any existing Python program are very minimal. In the ideal case, it’s literally only about following a so-called “drop-in library” approach — changing an import statement from multiprocessing.Pool to lithops.multiprocessing.Pool.

The initial config instructions are also very minimal — you can find them documented here.

Beyond that, Lithops allows for pure parallelization across a conceptually unlimited pool of resources and the application of a reduce/aggregation step at the end. That can be accomplished by simply passing in the reduce function as a third parameter — for example, map_reduce(business_logic, <list_of_data_elements, reduce_operation).

The price doesn’t change — whether it’s 10 cores for 100 seconds or 100 cores for 10 seconds, it doesn’t make a difference for the price. Obviously, depending on the nature of the problem, there is a certain percentage of overhead for distributing the work, which needs to be taken into account.

Another advantage is that instead of an approach where capacity has to be allocated based on the most expensive operation in your Python program, each individual map operation in a longer Python program would just dynamically allocate exactly the capacity need. This allows for a significant performance boost in combination with significant cost savings (see diagram below):

Obviously, this approach can be leveraged in interactive (data science and other) scenarios, where the data scientist wants to run some heavy processing operation while waiting in front of their screen for it to finish — by using pure Python with an editor, using a Jupyter notebook (e.g., in Watson Studio or elsewhere), etc. It’s also applicable to continuously running backend applications written in Python (or, basically, any other piece of Python code which needs to do some heavy lifting).

As indicated in the diagrams above, from a developer perspective, this looks like a single program running on a single computer. In reality, a very large amount of distributed capacity is being (de-)allocated dynamically. All of this takes us one step further towards our vision of a serverless supercomputer, where we treat the cloud as a single computer in itself vs. a collection of VMs, containers, web servers, app servers and database servers. Watch this space.

Get started

Try it out by yourself by using IBM Cloud Code Engine and following the instructions captured here.

We would like to express our special thanks to URV, in particular Josep Sampe and Pedro Garcia-Lopez, who have been absolutely instrumental in developing Lithops and contributed multi-processing API. We very much enjoy and appreciate the collaboration.

Was this article helpful?
YesNo

More from Cloud

Top 6 innovations from the IBM – AWS GenAI Hackathon

5 min read - Eight client teams collaborated with IBM® and AWS this spring to develop generative AI prototypes to address real-world business challenges in the public sector, financial services, energy, healthcare and other industries. Over the course of several weeks, cross-functional teams comprising client teams, IBM and AWS representatives worked to design, develop and iterate on prototypes that push the boundaries of what's possible with generative AI. IBM used design thinking and user-centric approach to guide the teams throughout the hackathon. AWS provided…

IBM + AWS: Transforming Software Development Lifecycle (SDLC) with generative AI

7 min read - Generative AI is not only changing the way applications are built, but the way they are envisioned, designed, tested, documented, and deployed. It’s also revolutionizing the software development lifecycle (SDLC). IBM and AWS are infusing Amazon Bedrock generative AI capabilities into the IBM® SDLC solution to drive increased efficiency, speed, quality and value in every application lifecycle consistently and at scale. The evolution of the SDLC landscape The software development lifecycle has undergone several silent revolutions in recent decades. The…

How digital solutions increase efficiency in warehouse management

3 min read - In the evolving landscape of modern business, the significance of robust operational and maintenance systems cannot be overstated. Efficient warehouse management helps businesses to operate seamlessly, ensure precision and drive productivity to new heights. In our increasingly digital world, bar coding stands out as a cornerstone technology, revolutionizing warehouses by enabling meticulous data tracking and streamlined workflows. With this knowledge, A3J Group is focused on using IBM® Maximo® Application Suite and the Red Hat® Marketplace to help bring inventory solutions…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters