February 18, 2022 By Torsten Steinbach
Michael Behrendt
8 min read

A demonstration of how to leverage the Ray compute framework to perform typical daily operational tasks at cloud-scale on large volumes of data on object storage.

The cloud takes on an increasingly essential role in mainstream daily IT operations, including routine procedures for working with data. Most IT users are familiar with working with data stored on some local disk or network-mounted drives. They are using shells with interactive command line tools like cp to copy data from one drive or folder to another. However, a benefit of using a cloud is that you can store multiple orders of magnitude larger volumes of data, and you have orders of magnitude more processing power than on just your local machine. But many users struggle to make effective use of this because these capabilities are not offered as conveniently as using a shell terminal and simple commands like cp.

This motivated us to craft a small tooling framework that abstracts cloud data storage and typical standard operations on it in a shell command experience that looks and feels similar to current single machine terminals. In the backend, we use the highly versatile scale-out programming and runtime framework Ray to transparently and radically scale out the data processing behind the scenes on IBM Cloud’s infrastructure.

As the first use case, we have implemented copying large volumes of data from one “cloud volume” (a bucket in a cloud object storage service) to another “cloud volume.” To the user, this is made available as a simple command line tool to call on their client machine.

The setup and usage is very easy and straightforward. The user just performs the following sequence of operations — each represents as a single command line invocation:

  1. Generate a Ray cluster configuration for IBM Cloud (only needs to be done once, 1 minute)
  2. Execute the Ray command:
    1. Deploy and launch the Ray cluster in IBM Cloud (2 minutes)
    2. Run the command (for instance, the copy command in our initial use case here)
    3. Tear down the Ray cluster in IBM Cloud (10 seconds)

It is indeed as simply as calling four CLI commands in a row. Of course, you can run multiple Ray commands (step 2.2) with the same Ray cluster and keep it running (steps 2.1 and 2.3) across multiple command invocations. The following section provides an overview of the architecture. In the section afterwards, you can see a detailed walk through of the sequence of commands.

Scale out architecture

The central element is the IBM Cloud Virtual Private Cloud (VPC) infrastructure that provides the virtual machines (i.e., Virtual Server Instances (VSIs)) to run the Ray cluster with the scale-out command in an on-demand fashion:

On the client machine, you run the ray CLI tool to provision and start the VSIs with the Ray cluster and stop and decommission it when done. Before doing that, there is another client tool — lithopscloud — that needs to be run once. This generates the desired VPC and VSI resource configuration to be used by the ray CLI tool when provisioning the Ray cluster. To connect to the IBM Cloud Gen2 VPC infrastructure, the ray CLI uses the gen2-connector plugin.

The ray CLI launches a Ray cluster to execute a Ray command, which accesses and processes the data in your cloud storage in a scale-out manner. Our Ray command implementation employs a map-reduce pattern to first generate a series of parallelizable tasks (for instance, by listing all objects in a bucket and creating one task for each). After that, the Ray command executes these tasks in the Ray cluster in parallel according to the amount of resources that you configured when you used the lithopscloud tool beforehand.

Client setup

Open a terminal and clone the Ray-Commands git repo:

git clone https://github.com/IBM-Cloud/Ray-Commands.git

Now enter the cloned directory:

cd Ray-Commands

Create a client environment and install all required tools by running the following:

source ./setup_env.sh

This installs the previously mentioned ray and lithopscloud CLI tools along with the gen2-connector plugin to interact with the IBM Cloud Gen2 VPC infrastructure.

Generate a Ray cluster configuration for IBM Cloud

First, let’s generate a Ray cluster configuration for VPC in IBM Cloud. You only need to do this once. You can reuse that configuration for many Ray cluster deployments. Launch lithopscloud in your terminal, select Ray Gen2 as compute backend (it stands for Ray on IBM Cloud Gen2 VPC infrastructure). Now paste an IBM API Key (if needed, create a new one for your user here). Select the IBM Cloud region. Select an existing VPC in your account or choose Create new VPC. Select the availability zone where you want to create the cluster nodes:

You can use the defaults for resource group, security group and new ssh keys unless you know better. There are multiple options for a base image. We tested with ubuntu 20 minimal:

Now specify the min/max corridor for number of worker nodes and the node type according to your expected resource needs for your job. More cores in total results in more parallelism. 

In our example, we use a static setup of 4 nodes (min and max). You also need to select a virtual machine node type. In our example, we pick the 8 cores by 32 GB memory nodes. In total, our Ray cluster will then consist of five such nodes — four worker nodes plus a Ray head node of the same size. Ray will also use the head node to run Ray processing tasks:

lithopscloud now generates the configuration in a yaml file and prints out its location, as can be seen in the above screenshot.

Mark and copy the full path to your clipboard and run the following:

./get_cluster_config.sh <paste path from clipboard> 

This stores a file named cluster.yaml in the current directory that includes all libraries and configuration that we need to run our Ray commands.

Deploy and launch the Ray cluster in IBM Cloud

You can now bring up the Ray cluster by running the following:

ray up cluster.yaml

Type “y” when being asked whether you want to launch a new Ray cluster:

The command runs for about two minutes:

Take note of the command output about your new Ray cluster and how you can submit workload, attach to the cluster or ssh into the head node. In the following, though, we will be using the prepared Ray Command CLI script to submit workload from this repository.

You may also want to check out the cloud console for your VPC instance that was created or selected when you ran lithopscloud earlier. There you can see the Ray cluster VMs that were launched with the Ray up command:

Submit ray_cp scale-out command to the Ray cluster

The ray_cp.sh script represents the scale-out implementation of a copy command for large amounts of object storage data using the Ray cluster that we just brought up. By default, it will interactively prompt you for all required parameters about your source location and target location on object storage. 

To streamline its usage across multiple invocations and to allow you to embed it into automated scripts, you can provide some or all of the requirement parameters as environment variables. In the example screenshot below, you can see that we set the HMAC credentials (access key and secret key) and HTTPS endpoints for the input bucket and for the output bucket as environment variables. We did not set the input and output bucket names and prefix paths as variables; therefore, the tool prompts us for them interactively:

After the input is gathered, the Ray command first does a listing of all objects in the source object storage bucket and prepares a set of copy tasks for them. In our example in the screenshot below, it found 8,988 objects to copy. The copy command implementation does a simple batching of multiple objects to copy with a single task. The batch size is currently set to four, so we get 2,247 tasks. Now the command submits these Ray tasks to the Ray cluster as you can see in the screenshot:

As soon as the first tasks complete, the tool starts to report the progress of both, the accumulated data volume that is copied so far and how many of the overall tasks have finished:

When all tasks have finished, the copy command prints a short elapse time and throughput summary:

In the our example, we used a Ray cluster with four workers with eight vCPUs per worker. Since a simple object copy is primarily network bound, we can safely overcommit the vCPUs by a decent factor. The current copy command implementation overcommits by a factor of five. With the four workers plus one Ray head node, eight vCPUs per node and with the vCPU overcommitting, the Ray scheduler can run (4 + 1) * 8 * 5 = 200 copy tasks in parallel. With this scheduling setup, we could copy the 2 TB of data in 8,988 objects across cloud regions from IBM Cloud Dallas region in US to IBM Cloud Frankfurt region in EU in about 20 minutes.

A larger Ray cluster would have allowed for even more parallelism and shorter elapse time for the 2 TB copy. In addition to Ray cluster size — and depending on the size and number of your objects to copy — you can tune the batch size and vCPU overcommitting to achieve optimal performance.

The copy command also has a simple retry logic for tasks implemented to account for the fact that in large scale-out processing, the odds are that one of the many tasks fails for whatever transient reason.

Tear down the Ray cluster

When you’re done with your Ray commands, you should deprovision the Ray cluster again to avoid accumulating costs for idling resources in your account. This takes just a few seconds can be done using ray down cluster.yam:

You can bring up a fresh Ray cluster quickly (about two minutes) at any time by running the ray up command again.

Conclusion

In this article, we have demonstrated a simple way to leverage Ray compute framework to perform typical daily operational tasks at cloud-scale on large volumes of data on object storage. The Ray-Commands repository that was using in this demonstration includes an implementation of a copy command. However, it is entirely open source and it can and should be extended with further such commands by applying the pattern of the already available copy command implementation.

Learn more about IBM Cloud Virtual Private Cloud (VPC).

Was this article helpful?
YesNo

More from Cloud

New 4th Gen Intel Xeon profiles and dynamic network bandwidth shake up the IBM Cloud Bare Metal Servers for VPC portfolio

3 min read - We’re pleased to announce that 4th Gen Intel® Xeon® processors on IBM Cloud Bare Metal Servers for VPC are available on IBM Cloud. Our customers can now provision Intel’s newest microarchitecture inside their own virtual private cloud and gain access to a host of performance enhancements, including more core-to-memory ratios (21 new server profiles/) and dynamic network bandwidth exclusive to IBM Cloud VPC. For anyone keeping track, that’s 3x as many provisioning options than our current 2nd Gen Intel Xeon…

IBM and AWS: Driving the next-gen SAP transformation  

5 min read - SAP is the epicenter of business operations for companies around the world. In fact, 77% of the world’s transactional revenue touches an SAP system, and 92% of the Forbes Global 2000 companies use SAP, according to Frost & Sullivan.   Global challenges related to profitability, supply chains and sustainability are creating economic uncertainty for many companies. Modernizing SAP systems and embracing cloud environments like AWS can provide these companies with a real-time view of their business operations, fueling growth and increasing…

Experience unmatched data resilience with IBM Storage Defender and IBM Storage FlashSystem

3 min read - IBM Storage Defender is a purpose-built end-to-end data resilience solution designed to help businesses rapidly restart essential operations in the event of a cyberattack or other unforeseen events. It simplifies and orchestrates business recovery processes by providing a comprehensive view of data resilience and recoverability across primary and  auxiliary storage in a single interface. IBM Storage Defender deploys AI-powered sensors to quickly detect threats and anomalies. Signals from all available sensors are aggregated by IBM Storage Defender, whether they come…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters