Introduction to IBM Open Data Analytics for z/OS

This topic provides a brief introduction to the product components and terminology in IBM® Open Data Analytics for z/OS® (IzODA).

Product components

IBM Open Data Analytics for z/OS consists of the following components:

z/OS IzODA Spark (FMID HSPK120): z/OS IzODA Spark (z/OS Spark) is built on Apache Spark, a high-performance, general execution engine for large-scale data processing. One of its key features is the capability to perform in-memory computing. Unlike traditional large data processing technologies, Spark allows caching of intermediate results in memory rather than writing them to disk, thereby dramatically improving the performance of iterative processing.
z/OS IzODA Mainframe Data Service (FMID HMDS120): z/OS IzODA Mainframe Data Service (Data Service or MDS) provides integration facilities for both IBM Z data sources and other off-platform data sources. The Data Service provides your Apache Spark application with optimized, virtualized, and parallelized access to a wide variety of data.
z/OS IzODA Anaconda (FMID HANA110): z/OS IzODA Anaconda includes Python and Anaconda Python packages for data science, which provide data scientists with a comprehensive solution for integrating computations to the data.

Figure 1 illustrates the components in a typical Spark cluster.

Components in a typical Spark cluster include z/OS IzODA Spark, z/OS IzODA Mainframe Data Service, Apache Toree, Jupyter Notebook clients on Linux on z Systems, and data sources such as VSAM, Db2, and HDFS. The z/OS IzODA Spark contains the driver, master, worker, and executors. — Figure 1. Example of components in a typical Spark cluster

The components are described in Terminology.

z/OS IzODA Livy

z/OS IzODA Livy is built on Apache Livy. It is a REST service used in conjunction with Spark that enables users to submit Spark jobs without having the Spark client installed. This enables developers to harness the data analytics power that Spark is capable of providing from within a web or mobile application. Jobs are submitted to the Livy server through REST API that contains information about the Spark application to be run. An interactive Scala or Python session with Livy can also be started.

Terminology

The following terms and abbreviations appear throughout this documentation:

Master: The Spark daemon that allocates resources across applications.
Worker: The Spark daemon that monitors and reports resource availability and, when directed by the master, spawns executors. The worker also monitors the liveness and resource consumption of the executors.
Executor: A process that the worker creates for an application. The executors perform the actual computation and data processing for an application. Each application has its own executors.
Driver program: The process that runs the main function of the application and creates the SparkContext.
SparkContext: Coordinates all executors in the cluster and sends tasks for the executors to run.
Apache Toree: Open source software that provides the foundation for interactive applications to connect to and use z/OS Spark.
Jupyter Notebook: An open source web application that provides an interactive application development environment for data scientists.
Deploy mode: Distinguishes where the driver process runs. In cluster deploy mode, the framework starts the driver inside the cluster. In client deploy mode, the submitter starts the driver from outside the cluster. If you use Jupyter Notebook and Apache Toree to interact with Spark, you are likely using client deploy mode. The default is client deploy mode.
Local mode: A non-distributed, single-JVM deployment mode in which all of the Spark execution components—driver, master, worker, and executors—run in the same JVM.
Cluster mode: Not to be confused with cluster deploy mode, Spark in cluster mode means that, unlike local mode, each Spark execution component—driver, master, worker, and executors—runs in a separate JVM. An application can be submitted to a Spark cluster in both cluster deploy mode and client deploy mode.
Cluster manager: The software that manages resources for the Spark cluster. Apache Spark supports Standalone, Mesos, and YARN. Only the Standalone cluster manager is available for Open Data Analytics for z/OS.
Task: A unit of work that is sent to one executor.