Using Spark in RStudio

Although the RStudio IDE cannot be started in a Spark with R environment runtime, you can use Spark in your R scripts and Shiny apps by accessing Spark kernels programmatically.

RStudio uses the sparklyr package to connect to Spark from R. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.

There are two methods of connecting to Spark from RStudio:

By connecting to a Spark kernel that runs locally in the RStudio container in IBM Watson Studio
By connecting to a remote Spark kernel that runs outside of IBM Watson Studio in an Analytics Engine powered by Apache Spark service instance.

RStudio includes sample code snippets that show you how to connect to a Spark kernel in your applications for both methods.

To use Spark in RStudio after you have launched the IDE:

Locate the ibm_sparkaas_demos directory under your home directory and open it. The directory contains the following R scripts:
- A readme with details on the included R sample scripts
- spark_kernel_basic_local.R includes sample code of how to connect to a local Spark kernel
- spark_kernel_basic_remote.R includes sample code of how to connect to a remote Spark kernel
- The files sparkaas_flights.Rand sparkaas_mtcars.R are two examples of how to use Spark in a small sample application
Use the sample code snippets in your R scripts or applications to help you get started using Spark.

Creating custom Spark environments for RStudio

To be able to connect to Spark from RStudio using the Sparklyr R package, you must create either a Scala 2.12 with Spark 3.0 or a Scala 2.11 with Spark 2.4 environment.

To create a custom environment definition for Scala 2.12 with Spark 3.0 or Scala 2.11 with Spark 2.4 in which to launch RStudio, see Creating environment definitions.

After you launched RStudio in the custom Spark environment you created, you can use the following commands to get a listing of the Spark environment details and connect to Spark from RStudio:

# load spark R packages
library(ibmwsrspark)
library(sparklyr)

# load kernels
kernels <- load_spark_kernels()

# display kernels
display_spark_kernels()

# get spark kernel Configuration

conf <- get_spark_config(kernels[1])
# Set spark configuration
conf$spark.driver.maxResultSize <- "1G"
# connect to Spark kernel

sc <- spark_connect(config = conf)

Then to disconnect from Spark, use:

# disconnect
spark_disconnect(sc)

Examples of these commands are provided in the readme under /home/wsuser/ibm_sparkaas_demos.

Parent topic: RStudio