Refine data sets

Before you analyze local data sets, you can refine the data by cleansing and shaping it.

To refine the data, you create a data flow that contains steps to cleanse and shape the data the way you want it:

Cleanse the data: Fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated.
Shape the data: Filter, sort, combine, or remove columns, and perform operations.

You can refine the data in real-time to build a customized data flow, and then save the data flow as either a separate JSON file or an R script.

Want to see Data Refinery in action? Watch this short video:

Figure 1. Video that shows how Data Refinery integrates with Watson Studio Local.

Use the Data Refiner to perform the following tasks:

Set up data
Create a data flow
Profile data
Visualize data
Run the flow as a job

Set up data

Before you create a data flow, ensure your CSV data meets the following requirements:

No empty column names: Data sets should also not contain a comma (,) at the end of the line because this creates a column with no name.
No duplicate columns: Although you can still refine data, you won't be able to run a data shaping job on the columns that are duplicated.

Recommendation: Do not store local data sets within subdirectories of the datasets directory. Although they will appear in your list of assets and can be previewed and exported, they might not work correctly with data flows and other tools.

For remote data sets, the following data source types are supported: HDFS, HDFS-CDH (.CSV file only), DB2, DB2Z, INFORMIX, NETEZZA, ORACLE, DASHDB, BIGSQL.

Restrictions:

For JDBC data types, you cannot run batch shaping or save an R script.
Remote shaping is not supported for CDH spark2.

Before you can refine data against data sets that contain the livyspark (Spark 1) definition, you must set up the remote Spark 1 environment to reference additional libraries to allow Spark to read the CSV format:

Download two JAR files required for spark CSV reading:

!wget http://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar
!wget http://central.maven.org/maven2/org/apache/commons/commons-csv/1.2/commons-csv-1.2.jar

Ensure the JAR files are available on all nodes that have the Spark client installed. Determine a location to copy the files to, for example, copy commons-csv-1.2.jar and spark-csv_2.10-1.5.0.jar to /usr/hdp/current/spark-client/lib.

In the spark-defaults.conf file for either Ambari (custom spark-defaults.conf) or Cloudera Manager, include the following code:

spark.driver.extraClassPath /usr/hdp/current/spark-client/lib/commons-csv-1.2.jar:/usr/hdp/current/spark-client/lib/spark-csv_2.10-1.5.0.jar
 spark.executor.extraClassPath /usr/hdp/current/spark-client/lib/commons-csv-1.2.jar:/usr/hdp/current/spark-client/lib/spark-csv_2.10-1.5.0.jar

Restart Spark to ensure all new configurations are pushed to the nodes.

Create a data flow

To build a new data flow that refines your data, complete the following steps:

In your project assets page, click Data sets and click a local CSV file, remote CSV file, or remote table. The Data Refinery automatically opens if the data set it supported. The Data Refinery will not launch if the data set is not supported, for example, if the data is a JSON file. Also, an empty column header can cause the data load to fail.
View your data on the Data tab. If your data set is large, you will see only a subset of the rows even though the operations will run on the entire data set.
Use the tool to refine the data:
1. Click the icon in a column and select an operation to cleanse, shape, or enrich the column. Alternatively, click Operation to browse operation categories or search for a specific operation. You can also manually enter R code in the command line, and let autocomplete assist you with the syntax.
2. Review and work with the operations.
  Each operation you add to the data flow appears as a step under the Steps pane. You can click each step for a snapshot of how the operation will refine the data. You can also undo, redo, edit, or delete steps. To insert an operation between two existing steps, click the step before the position and then apply the new operation.
Save the data flow:
- To save the data flow as a separate JSON file in the Data Flows section of your assets page, click the icon and click Save data flow.
- To save the flow as an R script in the Scripts section of your assets page, click the icon and click Save R Script.

Frequently used operations include:

Calculate

Perform a calculation with another column or with a specified value. The operators are:

Addition
Subtraction
Multiplication
Division
Modulus
Exponentiation
Is equal to
Is not equal to
Is less than
Is less than or equal to

Convert column type

Convert the type of the column to a different data type. You can create a new column to hold the result of this operation or you can overwrite the existing column values.

A column's data type determines the operations you can perform. Changing the data type can affect which operations are relevant for that column.

Filter

Filter rows by the selected columns. Keep rows with the selected column values; filter out all other rows.

The filter operators for numeric, string, and Boolean (logical) columns are:

Operator	Numeric	String	Boolean
Is between two numbers	✓
Is empty		✓	✓
Is not empty		✓	✓
Is equal to	✓	✓
Is not equal to	✓	✓
Is less than	✓
Is less than or equal to	✓
Is greater than	✓
Is greater than or equal to	✓
Is in	✓	✓
Is not in	✓	✓
Is null	✓	✓	✓
Is not null		✓	✓

Math

You can apply math operations only to numeric columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Absolute value: Get the absolute value of a number. Example: The absolute value of both 4 and -4 is 4.
Arc cosine: Get the arc cosine of an angle.
Ceiling: Get the nearest integer of greater value, also known as the ceiling of the number. Examples: The ceiling of 2.31 is 3. The ceiling of -2.31 is -2.
Exponentiation: Get a number raised to the power of the column value.
Floor: Get the nearest integer of lesser value, also known as the floor of the number. Example: The floor of 2.31 is 2. The floor of -2.31 is -3.
Round: Get the whole number nearest to the column value. If the column value is a whole number, return it.
Square root: Get the square root of the column value.

Remove

Remove the selected column.

Rename

Rename the selected column.

Sort ascending

Sort rows by the selected column in ascending order.

Sort descending

Sort rows by the selected column in descending order.

Substitute

Obscure sensitive information from view by substituting a random string of characters for the actual data in the selected column.

Text

You can apply text operations only to string columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Collapse spaces: Collapse multiple, consecutive spaces in the text to a single space.
Concatenate string: Link together any string to the text. You can prepend the string to the text, append the string to the text, or both.
Lower case: Convert the text to lower case.
Number of characters: Return the number of characters in the text.
Pad characters: Pad the text with the specified string. Specify whether to pad the text on the left, right, or both the left and right.
Title case: Convert the text to title case.
Trim quotes: Remove single or double quotation marks from the text.
Trim spaces: Remove leading, trailing, and extra spaces from the text.
Upper case: Convert the text to upper case.

Cleanse operations:

Convert column value to missing: Convert all column values to missing values or convert only the specified column value to a missing value.
Remove duplicates: Remove rows with duplicate column values.
Remove empty rows: Remove rows that have a blank or missing value for the selected column.
Replace missing values: Replace missing values in the column with a specified value or with the value from a specified column in the same row.
Replace substring: Replace the specified substring with the specified text.

Organize operations:

Concatenate: Concatenate the values of two or more columns.
Join: Not supported.
Split column: Split the column by non-alphanumeric characters, position, pattern, or text.

Profile data

To profile data, click the Profile tab for frequency and summary statistics for each of your columns. Hover over a bar to see the data values.

Frequency: The number of times that a value, or a value in a specified range, occurs. Each frequency distribution (bar) shows the count of unique values in a column.
Review the frequency distribution to find anomalies in your data. If you want to cleanse your data of those anomalies, simply remove the values.
Statistics: A collection of quantitative data. The statistics for each column show the minimum, maximum, mean, and number of unique values in that column. Depending on a column's data type, the statistics for each column will vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values while statistics for a column of data type string have minimum length, maximum length, and mean length values.

Visualize data

Click the Visualizations tab to input columns and view data visualizations that provide insights and uncover patterns, trends, and correlations in your data.

Run the flow as a job

Click the Job icon to run Data Refinery job on a local data set or a remote HDFS data set.

Restriction: You cannot run Data Refinery jobs on a remote JDBC data set.

For registered Hadoop clusters, you can run a job only on data sets that contain the livyspark2 definition.

A livyspark job will fail the first time. You must edit the job settings to update the livyspark2 parameter with the livyspark URL so that the job can run against the Spark instance. If livyspark2 fails with missing packages, complete the same Spark 1 steps in Set up data on the Spark 2 clients.