Preparing to use HDFS

Processing jobs can store data in a Hadoop Distributed File System (HDFS) data lake.

About this task

Restriction: The Developer Edition does not support HDFS data storage.

HDFS is required if you enable the processing job for ingestion of raw events, typically for further reuse of your business data.

You can install IBM® Business Automation Insights with no HDFS storage and enable it later, as described in Configuring HDFS long-term storage.

New in 18.0.2 You can enable HDFS at initial configuration and disable it later, as described in Advanced updates.

Supported HDFS versions

IBM Business Automation Insights supports HDFS 2.7.x, 2.8.x, and 2.9.x.

Storage bucket

IBM Business Automation Insights requires a dedicated storage bucket for processing jobs to store data in HDFS.

Permissions

Processing jobs access HDFS with a user named bai. However, when Kerberos is enabled with HDFS, processing jobs access HDFS with the name of the Kerberos principal. Therefore, depending on your work case, make sure that the following prerequisites are met.

A bai user or Kerberos user name exists on your Hadoop distribution file system (HDFS) system.
A /user/bai or /user/<kerberos_user_name> directory exists.
The user has write access to that directory.

Procedure

Create a user identifier, for example bai.
IBM Business Automation Insights uses this user identifier to write events to HDFS.
Create a directory.
The directory path must end with /user/, for example /user/bai.
Give the bai user write permission to that directory.