Hadoop support for IBM Spectrum Scale

The Apache Hadoop framework features open source software that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a high degree of fault tolerance.

The Apache Hadoop framework allows distributed processing of large data sets across clusters of computers that use simple programming models. The Apache Hadoop framework processes data-intensive computational tasks, which include data amounts that can range from hundreds of terabytes (TBs) to tens of petabytes (PBs). This computation mode differs from the computation mode that is used in traditional high-performance computing (HPC) environments.

For example, Hadoop consists of many open source modules. One of the primary open source modules is the Hadoop Distributed File System (HDFS), which is a distributed file system that runs on commodity hardware. HDFS lacks enterprise class capabilities necessary for reliability, data management, and data governance. IBM Spectrum Scale™, which is a scale-out distributed file system, offers an enterprise-class alternative to HDFS.

Hadoop collaboration with IBM Spectrum Scale

IBM Spectrum Scale provides integration with Hadoop applications that use the Hadoop connector (so you can use IBM Spectrum Scale enterprise-level functions on Hadoop):
  • POSIX-compliant APIs or the command line
  • FIPS and NIST compliant data encryption
  • Disaster Recovery
  • Simplified data workflow across applications
  • Snapshot support for point-in-time data captures
  • Simplified capacity management by using IBM Spectrum Scale (for all storage needs)
  • Policy-based information lifecycle management capabilities to manage PBs of data
  • Infrastructure to manage multi-tenant Hadoop clusters based on service-level agreements (SLAs)
  • Simplified administration and automated recovery
  • Multiple clusters