What is Hadoop Distributed File System (HDFS)?

Authors

Jim Holdsworth

Writer

What is HDFS?

Hadoop Distributed File System (HDFS) is a file system that manages large data sets that can run on commodity hardware. HDFS is the most popular data storage system for Hadoop and can be used to scale a single Apache Hadoop cluster to hundreds and even thousands of nodes. Because it efficiently manages big data with high throughput, HDFS can be used as a data pipeline and is ideal for supporting complex data analytics.

HDFS is built on an open source framework and is one of the major components of Apache Hadoop, the others being MapReduce and YARN. HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented, non-relational database management system that sits on top of HDFS and can better support real-time data needs with its in-memory processing engine.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Benefits of HDFS

Fault tolerance and fast recovery from hardware failures

Because one HDFS instance might consist of thousands of servers, failure of at least one server is always a possibility. HDFS has been built to detect faults and automatically recover quickly. Data replication with multiple copies across many nodes helps protect against data loss. HDFS keeps at least one copy on a different rack from all other copies. This data storage in a large cluster across nodes increases reliability. In addition, HDFS can take storage snapshots to save point-in-time (PIT) information.

Access to streaming data

HDFS is intended more for batch processing versus interactive use, so the emphasis in the design is for high data throughput rates, which accommodate streaming access to data sets.

Accommodation of large data sets

HDFS accommodates applications that use data sets typically from gigabytes to terabytes in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster and help drive high-performance computing (HPC) systems. Data lakes are often stored on HDFS. Data warehouses have also used HDFS, but now less often, due to perceived complexity of operation.

Cost-efficiency

Because the data is stored virtually, the costs for file system metadata and file system namespace data storage can be reduced.

Portability

To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to be compatible with various underlying operating systems, including Linux, macOS and Windows. In addition, Hadoop data lakes are able to support databases that are unstructured, semistructured and structured, for maximum flexibility. While Hadoop is coded in Java, other languages—including C++, Perl, Python and Ruby) enable its use in data science.

Processing speed

HDFS uses a cluster architecture to help deliver high throughput. To reduce network traffic the Hadoop file system stores data in DataNodes where computations take place, rather than moving the data to another location for computation.

Scalability

With both horizontal and vertical scalability features, HDFS can be quickly adjusted to match an organization’s data needs. A cluster might include hundreds or thousands of nodes.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

HDFS architecture and how it works

HDFS has a director/worker architecture.

An HDFS cluster includes one NameNode, which is the director server. The NameNode tracks the status of all files, the file permissions and location of every block. The NameNode software manages the file system namespace which in turn tracks and controls client access to the files and performs operations such as file opening, closing and renaming directories and files.

The file system namespace also divides files into blocks and maps the blocks to the DataNodes, which is the worker portion of the system. By configuring with only a single NameNode per cluster, the system architecture simplifies data management and storage of the HDFS metadata. In addition, greater security is built in by keeping user data from flowing through the NameNode.
Most often there is one DataNode per node in a cluster that manages the data storage within the node. The DataNode software manages block creation, deletion and replication, plus read and write requests. Each DataNode separately stores HDFS data in its local file system with each block as a separate file. DataNodes are the worker nodes (or Hadoop daemons, running processes in the background) and can run on commodity hardware if an organization wants to economize.

Both the NameNode and DataNode are software written to run on a wide variety of operating systems (OS), which is often the GNU/Linux OS. The Java language was used in building HDFS, meaning that any machine supporting Java can also use the NameNode or DataNode software.

Deployments will often have a single dedicated machine that runs the NameNode software. Then, any other machine in the cluster runs a single instance of DataNode software. If needed, but only used infrequently, a configuration of more than one DataNode on a single machine is possible.

When data is brought into HDFS, it’s broken into blocks and distributed to different nodes in a cluster. With the data stored in multiple DataNodes, the blocks can be replicated to other nodes to enable parallel processing. The distributed file system (DFS) includes commands to quickly access, retrieve, move and view data. With replicas of datablocks across multiple DataNodes, one copy can be removed without risking file corruption of the other copies. The default HDFS block size is 128 MB (Hadoop 2.x), which some will consider to be large, but the block size is done to minimize seek times and reduce the metadata needed.

To minimize risk and speed processing, when a DataNode stops signaling the NameNode, that DataNode is removed from the cluster and operations continue without that DataNode. If that DataNode later becomes operational, it is allocated to a new cluster.

HDFS provides flexible data access files through various interfaces: a native Java API is provided with HDFS, while a C language wrapper is available for the Java API, plus an HTTP browser can be used to browse the files of an HDFS instance.

The file system namespace

HDFS is organized with a traditional file hierarchy where the user can create directories that contain multiple files. The hierarchy of the file system namespace is similar to traditional file systems, where the user creates and removes files, moves them between directories and can rename files.

The file system namespace is maintained by the NameNode, which maintains records of any changes in the file system namespace. The total number of replicas to be saved for any application can be specified here. That number is the replication factor for that file. The replication factor can be set when the file is created and later modified as needed.

Data replication

In order to provide reliable storage, HDFS stores large files in multiple locations in a large cluster, with each file in a sequence of blocks. Each block is stored in a file of the same size, except for the final block, which fills as data is added.

For added protection, HDFS files are write-once by only one writer at any time. To help ensure that all data is being replicated as instructed. The NameNode receives a heartbeat (a periodic status report) and blockreport (the block ID, generation stamp and length of every block replica) from every DataNode attached to the cluster. Receiving a heartbeat indicates that the DataNode is working correctly.

The NameNode selects the rack ID for each DataNode by using a process called Hadoop Rack Awareness to help prevent the loss of data if an entire rack fails. This also enables the use of bandwidth from multiple racks when reading data.

HDFS example and use cases

Consider a file that includes phone numbers for an entire country. The numbers for people with a surname starting with A might be stored on server 1, B on server 2 and so on. With Hadoop, pieces of this telephone directory would be stored across a single cluster, and to reconstruct the entire phonebook, an application would need the blocks from every server in the cluster.

To help ensure high availability if and when a server fails, HDFS replicates these smaller pieces onto two more servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment. For example, a development Hadoop cluster typically doesn’t need any data redundancy.)

This redundancy also enables the Hadoop cluster to break up work into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, an organization gains the benefit of data locality, which is critical when working with large data sets.

HDFS can also enable artificial intelligence (AI) and machine learning (ML) by effectively scaling up. First to store data in large enough quantities required to train ML models and then to access those enormous data sets.

Any organization that captures, stores and uses large datasets—up to petabytes—might consider using HDFS. A few industry-based use cases show how HDFS might be implemented.

Energy: When using phasor measurement units (PMUs) to monitor the performance of the smart grids in their transmission networks, an electric company might amass enormous volumes of data, with thousands of records per second. HDFS might be the needed cost-effective and highly available file system they can rely on.
Healthcare: The volume of medical records grows every day. Medical equipment and patient sensor data can be efficiently collected and stored for more responsive treatment and research.
Marketing: Data from customer relationship management (CRM) systems, point-of-sale (PoS) systems, campaign responses and social media—much of it unstructured—needs to be collected for analysis and to guide future marketing efforts. HDFS clusters might provide a budget-saving solution for storing and analyzing the large amounts of data generated.
Oil and gas: An HDFS cluster can help to unify all of the data that arrives in various formats to make it available for big data analytics. This can include everything from 3D earth models to videos, customer purchases or equipment sensor data.
Retail: To better understand their customers, retailers can assemble and analyze data from multiple sources. This includes sales records, customer service interactions and social media, both unstructured data and structured—to develop new engagement strategies.
Telecommunications: HDFS can help telecom businesses build robust network paths, conduct predictive maintenance, indicate promising network expansion options and analyze customer behavior.

The history of HDFS

The origin of Hadoop, according to cofounders Mike Cafarella and Doug Cutting, was a Google File System paper, published in 2003. A second paper followed, "MapReduce: Simplified Data Processing on Large Clusters."Development of an early search engine named Apache Nutch was begun, but then work moved with Doug Cutting over to Yahoo in 2006.

Hadoop was named after a toy elephant belonging to Cutting’s son. (Hence the logo.) The initial Hadoop code was largely based on Nutch—but overcame its scalability limitations—and contained both the early versions of HDFS and MapReduce.

The suite of programs in the Hadoop Ecosystem continues to grow. In addition to HDFS, there is also:HBase (a NoSQL database), Mahout, Spark MLLib (algorithm libraries for machine learning), MapReduce (programming-based data processing), Oozie (job scheduler), PIG and HIVE (query-based data processing services), Solar and Lucene (for searching and indexing), Spark (data processing, in-memory), YARN (Yet Another Resource Negotiator) and Zookeeper (cluster coordination).

The open source software within the Hadoop Ecosystem is now managed by the Apache Software Foundation¹—a worldwide community for software developers and software contributors.

Footnotes

¹ Apache software foundation

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.