Hadoop is a Big Data computing platform for handling large datasets. Hadoop has a core two components: HDFS and MapReduce. In this tutorial, we will focus on the HDFS (Hadoop Distributed File System). HDFS is a storage component of Hadoop, It is a Distributed File System. It is optimized for high throughput and replicates blocks for failure handling.
HDFS stands for Hadoop Distributed File System. HDFS is a file system that is inspired by Google File Systems and designed for executing MapReduce jobs. HDFS reads input data in large chunks of input, processes it, and writes large chunks of output. HDFS is written in Java. It is scalable and reliable data storage, and it is designed to handle large clusters of commodity servers.
HDFS cluster has a single NameNode that acts as the master node and Multiple DataNodes. NameNode manages the file system and regulates access to files. It controls and manages the services. DataNode provides block storage and retrieval services. Namenode maintains the file system and file Blockmap in memory. It keeps the metadata of any change in the block of data so that failure and system crash situations can be easily handled.
Job Tracker executes on a master node and Task Tracker executes on every slave node(master node also have a Task Tracker). Each slave ties with processing (TaskTracker) and the storage (DataNode).
The Job Tracker maintains records of each resource in the Hadoop cluster. it schedules and assigns resources to the Task Tracker nodes. It regularly receives the progress status from Task Tracker. The task tracker regularly receives execution requests from Job Tracker. Task tracker tracks each task(mapper or reducer) and updates Job tracker about the workload. Task-Trackers’ main responsibility is to manage the processing resources on each slave node.
In HDFS, each file is broken into equal size of blocks. Blocks are replicated to handle fault tolerance. NameNode continuously receives a Heartbeat or status update and a data block report from each DataNode in the cluster.
DataNode failure means the failure of the slave node. DataNode periodically passes a signal or heartbeat to NameNode. It is helpful to tress the DataNode working status. When NameNode does not receive any heartbeat signals from DataNode then it assumes that DataNode is dead. After that, NameNode transfers the load of the current dead DataNode to other DataNode.
NameNode holds the metadata for the whole Hadoop cluster environment. It means it keeps track of each DataNode and maintains its record. NameNode failure is considered as a single point of failure(SPOF). If NameNode crashes due to system, hard drive failure whole cluster information will be lost.
Why is the NameNode a single point of failure? What is bad or difficult about having a complete duplicate of the NameNode running as well?
We can recover from the situation of NameNode or SPOF by maintaining two NameNodes where one acts as a primary and the other NameNode acts as a secondary NameNode. It recovers by maintaining regular checkpoints on the Secondary NameNode. Here, Secondary NameNode is not a complete backup for the NameNode but it keeps the data of primary NameNode. It performs a checkpoint process periodically.
The main problem with the secondary NameNode is that it does not provide automatic failure recovery. It means in case of Namenode failure, the Hadoop administrator needs to take manually recover the data from Secondary Namenode.
Hadoop 2.0 offers high availability compare to Hadddop 1.X. It uses the concept of Standby Namenode. The standby NameNode is used to handle the problem of a Single Point of Failure(SPOF). The standby NameNode provides automatic failover of NameNode failure.
Here, Hadoop uses 2 NameNodes alongside one another so that if one of the Namenodes fails then the cluster will quickly use the other NameNode. In this standby NameNOde concept, DataNode sends all the signals to both the NameNodes and keeps a shared directory in Network File System.
Hadoop 2.x keeps 3 replicas by default for handling any kind of failure. This is a good strategy for data locality but keeping multiple copies extra overhead and slow down the overall throughput. Hadoop 3.x storage solved this 200% overhead by using erasure coding storage. It is cost-effective and saves IO operation time.
First, we split the data into several blocks in HDFS, and then we pass to the Erasure encoding. Erasure Encoding output several parity blocks. A combination of data and parity bock is known as an encoding group and in case of a failure, the erasure decoding group reconstructs the original data.
Congratulations, you have made it to the end of this tutorial!
In this tutorial, you have covered a lot of details about HDFS. What is HDFS?, its features, how Hadoop works?, NameNode, DataNode, Job-Tracker, Task-Tracker, Data Replication, Handling DataNode, and NameNode Failure, Hadoop, 2.X, and Hadoop 3.X. Also, discussed the comparison of HDFS with NTFS and RDBMS.
Hopefully, you can now utilize the HDFS knowledge in your big data and data science career. Thanks for reading this tutorial!
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…