Objective
This blog provides you the description of Hadoop HDFS High Availability feature. In this blog firstly we will discuss what exactly high availability is, How Hadoop achieve high availability, what is the need of HDFS high availability feature. We will also cover the example of Hadoop high availability feature in this Big data Hadoop tutorial.
What is Hadoop HDFS High Availability?
Hadoop HDFS is a distributed file system. HDFS distributes data among the nodes in the Hadoop cluster by creating a replica of the file. Hadoop framework store these replicas of files on the other machines present in the cluster. So, when an HDFS client wants to access his data, he can easily access that data from a number of machines present in the cluster. Data is easily available in the closest node in the cluster. At some unfavorable conditions like a failure of a node, the client can easily access their data from the other nodes. This feature of Hadoop is called High Availability.
How is High Availability achieved in Hadoop?
In the HDFS cluster, there is a number of DataNodes. After the definite interval of time, all these DataNodes sends heartbeat messages to the NameNode. If the NameNode stops receiving heartbeat messages from any of these DataNodes, then it assumes it to be dead. After that, it checks for the data present in those nodes and then gives commands to the other data node to create a replica of that data to other data nodes. Therefore data is always available.
When a client asks for a data access in HDFS, first of all, NameNode searches for the data in those data nodes, in which data is quickly available. And then provides access to that data to the client. Clients don’t have to search for the data in all the data nodes. HDFS Namenode itself makes data availability easy to the clients by providing the address of the data node from where a user can directly read.
Example of Hadoop High Availability
Hadoop HDFS provides High availability of data. When the client requests NameNode for data access, then the NameNode searches for all the nodes in which that data is available. After that, it provides access to that data to the user from the node in which data was quickly available. While searching for data on all the nodes in the cluster, if NameNode finds some node to be dead, then without user knowledge NameNode redirects the user to the other node in which the same data is available. Without any interruption, data is made available to the user. So in conditions of node failure also data is highly available to the users.
What were the issues in legacy systems?
- Data was unavailable due to machine crashing.
- HDFS client has to wait for a long period of time to access their data. Most of the time users have to wait for a particular period of time till the website becomes up.
- Limited functionalities and features.
- Due to unavailability of data, completion of many major projects at organizations gets extended for a long period of time and hence companies have to go through critical situations.