What is Hadoop

Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.

Hadoop was developed, based on the paper written by Google on the Map Reduce system and it applies concepts of functional programming. Hadoop is written in the Java and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.

it all started with two people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realized that their architecture will not be capable enough to work around with billions of pages on the web.

Features -

Reliability - Distribued systems

Economical - It can work with low speed disk as it follows write onces and read many times

Flexible - No schema validation

Scalable - Processing happend very near to Data and it help to pocess big data in optimized time frame.

Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your Hadoop platform, but there are two services which are always mandatory for setting up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS

Let us go ahead with HDFS first. The main components of HDFS are the NameNode and the DataNode. Let us talk about the roles of these two components in detail

NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive
It keeps a record of all the blocks in the HDFS and DataNode in which they are stored

DataNode

It is the slave daemon which runs on each slave machine
The actual data is stored on DataNodes
It is responsible for serving read and write requests from the clients
It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds

So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of Hadoop i.e. YARN.

YARN

YARN comprises of two major components: ResourceManager and NodeManager.

ResourceManager

It is a cluster-level (one for each cluster) component and runs on the master machine
It manages resources and schedules applications running on top of YARN
It has two components: Scheduler & ApplicationManager
The Scheduler is responsible for allocating resources to the various running applications
The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
It keeps a track of the heartbeats from the Node Manager

NodeManager

It is a node-level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each container
It also keeps track of node health and log management
It continuously communicates with ResourceManager to remain up-to-date

Hadoop Ecosystem

So far you would have figured out that Hadoop is neither a programming language nor a service, it is a platform or framework which solves Big Data problems. You can consider it as a suite which encompasses a number of services for ingesting, storing and analyzing huge data sets along with tools for configuration management.

HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr & Lucene -> Searching & Indexing
Ambari -> Provision, Monitor and Maintain cluster

Search This Blog

BIG+DATA

What is Hadoop

Core Components

HDFS

NameNode

DataNode

YARN

ResourceManager

NodeManager

Hadoop Ecosystem

Comments

Post a Comment

Popular posts from this blog

Apache Spark

What is BIG DATA