What is Hadoop

 Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.

Hadoop was developed, based on the paper written by Google on the Map Reduce system and it applies concepts of functional programming. Hadoop is written in the Java and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.







 it all started with two people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realized that their architecture will not be capable enough to work around with billions of pages on the web.


Features -

Reliability - Distribued systems
Economical - It can work with low speed disk as it follows write onces and read many times 
Flexible - No schema validation
Scalable - Processing happend very near to Data and it help to pocess big data in optimized time frame.


Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your Hadoop platform, but there are two services which are always mandatory for setting up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS

Let us go ahead with HDFS first. The main components of HDFS are the NameNode and the DataNode. Let us talk about the roles of these two components in detail

NameNode

  • It is the master daemon that maintains and manages the DataNodes (slave nodes)
  • It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
  • It records each and every change that takes place to the file system metadata
  • If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive
  • It keeps a record of all the blocks in the HDFS and DataNode in which they are stored

DataNode

  • It is the slave daemon which runs on each slave machine
  • The actual data is stored on DataNodes
  • It is responsible for serving read and write requests from the clients
  • It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
  • It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds

So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of Hadoop i.e. YARN.

YARN

YARN comprises of two major components: ResourceManager and NodeManager.

ResourceManager 

  • It is a cluster-level (one for each cluster) component and runs on the master machine
  • It manages resources and schedules applications running on top of YARN
  • It has two components: Scheduler & ApplicationManager
  • The Scheduler is responsible for allocating resources to the various running applications
  • The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
  • It keeps a track of the heartbeats from the Node Manager

NodeManager

  • It is a node-level component (one on each node) and runs on each slave machine
  • It is responsible for managing containers and monitoring resource utilization in each container
  • It also keeps track of node health and log management
  • It continuously communicates with ResourceManager to remain up-to-date


Hadoop Ecosystem

So far you would have figured out that Hadoop is neither a programming language nor a service, it is a platform or framework which solves Big Data problems. You can consider it as a suite which encompasses a number of services for ingesting, storing and analyzing huge data sets along with tools for configuration management.



  1. HDFS -> Hadoop Distributed File System
  2. YARN -> Yet Another Resource Negotiator
  3. MapReduce -> Data processing using programming
  4. Spark -> In-memory Data Processing
  5. PIG, HIVE-> Data Processing Services using Query (SQL-like)
  6. HBase -> NoSQL Database
  7. Mahout, Spark MLlib -> Machine Learning
  8. Apache Drill -> SQL on Hadoop
  9. Zookeeper -> Managing Cluster
  10. Oozie -> Job Scheduling
  11. Flume, Sqoop -> Data Ingesting Services
  12. Solr & Lucene -> Searching & Indexing 
  13. Ambari -> Provision, Monitor and Maintain cluster



Comments