HADOOP Ecosystem



Big data is going big these days(pun intended). Organisations are generating petabytes of data every second , so the obvious questions are : where is this data stored ? How is this data used for analysis ? What tools are used for analysis? 

Apache developed an open source framework HADOOP dedicated for interaction with huge data for analysis, storing and retrieval.  The HADOOP ecosystem can be used for anything, be it simple operations like storing data to complex applications like applying ML algorithms to data stored and performing real time data analysis. 


The HADOOP ecosystem has the following components

  1. Hadoop Distributed File System (HDFS)

  2. Yet Another Resource Negotiator (YARN)

  3. MapReduce

  4. Spark

  5. PIG, HIVE

  6. HBase

  7. Mahout, Spark

  8. Solar, Lucene

  9. Zookeeper


Lets see each component one by one-


  • Hadoop Distributed File System-

HDFS is the primary component of the ecosystem and it stores all the data collected . It has 2 components: Data node and the name node.

All the data is distributed across the Data nodes whereas Name node contains all the metadata ( information about the data stored).

  • YARN

YARN (Yet Another Resource Negotiator) as the name suggests it is responsible for managing and scheduling all the resources across all the clusters . It has three main components: resource manager, application manager and node manager.

  • Mapreduce

It helps storing the data across multiple nodes in a manageable way and helps process data on multiple nodes before actually combining the results at each node to generate the final output.

  • Spark

It is used for faster processing and applying ML algorithms on data. It uses in memory processing which is faster as the main memory (RAM) is being used for data processing on each node. It is highly scalable and fault tolerant.

  • HIVE

It is a relational database which uses a language HQL for storage and retrieval of information. It is highly scalable and allows real time processing of data stored in relational tables.

  • Mahout 

It allows usage of machine learning algorithms to our task. It consists of various classification, clustering and other algorithms which can be invoked as per the need.

  • HBase

It is a NOSQL database which supports all the functions of a database . It provides functionalities similar to google’s big table and hence can work on large data effectively.

  • Zookeeper

It is used for synchronisation and coordination among the components of Hadoop , it also performs inter process communication and maintenance.

  • Solr, Lucene

Solr and lucene are used to locate the required information amongst all the other information stored in the database.

  •  April, 13, 2021
  • Yash Burad
We'll never share your email with anyone else.
Save my name, email, and website in this browser for the next time I comment.
Latest Blogs