Big data is going big these days(pun intended). Organisations are generating petabytes of data every second , so the obvious questions are : where is this data stored ? How is this data used for analysis ? What tools are used for analysis?
Apache developed an open source framework HADOOP dedicated for interaction with huge data for analysis, storing and retrieval. The HADOOP ecosystem can be used for anything, be it simple operations like storing data to complex applications like applying ML algorithms to data stored and performing real time data analysis.
The HADOOP ecosystem has the following components
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
Lets see each component one by one-
Hadoop Distributed File System-
HDFS is the primary component of the ecosystem and it stores all the data collected . It has 2 components: Data node and the name node.
All the data is distributed across the Data nodes whereas Name node contains all the metadata ( information about the data stored).
YARN (Yet Another Resource Negotiator) as the name suggests it is responsible for managing and scheduling all the resources across all the clusters . It has three main components: resource manager, application manager and node manager.
It helps storing the data across multiple nodes in a manageable way and helps process data on multiple nodes before actually combining the results at each node to generate the final output.
It is used for faster processing and applying ML algorithms on data. It uses in memory processing which is faster as the main memory (RAM) is being used for data processing on each node. It is highly scalable and fault tolerant.
It is a relational database which uses a language HQL for storage and retrieval of information. It is highly scalable and allows real time processing of data stored in relational tables.
It allows usage of machine learning algorithms to our task. It consists of various classification, clustering and other algorithms which can be invoked as per the need.
It is a NOSQL database which supports all the functions of a database . It provides functionalities similar to google’s big table and hence can work on large data effectively.
It is used for synchronisation and coordination among the components of Hadoop , it also performs inter process communication and maintenance.
Solr and lucene are used to locate the required information amongst all the other information stored in the database.