Have you ever wondered how Google does their queries into their mountains of data?
How Facebook is able to quickly deal with such large quantities of information?
Well today, we’re going into the Depth of Data Management which is called Big Data. Now while you may or may not have heard of Big Data, and other terms like Hadoop or MapReduce, you can be sure that they will be a regular part of life in the coming months and years.
This is because 90% of the world’s data was generated in just the last 2 years.
Yes what you heard is right,
All the data in the world was mostly generated in the last 2 years, and this trend is going to continue further. All this new data is coming from smartphones, social networks, trading platforms, machines, and other sources.
However, in the early 2000’s, companies like Google, were running into a wall. Their vast quantities of data were simply too large to pump through a single database, and they simply couldn’t write a large enough check to process the data.
To address this, their Google team fellows (:p) developed an algorithm that allowed for large data calculations to be reduced into smaller chunks, and mapped to many computers, They called this algorithm MapReduce.
This algorithm was later used to develop an open source project called Hadoop
Here processing of data in parallel rather than in serial. So why do I call it the wild west of data management?
Well, even though the MapReduce algorithm was released 8 years ago, it’s still very reliant on java coding to be successfully implemented. However, the market is rapidly evolving and tools are coming available to help businesses adopt
So should your business be getting into Hadoop?
Hadoop will become part of our day to day information architecture and one can easily get data architect training to increase knowledge in same area. We will start seeing Hadoop playing a central role in statistical analysis, ETL processing, and business intelligence.
- Hadoop architecture is divided into two parts which one is called as hydrogue distributed file system and another one is MapReduce.
- Let’s say a name node and it is nothing but it a center piece of the HDFC file system it keeps a directory tree of all files in the file system and tracks where it grows the cluster of data is kept and does not share the data of its file itself.
- It will process the data and the Name node is the main data which we are we going to give to the Hadoop and it is divided into several nodes.
- Data nodes are nothing but it just towards the data in how to file system a functional file system.
- It has more than one data node. The main data is a huge data so I Hadoop will divide the data into different data nodes these are known as slave nodes.
- The division is depend upon the size of data the type of data all the factors.
- The HDFC (Hadoop distributed file system) it is designed to store very large data sets label to stream those data sets at high bandwidth to user applications.
- Hadoop provides a distributed file system and framework for the analysis and transformation of very large data sets using the MapReduce paradigm.
- The property of the main properties of HDFS is it has a very large data HDFS instance may consist of thousands of server machines each storing part of the file systems data the replication the main advantage of the Hadoop is the replication of the data reliability.
- If you store a data in 1 system let’s say the system has been crashed so all the data will be gone forever so what the Hadoop does is?
- If there is a data it divides it and makes the copy of data and places the copies in different machines so that if one system has been Crashed and the other data is copied to other machines.
- So it has the data from other machines also it is the main advantage each data block.
- The main data is divided into data nodes and name nodes always the data nodes will should be communicate with a name node to get the data to process the data the calculation time.
- And coming to the MapReduce engine MapReduce engine is the main part of the Hadoop file system.
- Hadoop MapReduce is a software framework for easily writing applications which process vast amount of data in parallel on large data upsets.
- MapReduce job usually splits the input datasets in to independent jokes which are processed by the map tasks in a completely parallel manner the frame the framework sorts the outputs of the map which are the input to reduce tasks.
- Typically both input and output of the Java stored in a file system the main use of the MapReduce engine is the data which is big to the Hadoop is a large.
- It will divid into separate certain amount of parts based upon the types of data size. It is done by the MapReduce engine.
- If the MapReduce engine has two parts master node and slave node master node is also called as job tracker, and a slave node is called as task tracker.
- The job tracker first point Hadoop gets a data it will goes to the jobtracker into the MapReduce engine and receives the user data and it is the main master node part which divides the data into different parts.
- Job Tracker is also known Name node.