Hadoop for Toddlers — Part 1: Why Hadoop ?

Minh-Luan H. Phan
5 min readFeb 21, 2021

--

Assume that your boss needs you to find a solution to store 1024 GB data related to customer’s orders for analyzing purpose.

Question 1: How data should be stored to achieve acceptable read operations in the future ?

  • Solution 1: Store data in a single computer

This seems a simple approach. However, this solution poses some potential problems.

Assume that your server uses a HDD to store data. Its read speed is approximately 100 mb/s

1024GB * 1024 = 1,048,576 (mb)

Time to read the whole dataset

1,048,576 mb / 100 = 10,485.76 (seconds) = 174.76 (minutes)

Whenever we need to analyze a major of this database, it takes nearly 3 hours to read through.

  • Solution 2: Store data in 10 computers

By splitting equally the above dataset into 10 pieces, each piece (about 102.4 gb) is stored in a computer. To improve the read operation, we can access data in parallel fashion
Time to read a piece of data = time to read the whole dataset.

A piece of data is about 102.4 gb

102.4GB * 1024 = 104,857.6 (mb)

Time to read the whole dataset

104,857.6 mb / 100 = 1,048.576 (seconds) = 17.47(minutes)

However, after reading all of data, there are many arise problems.

Firstly, hardware failures are inevitable. To tackle this such problem, we need to replicate our data to a number of computers. This is where Hadoop Distributed File System (HDFS) comes to the rescue.

Secondly, most analyst tasks require data combination, which means after traversing and filter through the whole dataset, we need to combine the outcome of each piece. Handle this process correctly is extremely challenging due to many factors. These problems can be diminished with the support of MapReduce, a component of Hadoop.

In a nutshell, Hadoop provides HDFS for reliable shared storage and MapRedude for analysis

Question 2: Why can’t we use a database with lots of disks to do large-scale batch analysis ?

  • 1st reason: time

To answer this question, we need to take account of seek time and and transfer rate of disk drive.

What is seek time ?

Seek time is the time taken for a hard disk controller to locate a specific piece of stored data

When anything is read or written to a disc drive, the read/write head of the disc needs to move to the right position. The actual physical positioning of the read/write head of the disc is called seeking

What is transfer rate ?

A data transfer rate tells you how much digital data will travel from one place to another — from a hard drive to a USB flash drive, for example.

There is a trend in disk drive: seek time is improving more slowly than transfer rate.

Back to your boss requirement, it is crystal clear that the given dataset will be accessed in Write-One Read-Many fashion, which means we have to find a solution that support Batch Data Processing that takes fully advantage of transfer rate instead of seek time.

RDBMS, whose data access pattern is dominated by seek time, does its best in reading and updating a small proportion of dataset. However, with large proportion or even the whole dataset, it turns out less efficiently.

MapReduce tackles all the above problems of RDMB which leads to be a good option for analyzing a large amount of data

What is MapReduce ?

According to https://www.hadoop.apache.org:
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

  • 2nd reason: RDBMS only handle Structured Data

Moreover, while RDBMS is only able handle Structured Data, MapReduce can work well on unstructured data and semi-structured data because the schema of data is determined at processing time, which means the input keys and values for MapReduce might not be the original property of the data, they are chosen by developer who analyze the data.

  • 3rd reason: Network congestion

In Hadoop, there is a term named Data Locality

Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Broadly, to reinforce performance, data will be processed locally. But in RDBMS, data is stored in relational fashion to retain its integrity as well as remove redundancy. However, this implementation is not suitable for distributed system like Hadoop because it makes reading a record a nonlocal operation.

MapReduce is recognized as a linearly scalable programming model.

What does linearly scalable mean ?
Back to the given dataset, if you double the size of the dataset, the job will run twice as slow. However, to tackle this problem, you can double the nodes in the cluster, which means the computation power has been increased.

Traditionally, to improve performance, people usually upgrade server, for example, add more rams. But with Hadoop, you just add more nodes into your cluster. The goal of Hadoop is that everything can work well on commodity server.

Question 3: In the past, people apply Grid Computing to do large-scale data processing, what is the different between Grid Computing and Hadoop ?

In Grid Computing is suitable for heavy-computing jobs, but when it needs to access a large volume of data, it turns out a disadvantage. Transfer a large amount of data over the internet causes bottleneck and takes a lot of precious resources such as network bandwidth. Moreover, while waiting for response, the node become idle.

With MapReduce, it tries to distribute the job instead of the data to that node, that is the reason why MapReduce has exceptionally performance, because data is processed locally.

Moreover, handle data flow in Grid Computing is truly a pain, because it requires developers understand about low-level programming. MapReduce gets developers rid of all the low-level things, developers only focus on decide the input keys and values for map and reduce functions.

Additionally, the hardest aspect in distributed computing must be partial failure management. Because you do not know whether a job is failure or success. MapReduce has a mechanism to handle this such things, it can automatically check node health as well as determine which nodes will handle the last failure tasks. Especially, tasks in MapReduce are independent from each other.

--

--

Minh-Luan H. Phan
Minh-Luan H. Phan

Written by Minh-Luan H. Phan

Do the difficult things while they are easy and do the great things while they are small. A journey of a thousand miles must begin with a single step — Lao Tzu

No responses yet