What is Big Data and Hadoop Framework

What is BIG DATA ?
 Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. Big
data can be analyzed for insights that lead to better decisions and strategic
business moves.

Where does it come from ?
 Social Media
 Business Transaction
 Smart Phones
 Vehicles
 Satellite
 Log Files
 Smart Devices
 Sensors

3 V’s of Big Data
 Velocity : The rate at which data is generated and changed.
 Variety : The number of different data sources and types.
 Volume : The average quantity of data units/category.

Importance of Big Data
The importance of big data doesn’t revolve around how much data
you have, but what you do with it. You can take data from any
source and analyze it to find answers that enable 1) cost reductions,
2) time reductions, 3) new product development and optimized
offerings, and 4) smart decision making. When you combine big data
with high-powered analytics, you can accomplish business-related
tasks such as:
 Determining root causes of failures, issues and defects in near-real
time.
 Generating coupons at the point of sale based on the customer’s
buying habits.
 Recalculating entire risk portfolios in minutes.
 Detecting fraudulent behavior before it affects your organization.

Applications of Big Data
 A 360 degree view of a customer.
 Internet of Things
 Healthcare
 Information Security
 E-Commerce
 Data warehouse optimization

Emergence of Hadoop
 An Open Source project Nutch (A search engine) – the brainchild of Doug
Cutting and Mike Cafarella, aimed at returning web search results faster by
distributing data and calculations across different computers so multiple
tasks could be accomplished simultaneously.
 In 2006, Cutting joined Yahoo. The Nutch project was divided – the web
crawler portion remained as Nutch and the distributed computing and
processing portion became Hadoop.
 In 2008, Yahoo released Hadoop as an open-source project. Today,
Hadoop’s framework and ecosystem of technologies are managed and
maintained by the non-profit Apache Software Foundation (ASF), a global
community of software developers and contributors.

Importance
 Ability to store and process huge amounts of any kind of data,
quickly. With data volumes and varieties constantly increasing, especially
from social media and the Internet of Things (IoT), that's a key
consideration.
 Computing power. Hadoop's distributed computing model processes big
data fast. The more computing nodes you use, the more processing power
you have.
 Fault tolerance. Data and application processing are protected against
hardware failure. If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does not fail. Multiple
copies of all data are stored automatically

Importance(contd.)
 Flexibility. Unlike traditional relational databases, you don’t have to
preprocess data before storing it. You can store as much data as you want
and decide how to use it later. That includes unstructured data like text,
images and videos.
 Low cost. The open-source framework is free and uses commodity
hardware to store large quantities of data.
 Scalability. You can easily grow your system to handle more data simply
by adding nodes. Little administration is required.

Hadoop Distributed File System
(HDFS)

MapReduce
 MapReduce is a programming model and an associated implementation
for processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
 The term MapReduce actually refers to two separate and distinct tasks that
Hadoop programs perform. The first is the map job, which takes a set of
data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output
from a map as input and combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.

MapReduce : Example
 Let’s look at a simple example. Assume you have three files, and each file
contains two columns (a key and a value in Hadoop terms) that represent
a city and the corresponding temperature recorded in that city for the
various measurement days. Of course we’ve made this example very
simple so it’s easy to follow. You can imagine that a real application won’t
be quite so simple, as it’s likely to contain millions or even billions of rows,
and they might not be neatly formatted rows at all; in fact, no matter how
big or small the amount of data you need to analyze, the key principles
we’re covering here remain the same. Either way, in this example, city is
the key and temperature is the value.

Map Reduce : Example (contd.)
Key Value
Toronto 20
Whitby 25
New York 22
Rome 32
Toronto 14
Rome 33
New York 18
Key Value
Toronto 18
Whitby 22
New York 25
Rome 35
Toronto 22
Rome 38
New York 21
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36
Toronto 12
Rome 35
New York 19
File 1 File 2 File 3
 Out of all the data we have collected, we want to find the maximum
temperature for each city across all of the data files (note that each file
might have the same city represented multiple times).

 After mapping each file will return data as shown below. This is called
mapped data.
Key Value
Toronto 20
Whitby 25
New York 22
Rome 33
Key Value
Whitby 22
New York 25
Toronto 22
Rome 38
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36

 After mapping the reduction phase is performed and the final result is
displayed. All the data in the three files will be compared among the
corresponding key to find the highest temperature.
 The final result will be as follows:-
Key Value
Toronto 22
Whitby 26
New York 26
Rome 38

Conclusion
 Big data is changing the way people within organizations work together. It
is creating a culture in which business and IT leaders must join forces to
realize value from all data. Insights from big data can enable all employees
to make better decisions—deepening customer engagement, optimizing
operations, preventing threats and fraud, and capitalizing on new sources
of revenue. But escalating demand for insights requires a fundamentally
new approach to architecture, tools and practices.
 Competitive Advantage
 Better decision making
 Value of data

What is Big Data and Hadoop Framework

What is Big Data and Hadoop Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (16)

Similar to What is Big Data and Hadoop Framework

Similar to What is Big Data and Hadoop Framework (20)

Recently uploaded

Recently uploaded (20)

What is Big Data and Hadoop Framework