2. What is BIG DATA ?
Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. Big
data can be analyzed for insights that lead to better decisions and strategic
business moves.
3. Where does it come from ?
Social Media
Business Transaction
Smart Phones
Vehicles
Satellite
Log Files
Smart Devices
Sensors
5. 3 V’s of Big Data
Velocity : The rate at which data is generated and changed.
Variety : The number of different data sources and types.
Volume : The average quantity of data units/category.
6. Importance of Big Data
The importance of big data doesn’t revolve around how much data
you have, but what you do with it. You can take data from any
source and analyze it to find answers that enable 1) cost reductions,
2) time reductions, 3) new product development and optimized
offerings, and 4) smart decision making. When you combine big data
with high-powered analytics, you can accomplish business-related
tasks such as:
Determining root causes of failures, issues and defects in near-real
time.
Generating coupons at the point of sale based on the customer’s
buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.
7. Applications of Big Data
A 360 degree view of a customer.
Internet of Things
Healthcare
Information Security
E-Commerce
Data warehouse optimization
8.
9. Emergence of Hadoop
An Open Source project Nutch (A search engine) – the brainchild of Doug
Cutting and Mike Cafarella, aimed at returning web search results faster by
distributing data and calculations across different computers so multiple
tasks could be accomplished simultaneously.
In 2006, Cutting joined Yahoo. The Nutch project was divided – the web
crawler portion remained as Nutch and the distributed computing and
processing portion became Hadoop.
In 2008, Yahoo released Hadoop as an open-source project. Today,
Hadoop’s framework and ecosystem of technologies are managed and
maintained by the non-profit Apache Software Foundation (ASF), a global
community of software developers and contributors.
10. Importance
Ability to store and process huge amounts of any kind of data,
quickly. With data volumes and varieties constantly increasing, especially
from social media and the Internet of Things (IoT), that's a key
consideration.
Computing power. Hadoop's distributed computing model processes big
data fast. The more computing nodes you use, the more processing power
you have.
Fault tolerance. Data and application processing are protected against
hardware failure. If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does not fail. Multiple
copies of all data are stored automatically
11. Importance(contd.)
Flexibility. Unlike traditional relational databases, you don’t have to
preprocess data before storing it. You can store as much data as you want
and decide how to use it later. That includes unstructured data like text,
images and videos.
Low cost. The open-source framework is free and uses commodity
hardware to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply
by adding nodes. Little administration is required.
13. MapReduce
MapReduce is a programming model and an associated implementation
for processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
The term MapReduce actually refers to two separate and distinct tasks that
Hadoop programs perform. The first is the map job, which takes a set of
data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output
from a map as input and combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
14. MapReduce : Example
Let’s look at a simple example. Assume you have three files, and each file
contains two columns (a key and a value in Hadoop terms) that represent
a city and the corresponding temperature recorded in that city for the
various measurement days. Of course we’ve made this example very
simple so it’s easy to follow. You can imagine that a real application won’t
be quite so simple, as it’s likely to contain millions or even billions of rows,
and they might not be neatly formatted rows at all; in fact, no matter how
big or small the amount of data you need to analyze, the key principles
we’re covering here remain the same. Either way, in this example, city is
the key and temperature is the value.
15. Map Reduce : Example (contd.)
Key Value
Toronto 20
Whitby 25
New York 22
Rome 32
Toronto 14
Rome 33
New York 18
Key Value
Toronto 18
Whitby 22
New York 25
Rome 35
Toronto 22
Rome 38
New York 21
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36
Toronto 12
Rome 35
New York 19
File 1 File 2 File 3
Out of all the data we have collected, we want to find the maximum
temperature for each city across all of the data files (note that each file
might have the same city represented multiple times).
16. Map Reduce : Example (contd.)
After mapping each file will return data as shown below. This is called
mapped data.
Key Value
Toronto 20
Whitby 25
New York 22
Rome 33
Key Value
Whitby 22
New York 25
Toronto 22
Rome 38
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36
17. Map Reduce : Example (contd.)
After mapping the reduction phase is performed and the final result is
displayed. All the data in the three files will be compared among the
corresponding key to find the highest temperature.
The final result will be as follows:-
Key Value
Toronto 22
Whitby 26
New York 26
Rome 38
18. Conclusion
Big data is changing the way people within organizations work together. It
is creating a culture in which business and IT leaders must join forces to
realize value from all data. Insights from big data can enable all employees
to make better decisions—deepening customer engagement, optimizing
operations, preventing threats and fraud, and capitalizing on new sources
of revenue. But escalating demand for insights requires a fundamentally
new approach to architecture, tools and practices.
Competitive Advantage
Better decision making
Value of data