What is BIG DATA ?
 Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. Big
data can be analyzed for insights that lead to better decisions and strategic
business moves.
Where does it come from ?
 Social Media
 Business Transaction
 Smart Phones
 Vehicles
 Satellite
 Log Files
 Smart Devices
 Sensors
Fact Check
3 V’s of Big Data
 Velocity : The rate at which data is generated and changed.
 Variety : The number of different data sources and types.
 Volume : The average quantity of data units/category.
Importance of Big Data
The importance of big data doesn’t revolve around how much data
you have, but what you do with it. You can take data from any
source and analyze it to find answers that enable 1) cost reductions,
2) time reductions, 3) new product development and optimized
offerings, and 4) smart decision making. When you combine big data
with high-powered analytics, you can accomplish business-related
tasks such as:
 Determining root causes of failures, issues and defects in near-real
time.
 Generating coupons at the point of sale based on the customer’s
buying habits.
 Recalculating entire risk portfolios in minutes.
 Detecting fraudulent behavior before it affects your organization.
Applications of Big Data
 A 360 degree view of a customer.
 Internet of Things
 Healthcare
 Information Security
 E-Commerce
 Data warehouse optimization
Emergence of Hadoop
 An Open Source project Nutch (A search engine) – the brainchild of Doug
Cutting and Mike Cafarella, aimed at returning web search results faster by
distributing data and calculations across different computers so multiple
tasks could be accomplished simultaneously.
 In 2006, Cutting joined Yahoo. The Nutch project was divided – the web
crawler portion remained as Nutch and the distributed computing and
processing portion became Hadoop.
 In 2008, Yahoo released Hadoop as an open-source project. Today,
Hadoop’s framework and ecosystem of technologies are managed and
maintained by the non-profit Apache Software Foundation (ASF), a global
community of software developers and contributors.
Importance
 Ability to store and process huge amounts of any kind of data,
quickly. With data volumes and varieties constantly increasing, especially
from social media and the Internet of Things (IoT), that's a key
consideration.
 Computing power. Hadoop's distributed computing model processes big
data fast. The more computing nodes you use, the more processing power
you have.
 Fault tolerance. Data and application processing are protected against
hardware failure. If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does not fail. Multiple
copies of all data are stored automatically
Importance(contd.)
 Flexibility. Unlike traditional relational databases, you don’t have to
preprocess data before storing it. You can store as much data as you want
and decide how to use it later. That includes unstructured data like text,
images and videos.
 Low cost. The open-source framework is free and uses commodity
hardware to store large quantities of data.
 Scalability. You can easily grow your system to handle more data simply
by adding nodes. Little administration is required.
Hadoop Distributed File System
(HDFS)
MapReduce
 MapReduce is a programming model and an associated implementation
for processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
 The term MapReduce actually refers to two separate and distinct tasks that
Hadoop programs perform. The first is the map job, which takes a set of
data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). The reduce job takes the output
from a map as input and combines those data tuples into a smaller set of
tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
MapReduce : Example
 Let’s look at a simple example. Assume you have three files, and each file
contains two columns (a key and a value in Hadoop terms) that represent
a city and the corresponding temperature recorded in that city for the
various measurement days. Of course we’ve made this example very
simple so it’s easy to follow. You can imagine that a real application won’t
be quite so simple, as it’s likely to contain millions or even billions of rows,
and they might not be neatly formatted rows at all; in fact, no matter how
big or small the amount of data you need to analyze, the key principles
we’re covering here remain the same. Either way, in this example, city is
the key and temperature is the value.
Map Reduce : Example (contd.)
Key Value
Toronto 20
Whitby 25
New York 22
Rome 32
Toronto 14
Rome 33
New York 18
Key Value
Toronto 18
Whitby 22
New York 25
Rome 35
Toronto 22
Rome 38
New York 21
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36
Toronto 12
Rome 35
New York 19
File 1 File 2 File 3
 Out of all the data we have collected, we want to find the maximum
temperature for each city across all of the data files (note that each file
might have the same city represented multiple times).
Map Reduce : Example (contd.)
 After mapping each file will return data as shown below. This is called
mapped data.
Key Value
Toronto 20
Whitby 25
New York 22
Rome 33
Key Value
Whitby 22
New York 25
Toronto 22
Rome 38
Key Value
Toronto 22
Whitby 26
New York 24
Rome 36
Map Reduce : Example (contd.)
 After mapping the reduction phase is performed and the final result is
displayed. All the data in the three files will be compared among the
corresponding key to find the highest temperature.
 The final result will be as follows:-
Key Value
Toronto 22
Whitby 26
New York 26
Rome 38
Conclusion
 Big data is changing the way people within organizations work together. It
is creating a culture in which business and IT leaders must join forces to
realize value from all data. Insights from big data can enable all employees
to make better decisions—deepening customer engagement, optimizing
operations, preventing threats and fraud, and capitalizing on new sources
of revenue. But escalating demand for insights requires a fundamentally
new approach to architecture, tools and practices.
 Competitive Advantage
 Better decision making
 Value of data
Big Data

Big Data

  • 2.
    What is BIGDATA ?  Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
  • 3.
    Where does itcome from ?  Social Media  Business Transaction  Smart Phones  Vehicles  Satellite  Log Files  Smart Devices  Sensors
  • 4.
  • 5.
    3 V’s ofBig Data  Velocity : The rate at which data is generated and changed.  Variety : The number of different data sources and types.  Volume : The average quantity of data units/category.
  • 6.
    Importance of BigData The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:  Determining root causes of failures, issues and defects in near-real time.  Generating coupons at the point of sale based on the customer’s buying habits.  Recalculating entire risk portfolios in minutes.  Detecting fraudulent behavior before it affects your organization.
  • 7.
    Applications of BigData  A 360 degree view of a customer.  Internet of Things  Healthcare  Information Security  E-Commerce  Data warehouse optimization
  • 9.
    Emergence of Hadoop An Open Source project Nutch (A search engine) – the brainchild of Doug Cutting and Mike Cafarella, aimed at returning web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously.  In 2006, Cutting joined Yahoo. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop.  In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
  • 10.
    Importance  Ability tostore and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.  Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.  Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically
  • 11.
    Importance(contd.)  Flexibility. Unliketraditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.  Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.  Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
  • 12.
  • 13.
    MapReduce  MapReduce isa programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
  • 14.
    MapReduce : Example Let’s look at a simple example. Assume you have three files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value.
  • 15.
    Map Reduce :Example (contd.) Key Value Toronto 20 Whitby 25 New York 22 Rome 32 Toronto 14 Rome 33 New York 18 Key Value Toronto 18 Whitby 22 New York 25 Rome 35 Toronto 22 Rome 38 New York 21 Key Value Toronto 22 Whitby 26 New York 24 Rome 36 Toronto 12 Rome 35 New York 19 File 1 File 2 File 3  Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times).
  • 16.
    Map Reduce :Example (contd.)  After mapping each file will return data as shown below. This is called mapped data. Key Value Toronto 20 Whitby 25 New York 22 Rome 33 Key Value Whitby 22 New York 25 Toronto 22 Rome 38 Key Value Toronto 22 Whitby 26 New York 24 Rome 36
  • 17.
    Map Reduce :Example (contd.)  After mapping the reduction phase is performed and the final result is displayed. All the data in the three files will be compared among the corresponding key to find the highest temperature.  The final result will be as follows:- Key Value Toronto 22 Whitby 26 New York 26 Rome 38
  • 18.
    Conclusion  Big datais changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data. Insights from big data can enable all employees to make better decisions—deepening customer engagement, optimizing operations, preventing threats and fraud, and capitalizing on new sources of revenue. But escalating demand for insights requires a fundamentally new approach to architecture, tools and practices.  Competitive Advantage  Better decision making  Value of data