By : Priyanka Tuteja 
(2k14-mtech(cse)-mrce-012)
Introduction
Outlines 
1. What is Big Data 
2. Big Data generators 
3. Why Big Data 
4. Characteristic of Big Data 
5. Big Data – A world wide problem 
6. Solution for Big Data 
7. Hadoop 
 HDFS 
 Map Reduce 
8. How Big Data Impact on IT 
9. Future of Big Data
What is big data? 
Big data is a collection of large and complex data sets 
which becomes difficult to process using on-hand database 
management tools or traditional data processing 
applications. 
In simpler terms, 
Big Data is a term given to large volumes of data that 
organizations store and process.
Huge amount of data 
+ From the beginning of recorded time until 2003,we 
created 5 billion gigabytes (exabytes) of data. 
+ In 2011, the same amount was created every two days 
+ In 2013, the same amount of data is created every 10 
minutes.
Types of Data Generators 
This data comes from everywhere: 
<> sensors used to gather climate information, 
<> posts to social media sites, 
<> digital pictures 
<> online Shopping 
<> Hospitality data 
<> Airlines 
<> purchase transaction records, and many more… 
This data is “ big data.”
Comparison 
1990’s 2014 
H/D: 1GB-20 GB I TB 
RAM : 64-128 MB 4-16 GB 
Reading : 10 KBPS 100 MBPS
Big Data Requires ? 
• Growth of Big Data is needed 
– Increase of storage capacities 
– Increase of processing power 
– Availability of data(different data types) 
– Every day we create 2.5 quintillion bytes of data; 
90% of the data in the world today has been created 
in the last two years alone
Big Data stores 
• Choosing the correct data stores based on 
your data characteristics 
• Data center people maintain these servers 
and these servers can be IBM, EMC server etc. 
• Whenever you want to process data 
– Fetch data. 
– Give it to your local machine. 
– Then process.
Three Characteristics of Big Data V3s 
Volume 
•Data 
quantity 
Velocity 
•Data 
Speed 
Variety 
•Data 
Types
1st Character of Big Data 
Volume 
• It refers to vast amount of data generated every second. 
•The size of available data has been growing at an increasing rate. 
•Today, Facebook ingests 500 terabytes of new data every day. 
• The smart phones, the data they create and consume; sensors 
embedded into everyday objects will soon result in billions of new, 
constantly-updated data feeds containing environmental, location, 
and other information, including video.
2nd Character of Big Data 
Velocity 
• It refers to speed at which new data is being generated. 
• Speed at which data moves around. 
• Clickstreams and ad impressions capture user behavior at 
millions of events per second 
• machine to machine processes exchange data between 
billions of devices 
• on-line gaming systems support millions of concurrent users, 
each producing multiple inputs per second.
3rd Character of Big Data 
Variety 
• It refers to different types of data we are now using. 
• In past we only focused on structured data that 
nearly fitted into tables and relational databases. 
• Nowa days 80% data is unstructured (text, images , 
video,voice) or semi structured (log files) 
• Big Data analysis includes different types of data
Big Data! A Worldwide Problem ? 
It is becoming very difficult for companies to 
store, retrieve and process the ever-increasing 
data. 
The problem lies in the use of traditional 
systems to store enormous data. 
 These systems were a success a few years 
ago, with increasing amount and complexity 
of data, these are soon becoming obsolete.
Contd.. 
• When data is less , processing speed is feasible 
• As soon as data increases, processing is not 
that much good. 
• Thus for more data, processing should be 
equalise. 
• Thus, HADOOP is introduced as a best 
solution.
Solution for Big Data ! 
The good news is - Hadoop, 
 Panacea for all those companies working with 
BIG DATA in a variety of applications 
 It has become an integral part for storing, 
handling, evaluating and retrieving hundreds 
or even petabytes of data.
Apache Hadoop! 
Hadoop was developed by Doug Cutting and 
Michael J. Cafarella. 
Hadoop is an open source software 
framework. 
It supports data-intensive distributed 
applications. 
 Hadoop is licensed under the Apache v2 
license. 
 Therefore known as Apache Hadoop.
Core concepts of Hadooop 
• HDFS (Hadoop Distributed File System) 
Technique for storing huge amount of data. 
• Map Reduce 
Technique for processing the data which we 
are storing in HDFS.
HDFS 
• It is a specially designed file system for storing huge 
data sets with cluster of commodity h/w and with 
streaming access pattern. 
cluster - Set of machines working togather 
commodity h/w -Cheap hardware 
streaming access pattern - Write ones, read any no of 
times but dont try to change the content of file ones 
you are keeping data in HDFS
CONTD. 
• Its HDFS (Hadoop Distributed File System) 
splits files into large blocks (default 64MB or 
128MB) and distributes the blocks 
• Amongst the nodes in the cluster. 
• For processing the data, the 
Hadoop Map/Reduce ships code to the nodes 
that have the required data, and the nodes 
then process the data in parallel.
Map Reduce 
• It is a technique for processing a data which we are storing in 
HDFS. 
• Hadoop runs map reduce in form of key , value pairs. 
• Mapper and Reducer also works with key, value pairs.
Contd.. 
• Record reader is a interface between input split and mapper 
• For every input split and mapper there is one record reader. 
• Record reader has been taken care by hadoop framework 
itself by default 
• In Mapper code we are writting logic on basis of that logic it 
will give key,value pairs 
• Record reader on basis of 3 file formats converts records into 
key,value pairs 
– Text Input Format (by default) 
– KeyValueText Input Format 
– SequenceFile Input Format
• Shuffling: 
– It is a phase on intermediate data to combine all 
key values pairs into a collection associated to 
same key. 
• (how[11111]) 
• (is[11111]) 
• Sorting : 
– It is also an another phase on intermediate data to 
sort all key values pairs.
How Big data impacts on IT 
• Big data is a troublesome force presenting 
opportunities with challenges to IT organizations. 
• By 2015 4.4 million IT jobs in Big Data ; 1.9 million 
is in US itself 
• India will require a minimum of 1 lakh data 
scientists in the next couple of years in addition 
to data analysts and data managers to support 
the Big Data space.
Future of Big Data 
• $15 billion on software firms only specializing in 
data management and analytics. 
• This industry on its own is worth more than $100 
billion and growing at almost 10% a year which is 
roughly twice as fast as the software business as a 
whole. 
• In February 2012, the open source analyst firm 
Wikibon released the first market forecast for Big 
Data , listing $5.1B revenue in 2012 with growth to 
$53.4B in 2017 
• The McKinsey Global Institute estimates that data 
volume is growing 40% per year.
Big Data

Big Data

  • 1.
    By : PriyankaTuteja (2k14-mtech(cse)-mrce-012)
  • 2.
  • 3.
    Outlines 1. Whatis Big Data 2. Big Data generators 3. Why Big Data 4. Characteristic of Big Data 5. Big Data – A world wide problem 6. Solution for Big Data 7. Hadoop  HDFS  Map Reduce 8. How Big Data Impact on IT 9. Future of Big Data
  • 4.
    What is bigdata? Big data is a collection of large and complex data sets which becomes difficult to process using on-hand database management tools or traditional data processing applications. In simpler terms, Big Data is a term given to large volumes of data that organizations store and process.
  • 5.
    Huge amount ofdata + From the beginning of recorded time until 2003,we created 5 billion gigabytes (exabytes) of data. + In 2011, the same amount was created every two days + In 2013, the same amount of data is created every 10 minutes.
  • 6.
    Types of DataGenerators This data comes from everywhere: <> sensors used to gather climate information, <> posts to social media sites, <> digital pictures <> online Shopping <> Hospitality data <> Airlines <> purchase transaction records, and many more… This data is “ big data.”
  • 7.
    Comparison 1990’s 2014 H/D: 1GB-20 GB I TB RAM : 64-128 MB 4-16 GB Reading : 10 KBPS 100 MBPS
  • 8.
    Big Data Requires? • Growth of Big Data is needed – Increase of storage capacities – Increase of processing power – Availability of data(different data types) – Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone
  • 9.
    Big Data stores • Choosing the correct data stores based on your data characteristics • Data center people maintain these servers and these servers can be IBM, EMC server etc. • Whenever you want to process data – Fetch data. – Give it to your local machine. – Then process.
  • 10.
    Three Characteristics ofBig Data V3s Volume •Data quantity Velocity •Data Speed Variety •Data Types
  • 11.
    1st Character ofBig Data Volume • It refers to vast amount of data generated every second. •The size of available data has been growing at an increasing rate. •Today, Facebook ingests 500 terabytes of new data every day. • The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
  • 12.
    2nd Character ofBig Data Velocity • It refers to speed at which new data is being generated. • Speed at which data moves around. • Clickstreams and ad impressions capture user behavior at millions of events per second • machine to machine processes exchange data between billions of devices • on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
  • 13.
    3rd Character ofBig Data Variety • It refers to different types of data we are now using. • In past we only focused on structured data that nearly fitted into tables and relational databases. • Nowa days 80% data is unstructured (text, images , video,voice) or semi structured (log files) • Big Data analysis includes different types of data
  • 14.
    Big Data! AWorldwide Problem ? It is becoming very difficult for companies to store, retrieve and process the ever-increasing data. The problem lies in the use of traditional systems to store enormous data.  These systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete.
  • 15.
    Contd.. • Whendata is less , processing speed is feasible • As soon as data increases, processing is not that much good. • Thus for more data, processing should be equalise. • Thus, HADOOP is introduced as a best solution.
  • 16.
    Solution for BigData ! The good news is - Hadoop,  Panacea for all those companies working with BIG DATA in a variety of applications  It has become an integral part for storing, handling, evaluating and retrieving hundreds or even petabytes of data.
  • 17.
    Apache Hadoop! Hadoopwas developed by Doug Cutting and Michael J. Cafarella. Hadoop is an open source software framework. It supports data-intensive distributed applications.  Hadoop is licensed under the Apache v2 license.  Therefore known as Apache Hadoop.
  • 18.
    Core concepts ofHadooop • HDFS (Hadoop Distributed File System) Technique for storing huge amount of data. • Map Reduce Technique for processing the data which we are storing in HDFS.
  • 19.
    HDFS • Itis a specially designed file system for storing huge data sets with cluster of commodity h/w and with streaming access pattern. cluster - Set of machines working togather commodity h/w -Cheap hardware streaming access pattern - Write ones, read any no of times but dont try to change the content of file ones you are keeping data in HDFS
  • 20.
    CONTD. • ItsHDFS (Hadoop Distributed File System) splits files into large blocks (default 64MB or 128MB) and distributes the blocks • Amongst the nodes in the cluster. • For processing the data, the Hadoop Map/Reduce ships code to the nodes that have the required data, and the nodes then process the data in parallel.
  • 21.
    Map Reduce •It is a technique for processing a data which we are storing in HDFS. • Hadoop runs map reduce in form of key , value pairs. • Mapper and Reducer also works with key, value pairs.
  • 22.
    Contd.. • Recordreader is a interface between input split and mapper • For every input split and mapper there is one record reader. • Record reader has been taken care by hadoop framework itself by default • In Mapper code we are writting logic on basis of that logic it will give key,value pairs • Record reader on basis of 3 file formats converts records into key,value pairs – Text Input Format (by default) – KeyValueText Input Format – SequenceFile Input Format
  • 24.
    • Shuffling: –It is a phase on intermediate data to combine all key values pairs into a collection associated to same key. • (how[11111]) • (is[11111]) • Sorting : – It is also an another phase on intermediate data to sort all key values pairs.
  • 25.
    How Big dataimpacts on IT • Big data is a troublesome force presenting opportunities with challenges to IT organizations. • By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US itself • India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.
  • 26.
    Future of BigData • $15 billion on software firms only specializing in data management and analytics. • This industry on its own is worth more than $100 billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole. • In February 2012, the open source analyst firm Wikibon released the first market forecast for Big Data , listing $5.1B revenue in 2012 with growth to $53.4B in 2017 • The McKinsey Global Institute estimates that data volume is growing 40% per year.

Editor's Notes