An introduction to
Big Data processing
using Hadoop
A.Sedighi
hexican.com
No	
  single	
  standard	
  definiHon…
“Big	
  Data”	
  is	
  data	
  whose	
  scale,	
  diversity,	
  
and	
  complexity	
  require	
  new	
  architecture,	
  
techniques,	
  algorithms,	
  and	
  analyHcs	
  to	
  
manage	
  it	
  and	
  extract	
  value	
  and	
  hidden	
  
knowledge	
  from	
  it…
Big Data, Definition
Information is powerful…
but it is how we use it that will
define us
Data Explosion
relational
text
audio
video
images
Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day
Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?
Big Data: 3 V's
Big Data: 3 V's
volume
velocity
variety
Some	
  Makes	
  it	
  3	
  V's
What	
  is	
  driving	
  Big	
  Data	
  Industry?	
  
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
Big Data Challenges
Big Data Challenges
Sorting of 10TB on:
1 node takes 2.5 Days O(N log N)
100 nodes takes 35 Mins O(log N)
Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.
What	
  Technology	
  Do	
  We	
  Have
For	
  Big	
  Data	
  ?
Map Reduce
MapReduce
Published in 2004 by Google
Popularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
Word	
  Count	
  Example
MapReduce philosophy
-hide complexity
-make it scalable
-make it cheap
MapReduce popularized by
Apache Hadoop project
Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.
Everything	
  Started	
  By	
  Searching
Hadoop was created by
Doug Cutting, the creator
of Apache Lucene, the
widely used text search
library. Hadoop has its
origins in Apache Nutch,
an open source web
search engine, itself a part
of the Lucene project.
Hadoop	
  Sub	
  Projects	
  -­‐	
  1
Hadoop	
  Sub	
  Projects	
  -­‐	
  2
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  1
HDFS is a filesystem designed for storing very large files
with streaming data access patterns, running on clusters
on commodity hardware.
-“Very large” in this context means files that are hundreds
of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of
data.
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  2
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern. The
time to read the whole dataset is more important than the latency
in reading the first record.
Hadoop	
  Distributed	
  File	
  System	
  
(HDFS)	
  -­‐	
  3
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters on commodity
hardware.
-HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
Were	
  HDFS	
  doesn't	
  work	
  well?
● Low-­‐latency	
  data	
  access
● Lots	
  of	
  small	
  files
● MulHple	
  writers,	
  arbitrary	
  file	
  modificaHons.
MapReduce	
  and	
  HDFS	
  
HDFS Concepts - Blocks
65MB 128MB or 256MB Block size.
If the seek time is around 10ms, and the transfer rate is 100 MB/s,
then to make the seek time 1% of the transfer time, we need to
make the block size around 100 MB.
Anatomy	
  of	
  a	
  File	
  Read
Anatomy	
  of	
  a	
  File	
  Write
Replica Replacement
Machine Learning - 1
Mahout's	
  goal	
  is	
  to	
  build	
  scalable	
  machine	
  
learning	
  libraries	
  providing	
  core	
  algorithms	
  for	
  
clustering,	
  classificaHon	
  and	
  batch	
  based	
  
collaboraHve	
  filtering	
  are	
  implemented	
  on	
  top	
  
of	
  Apache	
  Hadoop	
  using	
  the	
  map/reduce	
  
paradigm.	
  
Machine Learning - 2
Mahout	
  can	
  be	
  used	
  as	
  a	
  recommender	
  engine	
  
on	
  the	
  top	
  of	
  hadoop	
  clusters.	
  
Using	
  hadoop	
  for
● ads and recomendations
● online travel
● processing mobile data
● energy savings and discovery
● infrastructure management
● image processing
● fraud detection
● IT security
● health care

An introduction to Big-Data processing applying hadoop

  • 1.
    An introduction to BigData processing using Hadoop A.Sedighi hexican.com
  • 2.
    No  single  standard  definiHon… “Big  Data”  is  data  whose  scale,  diversity,   and  complexity  require  new  architecture,   techniques,  algorithms,  and  analyHcs  to   manage  it  and  extract  value  and  hidden   knowledge  from  it… Big Data, Definition
  • 3.
    Information is powerful… butit is how we use it that will define us
  • 4.
  • 5.
    Big Data Era -createsover 30 billion pieces of content per day -stores 30 petabytes of data -produces over 90 million tweets per day
  • 6.
    Log Files -Log filescontains data. -Each banking transaction should be logged in different levels. How much a Banking solution generates log files per a day?
  • 7.
  • 8.
    Big Data: 3V's volume velocity variety
  • 9.
  • 10.
    What  is  driving  Big  Data  Industry?   - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets
  • 11.
  • 12.
    Big Data Challenges Sortingof 10TB on: 1 node takes 2.5 Days O(N log N) 100 nodes takes 35 Mins O(log N)
  • 13.
    Big Data Challenges Problem:“Fat” servers implies high cost. Solution: Using cheap commodity nodes instead. Problem: Large number of cheap nodes implies often failures. Solution: leverage automatic fault-tolerance
  • 14.
    Big Data Challenges Weneed new data-parallel programming model for clusters of commodity machines.
  • 15.
    What  Technology  Do  We  Have For  Big  Data  ?
  • 17.
  • 18.
    MapReduce Published in 2004by Google Popularized by Apache Hadoop project. Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
  • 19.
  • 20.
  • 21.
  • 22.
    Hadoop Overview Open sourceimplementation of Google MapReduce Google File System (GFS) First release in 2008 by Yahoo! Wide adoption by Facebook, Twitter, Amazon, etc.
  • 24.
    Everything  Started  By  Searching Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
  • 25.
  • 26.
  • 27.
    Hadoop  Distributed  File  System   (HDFS)  -­‐  1 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
  • 28.
    Hadoop  Distributed  File  System   (HDFS)  -­‐  2 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.
  • 29.
    Hadoop  Distributed  File  System   (HDFS)  -­‐  3 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. -HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
  • 30.
    Were  HDFS  doesn't  work  well? ● Low-­‐latency  data  access ● Lots  of  small  files ● MulHple  writers,  arbitrary  file  modificaHons.
  • 31.
  • 32.
    HDFS Concepts -Blocks 65MB 128MB or 256MB Block size. If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.
  • 33.
    Anatomy  of  a  File  Read
  • 34.
    Anatomy  of  a  File  Write
  • 36.
  • 37.
    Machine Learning -1 Mahout's  goal  is  to  build  scalable  machine   learning  libraries  providing  core  algorithms  for   clustering,  classificaHon  and  batch  based   collaboraHve  filtering  are  implemented  on  top   of  Apache  Hadoop  using  the  map/reduce   paradigm.  
  • 38.
    Machine Learning -2 Mahout  can  be  used  as  a  recommender  engine   on  the  top  of  hadoop  clusters.  
  • 39.
    Using  hadoop  for ●ads and recomendations ● online travel ● processing mobile data ● energy savings and discovery ● infrastructure management ● image processing ● fraud detection ● IT security ● health care