SOFTWARE DEVELOPMENT DONE RIGHT
       Netherlands | USA | India | UK | France
What is Big
Data?

  Generally refers to data that can not be
processed by traditional systems efficiently
mainly because of it's size.


    Twitter/Facebook example
    
      Facebook – 500TB data daily
    
      Twitter – 250million tweets daily


 90% of data has been generated in last 2-3
years.
Big Data
Sources

    Sources -
    • Social networking sites like twitter, facebook etc.
    • Smart phones
    • Trading platforms
    • Machines
    • Log Files


    This data is used for different purposes like
     • Product Trends
     • Market Analysis
What is
Hadoop ?

  Apache Hadoop is a Framework for running
applications on large cluster built of commodity
hardware.

  Transparently provides applications both
reliability and data motion.

  Implements a computational paradigm named
Map/Reduce where application is divided in
small fragments of work.

  Provides a distributed file system (HDFS)

  Transfers code near to data.

  Hadoop opened the gates for processing Big
Data
Hadoop's
History

    Hadoop is based on work done by Google

    
        GFS – HDFS

    
      Google Map Reduce – Hadoop Map
    Reduce

    
        BigTable – HBase
Hadoop
Features

    Partial Failure Support


    Data Recoverability


    Component Recovery


    Consistency


    Scalability
Hadoop
Components

    Core Components
    • HDFS – Hadoop Distributed File System
    • Map Reduce



    Projects in Hadoop Ecosystem
    • Pig, Hive, HBase, Flume, Oozie, Sqoop
    etc.
HDFS
Map/Reduce
Case
Study

  Product - Data Quality and cleansing product
solutions.


    Before Hadoop
     
       Two node DB cluster
     
       Multi-threaded java application for de-
     duplication
     
       1 million records took 10 hrs. to process


    After Hadoop
     
        8 GB Ram, 4 cores, 4 machines in cluster.
     
       1 million records took 30 min to process
Hadoop In
Use

    Any application which has
     
       > 10TB data
     
       Needs fast and cheap processing

    Log Analysis

    Recommendation Engine

    Feed Analysis

    Data Mining

    Statistical Analysis

    ETL Processing

    Business Intelligence
Cloudera
 
   Cloudera is “The commercial Hadoop
 company”.

 
   Founded by leading experts on Hadoop
 from Facebook, Google,Oracle and Yahoo.

 
   Provides consulting and training services
 for Hadoop users.

 
   Staff includes committers to virtually all
 Hadoop projects.
Resources
 
     Books
      
        Hadoop : The Definitive Guide (by Tom White)
      
        Hbase : The Definitive Guide (by Lars George)
      
        MapReduce Design Patterns (by Donald Miner)

 
     Web
      
          http://hadoop.apache.org/
      
          http://hbase.apache.org/
      
          http://research.google.com/archive/bigtable.html
      
          http://research.google.com/archive/mapreduce-osdi04.pdf
Contact us @




                 Xebia India
Website
www.xebia.com                  Thought Leadership
www.xebia.in                   http://blog.xebia.com
www.xebia.fr                   http://podcast.xebia.com

Hadoop

  • 1.
    SOFTWARE DEVELOPMENT DONERIGHT Netherlands | USA | India | UK | France
  • 2.
    What is Big Data?  Generally refers to data that can not be processed by traditional systems efficiently mainly because of it's size.  Twitter/Facebook example  Facebook – 500TB data daily  Twitter – 250million tweets daily  90% of data has been generated in last 2-3 years.
  • 3.
    Big Data Sources  Sources - • Social networking sites like twitter, facebook etc. • Smart phones • Trading platforms • Machines • Log Files  This data is used for different purposes like • Product Trends • Market Analysis
  • 4.
    What is Hadoop ?  Apache Hadoop is a Framework for running applications on large cluster built of commodity hardware.  Transparently provides applications both reliability and data motion.  Implements a computational paradigm named Map/Reduce where application is divided in small fragments of work.  Provides a distributed file system (HDFS)  Transfers code near to data.  Hadoop opened the gates for processing Big Data
  • 5.
    Hadoop's History  Hadoop is based on work done by Google  GFS – HDFS  Google Map Reduce – Hadoop Map Reduce  BigTable – HBase
  • 6.
    Hadoop Features  Partial Failure Support  Data Recoverability  Component Recovery  Consistency  Scalability
  • 7.
    Hadoop Components  Core Components • HDFS – Hadoop Distributed File System • Map Reduce  Projects in Hadoop Ecosystem • Pig, Hive, HBase, Flume, Oozie, Sqoop etc.
  • 8.
  • 9.
  • 10.
    Case Study  Product- Data Quality and cleansing product solutions.  Before Hadoop  Two node DB cluster  Multi-threaded java application for de- duplication  1 million records took 10 hrs. to process  After Hadoop  8 GB Ram, 4 cores, 4 machines in cluster.  1 million records took 30 min to process
  • 11.
    Hadoop In Use  Any application which has  > 10TB data  Needs fast and cheap processing  Log Analysis  Recommendation Engine  Feed Analysis  Data Mining  Statistical Analysis  ETL Processing  Business Intelligence
  • 12.
    Cloudera  Cloudera is “The commercial Hadoop company”.  Founded by leading experts on Hadoop from Facebook, Google,Oracle and Yahoo.  Provides consulting and training services for Hadoop users.  Staff includes committers to virtually all Hadoop projects.
  • 13.
    Resources  Books  Hadoop : The Definitive Guide (by Tom White)  Hbase : The Definitive Guide (by Lars George)  MapReduce Design Patterns (by Donald Miner)  Web  http://hadoop.apache.org/  http://hbase.apache.org/  http://research.google.com/archive/bigtable.html  http://research.google.com/archive/mapreduce-osdi04.pdf
  • 14.
    Contact us @ Xebia India Website www.xebia.com Thought Leadership www.xebia.in http://blog.xebia.com www.xebia.fr http://podcast.xebia.com