Leveraging Open Source Big Data Stack
                                              Prasanth M Sasidharan




                 Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012
What is data?
       Data is Information in raw or unorganized form such as alphabets,
        numbers, or symbols


What is Big data?
        Big Data refers to large datasets which are difficult to store, manage and
         analyze

        Everyday, we create 2.5 trillion bytes of data–so much that 90% of the
         data in the world today has been created in the last two years alone.




                   Copyright © 2011 Flytxt B.V. All rights reserved         1/16/2012   2
Data Explosion !
Global Data Trends




            Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   4
Big Data & Distributed Computing
    Multiple servers, each working on part of job, each doing same task .
    Key Challenges:
       • Work distribution and orchestration
       • Error recovery
       • Scalability and management




                 Copyright © 2011 Flytxt B.V. All rights reserved      1/16/2012   5
FOSS in Aadhar
    Aadhaar is a 12-digit unique number which the Unique Identification Authority
    of India (UIDAI) will issue for all residents in India

    The number will be stored in a centralized database and linked to the basic
    demographics and biometric information – photograph, ten fingerprints and iris
    – of each individual.

    It is unique and robust enough to eliminate the large number of duplicate and
    fake identities in government and private databases




               Copyright © 2011 Flytxt B.V. All rights reserved           1/16/2012   6
Lets Meet a Stack!




 Application Layer




 Infrastructure
 Layer




                  Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   7
Infrastructure for Big Data Analysis
     What’s Virtualization?
                  Virtualization allows multiple operating system instances to
      run concurrently on a single computer; it is a means of separating
      hardware from a single operating system.




                  Copyright © 2011 Flytxt B.V. All rights reserved      1/16/2012   8
What’s Hypervisor?
   ◦ Also called virtual machine manager (VMM), is one of many hardware
     virtualization techniques allowing multiple operating systems, termed guests, to
     run concurrently on a host computer

   ◦ Originally developed in the 1970s as part of the IBM S/360




   Xen® hypervisor




                Copyright © 2011 Flytxt B.V. All rights reserved           1/16/2012    9
Advantages of FOSS

   Flexibility and Freedom



   Reliability


   Auditability


   Fast Deployment



   Cost




                   Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   10
Cost For Reproducing YouTube

                             Capital Expenditures                            Ann Expenses,ex HW Support
                                     ($M)                                               ($M)


   System        Hardware                  Software                 Total     Staff    Support      Total

Oracle Exadata     $147.4                    $442.0                 $589.4    $1.6      $97.4      $99.0
  Alternative
 openSource,
 commodity
   hardware        $104.2                      $0.0                 $104.2    $2.2      $12.9      $15.1




                 Copyright © 2011 Flytxt B.V. All rights reserved                                1/16/2012   11
Get Involved!
  Find out about Apache projects (http://projects.apache.org/
  Join mailing lists
  Pick up a Bug
  Suggest ideas or Fixes
  Checkout the latest code / Download releases
  Change the sourcefiles to incorporate your change or addition
  Provide appropriate source code documentation and follow project's
   coding conventions.
  Check Whether the software still compiles and runs correctly
  Run any unit or regression tests the software may have
  Send the patch for Review & committing


                Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   12
Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)

    •    Adobe                                                                •   Meebo
    •    Amazon                                                               •   The New York Times
    •    AOL                                                                  •   Rackspace
    •    eBay                                                                 •   StumbleUpon
    •    Facebook                                                             •   Twitter
    •    Fox Interactive Media                                                •   Yahoo
    •    IBM
    •    Last.fm
    •    LinkedIn

References
        • Hadoop: The Definitive Guide-MapReduce for the Cloud

        • HBase: The Definitive Guide

        • Hive Wiki (http://wiki.apache.org/hadoop/Hive)

        • Pig Wiki (http://wiki.apache.org/pig/)



                           Copyright © 2011 Flytxt B.V. All rights reserved                        1/16/2012   13
Open Source Initiatives @ FlyTXT
    Customization Specific to our business lines

    Mahout Enhancements for additional Machine Learning Algorithms

    Hive Customization

    Oozie Enhancements

    Hadoop Enhancements

    We won the IEEE cloud computing challenge




                 Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   14
THANK YOU




       Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   15
Extra Slides




         Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   16
Major Contributors to Hadoop….




         Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   17
Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   18
Quantity of Global Data
                           Exabyte




   130                  2,720
                                                              7,910
  2005

                        2012

                                                               2015*




           Copyright © 2011 Flytxt B.V. All rights reserved            1/16/2012   19
Numbers behind the News!!


    Twitter produces over 230 million tweets per day


    Wal-Mart is logging one million transactions per hour


    Facebook creates over 30 billion pieces of content ranging
     from web links, news, blogs, photo


    India's mobile subscription base at 873.61 mn users


    India has a population of 1.21 billion
Lets meet the Big data Stack
 •   Oozie – Open-source workflow/coordination service to
     manage data processing jobs for Apache Hadoop™ -
     Developed at Yahoo!

 •   HBase – Column-store database based on Google’s
     BigTable. Holds extremely large data sets (Petabytes)

 •   Hive – SQL based data warehousing app with features for
     analyzing very large data sets - Developed at Facebook

 •   Zoo Keeper – Distributed consensus engine providing
     Leader election, service discovery, distributed locking /
     mutual exclusion

 •   Pig - platform for analyzing large data sets that consists of a
     high-level language for expressing data analysis steps

 •   Ganglia - a scalable distributed monitoring system for high-
     performance computing systems such as clusters and Grids

 •   Apache Mahout - Free implementations of distributed or
     otherwise scalable machine learning algorithms on
     the Hadoop platform


                      Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   21
Copyright © 2011 Flytxt B.V. All rights reserved   1/16/2012   22

Leveraging open source for big data stack

  • 1.
    Leveraging Open SourceBig Data Stack Prasanth M Sasidharan Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
  • 2.
    What is data?  Data is Information in raw or unorganized form such as alphabets, numbers, or symbols What is Big data?  Big Data refers to large datasets which are difficult to store, manage and analyze  Everyday, we create 2.5 trillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
  • 3.
  • 4.
    Global Data Trends Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 4
  • 5.
    Big Data &Distributed Computing  Multiple servers, each working on part of job, each doing same task .  Key Challenges: • Work distribution and orchestration • Error recovery • Scalability and management Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
  • 6.
    FOSS in Aadhar  Aadhaar is a 12-digit unique number which the Unique Identification Authority of India (UIDAI) will issue for all residents in India  The number will be stored in a centralized database and linked to the basic demographics and biometric information – photograph, ten fingerprints and iris – of each individual.  It is unique and robust enough to eliminate the large number of duplicate and fake identities in government and private databases Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
  • 7.
    Lets Meet aStack! Application Layer Infrastructure Layer Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
  • 8.
    Infrastructure for BigData Analysis  What’s Virtualization? Virtualization allows multiple operating system instances to run concurrently on a single computer; it is a means of separating hardware from a single operating system. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
  • 9.
    What’s Hypervisor? ◦ Also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer ◦ Originally developed in the 1970s as part of the IBM S/360 Xen® hypervisor Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 9
  • 10.
    Advantages of FOSS  Flexibility and Freedom  Reliability  Auditability  Fast Deployment  Cost Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
  • 11.
    Cost For ReproducingYouTube Capital Expenditures Ann Expenses,ex HW Support ($M) ($M) System Hardware Software Total Staff Support Total Oracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0 Alternative openSource, commodity hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1 Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
  • 12.
    Get Involved! Find out about Apache projects (http://projects.apache.org/  Join mailing lists  Pick up a Bug  Suggest ideas or Fixes  Checkout the latest code / Download releases  Change the sourcefiles to incorporate your change or addition  Provide appropriate source code documentation and follow project's coding conventions.  Check Whether the software still compiles and runs correctly  Run any unit or regression tests the software may have  Send the patch for Review & committing Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
  • 13.
    Notable Users ofHadoop (Source: http://en.wikipedia.org/wiki/Hadoop) • Adobe • Meebo • Amazon • The New York Times • AOL • Rackspace • eBay • StumbleUpon • Facebook • Twitter • Fox Interactive Media • Yahoo • IBM • Last.fm • LinkedIn References • Hadoop: The Definitive Guide-MapReduce for the Cloud • HBase: The Definitive Guide • Hive Wiki (http://wiki.apache.org/hadoop/Hive) • Pig Wiki (http://wiki.apache.org/pig/) Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
  • 14.
    Open Source Initiatives@ FlyTXT  Customization Specific to our business lines  Mahout Enhancements for additional Machine Learning Algorithms  Hive Customization  Oozie Enhancements  Hadoop Enhancements  We won the IEEE cloud computing challenge Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
  • 15.
    THANK YOU Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
  • 16.
    Extra Slides Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
  • 17.
    Major Contributors toHadoop…. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 17
  • 18.
    Copyright © 2011Flytxt B.V. All rights reserved 1/16/2012 18
  • 19.
    Quantity of GlobalData Exabyte 130 2,720 7,910 2005 2012 2015* Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19
  • 20.
    Numbers behind theNews!! Twitter produces over 230 million tweets per day Wal-Mart is logging one million transactions per hour Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photo India's mobile subscription base at 873.61 mn users India has a population of 1.21 billion
  • 21.
    Lets meet theBig data Stack • Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo! • HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes) • Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook • Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion • Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps • Ganglia - a scalable distributed monitoring system for high- performance computing systems such as clusters and Grids • Apache Mahout - Free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 21
  • 22.
    Copyright © 2011Flytxt B.V. All rights reserved 1/16/2012 22

Editor's Notes

  • #20 Exabyte is 1 billion gigabytes, 7910 is 3 times more bits of information in digital universe than stars in physical universe
  • #21 Indian telecom added 7.9 million new subscribers in September. The indian population can be related to Aadhar project
  • #22 Mahout is a person who drives an elephant – catching a taxi from airport algorithm