Transforming Mobile Marketing & Advertising™




                        Harnessing s for Big Data
                        Analytics

                                                                   Jobin Wilson
                                                                   jobin.wilson@flytxt.com




                                                                                             Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Who am I ?

   • Architect @ Flytxt (Big Data Analytics & Automation)

   • Passionate about data, distributed computing , machine learning

   • Previously

        •Virtualization & Cloud Lifecycle Management(BMC)

               • Designed and Implemented Cloud Life Cycle Management Interface for BMC

        • Large Scale Data Centre Automation(AOL)

               • Implemented Centralized Data Center Management Framework for AOL

        •Workflow Systems & Automation (Accenture)

               • Implemented Service Management Suit for various customers




                                                                                          Confidential
             Copyright © 2010 Flytxt B.V. All rights reserved.
Session Agenda!

• Data – What's the big deal?

• What is Hadoop( & What it is not  )

• Map-Reduce Model & HDFS

• Hadoop Ecosystem & Tools

• Lets get started!

• Q&A




                                                                    3   Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Five computers & a 640k ;-)


                                                             "I think there is a world market
                                                             for about five computers"
      Moore’s Law
                                                                        Thomas Watson 1943,
                                                                        Chairman of the board of IBM




       "640k ought to be enough for
       anybody"


                          Attributed to
                          Bill Gates in 1981.




                                                                                                       Confidential
         Copyright © 2010 Flytxt B.V. All rights reserved.
Data Explosion !




                                                             Confidential
         Copyright © 2010 Flytxt B.V. All rights reserved.
Do I also know what you might do next summer?


                                        •     Does your travel company know you visited Goa &
                                              Cochin twice in the last two years?

                                        •     Collaborative Filtering




                                        •     Lots of Data + Statistics = WOW!!!

                                        •     BTW, don’t worry about the eqn 




                                                                                                Confidential
        Copyright © 2010 Flytxt B.V. All rights reserved.
Don‟t throw away data just because it doesn't „fit‟


 •   relational tuples, log files, semi structured textual data (e.g., e-mail),pictures
     , videos

 •   User generated data & System generated data

 •   Applications need more than structured data

 •   My application is not “Dumb” any more!!

 •   “I keep saying that the sexy job in the next 10 years will be
      statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist)




                                                                                          Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Lets get to business!!

What is Apache Hadoop ?

•   Apache Hadoop is an open-source system to
    reliably store and process extremely large data sets
    across many commodity computers.

•   originally developed to support Nutch search engine
    project.

•   scales linearly with data size or analysis complexity

•   Scale-out ,shared nothing architecture

•   inspired by Google's MapReduce and Google File
    System (GFS) papers




                                                                   Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Basics of Hadoop


 •   Two Core Components – HDFS & Map-Reduce

 •   Machines are un-reliable

 •   Separates distributed fault-tolerant computing code from application
     logic.

 •   No need to worry about identity of a machine

 •   lets you interact with a cluster, not a bunch of machines.

 •   Analysis workloads span across multiple machines

 •   runs as a cloud(cluster) & possibly on a cloud (EC2)




                                                                            Confidential
               Copyright © 2010 Flytxt B.V. All rights reserved.
Lead Actors


•   Name Node – Book keeping metadata server

•   Secondary Name Node – Assistant to Name Node

•   Job Tracker – Scheduler

•   Task Tracker - Task execution

•   Data Node - Block storage




                                                                    Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
HDFS Write Model




                                                            Confidential
        Copyright © 2010 Flytxt B.V. All rights reserved.
Map-Reduce Model




                                                          Confidential
      Copyright © 2010 Flytxt B.V. All rights reserved.
Map-Reduce Execution Flow




                                                          Confidential
      Copyright © 2010 Flytxt B.V. All rights reserved.
Hadoop Ecosystem
•   Oozie – Open-source workflow/coordination
    service to manage data processing jobs for Apache
    Hadoop™ - Developed at Yahoo!

•   HBase – Column-store database based on
    Google’s BigTable. Holds extremely large data sets
    (Petabytes)

•   Hive – SQL based data warehousing app with
    features for analyzing very large data sets -
    Developed at Facebook

•   Zoo Keeper – Distributed consensus engine
    providing Leader election, service
    discovery, distributed locking / mutual exclusion

•   Pig - platform for analyzing large data sets that
    consists of a high-level language for expressing
    data analysis steps

•   Ganglia - a scalable distributed monitoring system
    for high-performance computing systems such as
    clusters and Grids
                                                                       Confidential
                   Copyright © 2010 Flytxt B.V. All rights reserved.
Hadoop is not a “Holy Grail”

•   Not a substitute for a database

•   MapReduce is not always the best algorithm

•   HDFS is not a substitute for a
    High Availability SAN-hosted FS

•   HDFS is not a Posix file system

•   Not a place to learn Java programming

•   Not a place to learn Unix/Linux system administration

•   Not a place to learn basics of networking




                                                                    Confidential
                Copyright © 2010 Flytxt B.V. All rights reserved.
Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)



     • A9.com                               • Meebo
     • AOL                                  • Metaweb
     • EHarmony                             • The New York Times
     • eBay                                 • Rackspace
     • Facebook                             • StumbleUpon
     • Fox Interactive Media                • Twitter
     • IBM                                  • Yahoo
     • Last.fm                              • Amazon
     • LinkedIn




                                                                        Confidential
                    Copyright © 2010 Flytxt B.V. All rights reserved.
Q&A




                                                    www.flytxt.com
                                                    Confidential
Copyright © 2010 Flytxt B.V. All rights reserved.
THANK YOU
      contact us : dev2dev@flytxt.com/ jobin.wilson@flytxt.com




                                                                 www.flytxt.com
                                                                 Confidential   18
Copyright © 2010 Flytxt B.V. All rights reserved.

Harnessing hadoop for big data analytics v0.1

  • 1.
    Transforming Mobile Marketing& Advertising™ Harnessing s for Big Data Analytics Jobin Wilson jobin.wilson@flytxt.com Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 2.
    Who am I? • Architect @ Flytxt (Big Data Analytics & Automation) • Passionate about data, distributed computing , machine learning • Previously •Virtualization & Cloud Lifecycle Management(BMC) • Designed and Implemented Cloud Life Cycle Management Interface for BMC • Large Scale Data Centre Automation(AOL) • Implemented Centralized Data Center Management Framework for AOL •Workflow Systems & Automation (Accenture) • Implemented Service Management Suit for various customers Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 3.
    Session Agenda! • Data– What's the big deal? • What is Hadoop( & What it is not  ) • Map-Reduce Model & HDFS • Hadoop Ecosystem & Tools • Lets get started! • Q&A 3 Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 4.
    Five computers &a 640k ;-) "I think there is a world market for about five computers" Moore’s Law Thomas Watson 1943, Chairman of the board of IBM "640k ought to be enough for anybody" Attributed to Bill Gates in 1981. Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 5.
    Data Explosion ! Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 6.
    Do I alsoknow what you might do next summer? • Does your travel company know you visited Goa & Cochin twice in the last two years? • Collaborative Filtering • Lots of Data + Statistics = WOW!!! • BTW, don’t worry about the eqn  Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 7.
    Don‟t throw awaydata just because it doesn't „fit‟ • relational tuples, log files, semi structured textual data (e.g., e-mail),pictures , videos • User generated data & System generated data • Applications need more than structured data • My application is not “Dumb” any more!! • “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist) Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 8.
    Lets get tobusiness!! What is Apache Hadoop ? • Apache Hadoop is an open-source system to reliably store and process extremely large data sets across many commodity computers. • originally developed to support Nutch search engine project. • scales linearly with data size or analysis complexity • Scale-out ,shared nothing architecture • inspired by Google's MapReduce and Google File System (GFS) papers Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 9.
    Basics of Hadoop • Two Core Components – HDFS & Map-Reduce • Machines are un-reliable • Separates distributed fault-tolerant computing code from application logic. • No need to worry about identity of a machine • lets you interact with a cluster, not a bunch of machines. • Analysis workloads span across multiple machines • runs as a cloud(cluster) & possibly on a cloud (EC2) Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 10.
    Lead Actors • Name Node – Book keeping metadata server • Secondary Name Node – Assistant to Name Node • Job Tracker – Scheduler • Task Tracker - Task execution • Data Node - Block storage Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 11.
    HDFS Write Model Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 12.
    Map-Reduce Model Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 13.
    Map-Reduce Execution Flow Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 14.
    Hadoop Ecosystem • Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo! • HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes) • Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook • Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion • Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps • Ganglia - a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 15.
    Hadoop is nota “Holy Grail” • Not a substitute for a database • MapReduce is not always the best algorithm • HDFS is not a substitute for a High Availability SAN-hosted FS • HDFS is not a Posix file system • Not a place to learn Java programming • Not a place to learn Unix/Linux system administration • Not a place to learn basics of networking Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 16.
    Notable Users ofHadoop (Source: http://en.wikipedia.org/wiki/Hadoop) • A9.com • Meebo • AOL • Metaweb • EHarmony • The New York Times • eBay • Rackspace • Facebook • StumbleUpon • Fox Interactive Media • Twitter • IBM • Yahoo • Last.fm • Amazon • LinkedIn Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 17.
    Q&A www.flytxt.com Confidential Copyright © 2010 Flytxt B.V. All rights reserved.
  • 18.
    THANK YOU contact us : dev2dev@flytxt.com/ jobin.wilson@flytxt.com www.flytxt.com Confidential 18 Copyright © 2010 Flytxt B.V. All rights reserved.