The Hadoop Ecosystem


                       J Singh, DataThinks.org

                                   March 12, 2012
The Hadoop Ecosystem
• Introduction
   – What Hadoop is, and what it’s not
   – Origins and History
   – Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                          2
                                  2
What Hadoop is, and what it’s not
• A Framework for Map Reduce

• A Top-level Apache Project

• Hadoop is                               • Hadoop is not
    A Framework, not a “solution”             A painless replacement for SQL
        • Think Linux or J2EE


    Scalable                                  Uniformly fast or efficient

    Great for pipelining massive              Great for ad hoc Analysis
     amounts of data to achieve the
     end result

    Sometimes the only option


© J Singh, 2011                                                                 3
                                      3
You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
   – Rate of data accumulation is increasing
   – The idea of moving data from hither to yon is positively scary
   – A hit man threatens to delete your data in the middle of the night
        • And you want to pay him to do it


• Seriously, you are ready for Hadoop when analysis is the bottleneck
   –   Could   be   because   of data size
   –   Could   be   because   of the complexity of the data
   –   Could   be   because   of the level of analysis required
   –   Could   be   because   the analysis requirements are fluid




© J Singh, 2011                                                           4
                                             4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow

                   Word Count Example




                     MapOut
                     foo 1
Lines                                   Result
                     bar 1
foo bar                                 foo 3
                     quux 1
quux foo                                labs 1
                     foo 1
foo labs                                quux 2
                     foo 1
quux                                    bar 1
                     labs 1
                     quux 1



 © J Singh, 2011                                 6
                              6
Hello Hadoop
• Word Count
   – Example with Unstructured Data
   – Load 5 books from Gutenberg.org
     into /tmp/gutenberg
   – Load them into HDFS
   – Run Hadoop
        • Results are put into HDFS
   – Copy results into file system

   – What could be simpler?

   – DIY instructions for Amazon EC2
     available on DataThinks.org blog




© J Singh, 2011                             7
                                        7
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
   –   Core: Hadoop Map Reduce and Hadoop Distributed File System
   –   Data Access: HBase, Pig, Hive
   –   Algorithms: Mahout
   –   Data Import: Flume, Sqoop and Nutch
• The Hadoop Providers
• Hosted Hadoop Frameworks




© J Singh, 2011                                                     8
                                  8
The Core: Hadoop and HDFS
• Hadoop                                     • Hadoop Distributed File System
   – One master, n slaves                       – Robust Data Storage across
   – Master                                       machines, insulating against
        • Schedules mappers & reducers            failure
        • Connects pipeline stages              – Keeps n copies of each file
        • Handles failure semantics                 • Configurable number of copies
                                                    • Distributes copies across racks
                                                      and locations




© J Singh, 2011                                                                         9
                                         9
Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives                   • Processing
   – Hbase                                  – Pig
        • Wide column data structure            • A high(-ish) level data-flow
          built on HDFS                           language and execution
                                                  framework for parallel
                                                  computation
                                                • Accesses HDFS and Hbase
                                                • Batch as well as Interactive
                                                • Integrates UDFs written in
                                                  Java, Python, JavaScript
                                                • Compiles to map & reduce
                                                  functions – not 100% efficiently




© J Singh, 2011                                                                  10
                                       10
In Pig (Latin)

   Users    = load ‘users’ as (name, age);
   Filtered = filter Users by
                     age >= 18 and age <= 25;
   Pages    = load ‘pages’ as (user, url);
   Joined   = join Filtered by name, Pages by user;
   Grouped = group Joined by url;
   Summed   = foreach Grouped generate group,
                      count(Joined) as clicks;
   Sorted   = order Summed by clicks desc;
   Top5     = limit Sorted 5;

   store Top5 into ‘top5sites’;


© J Singh, 2011                                                                                                               11
                                                     11
                  Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Pig Translation into Map Reduce


 Load Users                       Load Pages
                                                                  Users = load …
 Filter by age
                                                                  Fltrd = filter …
                                                                  Pages = load …
  Job 1           Join on name                                    Joined = join …
                  Group on url
                                                                  Grouped = group …
                                                                  Summed = … count()…
          Job 2 Count clicks                                      Sorted = order …
                                                                  Top5 = limit …
              Order by clicks

          Job 3 Take top 5


© J Singh, 2011        Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt   12
                                                        12
Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives                   • Processing
   – Hbase                                  – Hive
        • Wide column data structure           • Data Warehouse Infrastructure
          built on HDFS                        • QL, a subset of SQL that
                                                 supports primitives supportable
                                                 by Map Reduce
                                               • Support for custom mappers
                                                 and reducers for more
                                                 sophisticated analysis
                                               • Compiles to map & reduce
                                                 functions – not 100% efficiently

            Hive Example
        CREATE TABLE page_view(viewTime INT, userid BIGINT,
                         page_url STRING, referrer_url STRING,
                         ip STRING COMMENT 'IP Address of the User')
        :: ::
        STORED AS SEQUENCEFILE;

© J Singh, 2011                                                                 13
                                       13
Hadoop Bestiary (p2): Mahout
• Algorithms                               • Examples
   – Mahout                                    – Clustering Algorithms
        • Scalable machine learning and            • Canopy Clustering
          data mining                              • K-Means Clustering
        • Runs on top of Hadoop                    • …
        • Written in Java
        • In active development                – Recommenders / Collaborative
            – Algorithms being added
                                                 Filtering Algorithms

                                               – Other
                                                   • Regression Algorithms
                                                   • Neural Networks
                                                   • Hidden Markov Models




© J Singh, 2011                                                                 14
                                          14
Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms      • Data Import
   – Sqoop: Structured Data        – Sqoop
   – Flume: Streams                   • Import from RDBMS to HDFS
                                      • Export too
                                   – Flume
                                      • Import streams
                                         – Text Files
                                         – System Logs
                                   – Nutch
                                      • Import from Web
                                      • Note: Nutch + Hadoop = Lucene




© J Singh, 2011                                                         15
                              15
Hadoop Bestiary (p4): Complete Picture




© J Singh, 2011                          16
                        16
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
   – Apache
   – Cloudera
   – Options when your data lives in a Database
• Hosted Hadoop Frameworks




© J Singh, 2011                                   17
                                  17
Apache Distribution
• The Definitive Repository
   – The hub for Code, Documentation, Tutorials

   – Many contributors, for example
        • Pig was a Yahoo! Contribution
        • Hive came from Facebook
        • Sqoop came from Cloudera


• Bare metal install option:
   – Download to your machine(s) from Apache
   – Install and Operate
        • Modify to fit your business better




© J Singh, 2011                                     18
                                               18
Cloudera
• Cloudera : Hadoop :: Red Hat : Linux

• Cloudera’s Distribution Including Apache Hadoop (CDH)
   – A packaged set of Hadoop modules that work together
   – Now at CDH3
   – Largest contributor of code to Apache Hadoop


• $76M in Venture funding so far




© J Singh, 2011                                            19
                                    19
When the data lives in a Database…

• Objective: keeping Analytics and Data as close as possible


• Options for RDBMS :                • Options for NoSQL Databases
   – Sqoop data to/from HDFS             – Sqoop-like connectors
        • Need to move the data              • Need to move the data
                                             • Can utilize all parts of Hadoop
   – In-database analytics
        • Available for TeraData,        – Built-in Map Reduce available
          Greenplum, etc.                  for most NoSQL databases
        • If you have the need               • Knows about and tuned to the
            – And the $$$                      storage mechanism
                                             • But typically only offers map
                                               and reduce
                                                 – No Pig, Hive, …



© J Singh, 2011                                                                  20
                                    20
The Hadoop Ecosystem
• Introduction
• The Hadoop Bestiary
• The Hadoop Providers
• Hadoop Platforms as a Service
   –   Amazon Elastic MapReduce
   –   Hadoop in Windows Azure
   –   Google App Engine
   –   Other
        • Infochimps
        • IBM SmartCloud




© J Singh, 2011                        21
                                  21
Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
   – CLI on your laptop
        • Control over size of cluster
        • Automatic spin-up/down instances


   – Map & Reduce programs on S3
        • Pig, Hive or
        • Custom in Java, Ruby, Python,
          Perl, PHP, R, C++, Cascading


   – Data In/Out on S3 or
   – Data In/Out on DynamoDB


• Keep in mind:
   – Hadoop on EC2 is also an option

© J Singh, 2011                                22
                                          22
Hadoop in Windows Azure
• Basic Level
   – Hive Add-in for Excel
   – Hive ODBC Driver


• Hadoop-based Distribution for Windows Server and Azure
   – Strategic Partnership with HortonWorks
   – Windows-based CLI on your laptop


• Broadest Level
   – JavaScript framework for Hadoop
   – Hadoop connectors for SQL Server and Parallel Data Warehouse




© J Singh, 2011                                                     23
                                 23
Google App Engine MapReduce
• Map Reduce as a Service
   – Distinct from Google’s internal Map Reduce
   – Part of Google App Engine


• Works with Google Datastore
   – A Wide Column Store


• A “purely programmatic” environment
   – Write Map and Reduce functions in Python / Java




© J Singh, 2011                                        24
                                  24
Map Reduce Use at Google




© J Singh, 2011            25
                      25
Take Aways
• There are many flavors of
  Hadoop.
   – The important part is
     Functional Programming and
     Map Reduce

   – Don’t let the proliferation of
     choices stump you.

   – Experiment with it!




© J Singh, 2011                            26
                                      26
Thank you
• J Singh
   – President, Early Stage IT
        • Technology Services and Strategy for Startups


• DataThinks.org is a new service of Early Stage IT
   – “Big Data” analytics solutions




© J Singh, 2011                                           27
                                      27

The Hadoop Ecosystem

  • 1.
    The Hadoop Ecosystem J Singh, DataThinks.org March 12, 2012
  • 2.
    The Hadoop Ecosystem •Introduction – What Hadoop is, and what it’s not – Origins and History – Hello Hadoop • The Hadoop Bestiary • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 2 2
  • 3.
    What Hadoop is,and what it’s not • A Framework for Map Reduce • A Top-level Apache Project • Hadoop is • Hadoop is not  A Framework, not a “solution” A painless replacement for SQL • Think Linux or J2EE  Scalable Uniformly fast or efficient  Great for pipelining massive Great for ad hoc Analysis amounts of data to achieve the end result  Sometimes the only option © J Singh, 2011 3 3
  • 4.
    You are readyfor Hadoop when… • You no longer get enthused by the prospect of more data – Rate of data accumulation is increasing – The idea of moving data from hither to yon is positively scary – A hit man threatens to delete your data in the middle of the night • And you want to pay him to do it • Seriously, you are ready for Hadoop when analysis is the bottleneck – Could be because of data size – Could be because of the complexity of the data – Could be because of the level of analysis required – Could be because the analysis requirements are fluid © J Singh, 2011 4 4
  • 5.
    MapReduce Conceptual Underpinnings •Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6.
    MapReduce Flow Word Count Example MapOut foo 1 Lines Result bar 1 foo bar foo 3 quux 1 quux foo labs 1 foo 1 foo labs quux 2 foo 1 quux bar 1 labs 1 quux 1 © J Singh, 2011 6 6
  • 7.
    Hello Hadoop • WordCount – Example with Unstructured Data – Load 5 books from Gutenberg.org into /tmp/gutenberg – Load them into HDFS – Run Hadoop • Results are put into HDFS – Copy results into file system – What could be simpler? – DIY instructions for Amazon EC2 available on DataThinks.org blog © J Singh, 2011 7 7
  • 8.
    The Hadoop Ecosystem •Introduction • The Hadoop Bestiary – Core: Hadoop Map Reduce and Hadoop Distributed File System – Data Access: HBase, Pig, Hive – Algorithms: Mahout – Data Import: Flume, Sqoop and Nutch • The Hadoop Providers • Hosted Hadoop Frameworks © J Singh, 2011 8 8
  • 9.
    The Core: Hadoopand HDFS • Hadoop • Hadoop Distributed File System – One master, n slaves – Robust Data Storage across – Master machines, insulating against • Schedules mappers & reducers failure • Connects pipeline stages – Keeps n copies of each file • Handles failure semantics • Configurable number of copies • Distributes copies across racks and locations © J Singh, 2011 9 9
  • 10.
    Hadoop Bestiary (p1a):Hbase, Pig • Database Primitives • Processing – Hbase – Pig • Wide column data structure • A high(-ish) level data-flow built on HDFS language and execution framework for parallel computation • Accesses HDFS and Hbase • Batch as well as Interactive • Integrates UDFs written in Java, Python, JavaScript • Compiles to map & reduce functions – not 100% efficiently © J Singh, 2011 10 10
  • 11.
    In Pig (Latin) Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; © J Singh, 2011 11 11 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 12.
    Pig Translation intoMap Reduce Load Users Load Pages Users = load … Filter by age Fltrd = filter … Pages = load … Job 1 Join on name Joined = join … Group on url Grouped = group … Summed = … count()… Job 2 Count clicks Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5 © J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12 12
  • 13.
    Hadoop Bestiary (p1b):Hbase, Hive • Database Primitives • Processing – Hbase – Hive • Wide column data structure • Data Warehouse Infrastructure built on HDFS • QL, a subset of SQL that supports primitives supportable by Map Reduce • Support for custom mappers and reducers for more sophisticated analysis • Compiles to map & reduce functions – not 100% efficiently Hive Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') :: :: STORED AS SEQUENCEFILE; © J Singh, 2011 13 13
  • 14.
    Hadoop Bestiary (p2):Mahout • Algorithms • Examples – Mahout – Clustering Algorithms • Scalable machine learning and • Canopy Clustering data mining • K-Means Clustering • Runs on top of Hadoop • … • Written in Java • In active development – Recommenders / Collaborative – Algorithms being added Filtering Algorithms – Other • Regression Algorithms • Neural Networks • Hidden Markov Models © J Singh, 2011 14 14
  • 15.
    Hadoop Bestiary (p3):Data Import • Data Import Mechanisms • Data Import – Sqoop: Structured Data – Sqoop – Flume: Streams • Import from RDBMS to HDFS • Export too – Flume • Import streams – Text Files – System Logs – Nutch • Import from Web • Note: Nutch + Hadoop = Lucene © J Singh, 2011 15 15
  • 16.
    Hadoop Bestiary (p4):Complete Picture © J Singh, 2011 16 16
  • 17.
    The Hadoop Ecosystem •Introduction • The Hadoop Bestiary • The Hadoop Providers – Apache – Cloudera – Options when your data lives in a Database • Hosted Hadoop Frameworks © J Singh, 2011 17 17
  • 18.
    Apache Distribution • TheDefinitive Repository – The hub for Code, Documentation, Tutorials – Many contributors, for example • Pig was a Yahoo! Contribution • Hive came from Facebook • Sqoop came from Cloudera • Bare metal install option: – Download to your machine(s) from Apache – Install and Operate • Modify to fit your business better © J Singh, 2011 18 18
  • 19.
    Cloudera • Cloudera :Hadoop :: Red Hat : Linux • Cloudera’s Distribution Including Apache Hadoop (CDH) – A packaged set of Hadoop modules that work together – Now at CDH3 – Largest contributor of code to Apache Hadoop • $76M in Venture funding so far © J Singh, 2011 19 19
  • 20.
    When the datalives in a Database… • Objective: keeping Analytics and Data as close as possible • Options for RDBMS : • Options for NoSQL Databases – Sqoop data to/from HDFS – Sqoop-like connectors • Need to move the data • Need to move the data • Can utilize all parts of Hadoop – In-database analytics • Available for TeraData, – Built-in Map Reduce available Greenplum, etc. for most NoSQL databases • If you have the need • Knows about and tuned to the – And the $$$ storage mechanism • But typically only offers map and reduce – No Pig, Hive, … © J Singh, 2011 20 20
  • 21.
    The Hadoop Ecosystem •Introduction • The Hadoop Bestiary • The Hadoop Providers • Hadoop Platforms as a Service – Amazon Elastic MapReduce – Hadoop in Windows Azure – Google App Engine – Other • Infochimps • IBM SmartCloud © J Singh, 2011 21 21
  • 22.
    Amazon Elastic MapReduce (EMR) • Hosted Map Reduce – CLI on your laptop • Control over size of cluster • Automatic spin-up/down instances – Map & Reduce programs on S3 • Pig, Hive or • Custom in Java, Ruby, Python, Perl, PHP, R, C++, Cascading – Data In/Out on S3 or – Data In/Out on DynamoDB • Keep in mind: – Hadoop on EC2 is also an option © J Singh, 2011 22 22
  • 23.
    Hadoop in WindowsAzure • Basic Level – Hive Add-in for Excel – Hive ODBC Driver • Hadoop-based Distribution for Windows Server and Azure – Strategic Partnership with HortonWorks – Windows-based CLI on your laptop • Broadest Level – JavaScript framework for Hadoop – Hadoop connectors for SQL Server and Parallel Data Warehouse © J Singh, 2011 23 23
  • 24.
    Google App EngineMapReduce • Map Reduce as a Service – Distinct from Google’s internal Map Reduce – Part of Google App Engine • Works with Google Datastore – A Wide Column Store • A “purely programmatic” environment – Write Map and Reduce functions in Python / Java © J Singh, 2011 24 24
  • 25.
    Map Reduce Useat Google © J Singh, 2011 25 25
  • 26.
    Take Aways • Thereare many flavors of Hadoop. – The important part is Functional Programming and Map Reduce – Don’t let the proliferation of choices stump you. – Experiment with it! © J Singh, 2011 26 26
  • 27.
    Thank you • JSingh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2011 27 27

Editor's Notes

  • #4 Sources: Top 5 Reasons Not to Use Hadoop for AnalyticsThe Dark Side of HadoopHadoopDon’t’s: What not to do to harvest Hadoop’s full potential
  • #8 Get started with Hadoop
  • #11 http://pig.apache.org/docs/r0.9.2/index.htmlApache HadoopCascading
  • #14 http://pig.apache.org/docs/r0.9.2/index.html
  • #16 Flume Users GuideThrift PaperThrift Paper
  • #17 Missing components:Cascading