SlideShare a Scribd company logo
Hadoop and Vertica
The Data Analytics Platform at Twitter
               Bill Graham - @billgraham
     Data Systems Engineer, Analytics Infrastructure
              Hadoop Summit, June 2012
About that pony giveaway...




                              2
Outline
  • Architecture
  • Data flow
  • Job coordination
  • Resource management
  • Vertica integration
  • Gotchas
  • Future work




                          3
We count things

  • 140 characters
  • 140M active users
  • 400M tweets per day
  • 80-100 TB ingested daily (uncompressed)
  • 10s of Ks daily Hadoop jobs




                                              4
Heterogeneous stack
  • Many job execution applications
    • Crane - Java ETL
    • Oink - Pig scheduler
    • Rasvelg - SQL aggregations
    • Scalding - Cascading via Scala
    • PyCascading - Cascading via Python
    • Indexing jobs
  • Our users
    • Analytics, Revenue, Growth, Search, Recommendations, etc.
    • PMs, Sales!


                                                                  5
Data flow: Analytics

                                       Production Hosts
                  Log                                      Application
                events                                     Data
                         Scribe
                         Aggregators

  Third Party
                                                                     Social graph
   Imports                        HDFS                    MySQL/     Tweets
                                                          Gizzard    User profiles
                     Staging Hadoop Cluster




                Main Hadoop DW           HBase                                      Analytics
                                                           Vertica
                                                                                    Web Tools




                                                            MySQL
                                                                                                6
Data flow: Analytics

                                       Production Hosts
                  Log                                            Application
                events                                           Data
                         Scribe
                         Aggregators

  Third Party
                                                                           Social graph
   Imports                        HDFS                          MySQL/     Tweets
                                                                Gizzard    User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler



                            Log
                            Mover



                Main Hadoop DW           HBase                                            Analytics
                                                                 Vertica
                                                                                          Web Tools




                                                                  MySQL
                                                                                                      6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover



                Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane


                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed                                              Analysts
                     Staging Hadoop Cluster       Crawler                                                  Engineers
       Crane                                                                                               PMs
                                                            Crane                                          Sales
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                                  6
Data flow: Analytics

                                             Production Hosts
                        Log                                                 Application
                      events                                                Data
                               Scribe
                               Aggregators

       Third Party
                                                                                      Social graph
        Imports                         HDFS                              MySQL/      Tweets
                                                                          Gizzard     User profiles
                                                        Distributed                                              Analysts
                           Staging Hadoop Cluster       Crawler                                                  Engineers
              Crane                                                                                              PMs
                                                                  Crane                                          Sales
                                                                             Crane
                                  Log                                                     Rasvelg
HCatalog                          Mover


                                                                 Oink
           Oink       Main Hadoop DW           HBase                                                 Analytics
                                                                            Vertica
                                                                                                     Web Tools
                                                                Crane

                                                                          Crane
                                                                Crane

                                                       Oink
                                                                             MySQL
                                                                                                                        6
Chaotic? Actually, no.




                         7
System concepts


  • Loose coupling
  • Job coordination as a service
  • Resource management as a service
  • Idempotence




                                       8
Loose coupling


  • Multiple job frameworks
  • Right tool for the job
  • Common dependency management




                                   9
Job coordination

  • Shared batch table for job state
  • Access via client libraries
  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                            10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                      Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)                             ?


                                                                            10
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?




                                              12
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?
                                          granted


                                 denied                                  Insert entry into
                                                                            batch table
                                                      no
                          Idle                             yes   Completion
                                     Execution
                                     Complete?

                                          Execution




                                                                                    12
Job DAG & state transition

                 “Local View”
                 • Is it time for me to run yet?
                 • Are my dependancies satisfied?
                 • Any resource constraints?
                                               granted


                                      denied                                  Insert entry into
                                                                                 batch table
                                                           no
                               Idle                             yes   Completion
                                          Execution
                                          Complete?

                                               Execution


     batch table:
     (id, description, state,
      start_time, end_time,
      job_start_time, job_end_time)
                                                                                         12
Example: active users

  Production Hosts




                     Main Hadoop DW




       MySQL/                                  Analytics
       Gizzard                        MySQL   Dashboards
                          Vertica




                                                           13
Example: active users
                                                                       Job DAG




                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)

                         ib   e   web_events
                     Scr
                                                      Main Hadoop DW
                 Scr
                        ibe       sms_events




       MySQL/                                                                             Analytics
       Gizzard                                                                   MySQL   Dashboards
                                                           Vertica




                                                                                                      13
Example: active users
                                                                       Job DAG




                                                                             Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct




       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                                                                     Crane
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                         Crane                                                    Analytics
       Gizzard                                                                      MySQL     Dashboards
                                   user_profiles            Vertica




                                                                                                           13
Example: active users
                                                                            Job DAG




                                                                                   Oink     Oink
                                                                      Log mover
  Production Hosts
                                                                                           Crane
                                   Log mover                                                       Rasvelg
                              (via staging cluster)
                                                                                      Oink/Pig
                         ibe      web_events
                     Scr                                                              Cleanse
                                                       Main Hadoop DW                 Filter
                                                                                      Transform
                 Scr                                                                  Geo lookup
                        ibe       sms_events                                          Union
                                                                                      Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                                                   Analytics
       Gizzard                                                                            MySQL              Dashboards
                                   user_profiles                 Vertica



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Example: active users
                                                                             Job DAG




                                                                                    Oink     Oink
                                                                                                           ...
                                                                      Log mover
  Production Hosts
                                                                                            Crane
                                   Log mover                                                          Rasvelg Crane
                              (via staging cluster)
                                                                                         Oink/Pig
                         ibe      web_events
                     Scr                                                                 Cleanse
                                                       Main Hadoop DW                    Filter
                                                                                         Transform
                 Scr                                                                     Geo lookup
                        ibe       sms_events                                             Union
                                                                                         Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                 Crane                             Analytics
       Gizzard                                                                             MySQL             Dashboards
                                   user_profiles                 Vertica    active_by_*



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Vertica or Hadoop?
  • Vertica
    • Loads 100s of Ks rows/second
    • Aggregate 100s of Ms rows in seconds
    • Used for low latency queries and aggregations
    • Keep a sliding window of data
  • Hadoop
    • Excels when data size is massive
    • Flexible and powerful
    • Great with nested data structures and unstructured data
    • Used for complex functions and ML



                                                                14
Vertica import options
  • Direct import via Crane
    • Load into dest table, single thread
  • Atomic import via Crane/Rasvelg
    • Crane loads to temp table, single thread
    • Rasvelg moves to dest table
  • Parallel import via Oink/Pig
    • Pig job via VerticaStorer
                                                                MySQL/
                                                                Gizzard



    • ARM throttles active DB connections                         Crane

                                                                           Rasvelg


                                                        Oink
                                       Main Hadoop DW
                                                                 Vertica
                                                        Crane




                                                                                15
Vertica imports - pros/cons
  • Crane & Rasvelg
    • Good for smaller datasets, DB to DB transfers
    • Single threaded
    • Easy on Vertica
    • Hadoop not required
  • Pig
    • Great for larger datasets                                  MySQL/
                                                                 Gizzard


    • More complex, not atomic
                                                                   Crane

    • DDOS potential                                                        Rasvelg


                                                         Oink
                                        Main Hadoop DW
                                                                  Vertica
                                                         Crane




                                                                                16
VerticaStorer
  • PigStorage implementation
  • From Vertica’s Hadoop connector suite
  • Out of the box
    • Easy to get Hello World working
    • Well documented
    • Pig/Vertica data bindings work well
    • Fast!
    • Transaction-aware tasks
    • No bugs found
    • Open source?



                                            17
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table




                                            18
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table
         SET mapred.map.tasks.speculative.execution false

         user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;

         STORE user_sessions INTO '{db_schema.user_sessions}' USING
               com.twitter.twadoop.pig.store.VerticaStorage(
               'config/db.yml', 'db_name', 'arm_resource_name');
                                                                       18
Gotcha #1


  • MR data load is not atomic
    • Avoid partial reads
    • Option 1: load to temp table, then insert direct
    • Option 2: add job dependency concept




                                                         19
Gotcha #2



  • Speculative execution is not always your friend
    • Launch more tasks than needed, just in case
    • For non-idempotent jobs, extra tasks == BAD




                                                      20
Gotcha #3


  • isIdempotant() must be a first-class concept
    • Loader jobs will fail
    • Failure after first task success == not good
    • Can’t automate retry without cleanup




                                                    21
Gotcha #4

  • Vendor code only gets you so far
    • Nice to haves == have to write
    • Favor the decorator pattern
    • Pig’s StoreFuncWrapper can help
    • Vendor open sourcing is ideal




                                        22
Future work
  • More VerticaStorer features
  • Multiple Vertica clusters
  • Atomic DB loads with Pig/Oink
  • Better DAG visibility
  • Better job history visibility
  • MR job optimizations via historic stats
  • HCatalog data registry
  • Job push events


                                              23
Acknowledgements




                   24
Questions?

 Bill Graham - @billgraham




                             25

More Related Content

What's hot

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
Yahoo Developer Network
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
Jared Winick
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
TrendProgContest13
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 

What's hot (20)

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 

Viewers also liked

HP Vertica培训-基础篇
HP Vertica培训-基础篇HP Vertica培训-基础篇
HP Vertica培训-基础篇
Andy Lee
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
LivePerson
 
Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Japan
 
【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理Developers Summit
 
20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回都元ダイスケ Miyamoto
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Hortonworks
 
Data analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centersData analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centers
Hirotaka Niisato
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
Satoshi Yamada
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理
Makoto Yui
 
Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012
Cloudera Japan
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
Hortonworks
 
あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界
Yoshinori Nakanishi
 
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
kwatch
 
SQLチューニング入門 入門編
SQLチューニング入門 入門編SQLチューニング入門 入門編
SQLチューニング入門 入門編Miki Shimogai
 
Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話
Yuki Takeichi
 
ならば(その弐)
ならば(その弐)ならば(その弐)
ならば(その弐)
Tomoaki Hiramoto
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
Miki Shimogai
 

Viewers also liked (18)

HP Vertica培训-基础篇
HP Vertica培训-基础篇HP Vertica培训-基础篇
HP Vertica培训-基础篇
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料
 
【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理
 
20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Data analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centersData analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centers
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
 
Database smells
Database smellsDatabase smells
Database smells
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理
 
Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界
 
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
 
SQLチューニング入門 入門編
SQLチューニング入門 入門編SQLチューニング入門 入門編
SQLチューニング入門 入門編
 
Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話
 
ならば(その弐)
ならば(その弐)ならば(その弐)
ならば(その弐)
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
 

Similar to Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
Amazon Web Services LATAM
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
infolive
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
DataWorks Summit
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Mark Kromer
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
Big Data Houston
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
Steve Watt
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
Denny Lee
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
pbridges
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
Hortonworks
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
inside-BigData.com
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
Gruter
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud
 
Fluentd meetup #3
Fluentd meetup #3Fluentd meetup #3
Fluentd meetup #3
Treasure Data, Inc.
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
Roby Chen
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
m_hepburn
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
ACSG Section Montréal
 

Similar to Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter (20)

Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Fluentd meetup #3
Fluentd meetup #3Fluentd meetup #3
Fluentd meetup #3
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

  • 1. Hadoop and Vertica The Data Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
  • 2. About that pony giveaway... 2
  • 3. Outline • Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
  • 4. We count things • 140 characters • 140M active users • 400M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
  • 5. Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
  • 6. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 7. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 8. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
  • 9. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 10. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 11. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 12. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg HCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 14. System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
  • 15. Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
  • 16. Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 17. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 18. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 19. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 20. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
  • 21. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 22. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 23. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 24. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
  • 25. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
  • 26. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
  • 27. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 28. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 29. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 30. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 31. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
  • 32. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 33. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 34. Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
  • 35. Vertica import options • Direct import via Crane • Load into dest table, single thread • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
  • 36. Vertica imports - pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
  • 37. VerticaStorer • PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
  • 38. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
  • 39. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO '{db_schema.user_sessions}' USING com.twitter.twadoop.pig.store.VerticaStorage( 'config/db.yml', 'db_name', 'arm_resource_name'); 18
  • 40. Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
  • 41. Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
  • 42. Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
  • 43. Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
  • 44. Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
  • 46. Questions? Bill Graham - @billgraham 25

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. Point out differences more. which ones move from where\n
  6. describe colo\n
  7. describe colo\n
  8. describe colo\n
  9. describe colo\n
  10. describe colo\n
  11. describe colo\n
  12. \n
  13. point out develop your own tools pattern more\nopt-in too common services like screech-owl\n
  14. \n
  15. expand on the time-based aspect more (jobs and data)\n
  16. expand on the time-based aspect more (jobs and data)\n
  17. expand on the time-based aspect more (jobs and data)\n
  18. expand on the time-based aspect more (jobs and data)\n
  19. \n
  20. \n
  21. Point out that batch table is updated for all state changes\n
  22. Point out that batch table is updated for all state changes\n
  23. talk about when we use vertica and when we use Hadoop\n
  24. talk about when we use vertica and when we use Hadoop\n
  25. talk about when we use vertica and when we use Hadoop\n
  26. talk about when we use vertica and when we use Hadoop\n
  27. talk about when we use vertica and when we use Hadoop\n
  28. talk about when we use vertica and when we use Hadoop\n
  29. Writes are fast because they bypass the Vertica write buffer (copy direct)\n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. 2 vertica clusters: one for just queries\n
  39. \n
  40. \n