Hadoop and VerticaThe Data Analytics Platform at Twitter               Bill Graham - @billgraham     Data Systems Engineer...
About that pony giveaway...                              2
Outline  • Architecture  • Data flow  • Job coordination  • Resource management  • Vertica integration  • Gotchas  • Future...
We count things  • 140 characters  • 140M active users  • 400M tweets per day  • 80-100 TB ingested daily (uncompressed)  ...
Heterogeneous stack  • Many job execution applications    • Crane - Java ETL    • Oink - Pig scheduler    • Rasvelg - SQL ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                       Production Hosts                  Log                           ...
Data flow: Analytics                                             Production Hosts                        Log               ...
Chaotic? Actually, no.                         7
System concepts  • Loose coupling  • Job coordination as a service  • Resource management as a service  • Idempotence     ...
Loose coupling  • Multiple job frameworks  • Right tool for the job  • Common dependency management                       ...
Job coordination  • Shared batch table for job state  • Access via client libraries  • Jobs & data are time-based  • 3 typ...
Job coordination  • Shared batch table for job state             batch table:                                             ...
Job coordination  • Shared batch table for job state             batch table:                                             ...
Job coordination  • Shared batch table for job state             batch table:                                             ...
Job coordination  • Shared batch table for job state             batch table:                                             ...
Resource management  • Analytics Resource Manager - ARM!  • Library above Zookeeper  • Throttles jobs and workers    • Onl...
Resource management  • Analytics Resource Manager - ARM!  • Library above Zookeeper  • Throttles jobs and workers    • Onl...
Resource management  • Analytics Resource Manager - ARM!  • Library above Zookeeper  • Throttles jobs and workers    • Onl...
Job DAG & state transition            “Local View”            • Is it time for me to run yet?            • Are my dependan...
Job DAG & state transition            “Local View”            • Is it time for me to run yet?            • Are my dependan...
Job DAG & state transition                 “Local View”                 • Is it time for me to run yet?                 • ...
Example: active users  Production Hosts                     Main Hadoop DW       MySQL/                                  A...
Example: active users                                                                       Job DAG                       ...
Example: active users                                                                       Job DAG                       ...
Example: active users                                                                       Job DAG                       ...
Example: active users                                                                       Job DAG                       ...
Example: active users                                                                            Job DAG                  ...
Example: active users                                                                             Job DAG                 ...
Vertica or Hadoop?  • Vertica    • Loads 100s of Ks rows/second    • Aggregate 100s of Ms rows in seconds    • Used for lo...
Vertica import options  • Direct import via Crane    • Load into dest table, single thread  • Atomic import via Crane/Rasv...
Vertica imports - pros/cons  • Crane & Rasvelg    • Good for smaller datasets, DB to DB transfers    • Single threaded    ...
VerticaStorer  • PigStorage implementation  • From Vertica’s Hadoop connector suite  • Out of the box    • Easy to get Hel...
Pig VerticaStorage  • Our enhancements    • Connection credential management    • Truncate before load option    • Throttl...
Pig VerticaStorage  • Our enhancements    • Connection credential management    • Truncate before load option    • Throttl...
Gotcha #1  • MR data load is not atomic    • Avoid partial reads    • Option 1: load to temp table, then insert direct    ...
Gotcha #2  • Speculative execution is not always your friend    • Launch more tasks than needed, just in case    • For non...
Gotcha #3  • isIdempotant() must be a first-class concept    • Loader jobs will fail    • Failure after first task success =...
Gotcha #4  • Vendor code only gets you so far    • Nice to haves == have to write    • Favor the decorator pattern    • Pi...
Future work  • More VerticaStorer features  • Multiple Vertica clusters  • Atomic DB loads with Pig/Oink  • Better DAG vis...
Acknowledgements                   24
Questions? Bill Graham - @billgraham                             25
Upcoming SlideShare
Loading in...5
×

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

41,864

Published on

Published in: Technology, Business
2 Comments
42 Likes
Statistics
Notes
  • Awesome presentation! Thanks for sharing it. (I didn't feature it, found it in my newsfeed).

    We're wrestling with how to use Hadoop to power our analytics at slideshare ... it's obviously the right tool, but boy does it have a lot of moving parts!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks for the pony Bill!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
41,864
On Slideshare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
0
Comments
2
Likes
42
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • Point out differences more. which ones move from where\n
  • describe colo\n
  • describe colo\n
  • describe colo\n
  • describe colo\n
  • describe colo\n
  • describe colo\n
  • \n
  • point out develop your own tools pattern more\nopt-in too common services like screech-owl\n
  • \n
  • expand on the time-based aspect more (jobs and data)\n
  • expand on the time-based aspect more (jobs and data)\n
  • expand on the time-based aspect more (jobs and data)\n
  • expand on the time-based aspect more (jobs and data)\n
  • \n
  • \n
  • Point out that batch table is updated for all state changes\n
  • Point out that batch table is updated for all state changes\n
  • talk about when we use vertica and when we use Hadoop\n
  • talk about when we use vertica and when we use Hadoop\n
  • talk about when we use vertica and when we use Hadoop\n
  • talk about when we use vertica and when we use Hadoop\n
  • talk about when we use vertica and when we use Hadoop\n
  • talk about when we use vertica and when we use Hadoop\n
  • Writes are fast because they bypass the Vertica write buffer (copy direct)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • 2 vertica clusters: one for just queries\n
  • \n
  • \n
  • Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

    1. 1. Hadoop and VerticaThe Data Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
    2. 2. About that pony giveaway... 2
    3. 3. Outline • Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
    4. 4. We count things • 140 characters • 140M active users • 400M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
    5. 5. Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
    6. 6. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
    7. 7. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
    8. 8. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
    9. 9. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
    10. 10. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
    11. 11. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
    12. 12. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log RasvelgHCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
    13. 13. Chaotic? Actually, no. 7
    14. 14. System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
    15. 15. Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
    16. 16. Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
    17. 17. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
    18. 18. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
    19. 19. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
    20. 20. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
    21. 21. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
    22. 22. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
    23. 23. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
    24. 24. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
    25. 25. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
    26. 26. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
    27. 27. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
    28. 28. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
    29. 29. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
    30. 30. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
    31. 31. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
    32. 32. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
    33. 33. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
    34. 34. Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
    35. 35. Vertica import options • Direct import via Crane • Load into dest table, single thread • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
    36. 36. Vertica imports - pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
    37. 37. VerticaStorer • PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
    38. 38. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
    39. 39. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO {db_schema.user_sessions} USING com.twitter.twadoop.pig.store.VerticaStorage( config/db.yml, db_name, arm_resource_name); 18
    40. 40. Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
    41. 41. Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
    42. 42. Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
    43. 43. Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
    44. 44. Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
    45. 45. Acknowledgements 24
    46. 46. Questions? Bill Graham - @billgraham 25

    ×