Hadoop and Vertica: Data Analytics Platform at Twitter

33,326 views

Published on

Published in: Technology, Business
0 Comments
22 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
33,326
On SlideShare
0
From Embeds
0
Number of Embeds
5,480
Actions
Shares
0
Downloads
0
Comments
0
Likes
22
Embeds 0
No embeds

No notes for slide

Hadoop and Vertica: Data Analytics Platform at Twitter

  1. 1. Hadoop and VerticaThe Data Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
  2. 2. About that pony giveaway... 2
  3. 3. Outline • Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
  4. 4. We count things • 140 characters • 140M active users • 340M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
  5. 5. Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
  6. 6. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  7. 7. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  8. 8. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
  9. 9. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  10. 10. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  11. 11. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  12. 12. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log RasvelgHCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  13. 13. Chaotic? Actually, no. 7
  14. 14. System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
  15. 15. Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
  16. 16. Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  17. 17. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  18. 18. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  19. 19. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  20. 20. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
  21. 21. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  22. 22. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  23. 23. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  24. 24. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
  25. 25. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
  26. 26. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
  27. 27. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  28. 28. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  29. 29. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  30. 30. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  31. 31. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
  32. 32. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  33. 33. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  34. 34. Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
  35. 35. Vertica import options • Direct import via Crane • Load into dest table, single thread, atomic • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
  36. 36. Vertica imports - pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
  37. 37. VerticaStorer • PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
  38. 38. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
  39. 39. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO {db_schema.user_sessions} USING com.twitter.twadoop.pig.store.VerticaStorage( config/db.yml, db_name, arm_resource_name); 18
  40. 40. Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
  41. 41. Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
  42. 42. Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
  43. 43. Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
  44. 44. Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
  45. 45. Acknowledgements 24
  46. 46. Questions? Bill Graham - @billgraham 25

×