Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Strata Talk - Uber, your hadoop has arrived

3,819 views

Published on

http://conferences.oreilly.com/strata/hadoop-big-data-ca/public/schedule/detail/47039

Published in: Engineering
  • See how I make over $7,293 a month from home doing REAL online jobs! ➤➤ http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/39pMlLF ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/39pMlLF ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Easy and hassle free way to make money online! I have just registered with this site and straight away I was making money! It doesn't get any better than this. Thank you for taking out all the hassle and making money answering surveys as easy as possible even for non-techie guys like me! ▲▲▲ http://t.cn/AieXAuZz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Tackle Odds With Lottery Secrets ●●● http://t.cn/Airfq84N
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop Strata Talk - Uber, your hadoop has arrived

  1. 1. DATA “Uber, Your Hadoop Has Arrived” Vinoth Chandar
  2. 2. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber’s Mission “Transportation as reliable as running water, everywhere, for everyone” 400+ Cities 69 Countries And growing...
  3. 3. Agenda Bringing Hadoop To Uber Hadoop Ecosystem Challenges Ahead
  4. 4. Data @ Uber : Impact 1. City OPS ○ Data Users Operating a massive transportation system 2. Analysts & Execs ○ Marketing Spend, Forecasting 3. Engineers & Data Scientists ○ Fraud Detection & Discovery 4. Critical Business Operations ○ Incentive Payments/Background Checks 5. Fending Off Uber’s Legal/Regulatory Challenges ○ “You have to produce this data in X hours”
  5. 5. Data @ Uber : Circa 2014 Kafka7 Logs Schemaless Databases RDBMS Tables Vertica Applications - Incentive Payments - Machine Learning - Safety - Background Checks uploader Amazon S3 EMR Wall-e ETL Adhoc SQL - City Ops/DOPS - Data Scientists
  6. 6. Pain Point #1: Data Reliability - Free Form python/node objects -> heavily nested JSON - Word-of-mouth Schema communication Lots of Engineers & Lots of services Lots of City OPS Producers Data Team Consumers $$$$ Data Pipeline
  7. 7. Pain Point #2: System Scalability - Kafka7 : Heavy Topics/No HA - Wall-e : Celery workers unable to keep up with Kafka/Schemaless - Vertica Queries : More & More Raw Data piling on H1 2014 H2 2014 & beyond
  8. 8. Pain Point #3: Fragile Ingestion Model - Multiple fetching from sources - Painful Backfills, since projections & transformation are in pipelines mezzanine trips_table1 trips_table2 trips_table3 Warehouse mezzanine trips_table1 trips_table2 trips_table3 Warehouse VS DataPool?
  9. 9. Pain Point #4: No Multi-DC Support - No Unified view of data, More complexity from consumer - Wasteful use of WAN traffic DC1 DC2 Global Warehouse
  10. 10. Hadoop Data Lake: Pain,Pain Go Away! - (Pain 1) Schematize All Data (old & new) - Heatpipe/Schema Service/Paricon - (Pain 2) All Infrastructure Shall Scale Horizontally - Kafka8 & Hadoop - Streamific/Sqoop (Deliver data to HDFS) - Lizzie(Feed Vertica)/Komondor(Feed Hive) - (Pain 3) Store raw data in nested glory in Hadoop - Json -> Avro records -> Parquet! - (Pain 4) Global View Of All Data - Unified tables! Yay!
  11. 11. Uber’s Schema Talk : Tomorrow, 2:40PM
  12. 12. Hadoop Ecosystem: Overview Kafka8 Logs Schemaless Databases SOA Tables Vertica Adhoc SQL (Web UI) Lizzie ETL (Spark) Streamific Json, Avro Hive (parquet) Streamific Sqoop ETL (Modeled Tables) Janus Fraud (Hive) Machine Learning (Spark) Safety Apps (Spark) Backfill Pipelines (Spark) Hadoop ETL Modeled Tables (Hive) Back to - Hive - Kafka - NoSQL flat table modeled table Komondor (Spark)
  13. 13. Hadoop Ecosystem: Data Ingestion Row Based (HBase/SequenceFiles) (Parquet) Columnar HDFS Komondor (Batch) Kafka Logs DB Redo Logs DC1 DC2 DC1 DC2 Streamific (Streaming,duh..)
  14. 14. Hadoop Ecosystem: Streamific Long-running service - Backfills/Catch-up don’t hurt sources Low Latency delivery into row-oriented storage - HBase/HDFS Append** Deployed/Monitored the ‘uber’ way. - Can run on DCs without YARN etc Core (HA, Checkpointing, Ordering, Throttling, Atleast-once guarantees) + Pluggable In/Out streams. Akka (Peak 900MB/sec),Helix (300K partitions) HBase HDFS Kafka Kafka Schema -less S3
  15. 15. Hadoop Ecosystem: Komondor The YARN/Spark Muscle - Parquet writing is expensive - 1->N mapping from raw to parquet/Hive table Control Data Quality - Schema Enforcement - Cleaning JSON - Hive Partitioning File Stitching - Keeps NN happy & queries performant Let’s “micro batch”? - HDFS iNotify stability issues Kafka logs DB Changelogs Full Snapshot - Trips (partitioned by request date) - User (partitioned by join date) - Kafka events (partitioned by event publish date) - Transaction history (partitioned by charge date) Snapshot tables Incremental tables Full dump New Files (HBase) (HDFS) (HDFS)
  16. 16. Hadoop Ecosystem: Consuming Data 1. Adhoc SQL a. Gateway service => Janus i. Keep bad queries out! ii. Choose YARN queues b. Hits HiveServer2/Tez 2. Data Apps a. Spark/SparkSQL via HiveContext b. Support for saving results to Hive/Kafka c. Monitoring/Debugging the ‘uber’ way 3. Lightweight Apps a. Python apps hitting gateway b. Fetch Small results via WebHDFS
  17. 17. Hadoop Ecosystem: Feeding Data-marts Vertica - SparkSQL/Oozie ETL framework to produce flattened tables - High Volume - Simple projections/row-level xforms - HiveQL to produce well-modelled tables - + Complex joins - Also lands tables into Hive Real-time Dashboarding - Batch layer for lambda architecture - Memsql/Riak as the real-time stores
  18. 18. Hadoop Ecosystem: Some Numbers HDFS - (Total) 4 PB in 1 HDFS Cluster - (Growth) ~3 TB/day and growing YARN - 6300 VCores (total) - Number of daily jobs - 60K - Number of compute hours daily - 78K (3250 days)
  19. 19. Hadoop Ecosystem: 2015 Wins 1. Hadoop is source-of-truth for analytics data a. 100% of All analytics 2. Powered Critical Business Operations a. Partner Incentive Payments 3. Unlocked Data a. Data in Hadoop >> Data in Vertica We (almost) caught up!
  20. 20. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our City Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  21. 21. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  22. 22. #1- Interactive SQL at Scale: Motivation Vertica Fast Can’t cheaply scale Powerful, scales reliably Slowww…. Hive
  23. 23. #1- Interactive SQL at Scale: Presto Fast - (-er than SparkSQL, -errr than Hive-on-tez) Deployed at Scale - (FB/Netflix) Lack of UDF Interop - Hive ⇔ Spark UDF interop is great! Out-of-box Geo support - ESRI/Magellan Other Challenges: - Heavy joins in 100K+ existing queries - Vertica degrades more gracefully - Colocation With Hadoop - Network isolation
  24. 24. #1- Interactive SQL at Scale: Spark Notebooks 1. Great for data scientists! - Iterative prototyping/exploration 2. Zeppelin/JupyterHub on HDFS - Run off mesos clusters 3. Of course, Spark Shell! - Pipeline troubleshooting
  25. 25. #1- Interactive SQL at Scale: Plan 1. Get Presto up & running - Work off “modelled tables” out of Hive - Equivalent of Vertica usage today 2. Presto on Raw nested data - Raw data in Hive (will be) available at low latency - Uber’s scalable near real-time warehouse 3. Support Spark Notebook use cases - Similar QoS issues hitting HDFS from Mesos
  26. 26. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  27. 27. #2- All-Active: Motivation Low availability, SPOF Data From All DCs replicated to single global data lake Data copied in-out, high SLA Assumes unbounded WAN links Less Operational overhead
  28. 28. #2- All-Active: Plan** Same HA as online services (You failover, with data readily available) Maintain N Hadoop Lakes? Data is replicated to peer data centers and into global data lakes (solid lines).
  29. 29. #2- All-Active: Challenges 1. Cross DC replicator design - File Level vs Record Level Replication 2. Policy Management - Which data is worth replicating - Which fields are PII? 3. Reducing Storage Footprint - 9 copies!! (2 Local Lakes + 1 Global Lake = 3 * 3 times from HDFS) - Federation across WAN? 4. Capacity Management for Failover - Degraded mode or hot standby?
  30. 30. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  31. 31. #3- Fresher Data in Hadoop: Motivation 1. Uber’s business is inherently ‘real-time’ - Uber’s City Ops fresh data, to ‘debug’ Uber. 2. All the Data is in Hadoop Anyway - Reduce mindless data movement 3. Leverage Power Of Existing SQL Engines - Standard SQL Interface & Mature Join support
  32. 32. Vertica #3- Fresher Data: Trips on Hadoop (Today) Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table) Rows (tripid => row) (new/updated trip rows) Vertica Cells (Streaming) Cells (Streaming) Streaming 10 mins, file uploads 1 hr (tunable) 6 hr, snapshot Incremental Update Snapshot: Inefficient & Slow a) Snapshot job => 100s of Mappers b) Reads TBs, Writes TBs c) But, just X GB actual data per day!!!! HiveHBase Changes To HBase 6 hrs ~1 hr trips (flat table) fact_trip (modeled table)
  33. 33. #3- Fresher Data: Modelled Tables In Hadoop Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table) fact_trip (modelled table) (~7-8+hrs) Rows (tripid => row) Vertica Cells (Streaming) Cells (Streaming) Streaming 6 hr, snapshot Latency & Inefficiency worsen further a)Spark/Presto on modelled tables goes from 1- 2 hrs to 7-8 hrs!! b)Resource Usage shoots up HiveHBase (new/updated trip rows) Changes To HBase fact_trip (modelled table) Hive 1-2 hr, snapshot 10 mins, file uploads 7-8+ hr
  34. 34. #3- Fresher Data: Let’s incrementally update? Schemaless (dc1) Schemaless (dc2) Hadoop Streamific trips (raw table)Rows (tripid => row) Cells (Streaming) Cells (Streaming) Streaming So Problem Solved, right? a) Same pattern as Vertica load b) Saves a bunch of resources c) And shrinks down latency. Hive HBase (new/updated trip rows) Changes To HBase trips (modelled table) Hive 30 mins, Incremental Update Incremental Update 30 mins 10 mins, file uploads < 1 hr ~1 hr
  35. 35. #3- Fresher Data: HDFS/Hive Updates are tedious Hadoop So Problem Solved, right? Except HBase Changes To HBase Hive Cells (Streaming) Streamific 10 mins, file uploads (new/updated trip rows) Incremental Update 30 mins 30 mins, Incremental Update trips (modelled table) trips (raw table) Schemaless (dc1) Schemaless (dc2) Rows (tripid => row) Cells (Streaming) Streaming Hive Update!
  36. 36. #3- Fresher Data: Trip Updates Problem Raw Trips Table in Hive New trips/Updated Trips 2010-2014 2016/01/02 2016/01/03New Data Unaffected Data Updated Data 2015/12/(01-31) Incremental update 2015/(01-05) 2015/(06-11) Last 1 hr Day level partitions
  37. 37. #3- Fresher Data: HDFS/Hive Updates are tedious Hadoop So Problem Solved, right? Yes, except… HBase Changes To HBase Hive Cells (Streaming) Streamific 10 mins, file uploads (new/updated trip rows) Incremental Update 30 mins 30 mins, Incremental Update trips (modelled table) trips (raw table) Schemaless (dc1) Schemaless (dc2) Rows (tripid => row) Cells (Streaming) Streaming Hive Update!Good News: Solve this & everything becomes
  38. 38. #3- Fresher Data: Solving Updates 1. Simple Folder/Partition Re-writing - Most commonly used approach in Hadoop land 2. File Level Updates - Similar to a, but at file level 3. Record Level Updates - Feels like a k-v store on parquet (and thus more complex) - Similar to Kudu/Hive transactions
  39. 39. #3- Fresher Data: Plan ● Pick File Level Update approach ○ Establish all the machinery (Custom InputFormat, Spark/Presto Connectors) ○ Get latency down to 15mins - 1 hour ● Record Level Update approach, if needed ○ Study numbers from production ○ Switch will be transparent to consumers ● In Summary, ○ Unlocks interactive SQL on raw “nested” table at low latency
  40. 40. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  41. 41. #4- Incremental Computation: Recurring Jobs ● State of the art : Consume Complete/Immutable Partitions - Determine when partition/time window is complete - Trigger workflows waiting on that partition/time window ● As partition size , incoming data into old time bucket - With 1-min/10-min partitions, keep getting new data for old time buckets ● Fundamental tradeoff - More Latency => More Completeness
  42. 42. #4- Incremental Computation: Use Cases - Diff apps => Diff needs a. Hadoop has one mode “completeness” - Apps must be able to choose a. job.trigger. atCompleteness(90) b. job.schedule.atInterval (10 mins) - Closest thing out there - Google CloudflowCompleteness Latency Incentive payments Fraud Detection Backfill Pipelines Business Dashboards/ETL Days Hour < 1 hr Day Data Science Safety App
  43. 43. #4- Incremental Computation: Tailing Tables ● Adding a new style of consuming data - Obtain new records loaded into table, since last run, across partitions - Consume in 15-30 min batches ● Favours Latency - Providing new data quickly - Consumer logic responsible for reconciliation with previous results ● Need a special marker to denote consumption point - commitTime: For each record, the time at which it was last updated
  44. 44. #4- Incremental Computation: Plan ● Add Metadata at Record-level to enable tailing ○ Extra book-keeping to map commitTime to Hive Partitions/Files ■ Avoid disastrous full scans ○ Can be combined with granular Hive partitions if needed ■ 15 min Hive Partitions => ~200K partitions for trip table ● Open Items: ○ Late arrival handling ■ Tracking when a time-window become complete ■ Design to (re)trigger workflows ○ Incrementally Recomputing aggregates
  45. 45. Hadoop Ecosystem: 2016 Challenges 1. Interactive SQL at Scale a. Put the power of data in our Ops’s hands 2. All-Active a. Keep data apps working during failovers 3. Fresher Data in Hadoop a. Trips in Hive lands in 6 hrs, but 1 hr in Vertica 4. Incremental Computation a. Most Jobs run daily off raw tables b. Intra hour jobs to build modeled tables
  46. 46. Summary Today - Living, breathing data ecosystem - Catch(-ing) up to the state-of-art Tomorrow - Push edges based on Uber’s needs - Near Real-time Warehouse - Incremental Compute - All-Active - Make Every Decision (Human/Machine) data driven
  47. 47. Thank you!

×