Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reliable and Scalable
Data Ingestion at Airbnb
KRISHNA PUTTASWAMY & JASON ZHANG
1
Best travel experiences powered
by data products
Inform decision making based on
data and insights from data
2
• ML applications
-Fraud detection, Search ranking, etc.
• User activity
-Growth, matching, etc.
• Experimentation, monito...
• JSON events without schemas
• Over 800+ event types
• Easy to break events during evolution/code changes
• Lack of monit...
Data Quality Failure
CEO dashboard and
Bookings dashboards
were regularly broken.
1.5 Years Ago
Data Quality Failure
ERF was unstable and
experimentation culture
was weak
Hi team,
This is partly a PSA to let you
know E...
Events Data Ingestion
Must be Reliable
7
• Timeliness
-Land on time; be predictable
• Completeness
-All data should land in the warehouse
• Data Quality
-Identify ...
Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Data
Products
REST
Proxy
Kafka
Client
Kafka
Client
Ka...
• More users, activity, bookings, etc.
• Need lightweight techniques that are themselves not
bottlenecks
Rapid Growth in
E...
• How many events were actually emitted?
• How many must have been emitted?
• What should be in the correct data?
• How to...
E2E Audit
Schema
Enforcement
Anomaly
Detection
Component Level
Audit
Realtime
Ingestion
Phases of
Rebuilding Data
Ingestion
Phase 1: Audit each component
13
Instrumentation, monitoring, alerting on each component
• Process health
• Count of input/output events
• Week-over-week c...
Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Data
Products
REST
Proxy
Kafka
Client
Kafka
Client
Ka...
Phase 2: Audit E2E system
16
Hardening each component is not sufficient
• Account for new failure modes
• Quantify aggregate event loss
• Narrow down t...
Canary Service
• A standalone service that sends events at a known rate
• Compare events landed in warehouse and alert on ...
DB as Proxy for
Ground Truth
• Compare DB mutations with corresponding events emitted
• DB serves as a ground truth for ev...
Audit Pipeline
Overview
• Need to quantify event loss and ensure SLA is not violated
• Attach a header to each event when ...
Event Schema for Audit
Metadata
Kafka
Camus
EZSplit
HDFS
Ruby
Java
Javascript
Mobiles
Data
Pipelines
Site-
facing
Services
REST
Proxy
Kafka
Client
Kafka
C...
Phase 3: Schema enforcement
23
• JSON events without schemas
• Easy to break events during evolution/code changes
• Over 800+ event types
• Lack of monit...
25
Data Incidents
Schema
Enforcement
• Schema tech stack: Thrift
• Libraries for sending thrift objects from different clients:
Java, Ruby, ...
Thrift Schema Repository
Why Thrift?
• Easy syntax
• Good performance in Ruby
• Ubiquitous
Advantages of schema repo?
• Gr...
• Standard Field in the event schema
• Managed Explicitly
• use Semantic Versioning:
1.0.0 = MODEL . REVISION . ADDITION
M...
Example of Thrift Event
because the event is your API
30
Example
31
Example Schema Mapping in the Warehouse
Phase 4: Anomaly detection
32
A Bad Date
Picker
33
• On 9/22/2015, we launched a new Datepicker experiment on
P1
• Half of users received new_datepicker...
Diagnosis
34
• We realize a 14% drop in “searches with dates” after about
7 days
• The scope of the impact was unclear; we...
Diagnosis
35
• Drilling down into source = P1, we see a stronger pattern
• Something qualitatively worse is happening in I...
Curiosity
36
• Let’s automate this process!
• It’s hard to know which dimension combinations matter…
...so try as many of ...
Method
37
• Retrieve time series data from a source (GROUP BY time, dimension)
• Detect any anomalies in each dimension va...
Phase 5: Realtime ingestion
38
Streaming Ingestion Pipeline
end to end
HBase Row Key
• Event key = event_type.event_name.event_uuid.
Ex: air_event.canaryevent.016230ae-a3d8-434e
• Shard id = Ha...
Dedup and Repartition
41
Spark
Executor1
…
HBase
Region1
…
Executor2
ExecutorN
Region2
RegionM
RegionK
…
Hive Hbase
Connector
42
CREATE EXTERNAL TABLE `search_event_table` (
`rowkey` string COMMENT 'from deserializer',
`event_b...
• Ingest over 5B events with less than 100 events/day loss
• We can alert on data loss in real time (loss > 0.01%)
• We ca...
Upcoming SlideShare
Loading in …5
×

Reliable and Scalable Data Ingestion at Airbnb

1,477 views

Published on

Reliable and Scalable Data Ingestion at Airbnb

Published in: Technology
  • Be the first to comment

Reliable and Scalable Data Ingestion at Airbnb

  1. 1. Reliable and Scalable Data Ingestion at Airbnb KRISHNA PUTTASWAMY & JASON ZHANG 1
  2. 2. Best travel experiences powered by data products Inform decision making based on data and insights from data 2
  3. 3. • ML applications -Fraud detection, Search ranking, etc. • User activity -Growth, matching, etc. • Experimentation, monitoring, etc. Events Lead to Insights 3 Events Insights Production Data Warehouse
  4. 4. • JSON events without schemas • Over 800+ event types • Easy to break events during evolution/code changes • Lack of monitoring Lead to: • Too many data outages, data loss incidents • Lack of trust on data systems Challenges 1.5 Years Ago
  5. 5. Data Quality Failure CEO dashboard and Bookings dashboards were regularly broken. 1.5 Years Ago
  6. 6. Data Quality Failure ERF was unstable and experimentation culture was weak Hi team, This is partly a PSA to let you know ERF dashboard data hasn't been up to date/ accurate for several weeks now. Do not rely on the ERF dashboard for information about your experiment. 1.5 Years Ago
  7. 7. Events Data Ingestion Must be Reliable 7
  8. 8. • Timeliness -Land on time; be predictable • Completeness -All data should land in the warehouse • Data Quality -Identify anomalous behavior Reliability Guarantees Targeted 8
  9. 9. Kafka Camus EZSplit HDFS Ruby Java Javascript Mobiles Data Pipelines Data Products REST Proxy Kafka Client Kafka Client Kafka Client Invalid data Stuck processe Buffer overflow Node failures Host network Broker errors Distributed Systems 9
  10. 10. • More users, activity, bookings, etc. • Need lightweight techniques that are themselves not bottlenecks Rapid Growth in Events Data 1/8/14% 2/8/14% 3/8/14% 4/8/14% 5/8/14% 6/8/14% 7/8/14% 8/8/14% 9/8/14% 10/8/14% 11/8/14% 12/8/14% 1/8/15% 2/8/15% 3/8/15% 4/8/15% 5/8/15% 6/8/15% 7/8/15% 8/8/15% 9/8/15% 10
  11. 11. • How many events were actually emitted? • How many must have been emitted? • What should be in the correct data? • How to catch subtle anomalies in data? No Ground Truth 11
  12. 12. E2E Audit Schema Enforcement Anomaly Detection Component Level Audit Realtime Ingestion Phases of Rebuilding Data Ingestion
  13. 13. Phase 1: Audit each component 13
  14. 14. Instrumentation, monitoring, alerting on each component • Process health • Count of input/output events • Week-over-week comparison Guarding Against Component Failures 14
  15. 15. Kafka Camus EZSplit HDFS Ruby Java Javascript Mobiles Data Pipelines Data Products REST Proxy Kafka Client Kafka Client Kafka Client Stuck processe Buffer overflow Node failures Host network Broker errors Pipeline bug 15
  16. 16. Phase 2: Audit E2E system 16
  17. 17. Hardening each component is not sufficient • Account for new failure modes • Quantify aggregate event loss • Narrow down the source of loss • Need end-to-end and out-of-band checks on the full pipeline E2E System Auditing 17
  18. 18. Canary Service • A standalone service that sends events at a known rate • Compare events landed in warehouse and alert on loss • Simple, reliable, and accurate 18
  19. 19. DB as Proxy for Ground Truth • Compare DB mutations with corresponding events emitted • DB serves as a ground truth for events with 1:1 mapping 19
  20. 20. Audit Pipeline Overview • Need to quantify event loss and ensure SLA is not violated • Attach a header to each event when it enters the pipeline: REST proxy, Java, and Ruby • Header contains host, process, sequence, and uuid • Group sequence by (host, process) in warehouse: quantify event loss, and attribute loss to hosts • Extend to multi-hop sequence: easy to attribute loss to internal component in the pipeline 20
  21. 21. Event Schema for Audit Metadata
  22. 22. Kafka Camus EZSplit HDFS Ruby Java Javascript Mobiles Data Pipelines Site- facing Services REST Proxy Kafka Client Kafka Client Kafka Client 124 124 3 canary service database snapshot 22
  23. 23. Phase 3: Schema enforcement 23
  24. 24. • JSON events without schemas • Easy to break events during evolution/code changes • Over 800+ event types • Lack of monitoring Lead to: • Too many data outages, data loss incidents • Lack of trust on data systems Challenges 1.5 Years Ago
  25. 25. 25 Data Incidents
  26. 26. Schema Enforcement • Schema tech stack: Thrift • Libraries for sending thrift objects from different clients: Java, Ruby, JS, and Mobile • Who should define schemas: data scientist or product engineer • Development workflow: schema evolution, and bridge producer and consumer schemas • Self-serve 26
  27. 27. Thrift Schema Repository Why Thrift? • Easy syntax • Good performance in Ruby • Ubiquitous Advantages of schema repo? • Great Catalyst for communication, documentation, etc • it ships jar and gems • Will developers hate you for this? no
  28. 28. • Standard Field in the event schema • Managed Explicitly • use Semantic Versioning: 1.0.0 = MODEL . REVISION . ADDITION MODEL is a change which breaks the rules of backward compatibility. Example: changing the type of a field. REVISION is a change which is backward compatible but not forward compatible. Example: adding a new field to a union type. ADDITION is a change which is both backward compatible and forward compatible. Example: adding a new optional field. Schema Evolution
  29. 29. Example of Thrift Event because the event is your API
  30. 30. 30 Example
  31. 31. 31 Example Schema Mapping in the Warehouse
  32. 32. Phase 4: Anomaly detection 32
  33. 33. A Bad Date Picker 33 • On 9/22/2015, we launched a new Datepicker experiment on P1 • Half of users received new_datepicker treatment, the other half were control • It was shut off by 9/29/2015, and metrics recovered
  34. 34. Diagnosis 34 • We realize a 14% drop in “searches with dates” after about 7 days • The scope of the impact was unclear; we just know a subset of locales were affected • The root cause analysis depends heavily on vigilance and a bit of guesswork / luck • Drilling down by Country revealed an interesting pattern
  35. 35. Diagnosis 35 • Drilling down into source = P1, we see a stronger pattern • Something qualitatively worse is happening in IT, GB, and CA • “Affected locales: en-GB, it, en-AU, en-CA, en-NZ, da, zh-tw, ms-my and probably some more” • How did we know to try P1? • How to know which countries to slice by?
  36. 36. Curiosity 36 • Let’s automate this process! • It’s hard to know which dimension combinations matter… ...so try as many of them as we reasonably can, in an intelligent way • Drill-down into dimension combinations that are Specific enough to be informative, Yet still contribute meaningfully to top-level aggregate
  37. 37. Method 37 • Retrieve time series data from a source (GROUP BY time, dimension) • Detect any anomalies in each dimension value’s time series • Explore across the dimension space to compare values against each other • Prune the set of dimension values using anomalies / exploration • Drill-down into remaining dimensions for pruned values
  38. 38. Phase 5: Realtime ingestion 38
  39. 39. Streaming Ingestion Pipeline end to end
  40. 40. HBase Row Key • Event key = event_type.event_name.event_uuid. Ex: air_event.canaryevent.016230ae-a3d8-434e • Shard id = Hash (Event key) % Shard_num • Shard key = Region_start_keys[Shard_id]. Ex: 0000000 • Row key = Shard_key.Event_key. Ex: 0000000.air_events.canaryevent.016230-a3db-434e 40
  41. 41. Dedup and Repartition 41 Spark Executor1 … HBase Region1 … Executor2 ExecutorN Region2 RegionM RegionK …
  42. 42. Hive Hbase Connector 42 CREATE EXTERNAL TABLE `search_event_table` ( `rowkey` string COMMENT 'from deserializer', `event_bytes` binary COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.airbnb.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.airbnb.HBaseStorageHandler' WITH SERDEPROPERTIES ( ‘hbase.timerange.hourly.boundary'='true', // for current hour 'hbase.columns.mapping'=':key, b:event_bytes', ‘hbase.key.pushdown’=‘jitney_event.search_event', ‘hbase.timestamp.min’=‘…', // arbitrary time range start ‘hbase.timestamp.max’=‘…') // arbitrary time range end
  43. 43. • Ingest over 5B events with less than 100 events/day loss • We can alert on data loss in real time (loss > 0.01%) • We can quantify which machine/service lead to how much loss • We can identify even subtle anomalies in the data Conclusions 43

×