Interactive Analytics in Human Time
S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4
2 0 1 4 H a d o o p ...
Interactive – How we see it?
2 Yahoo Confidential & Proprietary
60B events,
3.5TB of compressed data
Response 400ms
Serve ...
Agenda:
Motivation
Approach
Problem Deepdive- Instant Overlap
Summary
Questions
Motivation:
Lots of data
Analytics
Data restatement - batch and real time
Human time
Lots of data
~30B advertising events/day
~10s of TB of compressed data/day
Minutes to Year Grain
Multi-quarter data retent...
Analytics
Reporting Metrics
Attribution
Multi-level hierarchical computation
Bidding/Targeting optimization
Non-additive c...
Data Restatement
Real time
Batch
Producer Consumer
quick path, lower
amount of checks or
reconciliation,
typically no look...
Human Time
<1s ( 99 percentile)
Default time grain ( < 300 ms)
Instant overlap ( < 60s)
Data ingested, insights available ...
Lots of data
Analytics
Data restatement - batch and real time
Human time
Approach:
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Bu...
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Colle...
Transformations/Aggs
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Business API
UI
Data Colle...
How we do what we do? Components of Advertising Data Warehouse
Druid
JDBC/ODBC
Data Warehouse-Persistence
Hive
Metrics
Sto...
Human time, How?
Druid for interactive queries
Storm-Druid for quick ingestion and
index
Specialized computation and
pr...
Problem Deepdive: Instant Overlap
Users
Car
commuters
Soccer
Fans
Vegans
Users
Car
Commuters
Soccer
Fans
Vegans
Overlap
Non-additive
› Require access to raw (user level data) to compute
non-additive
• Billions of events a day
• TBs o...
Re-stating motivation
Given two sets having identifiers, how
can we do exact overlaps in close to
real time?
( < 1 min).
O...
Existing Approaches
● Use exact compute paradigms
o Do joins for intersections which will lead to
exact results
 Hive, PI...
Using Feature Sequences – 1/4
Feature sequence encoding
o Encode the sequence
 {Ram} - { car commuter, soccer fan,...}
 ...
Using Feature Sequences – 2/4
Eliminate the user on encoded bitmaps
 {car commuter, soccer fan, vegan...}- count -c1- #
...
Using Feature Sequences – 3/4
● Store row qualifications into a bitmap
o Car commuter- Row1, Row3
 1010000000
o Vegan - R...
Using Feature Sequences – 4/4
 Data Structures
› {feature_sequence}->count
› Feature->row qualification bitmaps
 AND is ...
Comparison with existing algorithm
● 1-n – Bulk Overlap on grid
o 19 hours on grid
o Few-n calls for a re-process
o 1-1 ( ...
Summary
● Yahoo’s Advertising Data Warehouse
o Peta Byte Scale
o Normalized view across many systems
o Analytics and optim...
Thank You
@supreeth_
@_skgupta
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.
Data Ingestion or Collection
Transformations
Data Ingestion
Persistence
Runtime Compute
Caching
API
Optional Middleware
Bu...
Dimension Flexibility
Many dimensions
Adding new dimensions
Time zones
Time grain
Normalized view across systems
PaidSearch
Display
Native
Programmatic buying and
selling
Ad-targeting
Hardware Configs
●High-memory boxes
●SSD preferred
●Savings due to better
compression
Upcoming SlideShare
Loading in...5
×

Interactive Analytics in Human Time

2,861

Published on

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,861
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • Logical view of a typical data driven application architecture
    Sox compliance
  • -quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture
    - Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface

  • Bitmap indexed, columnar, can operate on compressed bitmaps, distributed
    -
  • Interactive Analytics in Human Time

    1. 1. Interactive Analytics in Human Time S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
    2. 2. Interactive – How we see it? 2 Yahoo Confidential & Proprietary 60B events, 3.5TB of compressed data Response 400ms Serve an ad and get insights < 2s
    3. 3. Agenda:
    4. 4. Motivation Approach Problem Deepdive- Instant Overlap Summary Questions
    5. 5. Motivation:
    6. 6. Lots of data Analytics Data restatement - batch and real time Human time
    7. 7. Lots of data ~30B advertising events/day ~10s of TB of compressed data/day Minutes to Year Grain Multi-quarter data retention Data Aging
    8. 8. Analytics Reporting Metrics Attribution Multi-level hierarchical computation Bidding/Targeting optimization Non-additive computation
    9. 9. Data Restatement Real time Batch Producer Consumer quick path, lower amount of checks or reconciliation, typically no lookups high latency path, checks and reconciliations, can have lookbacks and lookups
    10. 10. Human Time <1s ( 99 percentile) Default time grain ( < 300 ms) Instant overlap ( < 60s) Data ingested, insights available ( < 2s)
    11. 11. Lots of data Analytics Data restatement - batch and real time Human time
    12. 12. Approach:
    13. 13. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
    14. 14. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Batch processing DAG, Real-time topology, SOX, Traffic protection, Late processing, Retention, Completeness Monitoring, PII cleansing/masking Compatible with HDFS, Performance (Indexed, Columnar, Compression, Serialization, Flexibility, Concurrency, Grain of data stored) Distributed/Stand-alone, Caching objects vs caching results Access to data with group by, order by etc..; SQL or SQL like Translate JSON to SQL(optional) Logical View - Characteristics Impacts Out of scope
    15. 15. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident) Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala memcached_y, Redis JSON-REST API ; JDBC; ODBC Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Logical View - Choices Impacts Out of scope
    16. 16. How we do what we do? Components of Advertising Data Warehouse Druid JDBC/ODBC Data Warehouse-Persistence Hive Metrics Store JSON-API Persistence and run-time compute Computation and Ingestion Quick cache ( using a database for now) Upstream: API layer, MSTR, Adhoc access, Identity Service, Ad-Serving manifests Data Producers; Serving, Scoring, Booking, 3rd/1st Party Data Real time and batch compute engine (Hadoop/Storm ) Data filtering/transformations: Transformations, format conversions Custom Algorithms : computing recursive uniques, indexing
    17. 17. Human time, How? Druid for interactive queries Storm-Druid for quick ingestion and index Specialized computation and processing for quicker response › Sketches › Feature sequence based overlaps › Custom indexing
    18. 18. Problem Deepdive: Instant Overlap
    19. 19. Users Car commuters Soccer Fans Vegans
    20. 20. Users Car Commuters Soccer Fans Vegans
    21. 21. Overlap Non-additive › Require access to raw (user level data) to compute non-additive • Billions of events a day • TBs of data a day  1-1 vs 1-n vs few-n › Between car commuter and vegan what is the overlap › For Car commuter which are the top overlap groups › For Vegan, Car commuters what are the top overlap groups
    22. 22. Re-stating motivation Given two sets having identifiers, how can we do exact overlaps in close to real time? ( < 1 min). Overlap is like a AND operation or a set
    23. 23. Existing Approaches ● Use exact compute paradigms o Do joins for intersections which will lead to exact results  Hive, PIG, MR can all support efficient joins  Exact but not real time ● Use sketches o Approximate algorithms  HLL, KMV, accuracy vs size, performance  Approx, needs high perf tuning  close to real time but not exact
    24. 24. Using Feature Sequences – 1/4 Feature sequence encoding o Encode the sequence  {Ram} - { car commuter, soccer fan,...}  {Tom} - { soccer fan, vegan...}  {Sam} - { car commuter, soccer fan, vegan...}  ….
    25. 25. Using Feature Sequences – 2/4 Eliminate the user on encoded bitmaps  {car commuter, soccer fan, vegan...}- count -c1- #  {soccer fan, vegan...} - count - c2 - #  {car commuter, vegan...} - count - c2 - # Counts become additive now
    26. 26. Using Feature Sequences – 3/4 ● Store row qualifications into a bitmap o Car commuter- Row1, Row3  1010000000 o Vegan - Row1, Row2, Row3  1110000000 ● Load the bitmap into Druid using a custom indexer o in-memory or memory mapped
    27. 27. Using Feature Sequences – 4/4  Data Structures › {feature_sequence}->count › Feature->row qualification bitmaps  AND is now an “AND” on bitmaps › supported within Druid › Very efficient  Works alongside topN and groupBys
    28. 28. Comparison with existing algorithm ● 1-n – Bulk Overlap on grid o 19 hours on grid o Few-n calls for a re-process o 1-1 ( <1s) ● Instant Overlap o < 60s ( pre-processing 3-4 hours) o Supports “exact” AND o Flexible ( few-n, 1-n) o 1-1 ( < 1s)
    29. 29. Summary ● Yahoo’s Advertising Data Warehouse o Peta Byte Scale o Normalized view across many systems o Analytics and optimizations with specialized algorithms o Data restatement - batch and realtime o Human time
    30. 30. Thank You @supreeth_ @_skgupta We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.
    31. 31. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
    32. 32. Dimension Flexibility Many dimensions Adding new dimensions Time zones Time grain
    33. 33. Normalized view across systems PaidSearch Display Native Programmatic buying and selling Ad-targeting
    34. 34. Hardware Configs ●High-memory boxes ●SSD preferred ●Savings due to better compression

    ×