Your SlideShare is downloading. ×
June 2014 HUG: Interactive analytics over hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

June 2014 HUG: Interactive analytics over hadoop

550
views

Published on

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
550
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Logical view of a typical data driven application architecture
    Sox compliance
  • -quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture
    - Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface

  • Bitmap indexed, columnar, can operate on compressed bitmaps, distributed
    -
  • Transcript

    • 1. Interactive Analytics in Human Time S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 1 8 , 2 0 1 4 4 5 t h B a y A r e a H U G , S u n n yv a l e , C a l i f o r n i a
    • 2. Interactive – How we see it? 2 Yahoo Confidential & Proprietary 60B events, 3.5TB of compressed data Response 400ms Serve an ad and get insights < 2s
    • 3. Agenda:
    • 4. Motivation Approach Problem Deepdive- Instant Overlap Summary Questions
    • 5. Motivation:
    • 6. Lots of data Analytics Data restatement - batch and real time Human time
    • 7. Lots of data ~30B advertising events/day ~10s of TB of compressed data/day Minutes to Year Grain Multi-quarter data retention Data Aging
    • 8. Analytics Reporting Metrics Attribution Multi-level hierarchical computation Bidding/Targeting optimization Non-additive computation
    • 9. Data Restatement Real time Batch Producer Consumer quick path, lower amount of checks or reconciliation, typically no lookups high latency path, checks and reconciliations, can have lookbacks and lookups
    • 10. Human Time <1s ( 99 percentile) Default time grain ( < 300 ms) Instant overlap ( < 60s) Data ingested, insights available ( < 2s)
    • 11. Lots of data Analytics Data restatement - batch and real time Human time
    • 12. Approach:
    • 13. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
    • 14. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Batch processing DAG, Real-time topology, SOX, Traffic protection, Late processing, Retention, Completeness Monitoring, PII cleansing/masking Compatible with HDFS, Performance (Indexed, Columnar, Compression, Serialization, Flexibility, Concurrency, Grain of data stored) Distributed/Stand-alone, Caching objects vs caching results Access to data with group by, order by etc..; SQL or SQL like Translate JSON to SQL(optional) Logical View - Characteristics Impacts Out of scope
    • 15. Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident) Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala memcached_y, Redis JSON-REST API ; JDBC; ODBC Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Logical View - Choices Impacts Out of scope
    • 16. How we do what we do? Components of Advertising Data Warehouse Druid JDBC/ODBC Data Warehouse-Persistence Hive Metrics Store JSON-API Persistence and run-time compute Computation and Ingestion Quick cache ( using a database for now) Upstream: API layer, MSTR, Adhoc access, Identity Service, Ad-Serving manifests Data Producers; Serving, Scoring, Booking, 3rd/1st Party Data Real time and batch compute engine (Hadoop/Storm ) Data filtering/transformations: Transformations, format conversions Custom Algorithms : computing recursive uniques, indexing
    • 17. Human time, How? Druid for interactive queries Storm-Druid for quick ingestion and index Specialized computation and processing for quicker response › Sketches › Feature sequence based overlaps › Custom indexing
    • 18. Problem Deepdive: Instant Overlap
    • 19. Users Car commuters Soccer Fans Vegans
    • 20. Users Car Commuters Soccer Fans Vegans
    • 21. Overlap Non-additive › Require access to raw (user level data) to compute non-additive • Billions of events a day • TBs of data a day  1-1 vs 1-n vs few-n › Between car commuter and vegan what is the overlap › For Car commuter which are the top overlap groups › For Vegan, Car commuters what are the top overlap groups
    • 22. Re-stating motivation Given two sets having identifiers, how can we do exact overlaps in close to real time? ( < 1 min). Overlap is like a AND operation or a set
    • 23. Existing Approaches ● Use exact compute paradigms o Do joins for intersections which will lead to exact results  Hive, PIG, MR can all support efficient joins  Exact but not real time ● Use sketches o Approximate algorithms  HLL, KMV, accuracy vs size, performance  Approx, needs high perf tuning  close to real time but not exact
    • 24. Using Feature Sequences – 1/4 Feature sequence encoding o Encode the sequence  {Ram} - { car commuter, soccer fan,...}  {Tom} - { soccer fan, vegan...}  {Sam} - { car commuter, soccer fan, vegan...}  ….
    • 25. Using Feature Sequences – 2/4 Eliminate the user on encoded bitmaps  {car commuter, soccer fan, vegan...}- count -c1- #  {soccer fan, vegan...} - count - c2 - #  {car commuter, vegan...} - count - c2 - # Counts become additive now
    • 26. Using Feature Sequences – 3/4 ● Store row qualifications into a bitmap o Car commuter- Row1, Row3  1010000000 o Vegan - Row1, Row2, Row3  1110000000 ● Load the bitmap into Druid using a custom indexer o in-memory or memory mapped
    • 27. Using Feature Sequences – 4/4  Data Structures › {feature_sequence}->count › Feature->row qualification bitmaps  AND is now an “AND” on bitmaps › supported within Druid › Very efficient  Works alongside topN and groupBys
    • 28. Comparison with existing algorithm ● 1-n – Bulk Overlap on grid o 19 hours on grid o Few-n calls for a re-process o 1-1 ( <1s) ● Instant Overlap o < 60s ( pre-processing 3-4 hours) o Supports “exact” AND o Flexible ( few-n, 1-n) o 1-1 ( < 1s)
    • 29. Summary ● Yahoo’s Advertising Data Warehouse o Peta Byte Scale o Normalized view across many systems o Analytics and optimizations with specialized algorithms o Data restatement - batch and realtime o Human time
    • 30. Thank You @supreeth_ @_skgupta We are hiring! Reach out to us at bigdata@yahoo-inc.com.
    • 31. Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
    • 32. Dimension Flexibility Many dimensions Adding new dimensions Time zones Time grain
    • 33. Normalized view across systems PaidSearch Display Native Programmatic buying and selling Ad-targeting
    • 34. Hardware Configs ●High-memory boxes ●SSD preferred ●Savings due to better compression