• Save
Interactive Analytics in Human Time
Upcoming SlideShare
Loading in...5
×
 

Interactive Analytics in Human Time

on

  • 708 views

 

Statistics

Views

Total Views
708
Views on SlideShare
451
Embed Views
257

Actions

Likes
3
Downloads
0
Comments
0

2 Embeds 257

http://www.scoop.it 256
http://www.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Logical view of a typical data driven application architecture <br /> Sox compliance
  • -quick cache store for querying all metrics in a single fetch, to support one-page load UI architecture <br /> - Hive for scheduled job and for adhoc long range generic queries which are not supported on the interactive interface <br /> <br />
  • Bitmap indexed, columnar, can operate on compressed bitmaps, distributed <br /> -

Interactive Analytics in Human Time Interactive Analytics in Human Time Presentation Transcript

  • Interactive Analytics in Human Time S u p r e e t h R a o , S u n i l G u p t a ⎪ J u n e 4 , 2 0 1 4 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • Interactive – How we see it? 2 Yahoo Confidential & Proprietary 60B events, 3.5TB of compressed data Response 400ms Serve an ad and get insights < 2s
  • Agenda:
  • Motivation Approach Problem Deepdive- Instant Overlap Summary Questions
  • Motivation:
  • Lots of data Analytics Data restatement - batch and real time Human time
  • Lots of data ~30B advertising events/day ~10s of TB of compressed data/day Minutes to Year Grain Multi-quarter data retention Data Aging
  • Analytics Reporting Metrics Attribution Multi-level hierarchical computation Bidding/Targeting optimization Non-additive computation
  • Data Restatement Real time Batch Producer Consumer quick path, lower amount of checks or reconciliation, typically no lookups high latency path, checks and reconciliations, can have lookbacks and lookups
  • Human Time <1s ( 99 percentile) Default time grain ( < 300 ms) Instant overlap ( < 60s) Data ingested, insights available ( < 2s)
  • Lots of data Analytics Data restatement - batch and real time Human time
  • Approach:
  • Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Batch processing DAG, Real-time topology, SOX, Traffic protection, Late processing, Retention, Completeness Monitoring, PII cleansing/masking Compatible with HDFS, Performance (Indexed, Columnar, Compression, Serialization, Flexibility, Concurrency, Grain of data stored) Distributed/Stand-alone, Caching objects vs caching results Access to data with group by, order by etc..; SQL or SQL like Translate JSON to SQL(optional) Logical View - Characteristics Impacts Out of scope
  • Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Hadoop MR/PIG /Oozie(Lotus)/Storm(Trident) Druid, Shark, Hive, Oracle RAC, Mysql, Hbase, Impala memcached_y, Redis JSON-REST API ; JDBC; ODBC Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Logical View - Choices Impacts Out of scope
  • How we do what we do? Components of Advertising Data Warehouse Druid JDBC/ODBC Data Warehouse-Persistence Hive Metrics Store JSON-API Persistence and run-time compute Computation and Ingestion Quick cache ( using a database for now) Upstream: API layer, MSTR, Adhoc access, Identity Service, Ad-Serving manifests Data Producers; Serving, Scoring, Booking, 3rd/1st Party Data Real time and batch compute engine (Hadoop/Storm ) Data filtering/transformations: Transformations, format conversions Custom Algorithms : computing recursive uniques, indexing
  • Human time, How? Druid for interactive queries Storm-Druid for quick ingestion and index Specialized computation and processing for quicker response › Sketches › Feature sequence based overlaps › Custom indexing
  • Problem Deepdive: Instant Overlap
  • Users Car commuters Soccer Fans Vegans
  • Users Car Commuters Soccer Fans Vegans
  • Overlap Non-additive › Require access to raw (user level data) to compute non-additive • Billions of events a day • TBs of data a day  1-1 vs 1-n vs few-n › Between car commuter and vegan what is the overlap › For Car commuter which are the top overlap groups › For Vegan, Car commuters what are the top overlap groups
  • Re-stating motivation Given two sets having identifiers, how can we do exact overlaps in close to real time? ( < 1 min). Overlap is like a AND operation or a set
  • Existing Approaches ● Use exact compute paradigms o Do joins for intersections which will lead to exact results  Hive, PIG, MR can all support efficient joins  Exact but not real time ● Use sketches o Approximate algorithms  HLL, KMV, accuracy vs size, performance  Approx, needs high perf tuning  close to real time but not exact
  • Using Feature Sequences – 1/4 Feature sequence encoding o Encode the sequence  {Ram} - { car commuter, soccer fan,...}  {Tom} - { soccer fan, vegan...}  {Sam} - { car commuter, soccer fan, vegan...}  ….
  • Using Feature Sequences – 2/4 Eliminate the user on encoded bitmaps  {car commuter, soccer fan, vegan...}- count -c1- #  {soccer fan, vegan...} - count - c2 - #  {car commuter, vegan...} - count - c2 - # Counts become additive now
  • Using Feature Sequences – 3/4 ● Store row qualifications into a bitmap o Car commuter- Row1, Row3  1010000000 o Vegan - Row1, Row2, Row3  1110000000 ● Load the bitmap into Druid using a custom indexer o in-memory or memory mapped
  • Using Feature Sequences – 4/4  Data Structures › {feature_sequence}->count › Feature->row qualification bitmaps  AND is now an “AND” on bitmaps › supported within Druid › Very efficient  Works alongside topN and groupBys
  • Comparison with existing algorithm ● 1-n – Bulk Overlap on grid o 19 hours on grid o Few-n calls for a re-process o 1-1 ( <1s) ● Instant Overlap o < 60s ( pre-processing 3-4 hours) o Supports “exact” AND o Flexible ( few-n, 1-n) o 1-1 ( < 1s)
  • Summary ● Yahoo’s Advertising Data Warehouse o Peta Byte Scale o Normalized view across many systems o Analytics and optimizations with specialized algorithms o Data restatement - batch and realtime o Human time
  • Thank You @supreeth_ @_skgupta We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.
  • Data Ingestion or Collection Transformations Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Data Pipelines Data Warehouse/ Analytics and Optimizations Reporting Application/UI Logical View - Scope Transformations/Aggs Data Ingestion Persistence Runtime Compute Caching API Optional Middleware Business API UI Data Collection Impacts Out of scope
  • Dimension Flexibility Many dimensions Adding new dimensions Time zones Time grain
  • Normalized view across systems PaidSearch Display Native Programmatic buying and selling Ad-targeting
  • Hardware Configs ●High-memory boxes ●SSD preferred ●Savings due to better compression