Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013

17,453 views

Published on

Published in: Technology
  • Be the first to comment

Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013

  1. Real-time “OLAP” for Big Data (+ use cases) Cosmin Lehene | Adobe #bigdataro - 30 January 2013© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  3. “Physical” Building Blocks© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  4. Logical Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  5. OLAP 101 – Queries example Date Country City OS Browser Sale 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  6. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slice” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF” Browser sales visits  Top browsers by sales SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  7. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  8. SaasBase Map© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  9. SaasBase Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  10. SaasBase - Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  11. SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  12. SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  13. Ingestion© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  14. Ingestion(ETL) throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
  15. Processing© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
  16. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating cubes that can be queried in real-time  “Super Processor” code running in Storm, Map-Reduce, HBase© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  17. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  18. SaasBase vs. SQL Views Comparison© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  19. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  20. Adobe Business Catalyst  Online business presence: e-commerce, marketing, web analytics etc.  Use case: Web Analytics (visitors, channels, content, e- commerce, campaigns, etc.)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  21. BC - Workflow© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  22. Adobe Business Catalyst - Stats  3 active datacenters  Raw data ~6TB (from ~1TB 18 months ago)  Visits table: ~1TB each(compressed)  OLAP cubes (stats): 49GB – 64GB (compressed)  ~30 minutes latency (from actual pageview/sale to chart in UI)  10s – 100s of milliseconds latency for queries  ~3000/s max concurrent OLAP queries (actual traffic is much lower)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  23. Adobe Pass for TV Everywhere  Authentication & Authorization  Single sign-on to Programmer content (e.g. Turner, NBC, Hulu, MTV, etc) with Cable operator credentials (e.g. Comcast, Dish, etc.)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  24. Adobe Pass – Use Case  Analytics use case: Operational metrics (users, devices, latencies, etc.)  Real-time ingestion in HBase  High Frequency Map Reduce jobs (every 2 minutes)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  25. Adobe Pass - Stats (London Olympics 2012)  67M streams ~ 5.3M hours  1.5M concurrent streams  > 7M unique users  1 Technical & Engineering Emmy Award ;)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  26. Adobe Primetime – Real-time Video Analytics  Unified video platform (acquisition, transcoding, broadcast, ads, analytics)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  27. Adobe Primetime – Use Case  Use Cases:  Audience metrics – minutes latency ok  Ads metrics – seconds to minutes ok  Streaming QoS metrics – seconds must  Requirements:  Massive throughput (millions of streams, multiple heartbeats every 10 seconds)  Low latency (end-to-end)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  28. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  29. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Framework for vertical analytics systems  Tunable performance, resource allocation© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  30. Thank you! Cosmin Lehene @clehene http://hstack.org© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
  31. Related http://www.hbasecon.com/sessions/low-latency-olap-with-hbase/ http://www.slideshare.net/clehene/low-latency-olap-with-hbase-hbasecon-2012© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

×