Low Latency “OLAP” with HBase     Cosmin Lehene | Adobe© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confi...
What we needed … and built      OLAP Semantics      Low Latency Ingestion      High Throughput      Real-time Query AP...
Building Blocks      Dimensions, Metrics      Aggregations      Roll-up, drill-down, slicing and dicing, sorting© 2012 ...
OLAP 101 – Queries example                 Date                           Countr                        City            OS...
OLAP 101 – Queries example      Rolling up to country level:                                               Country    vis...
OLAP – Runtime Aggregation vs. Pre-aggregation      Aggregate at runtime                                                 ...
Pre-aggregation      Data needs to be summarized            Can’t visualize 1B data points (no, not even with Retina dis...
SaasBase      We tune both      pre-aggregation level                                                      vs.    runtim...
SaasBase Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   9
SaasBase - Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   10
SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confiden...
SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confiden...
Ingestion© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Ingestion throughput vs. latency      Historical data (large batches)            Optimize for throughput      Increment...
Large, granular input strategies      Slow listing in HDFS            Archive processed files      Filtering input     ...
Ingestion – Bulk Import      HFileOutputFormat (HFOF)            100s X faster than HBase API            No need to rec...
HFOF – FileSizeDatePartitioner      1 partition(reduce) / day for initial import      Uneven reduce (partitions) due to ...
Processing© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Processing      Processing involves reading the Input (files, tables, events), pre-       aggregating it (reducing cardin...
Processing for OLAP semantics            GROUP BY (process, query)            COUNT, SUM, AVG, etc. (process, query)    ...
SaasBase vs. SQL Views Comparison© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
reports.json entities definition© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Processing Performance      read, map, partition, combine, copy, sort, reduce, write      Read:            Scan.setCach...
Processing Performance      Hot (in memory) vs. Cold (on disk, on network) data            Minimize I/O from disk/networ...
Indexing© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
HBase natural order: hierarchical representation© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential...
Indexing - Why      Example: top 10 cities            ~50K [country, city] combinations per day            Top 10 citie...
Indexing with HBase “10” < “2”  GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10      Lexicographic sort...
Query Engine      Always reads indexed, compact data      Query parsing      Scan strategy            Single vs. multi...
Conclusions      OLAP semantics on a simple data model            Data as first class citizen            Domain Specifi...
Thank you!                                                               Cosmin Lehene @clehene                           ...
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
OLAP 101 - Rollup                             Countr                                                Visits   Sale         ...
OLAP 101 - Slicing  Date                       Countr                   City                    OS             Browser    ...
OLAP 101 – Sorting, TOP n  Date                       Countr                   City                    OS        Browser  ...
Upcoming SlideShare
Loading in...5
×

Low Latency “OLAP” with HBase - HBaseCon 2012

18,548

Published on

Published in: Technology, Business
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
18,548
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
241
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide
  • How many HBase users?
  • Data as first class citizen
  • Check contrast on projector
  • Just like speedvs space in general CS/algoQueries always hit indexes
  • Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
  • QUERY ENGINE -&gt; INDEX(always realtime)
  • Initial import/process and NEW reports (not covered) on historical data
  • 18K regions, upgrade to 0.92
  • DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
  • Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
  • LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
  • Inconsistent
  • Rowkey =dimensions group -&gt; metrics (right)
  • GO BACK to EXPLAIN
  • &gt;100K/sec/threadREALTIME
  • Data analysts work with familiar concepts
  • Low Latency “OLAP” with HBase - HBaseCon 2012

    1. 1. Low Latency “OLAP” with HBase Cosmin Lehene | Adobe© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    2. 2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API  Not hardcoded to web analytics or x-, y-, z- analytics, but extensible© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
    3. 3. Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
    4. 4. OLAP 101 – Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
    5. 5. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF”  Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
    6. 6. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
    7. 7. Pre-aggregation  Data needs to be summarized  Can’t visualize 1B data points (no, not even with Retina display)  Difficult to comprehend correlations among more than 3 dimensions  Not all dimension groups are relevant  Index on a needed basis (view selection problem)  Runtime aggregation == TeraSort for every query?  Pre-aggregate to reduce cardinality© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
    8. 8. SaasBase  We tune both  pre-aggregation level vs. runtime post-aggregation  (ingestion speed + space ) vs. (query speed)  Think materialized views from RDBMS© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
    9. 9. SaasBase Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
    10. 10. SaasBase - Domain Model Mapping© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
    11. 11. SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
    12. 12. SaasBase - Ingestion, Processing, Indexing, Querying© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
    13. 13. Ingestion© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
    14. 14. Ingestion throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
    15. 15. Large, granular input strategies  Slow listing in HDFS  Archive processed files  Filtering input  FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)  TableInputFormat start/stop row  File Index in HBase (track processed/new files)  Map tasks overhead - stitching input splits  400K files => 400K map tasks => overhead, slow reduce copy  CombineFileInputFormat – 2GB-splits => 500 splits for 1TB  FixedMappersTableInputFormat (e.g. 5-region splits)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
    16. 16. Ingestion – Bulk Import  HFileOutputFormat (HFOF)  100s X faster than HBase API  No need to recover from failed jobs  No unnecessary load on machines * No shuffle - global reduce order required!  e.g. first reduce key needs to be in the first region, last one in the last region  Watch for uneven partitions© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
    17. 17. HFOF – FileSizeDatePartitioner  1 partition(reduce) / day for initial import  Uneven reduce (partitions) due to data growth over time  Reduce k: 2010-12-04 = 500MB  Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region  Balance reduce buckets based on input file sizes and the reduce key  Generate sub-partitions based on predefined size (e.g. 1GB)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
    18. 18. Processing© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
    19. 19. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time  1 year: 1B events => 100B data points indexed  Query => scan 365 data points (e.g. daily page views)  Processing could be either MR or real-time (e.g. Storm)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
    20. 20. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
    21. 21. SaasBase vs. SQL Views Comparison© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
    22. 22. reports.json entities definition© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
    23. 23. Processing Performance  read, map, partition, combine, copy, sort, reduce, write  Read:  Scan.setCaching() (I/O ~ buffer)  Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)  Even region distribution across cluster (distributes CPU, I/O)  Map:  No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)  Avoid GC : new X() (CPU, Memory)  Avoid system calls (context switching)  Stripping unnecessary data (I/O)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
    24. 24. Processing Performance  Hot (in memory) vs. Cold (on disk, on network) data  Minimize I/O from disk/network  Single shot MR job: SuperProcessor  Emit all groups from one map() call  Incremental processing  Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
    25. 25. Indexing© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
    26. 26. HBase natural order: hierarchical representation© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
    27. 27. Indexing - Why  Example: top 10 cities  ~50K [country, city] combinations per day  Top 10 cities for 1 year =>  365 (days) X 50K ~=15M data points scanned  If you add gender => 30M  If you add Device, OS, Browser …  Might compress well, but think about the environment  How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
    28. 28. Indexing with HBase “10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10  Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/  scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
    29. 29. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
    30. 30. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Tunable performance, resource allocation  Framework for vertical analytics systems© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
    31. 31. Thank you! Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
    32. 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    33. 33. OLAP 101 - Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0  Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
    34. 34. OLAP 101 - Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1  Filter or Segment or Slice (WHERE or HAVING)© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
    35. 35. OLAP 101 – Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10  SELECT SUM(sales) as total GROUP BY browser ORDER BY total© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×