SlideShare a Scribd company logo
1 of 35
Low Latency “OLAP” with HBase
     Cosmin Lehene | Adobe




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What we needed … and built


      OLAP Semantics
      Low Latency Ingestion
      High Throughput
      Real-time Query API




      Not hardcoded to web analytics or x-, y-, z-
       analytics, but extensible
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
Building Blocks


      Dimensions, Metrics
      Aggregations
      Roll-up, drill-down, slicing and dicing, sorting




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
OLAP 101 – Queries example




                 Date                           Countr                        City            OS        Browser      Sale
                                                y
                 2012-05-21                     USA                           NY              Windows   FF           0.0

                 2012-05-21                     USA                           NY              Windows   FF           10.0

                 2012-05-22                     USA                           SF              OSX       Chrome       25.0

                 2012-05-22                     Canada                        Ontario         Linux     Chrome       0.0

                 2012-05-23                     USA                           Chicago         OSX       Safari       15.0

                 5 visits,                      2                             4 cities:       3 OS-es   3 browsers   50.0
                 3 days                         countries                     NY: 2           Win: 2    FF: 2        3 sales
                                                USA: 4                        SF: 1           OSX: 2    Chrome:2
                                                Canada: 1


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.               4
OLAP 101 – Queries example

      Rolling up to country level:                                               Country    visits   sales
  SELECT COUNT(visits), SUM(sales)
                                                                                  USA        4        $50
  GROUP BY country
                                                                                  Canada     1        0



      “Slicing” by browser                                                       Country   visits sales
  SELECT COUNT(visits), SUM(sales)                                                USA       2         $10
  GROUP BY country
                                                                                  Canada    0         0
  HAVING browser = “FF”


      Top browsers by sales                                                      Browser   sales     visits
  SELECT SUM(sales), COUNT(visits)                                                Chrome    $25       2
  GROUP BY browser
                                                                                  Safari    $15       1
  ORDER BY sales
                                                                                  FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
OLAP – Runtime Aggregation vs. Pre-aggregation


      Aggregate at runtime                                                      Pre-aggregate
            Most flexible                                                           Fast
            Fast – scatter gather                                                   Efficient – O(1)
            Space efficient                                                         High throughput
      But                                                                       But
            I/O, CPU intensive                                                      More effort to process (latency)
            slow for larger data                                                    Combinatorial explosion (space)
            low throughput                                                          No flexibility




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
Pre-aggregation

      Data needs to be summarized
            Can’t visualize 1B data points (no, not even with Retina display)
            Difficult to comprehend correlations among more than 3 dimensions


      Not all dimension groups are relevant
            Index on a needed basis (view selection problem)


      Runtime aggregation == TeraSort for every query?
            Pre-aggregate to reduce cardinality




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
SaasBase

      We tune both
      pre-aggregation level                                                      vs.    runtime post-aggregation
      (ingestion speed + space ) vs.                                             (query speed)


      Think materialized views from RDBMS




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
SaasBase Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   9
SaasBase - Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   10
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   11
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   12
Ingestion




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Ingestion throughput vs. latency


      Historical data (large batches)
            Optimize for throughput
      Increments (latest data, smaller)
            Optimize for latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   14
Large, granular input strategies

      Slow listing in HDFS
            Archive processed files


      Filtering input
            FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)
            TableInputFormat start/stop row
            File Index in HBase (track processed/new files)


      Map tasks overhead - stitching input splits
            400K files => 400K map tasks => overhead, slow reduce copy
            CombineFileInputFormat – 2GB-splits => 500 splits for 1TB
            FixedMappersTableInputFormat (e.g. 5-region splits)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   15
Ingestion – Bulk Import

      HFileOutputFormat (HFOF)
            100s X faster than HBase API
            No need to recover from failed jobs
            No unnecessary load on machines




  * No shuffle - global reduce order
  required!
            e.g. first reduce key needs to be in the
             first region, last one in the last region
            Watch for uneven partitions


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
HFOF – FileSizeDatePartitioner

      1 partition(reduce) / day for initial import
      Uneven reduce (partitions) due to data growth over time
            Reduce k: 2010-12-04 = 500MB
            Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region




      Balance reduce buckets based on input file sizes and the reduce key
      Generate sub-partitions based on predefined size (e.g. 1GB)

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Processing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Processing



      Processing involves reading the Input (files, tables, events), pre-
       aggregating it (reducing cardinality) and generating tables that can be
       queried in real-time
            1 year: 1B events => 100B data points indexed
            Query => scan 365 data points (e.g. daily page views)




      Processing could be either MR or real-time (e.g. Storm)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Processing for OLAP semantics

            GROUP BY (process, query)
            COUNT, SUM, AVG, etc. (process, query)
            SORT (process, query)
            HAVING (mostly query, can define pre-process constraints)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
reports.json entities definition




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Processing Performance

      read, map, partition, combine, copy, sort, reduce, write


      Read:
            Scan.setCaching() (I/O ~ buffer)
            Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)
            Even region distribution across cluster (distributes CPU, I/O)
      Map:
            No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string)
             (CPU)
            Avoid GC : new X() (CPU, Memory)
            Avoid system calls (context switching)
            Stripping unnecessary data (I/O)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Processing Performance

      Hot (in memory) vs. Cold (on disk, on network) data
            Minimize I/O from disk/network


      Single shot MR job: SuperProcessor
            Emit all groups from one map() call


      Incremental processing
            Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Indexing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
HBase natural order: hierarchical representation




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   26
Indexing - Why

      Example: top 10 cities
            ~50K [country, city] combinations per day
            Top 10 cities for 1 year =>
            365 (days) X 50K ~=15M data points scanned
            If you add gender => 30M
            If you add Device, OS, Browser …


      Might compress well, but think about the environment
      How much energy would you spend for just top 10 cities?



                                                                              * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        27
Indexing with HBase “10” < “2”

  GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

      Lexicographic sorting

  2012/05/USA/0000000000/
  2012/05/USA/4294961296/San Francisco                                                        = 1000 visits*
  2012/05/USA/4294961396/New York                                                             = 900 visits*
  . . .
  2012/05/USA/9999999999/

      scan “t” startrow => “2012/05/USA/”, limit => 10

                                                                              * Padding numbers for lexicographic sorting:
                                                                                1000 -> Long.MAX_VALUE – 1000 = 4294961296


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        28
Query Engine

      Always reads indexed, compact data
      Query parsing
      Scan strategy
            Single vs. multiple scans
            Start/stop rows (prefixes, index positions, etc.)
            Index selection (volatile indexes with incremental processing)
      Deserialization
      Post-aggregation, sorting, fuzzy-sorting etc.
      Paging
      Custom dimension/metric class loading




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Conclusions

      OLAP semantics on a simple data model
            Data as first class citizen
            Domain Specific “Language” for Dimensions, Metrics, Aggregations
      Tunable performance, resource allocation
      Framework for vertical analytics systems




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   30
Thank you!
                                                               Cosmin Lehene @clehene

                                                               http://hstack.org
                                                                               Credits:
                                                                              Andrei Dragomir
                                                                              Adrian Muraru
                                                                               Andrei Dulvac
                                                                              Raluca Podiuc
                                                                               Tudor Scurtu
                                                                              Bogdan Dragu
                                                                               Bogdan Drutu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.         31
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
OLAP 101 - Rollup

                             Countr                                                Visits   Sale
                             y
                             USA                                                   4        $50

                             Canada                                                1        $0




      Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
OLAP 101 - Slicing

  Date                       Countr                   City                    OS             Browser      Sale
                             y
  2012-03-02                 USA                      NY                      Windows        FF           0.0

  2012-03-02                 USA                      NY                      Windows        FF           10.0

  2012-03-03                 USA                      S                       OSX            Chrome       25.0

  2012-03-03                 Canada                   Ontario                 Linux          Chrome       0.0

  2012-03-04                 USA                      Chicago                 OSX            Safari       15.0

  5 visits,                  2                        4 cities:               3 OS-es        3 browsers   50.0
  3 days                     countries                NY: 2                   Win: 2         FF: 2        3 sales
                             USA: 4                   SF: 1                   OSX: 2         Chrome:2
                             Canada: 1
      Filter or Segment or Slice (WHERE or HAVING)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.             34
OLAP 101 – Sorting, TOP n

  Date                       Countr                   City                    OS        Browser   Sale
                             y
                                                                                        Chrome    $25

                                                                                        Safari    $15

                                                                                        Firefox   $10




      SELECT SUM(sales) as total GROUP BY browser ORDER BY total



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        35

More Related Content

What's hot

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningLaura Hood
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new featuresTanvi_Agrawal
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011Laura Hood
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciMark Ginnebaugh
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 

What's hot (13)

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Ta3
Ta3Ta3
Ta3
 
Monster
MonsterMonster
Monster
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
An Hour of DB2 Tips
An Hour of DB2 TipsAn Hour of DB2 Tips
An Hour of DB2 Tips
 
SQLFire Webinar
SQLFire WebinarSQLFire Webinar
SQLFire Webinar
 
SQLFire at Strata 2012
SQLFire at Strata 2012SQLFire at Strata 2012
SQLFire at Strata 2012
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SQLFire lightning talk
SQLFire lightning talkSQLFire lightning talk
SQLFire lightning talk
 

Viewers also liked

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @ShanghaiLuke Han
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVAJose Ramon
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)nirvanafilmblog
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursanteJose Ramon
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIOJose Ramon
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)GNOSS
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17rhbc
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)clivecaines
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
Normas de cine
Normas de cineNormas de cine
Normas de cineJose Ramon
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at ScaleAntony Messerl
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 

Viewers also liked (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVA
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursante
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIO
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Normas de cine
Normas de cineNormas de cine
Normas de cine
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Beacosystem V3
Beacosystem V3Beacosystem V3
Beacosystem V3
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 

Similar to Low Latency “OLAP” with HBase - HBaseCon 2012

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applicationsMichael Chaize
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...François Le Droff
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)Linux Foundation Korea
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2IBM Switzerland
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeIcinga
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performancebrettallison
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVHcloud
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsJoel Oleson
 

Similar to Low Latency “OLAP” with HBase - HBaseCon 2012 (20)

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Low Latency “OLAP” with HBase - HBaseCon 2012

  • 1. Low Latency “OLAP” with HBase Cosmin Lehene | Adobe © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API  Not hardcoded to web analytics or x-, y-, z- analytics, but extensible © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. OLAP 101 – Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF”  Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. Pre-aggregation  Data needs to be summarized  Can’t visualize 1B data points (no, not even with Retina display)  Difficult to comprehend correlations among more than 3 dimensions  Not all dimension groups are relevant  Index on a needed basis (view selection problem)  Runtime aggregation == TeraSort for every query?  Pre-aggregate to reduce cardinality © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. SaasBase  We tune both  pre-aggregation level vs. runtime post-aggregation  (ingestion speed + space ) vs. (query speed)  Think materialized views from RDBMS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. SaasBase Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. SaasBase - Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. Ingestion © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Ingestion throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
  • 15. Large, granular input strategies  Slow listing in HDFS  Archive processed files  Filtering input  FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)  TableInputFormat start/stop row  File Index in HBase (track processed/new files)  Map tasks overhead - stitching input splits  400K files => 400K map tasks => overhead, slow reduce copy  CombineFileInputFormat – 2GB-splits => 500 splits for 1TB  FixedMappersTableInputFormat (e.g. 5-region splits) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
  • 16. Ingestion – Bulk Import  HFileOutputFormat (HFOF)  100s X faster than HBase API  No need to recover from failed jobs  No unnecessary load on machines * No shuffle - global reduce order required!  e.g. first reduce key needs to be in the first region, last one in the last region  Watch for uneven partitions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. HFOF – FileSizeDatePartitioner  1 partition(reduce) / day for initial import  Uneven reduce (partitions) due to data growth over time  Reduce k: 2010-12-04 = 500MB  Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region  Balance reduce buckets based on input file sizes and the reduce key  Generate sub-partitions based on predefined size (e.g. 1GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Processing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time  1 year: 1B events => 100B data points indexed  Query => scan 365 data points (e.g. daily page views)  Processing could be either MR or real-time (e.g. Storm) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. reports.json entities definition © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Processing Performance  read, map, partition, combine, copy, sort, reduce, write  Read:  Scan.setCaching() (I/O ~ buffer)  Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)  Even region distribution across cluster (distributes CPU, I/O)  Map:  No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)  Avoid GC : new X() (CPU, Memory)  Avoid system calls (context switching)  Stripping unnecessary data (I/O) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Processing Performance  Hot (in memory) vs. Cold (on disk, on network) data  Minimize I/O from disk/network  Single shot MR job: SuperProcessor  Emit all groups from one map() call  Incremental processing  Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Indexing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. HBase natural order: hierarchical representation © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. Indexing - Why  Example: top 10 cities  ~50K [country, city] combinations per day  Top 10 cities for 1 year =>  365 (days) X 50K ~=15M data points scanned  If you add gender => 30M  If you add Device, OS, Browser …  Might compress well, but think about the environment  How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Indexing with HBase “10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10  Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/  scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Tunable performance, resource allocation  Framework for vertical analytics systems © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
  • 31. Thank you! Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 33. OLAP 101 - Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0  Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. OLAP 101 - Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1  Filter or Segment or Slice (WHERE or HAVING) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. OLAP 101 – Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10  SELECT SUM(sales) as total GROUP BY browser ORDER BY total © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35

Editor's Notes

  1. How many HBase users?
  2. Data as first class citizen
  3. Check contrast on projector
  4. Just like speedvs space in general CS/algoQueries always hit indexes
  5. Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
  6. QUERY ENGINE -&gt; INDEX(always realtime)
  7. Initial import/process and NEW reports (not covered) on historical data
  8. 18K regions, upgrade to 0.92
  9. DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
  10. Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
  11. LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
  12. Inconsistent
  13. Rowkey =dimensions group -&gt; metrics (right)
  14. GO BACK to EXPLAIN
  15. &gt;100K/sec/threadREALTIME
  16. Data analysts work with familiar concepts