SlideShare a Scribd company logo
1 of 35
Low Latency “OLAP” with HBase
     Cosmin Lehene | Adobe




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What we needed … and built


      OLAP Semantics
      Low Latency Ingestion
      High Throughput
      Real-time Query API




      Not hardcoded to web analytics or x-, y-, z-
       analytics, but extensible
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
Building Blocks


      Dimensions, Metrics
      Aggregations
      Roll-up, drill-down, slicing and dicing, sorting




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
OLAP 101 – Queries example




                 Date                           Countr                        City            OS        Browser      Sale
                                                y
                 2012-05-21                     USA                           NY              Windows   FF           0.0

                 2012-05-21                     USA                           NY              Windows   FF           10.0

                 2012-05-22                     USA                           SF              OSX       Chrome       25.0

                 2012-05-22                     Canada                        Ontario         Linux     Chrome       0.0

                 2012-05-23                     USA                           Chicago         OSX       Safari       15.0

                 5 visits,                      2                             4 cities:       3 OS-es   3 browsers   50.0
                 3 days                         countries                     NY: 2           Win: 2    FF: 2        3 sales
                                                USA: 4                        SF: 1           OSX: 2    Chrome:2
                                                Canada: 1


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.               4
OLAP 101 – Queries example

      Rolling up to country level:                                               Country    visits   sales
  SELECT COUNT(visits), SUM(sales)
                                                                                  USA        4        $50
  GROUP BY country
                                                                                  Canada     1        0



      “Slicing” by browser                                                       Country   visits sales
  SELECT COUNT(visits), SUM(sales)                                                USA       2         $10
  GROUP BY country
                                                                                  Canada    0         0
  HAVING browser = “FF”


      Top browsers by sales                                                      Browser   sales     visits
  SELECT SUM(sales), COUNT(visits)                                                Chrome    $25       2
  GROUP BY browser
                                                                                  Safari    $15       1
  ORDER BY sales
                                                                                  FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
OLAP – Runtime Aggregation vs. Pre-aggregation


      Aggregate at runtime                                                      Pre-aggregate
            Most flexible                                                           Fast
            Fast – scatter gather                                                   Efficient – O(1)
            Space efficient                                                         High throughput
      But                                                                       But
            I/O, CPU intensive                                                      More effort to process (latency)
            slow for larger data                                                    Combinatorial explosion (space)
            low throughput                                                          No flexibility




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
Pre-aggregation

      Data needs to be summarized
            Can’t visualize 1B data points (no, not even with Retina display)
            Difficult to comprehend correlations among more than 3 dimensions


      Not all dimension groups are relevant
            Index on a needed basis (view selection problem)


      Runtime aggregation == TeraSort for every query?
            Pre-aggregate to reduce cardinality




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
SaasBase

      We tune both
      pre-aggregation level                                                      vs.    runtime post-aggregation
      (ingestion speed + space ) vs.                                             (query speed)


      Think materialized views from RDBMS




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
SaasBase Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   9
SaasBase - Domain Model Mapping




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   10
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   11
SaasBase - Ingestion, Processing, Indexing, Querying




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   12
Ingestion




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Ingestion throughput vs. latency


      Historical data (large batches)
            Optimize for throughput
      Increments (latest data, smaller)
            Optimize for latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   14
Large, granular input strategies

      Slow listing in HDFS
            Archive processed files


      Filtering input
            FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)
            TableInputFormat start/stop row
            File Index in HBase (track processed/new files)


      Map tasks overhead - stitching input splits
            400K files => 400K map tasks => overhead, slow reduce copy
            CombineFileInputFormat – 2GB-splits => 500 splits for 1TB
            FixedMappersTableInputFormat (e.g. 5-region splits)
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   15
Ingestion – Bulk Import

      HFileOutputFormat (HFOF)
            100s X faster than HBase API
            No need to recover from failed jobs
            No unnecessary load on machines




  * No shuffle - global reduce order
  required!
            e.g. first reduce key needs to be in the
             first region, last one in the last region
            Watch for uneven partitions


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
HFOF – FileSizeDatePartitioner

      1 partition(reduce) / day for initial import
      Uneven reduce (partitions) due to data growth over time
            Reduce k: 2010-12-04 = 500MB
            Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region




      Balance reduce buckets based on input file sizes and the reduce key
      Generate sub-partitions based on predefined size (e.g. 1GB)

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Processing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Processing



      Processing involves reading the Input (files, tables, events), pre-
       aggregating it (reducing cardinality) and generating tables that can be
       queried in real-time
            1 year: 1B events => 100B data points indexed
            Query => scan 365 data points (e.g. daily page views)




      Processing could be either MR or real-time (e.g. Storm)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Processing for OLAP semantics

            GROUP BY (process, query)
            COUNT, SUM, AVG, etc. (process, query)
            SORT (process, query)
            HAVING (mostly query, can define pre-process constraints)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
reports.json entities definition




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Processing Performance

      read, map, partition, combine, copy, sort, reduce, write


      Read:
            Scan.setCaching() (I/O ~ buffer)
            Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)
            Even region distribution across cluster (distributes CPU, I/O)
      Map:
            No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string)
             (CPU)
            Avoid GC : new X() (CPU, Memory)
            Avoid system calls (context switching)
            Stripping unnecessary data (I/O)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Processing Performance

      Hot (in memory) vs. Cold (on disk, on network) data
            Minimize I/O from disk/network


      Single shot MR job: SuperProcessor
            Emit all groups from one map() call


      Incremental processing
            Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Indexing




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
HBase natural order: hierarchical representation




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   26
Indexing - Why

      Example: top 10 cities
            ~50K [country, city] combinations per day
            Top 10 cities for 1 year =>
            365 (days) X 50K ~=15M data points scanned
            If you add gender => 30M
            If you add Device, OS, Browser …


      Might compress well, but think about the environment
      How much energy would you spend for just top 10 cities?



                                                                              * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        27
Indexing with HBase “10” < “2”

  GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10

      Lexicographic sorting

  2012/05/USA/0000000000/
  2012/05/USA/4294961296/San Francisco                                                        = 1000 visits*
  2012/05/USA/4294961396/New York                                                             = 900 visits*
  . . .
  2012/05/USA/9999999999/

      scan “t” startrow => “2012/05/USA/”, limit => 10

                                                                              * Padding numbers for lexicographic sorting:
                                                                                1000 -> Long.MAX_VALUE – 1000 = 4294961296


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        28
Query Engine

      Always reads indexed, compact data
      Query parsing
      Scan strategy
            Single vs. multiple scans
            Start/stop rows (prefixes, index positions, etc.)
            Index selection (volatile indexes with incremental processing)
      Deserialization
      Post-aggregation, sorting, fuzzy-sorting etc.
      Paging
      Custom dimension/metric class loading




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Conclusions

      OLAP semantics on a simple data model
            Data as first class citizen
            Domain Specific “Language” for Dimensions, Metrics, Aggregations
      Tunable performance, resource allocation
      Framework for vertical analytics systems




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   30
Thank you!
                                                               Cosmin Lehene @clehene

                                                               http://hstack.org
                                                                               Credits:
                                                                              Andrei Dragomir
                                                                              Adrian Muraru
                                                                               Andrei Dulvac
                                                                              Raluca Podiuc
                                                                               Tudor Scurtu
                                                                              Bogdan Dragu
                                                                               Bogdan Drutu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.         31
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
OLAP 101 - Rollup

                             Countr                                                Visits   Sale
                             y
                             USA                                                   4        $50

                             Canada                                                1        $0




      Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
OLAP 101 - Slicing

  Date                       Countr                   City                    OS             Browser      Sale
                             y
  2012-03-02                 USA                      NY                      Windows        FF           0.0

  2012-03-02                 USA                      NY                      Windows        FF           10.0

  2012-03-03                 USA                      S                       OSX            Chrome       25.0

  2012-03-03                 Canada                   Ontario                 Linux          Chrome       0.0

  2012-03-04                 USA                      Chicago                 OSX            Safari       15.0

  5 visits,                  2                        4 cities:               3 OS-es        3 browsers   50.0
  3 days                     countries                NY: 2                   Win: 2         FF: 2        3 sales
                             USA: 4                   SF: 1                   OSX: 2         Chrome:2
                             Canada: 1
      Filter or Segment or Slice (WHERE or HAVING)




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.             34
OLAP 101 – Sorting, TOP n

  Date                       Countr                   City                    OS        Browser   Sale
                             y
                                                                                        Chrome    $25

                                                                                        Safari    $15

                                                                                        Firefox   $10




      SELECT SUM(sales) as total GROUP BY browser ORDER BY total



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.        35

More Related Content

What's hot

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningLaura Hood
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new featuresTanvi_Agrawal
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011Laura Hood
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciMark Ginnebaugh
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedKorea Sdec
 

What's hot (13)

DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration PlanningDB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Ta3
Ta3Ta3
Ta3
 
Monster
MonsterMonster
Monster
 
Oracle10g new features
Oracle10g  new featuresOracle10g  new features
Oracle10g new features
 
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database   IBM Tech Forum 2011DB210 Smarter Database   IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
 
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul BertucciSQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
 
An Hour of DB2 Tips
An Hour of DB2 TipsAn Hour of DB2 Tips
An Hour of DB2 Tips
 
SQLFire Webinar
SQLFire WebinarSQLFire Webinar
SQLFire Webinar
 
SQLFire at Strata 2012
SQLFire at Strata 2012SQLFire at Strata 2012
SQLFire at Strata 2012
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SQLFire lightning talk
SQLFire lightning talkSQLFire lightning talk
SQLFire lightning talk
 

Viewers also liked

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @ShanghaiLuke Han
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVAJose Ramon
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)nirvanafilmblog
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursanteJose Ramon
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIOJose Ramon
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)GNOSS
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17rhbc
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)clivecaines
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Markus Klems
 
Normas de cine
Normas de cineNormas de cine
Normas de cineJose Ramon
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at ScaleAntony Messerl
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 

Viewers also liked (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
HISTORIA ACTIVA
HISTORIA ACTIVAHISTORIA ACTIVA
HISTORIA ACTIVA
 
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
 
Ha nacido un concursante
Ha nacido un concursanteHa nacido un concursante
Ha nacido un concursante
 
DÍAS DE RADIO
DÍAS DE RADIODÍAS DE RADIO
DÍAS DE RADIO
 
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
 
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
 
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Normas de cine
Normas de cineNormas de cine
Normas de cine
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Beacosystem V3
Beacosystem V3Beacosystem V3
Beacosystem V3
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 

Similar to Low Latency “OLAP” with HBase - HBaseCon 2012

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applicationsMichael Chaize
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...François Le Droff
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)Linux Foundation Korea
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2IBM Switzerland
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeIcinga
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performancebrettallison
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVHcloud
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsJoel Oleson
 

Similar to Low Latency “OLAP” with HBase - HBaseCon 2012 (20)

Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at AdobeMonitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
 
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN PerformanceLeveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud DatabasesOVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint DeploymentsLarge Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
 

Recently uploaded

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Low Latency “OLAP” with HBase - HBaseCon 2012

  • 1. Low Latency “OLAP” with HBase Cosmin Lehene | Adobe © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. What we needed … and built  OLAP Semantics  Low Latency Ingestion  High Throughput  Real-time Query API  Not hardcoded to web analytics or x-, y-, z- analytics, but extensible © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. Building Blocks  Dimensions, Metrics  Aggregations  Roll-up, drill-down, slicing and dicing, sorting © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. OLAP 101 – Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. OLAP 101 – Queries example  Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0  “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF”  Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. OLAP – Runtime Aggregation vs. Pre-aggregation  Aggregate at runtime  Pre-aggregate  Most flexible  Fast  Fast – scatter gather  Efficient – O(1)  Space efficient  High throughput  But  But  I/O, CPU intensive  More effort to process (latency)  slow for larger data  Combinatorial explosion (space)  low throughput  No flexibility © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. Pre-aggregation  Data needs to be summarized  Can’t visualize 1B data points (no, not even with Retina display)  Difficult to comprehend correlations among more than 3 dimensions  Not all dimension groups are relevant  Index on a needed basis (view selection problem)  Runtime aggregation == TeraSort for every query?  Pre-aggregate to reduce cardinality © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. SaasBase  We tune both  pre-aggregation level vs. runtime post-aggregation  (ingestion speed + space ) vs. (query speed)  Think materialized views from RDBMS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. SaasBase Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. SaasBase - Domain Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. SaasBase - Ingestion, Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. Ingestion © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Ingestion throughput vs. latency  Historical data (large batches)  Optimize for throughput  Increments (latest data, smaller)  Optimize for latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
  • 15. Large, granular input strategies  Slow listing in HDFS  Archive processed files  Filtering input  FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log)  TableInputFormat start/stop row  File Index in HBase (track processed/new files)  Map tasks overhead - stitching input splits  400K files => 400K map tasks => overhead, slow reduce copy  CombineFileInputFormat – 2GB-splits => 500 splits for 1TB  FixedMappersTableInputFormat (e.g. 5-region splits) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
  • 16. Ingestion – Bulk Import  HFileOutputFormat (HFOF)  100s X faster than HBase API  No need to recover from failed jobs  No unnecessary load on machines * No shuffle - global reduce order required!  e.g. first reduce key needs to be in the first region, last one in the last region  Watch for uneven partitions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. HFOF – FileSizeDatePartitioner  1 partition(reduce) / day for initial import  Uneven reduce (partitions) due to data growth over time  Reduce k: 2010-12-04 = 500MB  Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region  Balance reduce buckets based on input file sizes and the reduce key  Generate sub-partitions based on predefined size (e.g. 1GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Processing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Processing  Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time  1 year: 1B events => 100B data points indexed  Query => scan 365 data points (e.g. daily page views)  Processing could be either MR or real-time (e.g. Storm) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Processing for OLAP semantics  GROUP BY (process, query)  COUNT, SUM, AVG, etc. (process, query)  SORT (process, query)  HAVING (mostly query, can define pre-process constraints) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. reports.json entities definition © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Processing Performance  read, map, partition, combine, copy, sort, reduce, write  Read:  Scan.setCaching() (I/O ~ buffer)  Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit)  Even region distribution across cluster (distributes CPU, I/O)  Map:  No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU)  Avoid GC : new X() (CPU, Memory)  Avoid system calls (context switching)  Stripping unnecessary data (I/O) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Processing Performance  Hot (in memory) vs. Cold (on disk, on network) data  Minimize I/O from disk/network  Single shot MR job: SuperProcessor  Emit all groups from one map() call  Incremental processing  Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Indexing © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. HBase natural order: hierarchical representation © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. Indexing - Why  Example: top 10 cities  ~50K [country, city] combinations per day  Top 10 cities for 1 year =>  365 (days) X 50K ~=15M data points scanned  If you add gender => 30M  If you add Device, OS, Browser …  Might compress well, but think about the environment  How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Indexing with HBase “10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10  Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/  scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Query Engine  Always reads indexed, compact data  Query parsing  Scan strategy  Single vs. multiple scans  Start/stop rows (prefixes, index positions, etc.)  Index selection (volatile indexes with incremental processing)  Deserialization  Post-aggregation, sorting, fuzzy-sorting etc.  Paging  Custom dimension/metric class loading © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Conclusions  OLAP semantics on a simple data model  Data as first class citizen  Domain Specific “Language” for Dimensions, Metrics, Aggregations  Tunable performance, resource allocation  Framework for vertical analytics systems © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
  • 31. Thank you! Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 33. OLAP 101 - Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0  Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. OLAP 101 - Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1  Filter or Segment or Slice (WHERE or HAVING) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. OLAP 101 – Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10  SELECT SUM(sales) as total GROUP BY browser ORDER BY total © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35

Editor's Notes

  1. How many HBase users?
  2. Data as first class citizen
  3. Check contrast on projector
  4. Just like speedvs space in general CS/algoQueries always hit indexes
  5. Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
  6. QUERY ENGINE -&gt; INDEX(always realtime)
  7. Initial import/process and NEW reports (not covered) on historical data
  8. 18K regions, upgrade to 0.92
  9. DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
  10. Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
  11. LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
  12. Inconsistent
  13. Rowkey =dimensions group -&gt; metrics (right)
  14. GO BACK to EXPLAIN
  15. &gt;100K/sec/threadREALTIME
  16. Data analysts work with familiar concepts