SlideShare a Scribd company logo
Big Data with
    HBase and
    Hadoop at Adobe
    Cosmin Lehene
    Programatica, November, 2010




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   1
Who am I


Cosmin Lehene

Adobe Services and Infrastructure Team = SaaS services
HBase and Hadoop contributor


clehene@adobe.com
@clehene


                                     h p://hstack.org
                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   2
                                                                                         2
Why I am here today


§     Riding the elephant since 2008


§     Analytics, BI, Machine Learning
§     Images, Videos, Flash, Web, etc.




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   3
                                                                                         3
Opaque Data (logs, archives)


§     Web traffic
§     Business events
§     User interactions
§     Infrastructure data
          §  Database logs, web server logs, etc.

§     Etc.



                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   4
                                                                                         4
h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   5
                                                                                         5
h p://www.google.com/images?q=data+visualization                                         6
                                                                                              ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   6
                                                                                              6
Can I


§     JOIN everything?
§     Increase user engagement?
§     Increase conversion rate?


§     Make $$$? J
§     Fast and cheap?


                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   7
                                                                                         7
Understand data and extract meaning
Real-time access to meaningful data




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   8
                                                                                         8
Agenda




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   9
                                                                                         9
noSQL 101
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   10
                                                                                          1
Scaling RDBMS


§     Scale up
          §  More memory

          §  More CPU

          §  Faster disks, SAN, etc.




§     Problems
          §  Expensive

          §            ere’s a limit

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   11
                                                                                          1
Scaling RDBMS


§     Scale horizontally
          §  Replication (reads)

          §  Sharding/ Horizontal Partitioning (writes)

                  §    Server 1: a-m, Server 2: m-z
          §  Denormalization




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   12
                                                                                          1
Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   13
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   14
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   15
                                                                                          1
Sharding & Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   16
                                                                                          1
Scaling RDBMS problems


§     Hard to repartition/reshard
          §  Pre allocate shards 2, 3, 100

§     Query each shard
§     High operational costs
§     Eventual consistency




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   17
                                                                                          1
Enter noSQL – the beginning


§     Google: BigTable
§     Amazon: Dynamo
§     Memcached




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   18
                                                                                          1
Data Models


§     Key-value
§     Columnar/Tabular
§     Document oriented
§     Graph




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   19
                                                                                          1
Architectures


§     Distributed hash tables
§     Consistent Hashing
§     Gossip
§     Vector clocks
§     Locality groups
§     Partitioning, replication
§     etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   20
                                                                                          2
Properties


§     Scalability
§     Failover
§     Durability
§     Consistency
§     Availability
§     Partition Tolerance
§     Etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   21
                                                                                          2
Cartesian Product




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   22
                                                                                          2
What do all these have in common




§     Different data models
                             noSQL
§     Different architectures
§     Different properties
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   23
Hadoop




                              h p://hadoop.apache.org

§     HDFS (distributed fs)
§     Map-reduce (distributed processing)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   24
                                                                                          2
Adobe Media Player

    Increase video
    consumption




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   25
AMP

 §     Recommendations
 §     Related content
 §     Related users




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   26
                                                                                          2
Video logs

 §     X watched movie A (comedy)
 §     Y watched movie B (drama)
 §     Z watched movie C (thriller)
 §     Z watched movie A (comedy)
 §     X watched movie D (technology)
 §     Y watched movie C (thriller)


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   27
                                                                                          2
Which users are alike?

 §     Compare every 2 users?
 §     5M vectors
 §     120 dimensions
 §     Distance is not enough – needed groups




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   28
                                                                                          2
How?




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   29
                                                                                          2
Custer projections


                                                                                          §  1 month

                                                                                          §  6GB

                                                                                          §  700k Users

                                                                                          §  114 genres

                                                                                          §  7 nodes

                                                                                          §  5 hours

                                                                                          §  27 clusters
                                                                                                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   30
                                                                                                            3
Game Constellations

                                                   §     Processing Shockwave logs




                                                                                            ®	





  Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   31
Lessons learned


 Need:
           §  Fine grain access

           §  Incremental updates

           §  Deal with changes in the original dataset

           §  Real-time data serving




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   32
                                                                                          3
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   33
                                                                                          3
h p://hbase.apache.org

 §     Sparse, distributed, persistent multidimensional
        sorted map
 §     Column oriented store
 §     Autosharding
 §     Data locality

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   34
                                                                                          3
Data Model

  table: row: family: column: value: version
  	
  domain.com/x.swf	
                 swf:	
                          sfw:size = 1876 bytes | 1876 bytes	
                          swf:fps = 30	
                          swf:avm = 3	

                 html: 	
                          embed = dynamic	

                 status:	
                          last_crawl = 2010/11/26 | last_crawl = 2010/11/25	

  domain.com/y.swf	
  domain.com/z.swf	                                                                        ®	





 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   35
                                                                                           3
API


§     Get
§     Put
§     Delete
§     Scan




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   36
Flash

    How is ash used




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   37
How is ash used in the “wild”?

 §     AVM popularity
 §     Frame rates
 §     Video formats
 §     SWF size
 §     Flex data structures
 §     …


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   38
                                                                                          3
How




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   39
                                                                                          3
How




                                                                                          max 1000


                                                                                                     ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   40
                                                                                                     4
e hard way

 §     Hadoop
 §     Nutch
 §     HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   41
                                                                                          4
Work ow

 §     Crawl:
           §    Nutch (seed: top-1m.csv Alexa)
           §    Detect ash embed, javascript
 §     Browse:
           §    Hadoop + FF + FP (chromeless)
           §    Dump stack traces, memory, swf bytes, etc.
 §     Process:
           §    Parse stack traces, rank, etc.
 §     Export:
           §    Hbase: swf table
           §    Md5, swf bytecode, memory, load time, etc.                               ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   42
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   43
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   44
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   45
                                                                                          4
Bene ts

 §     Security xes
 §     Optimization
 §     Prioritize based on real usage
 §     Testing




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   46
                                                                                          4
SaasBase – Hbase++ as a service

 §     Data storage (HBase + HDFS)
           §  Domains, tables,

           §  API: create, put, get, scan




 §     Analytics (HBase + Hadoop + query engine)
           §  Reports, dimensions, metrics

           §  API: query



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   47
                                                                                          4
photoshop.com

    Image analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   48
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   49
                                                                                          4
photoshop.com




 §     1B assets (images, videos, other)
           §  120M with EXIF metadata

 §     1.5 petabytes
 §     Home grown distributed storage




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   50
                                                                                          5
Intelligence

 §     Targeting users:
           §    Professionals or Amateurs?
           §    Where are pictures taken?



 §     Targeting partners:
           §    Popular cameras



 §     Tracking campaigns
           §    New accounts
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   51
                                                                                          5
5
                                                                                          2	

Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   52
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   53
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   54
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   55
                                                                                          5
Stats

 §     7 Machines (16 cores, 24 x 10K RPM SATA, 32GB
        RAM, 1Gbps)


 §     Map 700M records
 §     2hrs, 41mins
 §     Map output: 1.9B records (~80GB)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   56
                                                                                          5
Lessons

 §     SUM, COUNT, AVG, MIN, MAX, GROUP BY,
        HAVING, etc.
 §     Rollup, drilldown, segmentation
 -----------------------------------------------------------


 It’s all about Dimensions & Metrics



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   57
                                                                                          5
Recap



 §     Hadoop + Mahout + PIG (User clusters)
 §     HBase + Hadoop + Nutch+ MySQL (Flash analytics)
 §     HBase + Hadoop (EXIF Explorer, image analytics)




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   58
                                                                                          5
Business Catalyst

    Analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   59
BC




 §     End to end platform for online businesses
 §     E-commerce, Blogging, CRM, email marketing
 §     Analytics: web traffic, affiliates, sales, etc.




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   60
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   61
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   62
                                                                                          6
Successtrophe

 §     Analytics is troublesome
           §  SQL database was slow for analytics

 §     Over 50 different reports
 §     Over 100,000 websites
 §     Billions of page views




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   63
Requirements

 §     Fast incremental processing
 §     Custom reporting
 §     Filtering, segmentation, rollups, drilldowns
 §     Variable time ranges


 §  Fast


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   64
                                                                                          6
Solution

 §     Continuous processing (every 10 minutes)
 §     Reports de nition: dimensions, metrics
 §     Real-time queries: directly from HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   65
                                                                                          6
Work ow

 §     Import Logs ->HBase
 §     Incrementally process/index last 24 hours
 §     Serve from HBase
           §  Index scans

           §  Runtime aggregation




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   66
                                                                                          6
Stats

 §     1 datacenter, 10 months = 1 hour, 24 minutes
 §     > 3 Billion report items generated




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   67
                                                                                          6
Lessons

 §     UNIQUE is harder
           §  E.g :Unique visitors, Visitor loyalty

 §     Space vs. time
 §     Sorting magic




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   68
                                                                                          6
Not just web analytics


 X Analytics


 §     Feed in any le format (w3c, apache, tsv, etc.)
 §     Tag the dimensions and metrics
 §     Process (incremental)
 §     Query in real-time


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   69
                                                                                          6
Nothing but the hstack

 §     structured data storage: HBase
 §          le storage HDFS
 §     data processing: Hadoop




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   70
                                                                                          7
Conclusions

 §     Keep data
 §     Understand data
 §     Explore data
 §     Extract meaning




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   71
                                                                                          7
h p://hstack.org
                                           h p://hbase.apache.org
                                      h p://hadoop.apache.org
                                      h p://mahout.apache.org
                                            h p://nutch.apache.org
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   72
                                                                                          7

More Related Content

What's hot

Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
Trisha Gee
 
Tools for Metaspace
Tools for MetaspaceTools for Metaspace
Tools for Metaspace
Takahiro YAMADA
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
Konstantin V. Shvachko
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
ScyllaDB
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
MariaDB plc
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 

What's hot (20)

Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Tools for Metaspace
Tools for MetaspaceTools for Metaspace
Tools for Metaspace
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 

Similar to HBase and Hadoop at Adobe

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用d0nn9n
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboard
Michael Chaize
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform javaCh'ti JUG
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
Michael Chaize
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCP
David Nuescheler
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)
Andy Hall
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven Development
Michael Chaize
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash Platform
Michael Chaize
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - Keynote
Michael Chaize
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipelliando dias
 
Hello Gumbo
Hello GumboHello Gumbo
Hello Gumbo
Xavi Beumala
 
Jax2001 adobe keynote
Jax2001 adobe keynoteJax2001 adobe keynote
Jax2001 adobe keynote
Michael Chaize
 
As2 vs as3
As2 vs as3As2 vs as3
As2 vs as3
Yash Mody
 
MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«
MMT - Multimediatreff
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builder
ajuby
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
Kirsten Rourke
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital Innovation
Charles Duncan jr.
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
Michael Chaize
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krchamochimedia
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile development
Michael Chaize
 

Similar to HBase and Hadoop at Adobe (20)

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboard
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCP
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven Development
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash Platform
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - Keynote
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
 
Hello Gumbo
Hello GumboHello Gumbo
Hello Gumbo
 
Jax2001 adobe keynote
Jax2001 adobe keynoteJax2001 adobe keynote
Jax2001 adobe keynote
 
As2 vs as3
As2 vs as3As2 vs as3
As2 vs as3
 
MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builder
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital Innovation
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krcha
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile development
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

HBase and Hadoop at Adobe

  • 1. Big Data with HBase and Hadoop at Adobe Cosmin Lehene Programatica, November, 2010 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1
  • 2. Who am I Cosmin Lehene Adobe Services and Infrastructure Team = SaaS services HBase and Hadoop contributor clehene@adobe.com @clehene h p://hstack.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 2 2
  • 3. Why I am here today §  Riding the elephant since 2008 §  Analytics, BI, Machine Learning §  Images, Videos, Flash, Web, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 3 3
  • 4. Opaque Data (logs, archives) §  Web traffic §  Business events §  User interactions §  Infrastructure data §  Database logs, web server logs, etc. §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 4 4
  • 5. h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 5 5
  • 6. h p://www.google.com/images?q=data+visualization 6 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 6 6
  • 7. Can I §  JOIN everything? §  Increase user engagement? §  Increase conversion rate? §  Make $$$? J §  Fast and cheap? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 7 7
  • 8. Understand data and extract meaning Real-time access to meaningful data ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 8 8
  • 9. Agenda ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 9 9
  • 10. noSQL 101 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 10 1
  • 11. Scaling RDBMS §  Scale up §  More memory §  More CPU §  Faster disks, SAN, etc. §  Problems §  Expensive §  ere’s a limit ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 11 1
  • 12. Scaling RDBMS §  Scale horizontally §  Replication (reads) §  Sharding/ Horizontal Partitioning (writes) §  Server 1: a-m, Server 2: m-z §  Denormalization ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 12 1
  • 13. Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 13 1
  • 14. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 14 1
  • 15. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 15 1
  • 16. Sharding & Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 16 1
  • 17. Scaling RDBMS problems §  Hard to repartition/reshard §  Pre allocate shards 2, 3, 100 §  Query each shard §  High operational costs §  Eventual consistency ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 17 1
  • 18. Enter noSQL – the beginning §  Google: BigTable §  Amazon: Dynamo §  Memcached ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 18 1
  • 19. Data Models §  Key-value §  Columnar/Tabular §  Document oriented §  Graph ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 19 1
  • 20. Architectures §  Distributed hash tables §  Consistent Hashing §  Gossip §  Vector clocks §  Locality groups §  Partitioning, replication §  etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 20 2
  • 21. Properties §  Scalability §  Failover §  Durability §  Consistency §  Availability §  Partition Tolerance §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 21 2
  • 22. Cartesian Product ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 22 2
  • 23. What do all these have in common §  Different data models noSQL §  Different architectures §  Different properties ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 23
  • 24. Hadoop h p://hadoop.apache.org §  HDFS (distributed fs) §  Map-reduce (distributed processing) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 24 2
  • 25. Adobe Media Player Increase video consumption Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 25
  • 26. AMP §  Recommendations §  Related content §  Related users ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 26 2
  • 27. Video logs §  X watched movie A (comedy) §  Y watched movie B (drama) §  Z watched movie C (thriller) §  Z watched movie A (comedy) §  X watched movie D (technology) §  Y watched movie C (thriller) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 27 2
  • 28. Which users are alike? §  Compare every 2 users? §  5M vectors §  120 dimensions §  Distance is not enough – needed groups ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 28 2
  • 29. How? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 29 2
  • 30. Custer projections §  1 month §  6GB §  700k Users §  114 genres §  7 nodes §  5 hours §  27 clusters ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 30 3
  • 31. Game Constellations §  Processing Shockwave logs ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 31
  • 32. Lessons learned Need: §  Fine grain access §  Incremental updates §  Deal with changes in the original dataset §  Real-time data serving ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 32 3
  • 33. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 33 3
  • 34. h p://hbase.apache.org §  Sparse, distributed, persistent multidimensional sorted map §  Column oriented store §  Autosharding §  Data locality ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 34 3
  • 35. Data Model table: row: family: column: value: version domain.com/x.swf swf: sfw:size = 1876 bytes | 1876 bytes swf:fps = 30 swf:avm = 3 html: embed = dynamic status: last_crawl = 2010/11/26 | last_crawl = 2010/11/25 domain.com/y.swf domain.com/z.swf ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 35 3
  • 36. API §  Get §  Put §  Delete §  Scan ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 36
  • 37. Flash How is ash used Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 37
  • 38. How is ash used in the “wild”? §  AVM popularity §  Frame rates §  Video formats §  SWF size §  Flex data structures §  … ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 38 3
  • 39. How ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 39 3
  • 40. How max 1000 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 40 4
  • 41. e hard way §  Hadoop §  Nutch §  HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 41 4
  • 42. Work ow §  Crawl: §  Nutch (seed: top-1m.csv Alexa) §  Detect ash embed, javascript §  Browse: §  Hadoop + FF + FP (chromeless) §  Dump stack traces, memory, swf bytes, etc. §  Process: §  Parse stack traces, rank, etc. §  Export: §  Hbase: swf table §  Md5, swf bytecode, memory, load time, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 42 4
  • 43. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 43 4
  • 44. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 44 4
  • 45. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 45 4
  • 46. Bene ts §  Security xes §  Optimization §  Prioritize based on real usage §  Testing ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 46 4
  • 47. SaasBase – Hbase++ as a service §  Data storage (HBase + HDFS) §  Domains, tables, §  API: create, put, get, scan §  Analytics (HBase + Hadoop + query engine) §  Reports, dimensions, metrics §  API: query ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 47 4
  • 48. photoshop.com Image analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 48
  • 49. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 49 4
  • 50. photoshop.com §  1B assets (images, videos, other) §  120M with EXIF metadata §  1.5 petabytes §  Home grown distributed storage ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 50 5
  • 51. Intelligence §  Targeting users: §  Professionals or Amateurs? §  Where are pictures taken? §  Targeting partners: §  Popular cameras §  Tracking campaigns §  New accounts ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 51 5
  • 52. 5 2 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 52
  • 53. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 53 5
  • 54. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 54 5
  • 55. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 55 5
  • 56. Stats §  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB RAM, 1Gbps) §  Map 700M records §  2hrs, 41mins §  Map output: 1.9B records (~80GB) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 56 5
  • 57. Lessons §  SUM, COUNT, AVG, MIN, MAX, GROUP BY, HAVING, etc. §  Rollup, drilldown, segmentation ----------------------------------------------------------- It’s all about Dimensions & Metrics ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 57 5
  • 58. Recap §  Hadoop + Mahout + PIG (User clusters) §  HBase + Hadoop + Nutch+ MySQL (Flash analytics) §  HBase + Hadoop (EXIF Explorer, image analytics) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 58 5
  • 59. Business Catalyst Analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 59
  • 60. BC §  End to end platform for online businesses §  E-commerce, Blogging, CRM, email marketing §  Analytics: web traffic, affiliates, sales, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 60 6
  • 61. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 61 6
  • 62. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 62 6
  • 63. Successtrophe §  Analytics is troublesome §  SQL database was slow for analytics §  Over 50 different reports §  Over 100,000 websites §  Billions of page views ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 63
  • 64. Requirements §  Fast incremental processing §  Custom reporting §  Filtering, segmentation, rollups, drilldowns §  Variable time ranges §  Fast ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 64 6
  • 65. Solution §  Continuous processing (every 10 minutes) §  Reports de nition: dimensions, metrics §  Real-time queries: directly from HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 65 6
  • 66. Work ow §  Import Logs ->HBase §  Incrementally process/index last 24 hours §  Serve from HBase §  Index scans §  Runtime aggregation ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 66 6
  • 67. Stats §  1 datacenter, 10 months = 1 hour, 24 minutes §  > 3 Billion report items generated ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 67 6
  • 68. Lessons §  UNIQUE is harder §  E.g :Unique visitors, Visitor loyalty §  Space vs. time §  Sorting magic ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 68 6
  • 69. Not just web analytics X Analytics §  Feed in any le format (w3c, apache, tsv, etc.) §  Tag the dimensions and metrics §  Process (incremental) §  Query in real-time ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 69 6
  • 70. Nothing but the hstack §  structured data storage: HBase §  le storage HDFS §  data processing: Hadoop ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 70 7
  • 71. Conclusions §  Keep data §  Understand data §  Explore data §  Extract meaning ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 71 7
  • 72. h p://hstack.org h p://hbase.apache.org h p://hadoop.apache.org h p://mahout.apache.org h p://nutch.apache.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 72 7