SlideShare a Scribd company logo
1 of 72
Big Data with
    HBase and
    Hadoop at Adobe
    Cosmin Lehene
    Programatica, November, 2010




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   1
Who am I


Cosmin Lehene

Adobe Services and Infrastructure Team = SaaS services
HBase and Hadoop contributor


clehene@adobe.com
@clehene


                                     h p://hstack.org
                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   2
                                                                                         2
Why I am here today


§     Riding the elephant since 2008


§     Analytics, BI, Machine Learning
§     Images, Videos, Flash, Web, etc.




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   3
                                                                                         3
Opaque Data (logs, archives)


§     Web traffic
§     Business events
§     User interactions
§     Infrastructure data
          §  Database logs, web server logs, etc.

§     Etc.



                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   4
                                                                                         4
h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   5
                                                                                         5
h p://www.google.com/images?q=data+visualization                                         6
                                                                                              ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   6
                                                                                              6
Can I


§     JOIN everything?
§     Increase user engagement?
§     Increase conversion rate?


§     Make $$$? J
§     Fast and cheap?


                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   7
                                                                                         7
Understand data and extract meaning
Real-time access to meaningful data




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   8
                                                                                         8
Agenda




                                                                                         ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   9
                                                                                         9
noSQL 101
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   10
                                                                                          1
Scaling RDBMS


§     Scale up
          §  More memory

          §  More CPU

          §  Faster disks, SAN, etc.




§     Problems
          §  Expensive

          §            ere’s a limit

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   11
                                                                                          1
Scaling RDBMS


§     Scale horizontally
          §  Replication (reads)

          §  Sharding/ Horizontal Partitioning (writes)

                  §    Server 1: a-m, Server 2: m-z
          §  Denormalization




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   12
                                                                                          1
Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   13
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   14
                                                                                          1
Sharding




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   15
                                                                                          1
Sharding & Replication




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   16
                                                                                          1
Scaling RDBMS problems


§     Hard to repartition/reshard
          §  Pre allocate shards 2, 3, 100

§     Query each shard
§     High operational costs
§     Eventual consistency




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   17
                                                                                          1
Enter noSQL – the beginning


§     Google: BigTable
§     Amazon: Dynamo
§     Memcached




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   18
                                                                                          1
Data Models


§     Key-value
§     Columnar/Tabular
§     Document oriented
§     Graph




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   19
                                                                                          1
Architectures


§     Distributed hash tables
§     Consistent Hashing
§     Gossip
§     Vector clocks
§     Locality groups
§     Partitioning, replication
§     etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   20
                                                                                          2
Properties


§     Scalability
§     Failover
§     Durability
§     Consistency
§     Availability
§     Partition Tolerance
§     Etc.
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   21
                                                                                          2
Cartesian Product




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   22
                                                                                          2
What do all these have in common




§     Different data models
                             noSQL
§     Different architectures
§     Different properties
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   23
Hadoop




                              h p://hadoop.apache.org

§     HDFS (distributed fs)
§     Map-reduce (distributed processing)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   24
                                                                                          2
Adobe Media Player

    Increase video
    consumption




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   25
AMP

 §     Recommendations
 §     Related content
 §     Related users




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   26
                                                                                          2
Video logs

 §     X watched movie A (comedy)
 §     Y watched movie B (drama)
 §     Z watched movie C (thriller)
 §     Z watched movie A (comedy)
 §     X watched movie D (technology)
 §     Y watched movie C (thriller)


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   27
                                                                                          2
Which users are alike?

 §     Compare every 2 users?
 §     5M vectors
 §     120 dimensions
 §     Distance is not enough – needed groups




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   28
                                                                                          2
How?




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   29
                                                                                          2
Custer projections


                                                                                          §  1 month

                                                                                          §  6GB

                                                                                          §  700k Users

                                                                                          §  114 genres

                                                                                          §  7 nodes

                                                                                          §  5 hours

                                                                                          §  27 clusters
                                                                                                            ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   30
                                                                                                            3
Game Constellations

                                                   §     Processing Shockwave logs




                                                                                            ®	





  Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   31
Lessons learned


 Need:
           §  Fine grain access

           §  Incremental updates

           §  Deal with changes in the original dataset

           §  Real-time data serving




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   32
                                                                                          3
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   33
                                                                                          3
h p://hbase.apache.org

 §     Sparse, distributed, persistent multidimensional
        sorted map
 §     Column oriented store
 §     Autosharding
 §     Data locality

                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   34
                                                                                          3
Data Model

  table: row: family: column: value: version
  	
  domain.com/x.swf	
                 swf:	
                          sfw:size = 1876 bytes | 1876 bytes	
                          swf:fps = 30	
                          swf:avm = 3	

                 html: 	
                          embed = dynamic	

                 status:	
                          last_crawl = 2010/11/26 | last_crawl = 2010/11/25	

  domain.com/y.swf	
  domain.com/z.swf	                                                                        ®	





 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   35
                                                                                           3
API


§     Get
§     Put
§     Delete
§     Scan




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   36
Flash

    How is ash used




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   37
How is ash used in the “wild”?

 §     AVM popularity
 §     Frame rates
 §     Video formats
 §     SWF size
 §     Flex data structures
 §     …


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   38
                                                                                          3
How




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   39
                                                                                          3
How




                                                                                          max 1000


                                                                                                     ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   40
                                                                                                     4
e hard way

 §     Hadoop
 §     Nutch
 §     HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   41
                                                                                          4
Work ow

 §     Crawl:
           §    Nutch (seed: top-1m.csv Alexa)
           §    Detect ash embed, javascript
 §     Browse:
           §    Hadoop + FF + FP (chromeless)
           §    Dump stack traces, memory, swf bytes, etc.
 §     Process:
           §    Parse stack traces, rank, etc.
 §     Export:
           §    Hbase: swf table
           §    Md5, swf bytecode, memory, load time, etc.                               ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   42
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   43
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   44
                                                                                          4
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   45
                                                                                          4
Bene ts

 §     Security xes
 §     Optimization
 §     Prioritize based on real usage
 §     Testing




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   46
                                                                                          4
SaasBase – Hbase++ as a service

 §     Data storage (HBase + HDFS)
           §  Domains, tables,

           §  API: create, put, get, scan




 §     Analytics (HBase + Hadoop + query engine)
           §  Reports, dimensions, metrics

           §  API: query



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   47
                                                                                          4
photoshop.com

    Image analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   48
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   49
                                                                                          4
photoshop.com




 §     1B assets (images, videos, other)
           §  120M with EXIF metadata

 §     1.5 petabytes
 §     Home grown distributed storage




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   50
                                                                                          5
Intelligence

 §     Targeting users:
           §    Professionals or Amateurs?
           §    Where are pictures taken?



 §     Targeting partners:
           §    Popular cameras



 §     Tracking campaigns
           §    New accounts
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   51
                                                                                          5
5
                                                                                          2	

Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   52
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   53
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   54
                                                                                          5
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   55
                                                                                          5
Stats

 §     7 Machines (16 cores, 24 x 10K RPM SATA, 32GB
        RAM, 1Gbps)


 §     Map 700M records
 §     2hrs, 41mins
 §     Map output: 1.9B records (~80GB)



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   56
                                                                                          5
Lessons

 §     SUM, COUNT, AVG, MIN, MAX, GROUP BY,
        HAVING, etc.
 §     Rollup, drilldown, segmentation
 -----------------------------------------------------------


 It’s all about Dimensions & Metrics



                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   57
                                                                                          5
Recap



 §     Hadoop + Mahout + PIG (User clusters)
 §     HBase + Hadoop + Nutch+ MySQL (Flash analytics)
 §     HBase + Hadoop (EXIF Explorer, image analytics)




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   58
                                                                                          5
Business Catalyst

    Analytics




Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   59
BC




 §     End to end platform for online businesses
 §     E-commerce, Blogging, CRM, email marketing
 §     Analytics: web traffic, affiliates, sales, etc.




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   60
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   61
                                                                                          6
®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   62
                                                                                          6
Successtrophe

 §     Analytics is troublesome
           §  SQL database was slow for analytics

 §     Over 50 different reports
 §     Over 100,000 websites
 §     Billions of page views




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   63
Requirements

 §     Fast incremental processing
 §     Custom reporting
 §     Filtering, segmentation, rollups, drilldowns
 §     Variable time ranges


 §  Fast


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   64
                                                                                          6
Solution

 §     Continuous processing (every 10 minutes)
 §     Reports de nition: dimensions, metrics
 §     Real-time queries: directly from HBase




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   65
                                                                                          6
Work ow

 §     Import Logs ->HBase
 §     Incrementally process/index last 24 hours
 §     Serve from HBase
           §  Index scans

           §  Runtime aggregation




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   66
                                                                                          6
Stats

 §     1 datacenter, 10 months = 1 hour, 24 minutes
 §     > 3 Billion report items generated




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   67
                                                                                          6
Lessons

 §     UNIQUE is harder
           §  E.g :Unique visitors, Visitor loyalty

 §     Space vs. time
 §     Sorting magic




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   68
                                                                                          6
Not just web analytics


 X Analytics


 §     Feed in any le format (w3c, apache, tsv, etc.)
 §     Tag the dimensions and metrics
 §     Process (incremental)
 §     Query in real-time


                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   69
                                                                                          6
Nothing but the hstack

 §     structured data storage: HBase
 §          le storage HDFS
 §     data processing: Hadoop




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   70
                                                                                          7
Conclusions

 §     Keep data
 §     Understand data
 §     Explore data
 §     Extract meaning




                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   71
                                                                                          7
h p://hstack.org
                                           h p://hbase.apache.org
                                      h p://hadoop.apache.org
                                      h p://mahout.apache.org
                                            h p://nutch.apache.org
                                                                                          ®	





Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential.   72
                                                                                          7

More Related Content

What's hot

Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Amazon Web Services
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateBobby Curtis
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioAlluxio, Inc.
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...Amazon Web Services Korea
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 

What's hot (20)

Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Extreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGateExtreme Replication - Performance Tuning Oracle GoldenGate
Extreme Replication - Performance Tuning Oracle GoldenGate
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
대용량 데이터레이크 마이그레이션 사례 공유 [카카오게임즈 - 레벨 200] - 조은희, 팀장, 카카오게임즈 ::: Games on AWS ...
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 

Similar to HBase and Hadoop at Adobe

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用d0nn9n
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardMichael Chaize
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform javaCh'ti JUG
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform javaMichael Chaize
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPDavid Nuescheler
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)Andy Hall
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentMichael Chaize
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformMichael Chaize
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - KeynoteMichael Chaize
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipelliando dias
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builderajuby
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 WorkflowKirsten Rourke
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital InnovationCharles Duncan jr.
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applicationsMichael Chaize
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krchamochimedia
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile developmentMichael Chaize
 

Similar to HBase and Hadoop at Adobe (20)

董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用董龙飞 - 新一代企业应用
董龙飞 - 新一代企业应用
 
JAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboardJAX2010 Flex Java technical session: interactive dashboard
JAX2010 Flex Java technical session: interactive dashboard
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
Adobe flash platform java
Adobe flash platform javaAdobe flash platform java
Adobe flash platform java
 
NLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCPNLJUG: Content Management, Standards, Opensource & JCP
NLJUG: Content Management, Standards, Opensource & JCP
 
Flash performance tuning (EN)
Flash performance tuning (EN)Flash performance tuning (EN)
Flash performance tuning (EN)
 
Oop2012 keynote Design Driven Development
Oop2012 keynote Design Driven DevelopmentOop2012 keynote Design Driven Development
Oop2012 keynote Design Driven Development
 
Innovation and the Adobe Flash Platform
Innovation and the Adobe Flash PlatformInnovation and the Adobe Flash Platform
Innovation and the Adobe Flash Platform
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - Keynote
 
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendshipFlex, Adobe AIR, and PHP: the beginning of a beautiful friendship
Flex, Adobe AIR, and PHP: the beginning of a beautiful friendship
 
Hello Gumbo
Hello GumboHello Gumbo
Hello Gumbo
 
Jax2001 adobe keynote
Jax2001 adobe keynoteJax2001 adobe keynote
Jax2001 adobe keynote
 
As2 vs as3
As2 vs as3As2 vs as3
As2 vs as3
 
MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«MMT 28: Adobe »Edge to the Flash«
MMT 28: Adobe »Edge to the Flash«
 
AJUBY Open Source Application Builder
AJUBY Open Source Application BuilderAJUBY Open Source Application Builder
AJUBY Open Source Application Builder
 
eLearning Suite 6 Workflow
eLearning Suite 6 WorkfloweLearning Suite 6 Workflow
eLearning Suite 6 Workflow
 
Process in the Age of Digital Innovation
Process in the Age of Digital InnovationProcess in the Age of Digital Innovation
Process in the Age of Digital Innovation
 
Xebia adobe flash mobile applications
Xebia adobe flash mobile applicationsXebia adobe flash mobile applications
Xebia adobe flash mobile applications
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krcha
 
Xplatform mobile development
Xplatform mobile developmentXplatform mobile development
Xplatform mobile development
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

HBase and Hadoop at Adobe

  • 1. Big Data with HBase and Hadoop at Adobe Cosmin Lehene Programatica, November, 2010 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1
  • 2. Who am I Cosmin Lehene Adobe Services and Infrastructure Team = SaaS services HBase and Hadoop contributor clehene@adobe.com @clehene h p://hstack.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 2 2
  • 3. Why I am here today §  Riding the elephant since 2008 §  Analytics, BI, Machine Learning §  Images, Videos, Flash, Web, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 3 3
  • 4. Opaque Data (logs, archives) §  Web traffic §  Business events §  User interactions §  Infrastructure data §  Database logs, web server logs, etc. §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 4 4
  • 5. h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 5 5
  • 6. h p://www.google.com/images?q=data+visualization 6 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 6 6
  • 7. Can I §  JOIN everything? §  Increase user engagement? §  Increase conversion rate? §  Make $$$? J §  Fast and cheap? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 7 7
  • 8. Understand data and extract meaning Real-time access to meaningful data ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 8 8
  • 9. Agenda ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 9 9
  • 10. noSQL 101 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 10 1
  • 11. Scaling RDBMS §  Scale up §  More memory §  More CPU §  Faster disks, SAN, etc. §  Problems §  Expensive §  ere’s a limit ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 11 1
  • 12. Scaling RDBMS §  Scale horizontally §  Replication (reads) §  Sharding/ Horizontal Partitioning (writes) §  Server 1: a-m, Server 2: m-z §  Denormalization ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 12 1
  • 13. Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 13 1
  • 14. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 14 1
  • 15. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 15 1
  • 16. Sharding & Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 16 1
  • 17. Scaling RDBMS problems §  Hard to repartition/reshard §  Pre allocate shards 2, 3, 100 §  Query each shard §  High operational costs §  Eventual consistency ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 17 1
  • 18. Enter noSQL – the beginning §  Google: BigTable §  Amazon: Dynamo §  Memcached ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 18 1
  • 19. Data Models §  Key-value §  Columnar/Tabular §  Document oriented §  Graph ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 19 1
  • 20. Architectures §  Distributed hash tables §  Consistent Hashing §  Gossip §  Vector clocks §  Locality groups §  Partitioning, replication §  etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 20 2
  • 21. Properties §  Scalability §  Failover §  Durability §  Consistency §  Availability §  Partition Tolerance §  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 21 2
  • 22. Cartesian Product ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 22 2
  • 23. What do all these have in common §  Different data models noSQL §  Different architectures §  Different properties ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 23
  • 24. Hadoop h p://hadoop.apache.org §  HDFS (distributed fs) §  Map-reduce (distributed processing) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 24 2
  • 25. Adobe Media Player Increase video consumption Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 25
  • 26. AMP §  Recommendations §  Related content §  Related users ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 26 2
  • 27. Video logs §  X watched movie A (comedy) §  Y watched movie B (drama) §  Z watched movie C (thriller) §  Z watched movie A (comedy) §  X watched movie D (technology) §  Y watched movie C (thriller) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 27 2
  • 28. Which users are alike? §  Compare every 2 users? §  5M vectors §  120 dimensions §  Distance is not enough – needed groups ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 28 2
  • 29. How? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 29 2
  • 30. Custer projections §  1 month §  6GB §  700k Users §  114 genres §  7 nodes §  5 hours §  27 clusters ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 30 3
  • 31. Game Constellations §  Processing Shockwave logs ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 31
  • 32. Lessons learned Need: §  Fine grain access §  Incremental updates §  Deal with changes in the original dataset §  Real-time data serving ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 32 3
  • 33. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 33 3
  • 34. h p://hbase.apache.org §  Sparse, distributed, persistent multidimensional sorted map §  Column oriented store §  Autosharding §  Data locality ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 34 3
  • 35. Data Model table: row: family: column: value: version domain.com/x.swf swf: sfw:size = 1876 bytes | 1876 bytes swf:fps = 30 swf:avm = 3 html: embed = dynamic status: last_crawl = 2010/11/26 | last_crawl = 2010/11/25 domain.com/y.swf domain.com/z.swf ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 35 3
  • 36. API §  Get §  Put §  Delete §  Scan ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 36
  • 37. Flash How is ash used Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 37
  • 38. How is ash used in the “wild”? §  AVM popularity §  Frame rates §  Video formats §  SWF size §  Flex data structures §  … ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 38 3
  • 39. How ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 39 3
  • 40. How max 1000 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 40 4
  • 41. e hard way §  Hadoop §  Nutch §  HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 41 4
  • 42. Work ow §  Crawl: §  Nutch (seed: top-1m.csv Alexa) §  Detect ash embed, javascript §  Browse: §  Hadoop + FF + FP (chromeless) §  Dump stack traces, memory, swf bytes, etc. §  Process: §  Parse stack traces, rank, etc. §  Export: §  Hbase: swf table §  Md5, swf bytecode, memory, load time, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 42 4
  • 43. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 43 4
  • 44. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 44 4
  • 45. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 45 4
  • 46. Bene ts §  Security xes §  Optimization §  Prioritize based on real usage §  Testing ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 46 4
  • 47. SaasBase – Hbase++ as a service §  Data storage (HBase + HDFS) §  Domains, tables, §  API: create, put, get, scan §  Analytics (HBase + Hadoop + query engine) §  Reports, dimensions, metrics §  API: query ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 47 4
  • 48. photoshop.com Image analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 48
  • 49. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 49 4
  • 50. photoshop.com §  1B assets (images, videos, other) §  120M with EXIF metadata §  1.5 petabytes §  Home grown distributed storage ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 50 5
  • 51. Intelligence §  Targeting users: §  Professionals or Amateurs? §  Where are pictures taken? §  Targeting partners: §  Popular cameras §  Tracking campaigns §  New accounts ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 51 5
  • 52. 5 2 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 52
  • 53. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 53 5
  • 54. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 54 5
  • 55. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 55 5
  • 56. Stats §  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB RAM, 1Gbps) §  Map 700M records §  2hrs, 41mins §  Map output: 1.9B records (~80GB) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 56 5
  • 57. Lessons §  SUM, COUNT, AVG, MIN, MAX, GROUP BY, HAVING, etc. §  Rollup, drilldown, segmentation ----------------------------------------------------------- It’s all about Dimensions & Metrics ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 57 5
  • 58. Recap §  Hadoop + Mahout + PIG (User clusters) §  HBase + Hadoop + Nutch+ MySQL (Flash analytics) §  HBase + Hadoop (EXIF Explorer, image analytics) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 58 5
  • 59. Business Catalyst Analytics Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 59
  • 60. BC §  End to end platform for online businesses §  E-commerce, Blogging, CRM, email marketing §  Analytics: web traffic, affiliates, sales, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 60 6
  • 61. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 61 6
  • 62. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 62 6
  • 63. Successtrophe §  Analytics is troublesome §  SQL database was slow for analytics §  Over 50 different reports §  Over 100,000 websites §  Billions of page views ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 63
  • 64. Requirements §  Fast incremental processing §  Custom reporting §  Filtering, segmentation, rollups, drilldowns §  Variable time ranges §  Fast ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 64 6
  • 65. Solution §  Continuous processing (every 10 minutes) §  Reports de nition: dimensions, metrics §  Real-time queries: directly from HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 65 6
  • 66. Work ow §  Import Logs ->HBase §  Incrementally process/index last 24 hours §  Serve from HBase §  Index scans §  Runtime aggregation ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 66 6
  • 67. Stats §  1 datacenter, 10 months = 1 hour, 24 minutes §  > 3 Billion report items generated ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 67 6
  • 68. Lessons §  UNIQUE is harder §  E.g :Unique visitors, Visitor loyalty §  Space vs. time §  Sorting magic ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 68 6
  • 69. Not just web analytics X Analytics §  Feed in any le format (w3c, apache, tsv, etc.) §  Tag the dimensions and metrics §  Process (incremental) §  Query in real-time ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 69 6
  • 70. Nothing but the hstack §  structured data storage: HBase §  le storage HDFS §  data processing: Hadoop ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 70 7
  • 71. Conclusions §  Keep data §  Understand data §  Explore data §  Extract meaning ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 71 7
  • 72. h p://hstack.org h p://hbase.apache.org h p://hadoop.apache.org h p://mahout.apache.org h p://nutch.apache.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 72 7