HBase and Hadoop at Adobe

Big Data with
HBase and
Hadoop at Adobe
Cosmin Lehene
Programatica, November, 2010

Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1

Who am I

Cosmin Lehene

Adobe Services and Infrastructure Team = SaaS services
HBase and Hadoop contributor

clehene@adobe.com
@clehene

h p://hstack.org
®

2

Why I am here today

§  Riding the elephant since 2008

§  Analytics, BI, Machine Learning
§  Images, Videos, Flash, Web, etc.

®

3

Opaque Data (logs, archives)

§  Web traﬃc
§  Business events
§  User interactions
§  Infrastructure data
§  Database logs, web server logs, etc.

§  Etc.

®

4

h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ®

5

h p://www.google.com/images?q=data+visualization 6
®

6

Can I

§  JOIN everything?
§  Increase user engagement?
§  Increase conversion rate?

§  Make $$$? J
§  Fast and cheap?

®

7

Understand data and extract meaning
Real-time access to meaningful data

®

8

Agenda

®

9

noSQL 101
®

1

Scaling RDBMS

§  Scale up
§  More memory

§  More CPU

§  Faster disks, SAN, etc.

§  Problems
§  Expensive

§  ere’s a limit

®

1

Scaling RDBMS

§  Scale horizontally
§  Replication (reads)

§  Sharding/ Horizontal Partitioning (writes)

§  Server 1: a-m, Server 2: m-z
§  Denormalization

®

1

Replication

®

1

Sharding

®

1

Sharding & Replication

®

1

Scaling RDBMS problems

§  Hard to repartition/reshard
§  Pre allocate shards 2, 3, 100

§  Query each shard
§  High operational costs
§  Eventual consistency

®

1

Enter noSQL – the beginning

§  Google: BigTable
§  Amazon: Dynamo
§  Memcached

®

1

Data Models

§  Key-value
§  Columnar/Tabular
§  Document oriented
§  Graph

®

1

Architectures

§  Distributed hash tables
§  Consistent Hashing
§  Gossip
§  Vector clocks
§  Locality groups
§  Partitioning, replication
§  etc.
®

2

Properties

§  Scalability
§  Failover
§  Durability
§  Consistency
§  Availability
§  Partition Tolerance
§  Etc.
®

2

Cartesian Product

®

2

What do all these have in common

§  Different data models
noSQL
§  Different architectures
§  Different properties
®


Hadoop

h p://hadoop.apache.org

§  HDFS (distributed fs)
§  Map-reduce (distributed processing)

®

2

Adobe Media Player

Increase video
consumption


AMP

§  Recommendations
§  Related content
§  Related users

®

2

Video logs

§  X watched movie A (comedy)
§  Y watched movie B (drama)
§  Z watched movie C (thriller)
§  Z watched movie A (comedy)
§  X watched movie D (technology)
§  Y watched movie C (thriller)

®

2

Which users are alike?

§  Compare every 2 users?
§  5M vectors
§  120 dimensions
§  Distance is not enough – needed groups

®

2

How?

®

2

Custer projections

§  1 month

§  6GB

§  700k Users

§  114 genres

§  7 nodes

§  5 hours

§  27 clusters
®

3

Game Constellations

§  Processing Shockwave logs

®


Lessons learned

Need:
§  Fine grain access

§  Incremental updates

§  Deal with changes in the original dataset

§  Real-time data serving

®

3

®

3

h p://hbase.apache.org

§  Sparse, distributed, persistent multidimensional
sorted map
§  Column oriented store
§  Autosharding
§  Data locality

®

3

Data Model

table: row: family: column: value: version

domain.com/x.swf
swf:
sfw:size = 1876 bytes | 1876 bytes
swf:fps = 30
swf:avm = 3

html:
embed = dynamic

status:
last_crawl = 2010/11/26 | last_crawl = 2010/11/25

domain.com/y.swf
domain.com/z.swf ®

3

API

§  Get
§  Put
§  Delete
§  Scan

®


Flash

How is ash used


How is ash used in the “wild”?

§  AVM popularity
§  Frame rates
§  Video formats
§  SWF size
§  Flex data structures
§  …

®

3

How

®

3

How

max 1000

®

4

e hard way

§  Hadoop
§  Nutch
§  HBase

®

4

Work ow

§  Crawl:
§  Nutch (seed: top-1m.csv Alexa)
§  Detect ash embed, javascript
§  Browse:
§  Hadoop + FF + FP (chromeless)
§  Dump stack traces, memory, swf bytes, etc.
§  Process:
§  Parse stack traces, rank, etc.
§  Export:
§  Hbase: swf table
§  Md5, swf bytecode, memory, load time, etc. ®

4

®

4

Bene ts

§  Security xes
§  Optimization
§  Prioritize based on real usage
§  Testing

®

4

SaasBase – Hbase++ as a service

§  Data storage (HBase + HDFS)
§  Domains, tables,

§  API: create, put, get, scan

§  Analytics (HBase + Hadoop + query engine)
§  Reports, dimensions, metrics

§  API: query

®

4

photoshop.com

Image analytics


®

4

photoshop.com

§  1B assets (images, videos, other)
§  120M with EXIF metadata

§  1.5 petabytes
§  Home grown distributed storage

®

5

Intelligence

§  Targeting users:
§  Professionals or Amateurs?
§  Where are pictures taken?

§  Targeting partners:
§  Popular cameras

§  Tracking campaigns
§  New accounts
®

5

5
2


®

5

Stats

§  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB
RAM, 1Gbps)

§  Map 700M records
§  2hrs, 41mins
§  Map output: 1.9B records (~80GB)

®

5

Lessons

§  SUM, COUNT, AVG, MIN, MAX, GROUP BY,
HAVING, etc.
§  Rollup, drilldown, segmentation
-----------------------------------------------------------

It’s all about Dimensions & Metrics

®

5

Recap

§  Hadoop + Mahout + PIG (User clusters)
§  HBase + Hadoop + Nutch+ MySQL (Flash analytics)
§  HBase + Hadoop (EXIF Explorer, image analytics)

®

5

Business Catalyst

Analytics


BC

§  End to end platform for online businesses
§  E-commerce, Blogging, CRM, email marketing
§  Analytics: web traﬃc, aﬃliates, sales, etc.

®

6

®

6

Successtrophe

§  Analytics is troublesome
§  SQL database was slow for analytics

§  Over 50 diﬀerent reports
§  Over 100,000 websites
§  Billions of page views

®


Requirements

§  Fast incremental processing
§  Custom reporting
§  Filtering, segmentation, rollups, drilldowns
§  Variable time ranges

§  Fast

®

6

Solution

§  Continuous processing (every 10 minutes)
§  Reports de nition: dimensions, metrics
§  Real-time queries: directly from HBase

®

6

Work ow

§  Import Logs ->HBase
§  Incrementally process/index last 24 hours
§  Serve from HBase
§  Index scans

§  Runtime aggregation

®

6

Stats

§  1 datacenter, 10 months = 1 hour, 24 minutes
§  > 3 Billion report items generated

®

6

Lessons

§  UNIQUE is harder
§  E.g :Unique visitors, Visitor loyalty

§  Space vs. time
§  Sorting magic

®

6

Not just web analytics

X Analytics

§  Feed in any le format (w3c, apache, tsv, etc.)
§  Tag the dimensions and metrics
§  Process (incremental)
§  Query in real-time

®

6

Nothing but the hstack

§  structured data storage: HBase
§  le storage HDFS
§  data processing: Hadoop

®

7

Conclusions

§  Keep data
§  Understand data
§  Explore data
§  Extract meaning

®

7

h p://hstack.org
h p://hbase.apache.org
h p://hadoop.apache.org
h p://mahout.apache.org
h p://nutch.apache.org
®

7

HBase and Hadoop at Adobe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HBase and Hadoop at Adobe

Similar to HBase and Hadoop at Adobe (20)

Recently uploaded

Recently uploaded (20)

HBase and Hadoop at Adobe