• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HBase and Hadoop at Adobe
 

HBase and Hadoop at Adobe

on

  • 9,714 views

HBase and Hadoop at Adobe - presented at Programatica

HBase and Hadoop at Adobe - presented at Programatica

Statistics

Views

Total Views
9,714
Views on SlideShare
8,790
Embed Views
924

Actions

Likes
48
Downloads
0
Comments
0

12 Embeds 924

http://friendfeedredux.appspot.com 585
http://log.medcl.net 261
http://www.agora.ro 59
http://www.techgig.com 7
http://www.scalebig.com 4
http://paper.li 2
http://hstack.org 1
http://fiddle.jshell.net 1
http://twitter.com 1
http://static.slidesharecdn.com 1
http://passing.tk 1
http://cache.baidu.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    HBase and Hadoop at Adobe HBase and Hadoop at Adobe Presentation Transcript

    • Big Data with HBase and Hadoop at Adobe Cosmin Lehene Programatica, November, 2010Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1
    • Who am ICosmin LeheneAdobe Services and Infrastructure Team = SaaS servicesHBase and Hadoop contributorclehene@adobe.com@clehene h p://hstack.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 2 2
    • Why I am here today§  Riding the elephant since 2008§  Analytics, BI, Machine Learning§  Images, Videos, Flash, Web, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 3 3
    • Opaque Data (logs, archives)§  Web traffic§  Business events§  User interactions§  Infrastructure data §  Database logs, web server logs, etc.§  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 4 4
    • h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 5 5
    • h p://www.google.com/images?q=data+visualization 6 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 6 6
    • Can I§  JOIN everything?§  Increase user engagement?§  Increase conversion rate?§  Make $$$? J§  Fast and cheap? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 7 7
    • Understand data and extract meaningReal-time access to meaningful data ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 8 8
    • Agenda ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 9 9
    • noSQL 101 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 10 1
    • Scaling RDBMS§  Scale up §  More memory §  More CPU §  Faster disks, SAN, etc.§  Problems §  Expensive §  ere’s a limit ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 11 1
    • Scaling RDBMS§  Scale horizontally §  Replication (reads) §  Sharding/ Horizontal Partitioning (writes) §  Server 1: a-m, Server 2: m-z §  Denormalization ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 12 1
    • Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 13 1
    • Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 14 1
    • Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 15 1
    • Sharding & Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 16 1
    • Scaling RDBMS problems§  Hard to repartition/reshard §  Pre allocate shards 2, 3, 100§  Query each shard§  High operational costs§  Eventual consistency ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 17 1
    • Enter noSQL – the beginning§  Google: BigTable§  Amazon: Dynamo§  Memcached ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 18 1
    • Data Models§  Key-value§  Columnar/Tabular§  Document oriented§  Graph ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 19 1
    • Architectures§  Distributed hash tables§  Consistent Hashing§  Gossip§  Vector clocks§  Locality groups§  Partitioning, replication§  etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 20 2
    • Properties§  Scalability§  Failover§  Durability§  Consistency§  Availability§  Partition Tolerance§  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 21 2
    • Cartesian Product ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 22 2
    • What do all these have in common§  Different data models noSQL§  Different architectures§  Different properties ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 23
    • Hadoop h p://hadoop.apache.org§  HDFS (distributed fs)§  Map-reduce (distributed processing) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 24 2
    • Adobe Media Player Increase video consumptionCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 25
    • AMP §  Recommendations §  Related content §  Related users ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 26 2
    • Video logs §  X watched movie A (comedy) §  Y watched movie B (drama) §  Z watched movie C (thriller) §  Z watched movie A (comedy) §  X watched movie D (technology) §  Y watched movie C (thriller) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 27 2
    • Which users are alike? §  Compare every 2 users? §  5M vectors §  120 dimensions §  Distance is not enough – needed groups ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 28 2
    • How? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 29 2
    • Custer projections §  1 month §  6GB §  700k Users §  114 genres §  7 nodes §  5 hours §  27 clusters ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 30 3
    • Game Constellations §  Processing Shockwave logs ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 31
    • Lessons learned Need: §  Fine grain access §  Incremental updates §  Deal with changes in the original dataset §  Real-time data serving ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 32 3
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 33 3
    • h p://hbase.apache.org §  Sparse, distributed, persistent multidimensional sorted map §  Column oriented store §  Autosharding §  Data locality ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 34 3
    • Data Model table: row: family: column: value: version domain.com/x.swf swf: sfw:size = 1876 bytes | 1876 bytes swf:fps = 30 swf:avm = 3 html: embed = dynamic status: last_crawl = 2010/11/26 | last_crawl = 2010/11/25 domain.com/y.swf domain.com/z.swf ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 35 3
    • API§  Get§  Put§  Delete§  Scan ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 36
    • Flash How is ash usedCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 37
    • How is ash used in the “wild”? §  AVM popularity §  Frame rates §  Video formats §  SWF size §  Flex data structures §  … ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 38 3
    • How ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 39 3
    • How max 1000 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 40 4
    • e hard way §  Hadoop §  Nutch §  HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 41 4
    • Work ow §  Crawl: §  Nutch (seed: top-1m.csv Alexa) §  Detect ash embed, javascript §  Browse: §  Hadoop + FF + FP (chromeless) §  Dump stack traces, memory, swf bytes, etc. §  Process: §  Parse stack traces, rank, etc. §  Export: §  Hbase: swf table §  Md5, swf bytecode, memory, load time, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 42 4
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 43 4
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 44 4
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 45 4
    • Bene ts §  Security xes §  Optimization §  Prioritize based on real usage §  Testing ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 46 4
    • SaasBase – Hbase++ as a service §  Data storage (HBase + HDFS) §  Domains, tables, §  API: create, put, get, scan §  Analytics (HBase + Hadoop + query engine) §  Reports, dimensions, metrics §  API: query ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 47 4
    • photoshop.com Image analyticsCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 48
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 49 4
    • photoshop.com §  1B assets (images, videos, other) §  120M with EXIF metadata §  1.5 petabytes §  Home grown distributed storage ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 50 5
    • Intelligence §  Targeting users: §  Professionals or Amateurs? §  Where are pictures taken? §  Targeting partners: §  Popular cameras §  Tracking campaigns §  New accounts ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 51 5
    • 5 2 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 52
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 53 5
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 54 5
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 55 5
    • Stats §  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB RAM, 1Gbps) §  Map 700M records §  2hrs, 41mins §  Map output: 1.9B records (~80GB) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 56 5
    • Lessons §  SUM, COUNT, AVG, MIN, MAX, GROUP BY, HAVING, etc. §  Rollup, drilldown, segmentation ----------------------------------------------------------- It’s all about Dimensions & Metrics ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 57 5
    • Recap §  Hadoop + Mahout + PIG (User clusters) §  HBase + Hadoop + Nutch+ MySQL (Flash analytics) §  HBase + Hadoop (EXIF Explorer, image analytics) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 58 5
    • Business Catalyst AnalyticsCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 59
    • BC §  End to end platform for online businesses §  E-commerce, Blogging, CRM, email marketing §  Analytics: web traffic, affiliates, sales, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 60 6
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 61 6
    • ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 62 6
    • Successtrophe §  Analytics is troublesome §  SQL database was slow for analytics §  Over 50 different reports §  Over 100,000 websites §  Billions of page views ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 63
    • Requirements §  Fast incremental processing §  Custom reporting §  Filtering, segmentation, rollups, drilldowns §  Variable time ranges §  Fast ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 64 6
    • Solution §  Continuous processing (every 10 minutes) §  Reports de nition: dimensions, metrics §  Real-time queries: directly from HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 65 6
    • Work ow §  Import Logs ->HBase §  Incrementally process/index last 24 hours §  Serve from HBase §  Index scans §  Runtime aggregation ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 66 6
    • Stats §  1 datacenter, 10 months = 1 hour, 24 minutes §  > 3 Billion report items generated ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 67 6
    • Lessons §  UNIQUE is harder §  E.g :Unique visitors, Visitor loyalty §  Space vs. time §  Sorting magic ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 68 6
    • Not just web analytics X Analytics §  Feed in any le format (w3c, apache, tsv, etc.) §  Tag the dimensions and metrics §  Process (incremental) §  Query in real-time ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 69 6
    • Nothing but the hstack §  structured data storage: HBase §  le storage HDFS §  data processing: Hadoop ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 70 7
    • Conclusions §  Keep data §  Understand data §  Explore data §  Extract meaning ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 71 7
    • h p://hstack.org h p://hbase.apache.org h p://hadoop.apache.org h p://mahout.apache.org h p://nutch.apache.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 72 7