Your SlideShare is downloading. ×
HBase and Hadoop at Adobe
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HBase and Hadoop at Adobe

10,113
views

Published on

HBase and Hadoop at Adobe - presented at Programatica

HBase and Hadoop at Adobe - presented at Programatica

Published in: Technology

0 Comments
50 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,113
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
50
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data with HBase and Hadoop at Adobe Cosmin Lehene Programatica, November, 2010Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 1
  • 2. Who am ICosmin LeheneAdobe Services and Infrastructure Team = SaaS servicesHBase and Hadoop contributorclehene@adobe.com@clehene h p://hstack.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 2 2
  • 3. Why I am here today§  Riding the elephant since 2008§  Analytics, BI, Machine Learning§  Images, Videos, Flash, Web, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 3 3
  • 4. Opaque Data (logs, archives)§  Web traffic§  Business events§  User interactions§  Infrastructure data §  Database logs, web server logs, etc.§  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 4 4
  • 5. h p://commons.wikimedia.org/wiki/File:AWI-core-archive_hg.jpg ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 5 5
  • 6. h p://www.google.com/images?q=data+visualization 6 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 6 6
  • 7. Can I§  JOIN everything?§  Increase user engagement?§  Increase conversion rate?§  Make $$$? J§  Fast and cheap? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 7 7
  • 8. Understand data and extract meaningReal-time access to meaningful data ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 8 8
  • 9. Agenda ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 9 9
  • 10. noSQL 101 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 10 1
  • 11. Scaling RDBMS§  Scale up §  More memory §  More CPU §  Faster disks, SAN, etc.§  Problems §  Expensive §  ere’s a limit ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 11 1
  • 12. Scaling RDBMS§  Scale horizontally §  Replication (reads) §  Sharding/ Horizontal Partitioning (writes) §  Server 1: a-m, Server 2: m-z §  Denormalization ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 12 1
  • 13. Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 13 1
  • 14. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 14 1
  • 15. Sharding ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 15 1
  • 16. Sharding & Replication ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 16 1
  • 17. Scaling RDBMS problems§  Hard to repartition/reshard §  Pre allocate shards 2, 3, 100§  Query each shard§  High operational costs§  Eventual consistency ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 17 1
  • 18. Enter noSQL – the beginning§  Google: BigTable§  Amazon: Dynamo§  Memcached ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 18 1
  • 19. Data Models§  Key-value§  Columnar/Tabular§  Document oriented§  Graph ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 19 1
  • 20. Architectures§  Distributed hash tables§  Consistent Hashing§  Gossip§  Vector clocks§  Locality groups§  Partitioning, replication§  etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 20 2
  • 21. Properties§  Scalability§  Failover§  Durability§  Consistency§  Availability§  Partition Tolerance§  Etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 21 2
  • 22. Cartesian Product ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 22 2
  • 23. What do all these have in common§  Different data models noSQL§  Different architectures§  Different properties ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 23
  • 24. Hadoop h p://hadoop.apache.org§  HDFS (distributed fs)§  Map-reduce (distributed processing) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 24 2
  • 25. Adobe Media Player Increase video consumptionCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 25
  • 26. AMP §  Recommendations §  Related content §  Related users ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 26 2
  • 27. Video logs §  X watched movie A (comedy) §  Y watched movie B (drama) §  Z watched movie C (thriller) §  Z watched movie A (comedy) §  X watched movie D (technology) §  Y watched movie C (thriller) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 27 2
  • 28. Which users are alike? §  Compare every 2 users? §  5M vectors §  120 dimensions §  Distance is not enough – needed groups ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 28 2
  • 29. How? ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 29 2
  • 30. Custer projections §  1 month §  6GB §  700k Users §  114 genres §  7 nodes §  5 hours §  27 clusters ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 30 3
  • 31. Game Constellations §  Processing Shockwave logs ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 31
  • 32. Lessons learned Need: §  Fine grain access §  Incremental updates §  Deal with changes in the original dataset §  Real-time data serving ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 32 3
  • 33. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 33 3
  • 34. h p://hbase.apache.org §  Sparse, distributed, persistent multidimensional sorted map §  Column oriented store §  Autosharding §  Data locality ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 34 3
  • 35. Data Model table: row: family: column: value: version domain.com/x.swf swf: sfw:size = 1876 bytes | 1876 bytes swf:fps = 30 swf:avm = 3 html: embed = dynamic status: last_crawl = 2010/11/26 | last_crawl = 2010/11/25 domain.com/y.swf domain.com/z.swf ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 35 3
  • 36. API§  Get§  Put§  Delete§  Scan ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 36
  • 37. Flash How is ash usedCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 37
  • 38. How is ash used in the “wild”? §  AVM popularity §  Frame rates §  Video formats §  SWF size §  Flex data structures §  … ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 38 3
  • 39. How ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 39 3
  • 40. How max 1000 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 40 4
  • 41. e hard way §  Hadoop §  Nutch §  HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 41 4
  • 42. Work ow §  Crawl: §  Nutch (seed: top-1m.csv Alexa) §  Detect ash embed, javascript §  Browse: §  Hadoop + FF + FP (chromeless) §  Dump stack traces, memory, swf bytes, etc. §  Process: §  Parse stack traces, rank, etc. §  Export: §  Hbase: swf table §  Md5, swf bytecode, memory, load time, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 42 4
  • 43. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 43 4
  • 44. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 44 4
  • 45. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 45 4
  • 46. Bene ts §  Security xes §  Optimization §  Prioritize based on real usage §  Testing ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 46 4
  • 47. SaasBase – Hbase++ as a service §  Data storage (HBase + HDFS) §  Domains, tables, §  API: create, put, get, scan §  Analytics (HBase + Hadoop + query engine) §  Reports, dimensions, metrics §  API: query ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 47 4
  • 48. photoshop.com Image analyticsCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 48
  • 49. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 49 4
  • 50. photoshop.com §  1B assets (images, videos, other) §  120M with EXIF metadata §  1.5 petabytes §  Home grown distributed storage ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 50 5
  • 51. Intelligence §  Targeting users: §  Professionals or Amateurs? §  Where are pictures taken? §  Targeting partners: §  Popular cameras §  Tracking campaigns §  New accounts ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 51 5
  • 52. 5 2 Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 52
  • 53. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 53 5
  • 54. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 54 5
  • 55. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 55 5
  • 56. Stats §  7 Machines (16 cores, 24 x 10K RPM SATA, 32GB RAM, 1Gbps) §  Map 700M records §  2hrs, 41mins §  Map output: 1.9B records (~80GB) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 56 5
  • 57. Lessons §  SUM, COUNT, AVG, MIN, MAX, GROUP BY, HAVING, etc. §  Rollup, drilldown, segmentation ----------------------------------------------------------- It’s all about Dimensions & Metrics ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 57 5
  • 58. Recap §  Hadoop + Mahout + PIG (User clusters) §  HBase + Hadoop + Nutch+ MySQL (Flash analytics) §  HBase + Hadoop (EXIF Explorer, image analytics) ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 58 5
  • 59. Business Catalyst AnalyticsCopyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 59
  • 60. BC §  End to end platform for online businesses §  E-commerce, Blogging, CRM, email marketing §  Analytics: web traffic, affiliates, sales, etc. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 60 6
  • 61. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 61 6
  • 62. ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 62 6
  • 63. Successtrophe §  Analytics is troublesome §  SQL database was slow for analytics §  Over 50 different reports §  Over 100,000 websites §  Billions of page views ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 63
  • 64. Requirements §  Fast incremental processing §  Custom reporting §  Filtering, segmentation, rollups, drilldowns §  Variable time ranges §  Fast ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 64 6
  • 65. Solution §  Continuous processing (every 10 minutes) §  Reports de nition: dimensions, metrics §  Real-time queries: directly from HBase ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 65 6
  • 66. Work ow §  Import Logs ->HBase §  Incrementally process/index last 24 hours §  Serve from HBase §  Index scans §  Runtime aggregation ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 66 6
  • 67. Stats §  1 datacenter, 10 months = 1 hour, 24 minutes §  > 3 Billion report items generated ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 67 6
  • 68. Lessons §  UNIQUE is harder §  E.g :Unique visitors, Visitor loyalty §  Space vs. time §  Sorting magic ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 68 6
  • 69. Not just web analytics X Analytics §  Feed in any le format (w3c, apache, tsv, etc.) §  Tag the dimensions and metrics §  Process (incremental) §  Query in real-time ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 69 6
  • 70. Nothing but the hstack §  structured data storage: HBase §  le storage HDFS §  data processing: Hadoop ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 70 7
  • 71. Conclusions §  Keep data §  Understand data §  Explore data §  Extract meaning ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 71 7
  • 72. h p://hstack.org h p://hbase.apache.org h p://hadoop.apache.org h p://mahout.apache.org h p://nutch.apache.org ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe con dential. 72 7

×