1              Faster, cheaper, better                  Replacing Oracle with                  Hadoop and Solr            ...
2      Obligatory Background             Ken Krugler - direct from Nevada City, California             Krugle startup (200...
3      The 50,000ft View             We helped our client kick the RDBMS habit                     It’s an analytics web s...
4      What’s an Analytics Web Site?               Let the user ask questions about data         Copyright (c) 2012 Scale ...
5      Including Sexy Dashboards               All driven by slices of the data         Copyright (c) 2012 Scale Unlimited...
6      Behind the web site curtain             Each view or constraint change triggers queries                     “sum ad...
7      Combinatorial Explosion             Too many possibilities to pre-calculate everything                     more tha...
8      Trouble in UI Land             UI refresh took 10-30 seconds             Well outside of target range of “about a s...
8      Trouble in UI Land             UI refresh took 10-30 seconds             Well outside of target range of “about a s...
9      Trouble in the back office             Beefy hardware for multiple DBs was expensive                     AWS monthly...
10      A New Approach             Do analytics off-line using Hadoop                     Pre-generate as much as possible...
11      Obligatory Architectural Slide             Two search servers             8 shards per index                     O...
12      What Solr Gives Us             Fast, memory-efficient queries                     Count the number of documents tha...
13      How to Connect the Dots             We have web crawl data - ads, advertisers, publishers, networks               ...
14      Hadoop ETL             Implement appropriate Extract, Transform, Load                     Extract is just parsing ...
15      Simplicity Itself           25 Hadoop Jobs           Developed with Cascading           Daily run is $25         C...
16      Workflow Essentials             “Do analytics offline” means anything that involves aggregation             Solr is ...
17      Combinatorial Explosion             Limit questions that can be asked                     E.g. no arbitrary date r...
18      Reduce Duplicated Data             De-normalized schema means multiple records with similar data                  ...
19      Defer Workflow Optimizations             Frequently tempted to get tricky                     But helicopter stunts...
20      Analyzing Workflows             Sadly, hand analysis is             currently required             Key is no dead t...
21      Useful Optimizations             “Cache” results - HDFS storage is cheap                     Daily processing     ...
22      Map-side Reduction             Reduce the amount of data being sent from map to reduce                     Often i...
23      Avoid Heuristics in Hadoop             What’s easy to describe (and implement) in a function...                   ...
24      The Net-Net         Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
24      The Net-Net             If you have a web site that provides analytics         Copyright (c) 2012 Scale Unlimited....
24      The Net-Net             If you have a web site that provides analytics             And it’s currently using a RDBM...
24      The Net-Net             If you have a web site that provides analytics             And it’s currently using a RDBM...
24      The Net-Net             If you have a web site that provides analytics             And it’s currently using a RDBM...
25      Questions?             Feel free to contact me                     http://www.scaleunlimited.com/contact/         ...
Upcoming SlideShare
Loading in …5
×

Faster Cheaper Better-Replacing Oracle with Hadoop & Solr

5,650 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,650
On SlideShare
0
From Embeds
0
Number of Embeds
401
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Faster Cheaper Better-Replacing Oracle with Hadoop & Solr

  1. 1. 1 Faster, cheaper, better Replacing Oracle with Hadoop and Solr Ken Krugler Scale Unlimited Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  2. 2. 2 Obligatory Background Ken Krugler - direct from Nevada City, California Krugle startup (2005-2008) used Nutch, Hadoop, Solr Now running Scale Unlimited big data + search consulting + training Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  3. 3. 3 The 50,000ft View We helped our client kick the RDBMS habit It’s an analytics web site for display advertising Got rid of DBs handling queries for their web site Now uses Hadoop + Solr to... cut costs add features improve performance increase scalability Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  4. 4. 4 What’s an Analytics Web Site? Let the user ask questions about data Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  5. 5. 5 Including Sexy Dashboards All driven by slices of the data Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  6. 6. 6 Behind the web site curtain Each view or constraint change triggers queries “sum ad impact for all advertisers on all networks, sort by sum, limit 10” “sum ad impact by ad type for advertiser ‘oracle.com’” For millions of records, you have to chose... Fast, accurate, inexpensive - pick any two Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  7. 7. 7 Combinatorial Explosion Too many possibilities to pre-calculate everything more than 10^5 publishers more than 10^6 advertisers 30 ad networks, 3 day ranges, etc So many trillions of possible combinations Caching of DB query results isn’t very useful Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  8. 8. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  9. 9. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” 0.1 second: instantaneous 1.0 second: I’m still in the flow 10 seconds: I’m bored Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  10. 10. 9 Trouble in the back office Beefy hardware for multiple DBs was expensive AWS monthly cost approaching 5 figures And the data sets needed to grow significantly Constant schema changes meant painful data reloading Extract, load, transform (inside of DB) Re-indexing of DB fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  11. 11. 10 A New Approach Do analytics off-line using Hadoop Pre-generate as much as possible Use Solr as a NoSQL database And leverage search, faceting + = Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  12. 12. 11 Obligatory Architectural Slide Two search servers 8 shards per index Optimize response time Additional indexes autocompletion, etc. 200M total documents Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  13. 13. 12 What Solr Gives Us Fast, memory-efficient queries Count the number of documents that match a query Sort results by fields And search - “Find all Flash ads with the word ‘diet’” Fast faceting Count # of results from query that have different values for a field “How many different image ad sizes (w/counts) are used by google?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  14. 14. 13 How to Connect the Dots We have web crawl data - ads, advertisers, publishers, networks http://www.michiguide.com/some-page.html text google DIRECTV® For Businesses Save $13/mo ww.directv.com/business We have target Solr schemas with the fields defined <field name="network" type="string" indexed="true" stored="false" required="true" /> <field name="publisher" type="string" indexed="true" stored="false" required="true" /> How do we get from A to B? Data f(data)??? Index Sources Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  15. 15. 14 Hadoop ETL Implement appropriate Extract, Transform, Load Extract is just parsing text files that are stored in Amazon’s S3 Load is building the Solr index and deploying it to the search servers What about that pesky “Transform” part? Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  16. 16. 15 Simplicity Itself 25 Hadoop Jobs Developed with Cascading Daily run is $25 Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  17. 17. 16 Workflow Essentials “Do analytics offline” means anything that involves aggregation Solr is fine for first/last/count Pre-calculate anything that does math on each record Essentially index is pre-calculated answers to 200M questions “what is trendline for ad impact of this advertiser on that publisher?” “which ads use 300x250 images?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  18. 18. 17 Combinatorial Explosion Limit questions that can be asked E.g. no arbitrary date ranges Requires tricky “biggest bang for buck” decisions Collapse entries that are “all” and only one other Leverage Solr multi-value field support network:all and network:doubleclick are one entry Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  19. 19. 18 Reduce Duplicated Data De-normalized schema means multiple records with similar data “ad X on network Y”, “ad X on network Z” We couldn’t use Solr’s “join” support (not in 3.6, issues with shards) Non-indexed duplicated data goes into “special” records e.g. the records that have “all” for a field value Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  20. 20. 19 Defer Workflow Optimizations Frequently tempted to get tricky But helicopter stunts lead to pain and suffering Often complex ETL means running multiple jobs in parallel So job timing/prioritization is more important Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  21. 21. 20 Analyzing Workflows Sadly, hand analysis is currently required Key is no dead time map/reduce slots New solutions Ambrose Driven Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  22. 22. 21 Useful Optimizations “Cache” results - HDFS storage is cheap Daily processing Daily state + delta from today Throw away data ASAP - avoid data baggage Analytics data sets often have many, many fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  23. 23. 22 Map-side Reduction Reduce the amount of data being sent from map to reduce Often is bottleneck for jobs, due to network overhead Examples include aggregation, group-level filtering Hadoop has “combiners”, which are post-map reducers Do incremental reduce on map side before sending to reducers Cascading has “AggregateBy”, which are in-map reducers Keeps some number of results in memory using LRU queue Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  24. 24. 23 Avoid Heuristics in Hadoop What’s easy to describe (and implement) in a function... is often painful and slow in map-reduce Conditional/branching logic is common example If this join result matches X, use it; otherwise join with Y and do Z Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  25. 25. 24 The Net-Net Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  26. 26. 24 The Net-Net If you have a web site that provides analytics Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  27. 27. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  28. 28. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  29. 29. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Using Hadoop & Solr Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  30. 30. 25 Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Check out Lucid’s “Big Data & Solr” class http://www.lucidimagination.com/services/training/ Check out Cascading http://www.cascading.org/ Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12

×