Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr

2,064 views
2,045 views

Published on

Our client helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. They have been able to cut monthly costs by more than 50%, improve response time by 4x, and quickly add new features by switching from a traditional DB-centric approach to one based on Hadoop & Solr. This analysis is handled by a complex Hadoop-based workflow, where the end result is a set of unique, highly optimized Solr indexes. The data processing platform provided by Hadoop also enables scalable machine learning using Mahout. This presentation covers some of the unique challenges in switching the web site from relying on slow, expensive real-time analytics using database queries to fast, affordable batch analytics and search using Hadoop and Solr.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,064
On SlideShare
0
From Embeds
0
Number of Embeds
635
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr

  1. 1. 1 Faster, cheaper, better Replacing Oracle with Hadoop and Solr Ken Krugler Scale Unlimited Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  2. 2. 2 Obligatory Background Ken Krugler - direct from Nevada City, California Krugle startup (2005-2008) used Nutch, Hadoop, Solr Now running Scale Unlimited big data + search consulting + training Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  3. 3. 3 The 50,000ft View We helped our client kick the RDBMS habit It’s an analytics web site for display advertising Got rid of DBs handling queries for their web site Now uses Hadoop + Solr to... cut costs add features improve performance increase scalability Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  4. 4. 4 What’s an Analytics Web Site? Let the user ask questions about data Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  5. 5. 5 Including Sexy Dashboards All driven by slices of the data Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  6. 6. 6 Behind the web site curtain Each view or constraint change triggers queries “sum ad impact for all advertisers on all networks, sort by sum, limit 10” “sum ad impact by ad type for advertiser ‘oracle.com’” For millions of records, you have to chose... Fast, accurate, inexpensive - pick any two Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  7. 7. 7 Combinatorial Explosion Too many possibilities to pre-calculate everything more than 10^5 publishers more than 10^6 advertisers 30 ad networks, 3 day ranges, etc So many trillions of possible combinations Caching of DB query results isn’t very useful Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  8. 8. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  9. 9. 8 Trouble in UI Land UI refresh took 10-30 seconds Well outside of target range of “about a second or so” 0.1 second: instantaneous 1.0 second: I’m still in the flow 10 seconds: I’m bored Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  10. 10. 9 Trouble in the back office Beefy hardware for multiple DBs was expensive AWS monthly cost approaching 5 figures And the data sets needed to grow significantly Constant schema changes meant painful data reloading Extract, load, transform (inside of DB) Re-indexing of DB fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  11. 11. 10 A New Approach Do analytics off-line using Hadoop Pre-generate as much as possible Use Solr as a NoSQL database And leverage search, faceting + = Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  12. 12. 11 Obligatory Architectural Slide Two search servers 8 shards per index Optimize response time Additional indexes autocompletion, etc. 200M total documents Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  13. 13. 12 What Solr Gives Us Fast, memory-efficient queries Count the number of documents that match a query Sort results by fields And search - “Find all Flash ads with the word ‘diet’” Fast faceting Count # of results from query that have different values for a field “How many different image ad sizes (w/counts) are used by google?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  14. 14. 13 How to Connect the Dots We have web crawl data - ads, advertisers, publishers, networks http://www.michiguide.com/some-page.html text google DIRECTV® For Businesses Save $13/mo ww.directv.com/business We have target Solr schemas with the fields defined <field name="network" type="string" indexed="true" stored="false" required="true" /> <field name="publisher" type="string" indexed="true" stored="false" required="true" /> How do we get from A to B? Data f(data)??? Index Sources Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  15. 15. 14 Hadoop ETL Implement appropriate Extract, Transform, Load Extract is just parsing text files that are stored in Amazon’s S3 Load is building the Solr index and deploying it to the search servers What about that pesky “Transform” part? Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  16. 16. 15 Simplicity Itself 25 Hadoop Jobs Developed with Cascading Daily run is $25 Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  17. 17. 16 Workflow Essentials “Do analytics offline” means anything that involves aggregation Solr is fine for first/last/count Pre-calculate anything that does math on each record Essentially index is pre-calculated answers to 200M questions “what is trendline for ad impact of this advertiser on that publisher?” “which ads use 300x250 images?” Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  18. 18. 17 Combinatorial Explosion Limit questions that can be asked E.g. no arbitrary date ranges Requires tricky “biggest bang for buck” decisions Collapse entries that are “all” and only one other Leverage Solr multi-value field support network:all and network:doubleclick are one entry Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  19. 19. 18 Reduce Duplicated Data De-normalized schema means multiple records with similar data “ad X on network Y”, “ad X on network Z” We couldn’t use Solr’s “join” support (not in 3.6, issues with shards) Non-indexed duplicated data goes into “special” records e.g. the records that have “all” for a field value Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  20. 20. 19 Defer Workflow Optimizations Frequently tempted to get tricky But helicopter stunts lead to pain and suffering Often complex ETL means running multiple jobs in parallel So job timing/prioritization is more important Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  21. 21. 20 Analyzing Workflows Sadly, hand analysis is currently required Key is no dead time map/reduce slots New solutions Ambrose Driven Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  22. 22. 21 Useful Optimizations “Cache” results - HDFS storage is cheap Daily processing Daily state + delta from today Throw away data ASAP - avoid data baggage Analytics data sets often have many, many fields Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  23. 23. 22 Map-side Reduction Reduce the amount of data being sent from map to reduce Often is bottleneck for jobs, due to network overhead Examples include aggregation, group-level filtering Hadoop has “combiners”, which are post-map reducers Do incremental reduce on map side before sending to reducers Cascading has “AggregateBy”, which are in-map reducers Keeps some number of results in memory using LRU queue Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  24. 24. 23 Avoid Heuristics in Hadoop What’s easy to describe (and implement) in a function... is often painful and slow in map-reduce Conditional/branching logic is common example If this join result matches X, use it; otherwise join with Y and do Z Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  25. 25. 24 The Net-Net Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  26. 26. 24 The Net-Net If you have a web site that provides analytics Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  27. 27. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  28. 28. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  29. 29. 24 The Net-Net If you have a web site that provides analytics And it’s currently using a RDBMS like Oracle You should be able to make it faster, cheaper, better (and scalable) Using Hadoop & Solr Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12
  30. 30. 25 Questions? Feel free to contact me http://www.scaleunlimited.com/contact/ Check out Lucid’s “Big Data & Solr” class http://www.lucidimagination.com/services/training/ Check out Cascading http://www.cascading.org/ Copyright (c) 2012 Scale Unlimited. All Rights Reserved.Monday, June 11, 12

×