Faster Cheaper Better-Replacing Oracle with Hadoop & Solr

1

Faster, cheaper, better
Replacing Oracle with
Hadoop and Solr

Ken Krugler
Scale Unlimited

Copyright (c) 2012 Scale Unlimited. All Rights Reserved.

Monday, June 11, 12

2

Obligatory Background

Ken Krugler - direct from Nevada City, California
Krugle startup (2005-2008) used Nutch, Hadoop, Solr
Now running Scale Unlimited
big data + search
consulting + training


Monday, June 11, 12

3

The 50,000ft View

We helped our client kick the RDBMS habit
It’s an analytics web site for display advertising
Got rid of DBs handling queries for their web site
Now uses Hadoop + Solr to...
cut costs
add features
improve performance
increase scalability


Monday, June 11, 12

4

What’s an Analytics Web Site?

Let the user ask questions about data


Monday, June 11, 12

5

Including Sexy Dashboards

All driven by slices of the data


Monday, June 11, 12

6

Behind the web site curtain

Each view or constraint change triggers queries
“sum ad impact for all advertisers on all networks, sort by sum, limit 10”
“sum ad impact by ad type for advertiser ‘oracle.com’”

For millions of records, you have to chose...
Fast, accurate, inexpensive - pick any two


Monday, June 11, 12

7

Combinatorial Explosion

Too many possibilities to pre-calculate everything
more than 10^5 publishers
more than 10^6 advertisers
30 ad networks, 3 day ranges, etc

So many trillions of possible combinations
Caching of DB query results isn’t very useful


Monday, June 11, 12

8

Trouble in UI Land

UI refresh took 10-30 seconds
Well outside of target range of “about a second or so”


Monday, June 11, 12

8

Trouble in UI Land

UI refresh took 10-30 seconds
Well outside of target range of “about a second or so”

0.1 second: instantaneous
1.0 second: I’m still in the ﬂow
10 seconds: I’m bored


Monday, June 11, 12

9

Trouble in the back office

Beefy hardware for multiple DBs was expensive
AWS monthly cost approaching 5 figures
And the data sets needed to grow significantly

Constant schema changes meant painful data reloading
Extract, load, transform (inside of DB)
Re-indexing of DB fields


Monday, June 11, 12

10

A New Approach

Do analytics off-line using Hadoop
Pre-generate as much as possible
Use Solr as a NoSQL database
And leverage search, faceting

+ =

Monday, June 11, 12

11

Obligatory Architectural Slide

Two search servers
8 shards per index
Optimize response time
Additional indexes
autocompletion, etc.
200M total documents


Monday, June 11, 12

12

What Solr Gives Us

Fast, memory-efficient queries
Count the number of documents that match a query
Sort results by fields
And search - “Find all Flash ads with the word ‘diet’”

Fast faceting
Count # of results from query that have different values for a field
“How many different image ad sizes (w/counts) are used by google?”


Monday, June 11, 12

13

How to Connect the Dots
We have web crawl data - ads, advertisers, publishers, networks
http://www.michiguide.com/some-page.html text google
DIRECTV® For Businesses Save $13/mo ww.directv.com/business

We have target Solr schemas with the ﬁelds deﬁned

<field name="network" type="string" indexed="true" stored="false" required="true" />
<field name="publisher" type="string" indexed="true" stored="false" required="true" />

How do we get from A to B?

Data
f(data)??? Index
Sources


Monday, June 11, 12

14

Hadoop ETL

Implement appropriate Extract, Transform, Load
Extract is just parsing text ﬁles that are stored in Amazon’s S3
Load is building the Solr index and deploying it to the search servers
What about that pesky “Transform” part?


Monday, June 11, 12

15

Simplicity Itself

25 Hadoop Jobs
Developed with Cascading
Daily run is $25


Monday, June 11, 12

16

Workflow Essentials

“Do analytics offline” means anything that involves aggregation
Solr is fine for first/last/count
Pre-calculate anything that does math on each record
Essentially index is pre-calculated answers to 200M questions
“what is trendline for ad impact of this advertiser on that publisher?”
“which ads use 300x250 images?”


Monday, June 11, 12

17

Combinatorial Explosion

Limit questions that can be asked
E.g. no arbitrary date ranges
Requires tricky “biggest bang for buck” decisions

Collapse entries that are “all” and only one other
Leverage Solr multi-value ﬁeld support
network:all and network:doubleclick are one entry


Monday, June 11, 12

18

Reduce Duplicated Data

De-normalized schema means multiple records with similar data
“ad X on network Y”, “ad X on network Z”
We couldn’t use Solr’s “join” support (not in 3.6, issues with shards)

Non-indexed duplicated data goes into “special” records
e.g. the records that have “all” for a ﬁeld value


Monday, June 11, 12

19

Defer Workﬂow Optimizations

Frequently tempted to get tricky
But helicopter stunts lead to pain and suffering

Often complex ETL means running multiple jobs in parallel
So job timing/prioritization is more important


Monday, June 11, 12

20

Analyzing Workﬂows

Sadly, hand analysis is
currently required

Key is no dead time
map/reduce slots

New solutions
Ambrose
Driven


Monday, June 11, 12

21

Useful Optimizations

“Cache” results - HDFS storage is cheap
Daily processing
Daily state + delta from today

Throw away data ASAP - avoid data baggage
Analytics data sets often have many, many ﬁelds


Monday, June 11, 12

22

Map-side Reduction
Reduce the amount of data being sent from map to reduce
Often is bottleneck for jobs, due to network overhead
Examples include aggregation, group-level ﬁltering

Hadoop has “combiners”, which are post-map reducers
Do incremental reduce on map side before sending to reducers

Cascading has “AggregateBy”, which are in-map reducers
Keeps some number of results in memory using LRU queue


Monday, June 11, 12

23

Avoid Heuristics in Hadoop

What’s easy to describe (and implement) in a function...
is often painful and slow in map-reduce

Conditional/branching logic is common example
If this join result matches X, use it; otherwise join with Y and do Z


Monday, June 11, 12

24

The Net-Net


Monday, June 11, 12

24

The Net-Net

If you have a web site that provides analytics


Monday, June 11, 12

24

The Net-Net

And it’s currently using a RDBMS like Oracle


Monday, June 11, 12

24

The Net-Net

You should be able to make it faster, cheaper, better (and scalable)


Monday, June 11, 12

24

The Net-Net

You should be able to make it faster, cheaper, better (and scalable)
Using Hadoop & Solr


Monday, June 11, 12

25

Questions?

Feel free to contact me
http://www.scaleunlimited.com/contact/

Check out Lucid’s “Big Data & Solr” class
http://www.lucidimagination.com/services/training/

Check out Cascading
http://www.cascading.org/


Monday, June 11, 12

Faster Cheaper Better-Replacing Oracle with Hadoop & Solr

Recommended

Recommended

More Related Content

Similar to Faster Cheaper Better-Replacing Oracle with Hadoop & Solr

Similar to Faster Cheaper Better-Replacing Oracle with Hadoop & Solr (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Faster Cheaper Better-Replacing Oracle with Hadoop & Solr