2. Scale
Large data sizes
~3B records per day ~= 2.5T
uncompressed json
~100 primary, ~300 derived
dimensions.
~50 measures.
Analysis horizon -years
3. Scope
Highly Dynamic Data and Analytic needs
Frequent addition of newer dimensions.
Very dynamic query patterns.
Both canned and ad-hoc reports.
Multiple phase-shifted large data
streams.
Different kind of consumers – sales,
analyst, execs, machines.
4. The Beginnings: Perl to MR
(Hadoop)..
Logs summarized using perl. Low volumes
(order of hundred thousand).
Perl could not handle increased volumes
(millions). (2Q, 2010)
MR jobs to aggregate logs and populate DB (3
machine cluster)
DB views increased ; creating MR jobs time
consuming, error prone and hard. (3Q, 2010)
5. Solving for Pipeline - Pig
MR: New job per need ; known by few.
Pig: Well suited for medium complexity
pipeline jobs.
Data gets aggregated using Pig and
pushed to DB for analytics.
6. Analytics gets complex
Business evolved; complex analytics needed.
DB suffers ‘limited angle view’ problems.
Proliferation of materialized views.
Hive: not mature (early 2011), too much resource
on small/medium clusters, lot of flux, not
optimal, difficult to fix things and add features.
Back to Pig: Team of engineers writing ad-hoc pig
scripts for business; Performance only as good as
person writing the query – very low productivity.
7. Realization
Frequently ‘tools’ don’t work as intended. Too
much customizations and constant tuning.
Difficult to absorb the dynamics of the data.
Too generic and not optimal for our data
models and cluster size.
Parts of the required stack – difficult to
integrate and maintain.
Pig not suited for analytics by business. Too
much technical knowledge needed.
8. Yoda
Developed in-house system to satisfy ad-
hoc analytics.
Complete Stack (ETL, Query Processor,
Query Builder, Visualization) on top of
Hadoop, for processing logs & analytics.
(Q1, 2011)
SQL like operations like Select, Sum, Avg
Min, Max, Count, Distinct, Decode,
Expressions, GroupBy, Where, Having,
Decode, UDF, UDAF etc.
9. Yoda cont..
Heavily optimize storage and
queries for the data model.
All the fact data streams and
metadata in a coherent, seamless
view.
Platform–UI as well as API (to
embed the functionality it in other
apps).
10. Life of a Query
UI Optimizations Validate Select
Convert to Metadata-> Query Metadata
protbuf Fact Create Joins
promotions: Select Cube Estimate cost
Transmit
GroupBy Select Priority
Json Select Optimal Determine
Where
grain Split size
Collect data (Reducer) (Mapper) (Driver)
Do Aggregate Filter Push Down
Format and at record Optimize
output CSV Apply Formula reconstruction. query via
Fact filters. reorganizatio
Update Perform Join. n
status Having Dim Filter.
Select/Group Generate MR
Notify user Top N Partial spec.
aggregation
11. What worked
Efficiency in modeling and joins
Solid data modeling. Wasteful to
perform joins on the fly. Single-stage
MR to both group and join.
Map side metadata joins – efficient
horizontal, vertical & filtered data
load.
Pre-join metadata once.
12. What worked cont..
Simplicity: Transparent Cube and
Aggregate selection (no From or Join
clause).
Ability to absorb data dynamics.
Intuitive query builder.
Analytics - not ‘just’ query.
Support for ‘scheduled’ ad-hoc queries.