Yoda fifth elephant

3. Scope Highly Dynamic Data and Analytic needs Frequent addition of newer dimensions. Very dynamic query patterns. Both canned and ad-hoc reports. Multiple phase-shifted large data streams. Different kind of consumers – sales, analyst, execs, machines.

4. The Beginnings: Perl to MR (Hadoop).. Logs summarized using perl. Low volumes (order of hundred thousand). Perl could not handle increased volumes (millions). (2Q, 2010) MR jobs to aggregate logs and populate DB (3 machine cluster) DB views increased ; creating MR jobs time consuming, error prone and hard. (3Q, 2010)

5. Solving for Pipeline - Pig MR: New job per need ; known by few. Pig: Well suited for medium complexity pipeline jobs. Data gets aggregated using Pig and pushed to DB for analytics.

6. Analytics gets complex Business evolved; complex analytics needed. DB suffers ‘limited angle view’ problems. Proliferation of materialized views. Hive: not mature (early 2011), too much resource on small/medium clusters, lot of flux, not optimal, difficult to fix things and add features. Back to Pig: Team of engineers writing ad-hoc pig scripts for business; Performance only as good as person writing the query – very low productivity.

7. Realization Frequently ‘tools’ don’t work as intended. Too much customizations and constant tuning. Difficult to absorb the dynamics of the data. Too generic and not optimal for our data models and cluster size. Parts of the required stack – difficult to integrate and maintain. Pig not suited for analytics by business. Too much technical knowledge needed.

8. Yoda Developed in-house system to satisfy ad- hoc analytics. Complete Stack (ETL, Query Processor, Query Builder, Visualization) on top of Hadoop, for processing logs & analytics. (Q1, 2011) SQL like operations like Select, Sum, Avg Min, Max, Count, Distinct, Decode, Expressions, GroupBy, Where, Having, Decode, UDF, UDAF etc.

9. Yoda cont.. Heavily optimize storage and queries for the data model. All the fact data streams and metadata in a coherent, seamless view. Platform–UI as well as API (to embed the functionality it in other apps).

10. Life of a Query UI Optimizations Validate Select Convert to Metadata-> Query Metadata protbuf Fact Create Joins promotions: Select Cube Estimate cost Transmit GroupBy Select Priority Json Select Optimal Determine Where grain Split size Collect data (Reducer) (Mapper) (Driver) Do Aggregate Filter Push Down Format and at record Optimize output CSV Apply Formula reconstruction. query via Fact filters. reorganizatio Update Perform Join. n status Having Dim Filter. Select/Group Generate MR Notify user Top N Partial spec. aggregation

11. What worked Efficiency in modeling and joins Solid data modeling. Wasteful to perform joins on the fly. Single-stage MR to both group and join. Map side metadata joins – efficient horizontal, vertical & filtered data load. Pre-join metadata once.

12. What worked cont.. Simplicity: Transparent Cube and Aggregate selection (no From or Join clause). Ability to absorb data dynamics. Intuitive query builder. Analytics - not ‘just’ query. Support for ‘scheduled’ ad-hoc queries.

13. Demo + QA gaurav@inmobi.com

Yoda fifth elephant

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Yoda fifth elephant

Similar to Yoda fifth elephant (20)

Recently uploaded

Recently uploaded (20)

Yoda fifth elephant