MediaMath - Big Data Warehousing Meetup - 2/16/2016

©2016 MediaMath Inc. 1
02.16.2016
Rory Sawyer – Software Engineer, Data Platform
Moving Past Infrastructure
Limitations

Massive Volume of Data
 180 billion impression opportunities a day
 3+ million peak qps
 3+ TB of data per day (compressed)
 Logs represent financial transactions
Every record counts!

MediaMath’s Data Platform
 Centralized location for data at MM
Collect data from across the company
Standardize access for internal and external clients
 End-result of data warehouse transformation

Once Upon A Time
The Old Days
Etc..

Architecture – 2013

Data Warehousing at MM circa 2013
 No proper QA/testing environment
 Production workflows and ad-hoc analytics ran side-by-side
 Scaling becomes an issue
Developing/testing/deploying changes to workflows frustrating
Copying data to more monolithic systems
More shell, more problems

Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers

Data Access circa 2013
 Logs: Custom FTP transfers
Merely extracting data could cause production problems
FTP could run out of space
 Heavy reliance on canned reports
Served via reporting API
Updated at most three times a day, usually just once a day
 Hard to keep pace with growing demands
Internal Clients
External Clients

Data Liberation

Moving Past Infrastructure
 Resource flexibility
 Fully own our conceptual problems
Can’t just get a bigger box or a higher support license
 Lower barrier to entry
Decouple storage and computation

Move to the Cloud
 Simple Storage Service (S3):
Primary data store; source of truth
Append-only. Update = delete + append
 Elastic Map Reduce (EMR):
Transient hadoop clusters
Spot instances – save money
 Redshift:
Columnar storage for efficient querying

Data Platform – Today

Data Access – Today

Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
 Clearly distinct, perfectly synced QA environment
Run multiple versions of workflows simultaneously on same source data
 More control over components
 Localized impact of processing
Each team uses their own compute environment

We don’t worry about this like we used to

Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
 Transparently handle different data sources
Bridge storage types and AWS accounts
 Choose your preferred query method
Spark, MapReduce, Flink, or BI tool
 All barriers removed

Productize it, cap’n
 Log level data API
Direct log access on S3
 Interactive Query
Scalable, user-friendly data processing with Qubole

SmartQuery

Clusters

Qubole’s Greatest Hits

Hybrid Life

New and Old

Managing a Hybrid Warehouse
 Upfront effort to keep old and new consistent
After that, could migrate in pieces
 Keeping datasets in sync
Store metadata about datasets and processes
Keep record of what data was processed by which batches

Ch-ch-ch-challenges
 Spot instances: bid too low, jobs never start
Build processes around selecting best/cheapest zones
 Maintaining two systems at once
Consistency, monitoring, updates…
 Migrating mindset
New set of questions to answer

What we’ve learned

Life after Liberation
 Decentralize all the things
Single-machine -> distributed computing
Single data team -> data engineers on all the teams
 Engineers on every team
Data Science – Spark (Scala)
Analytics – Spark/Hive (with Redshift connector)
Product – Hive
Engineering – Spark/Hive/MapReduce
Business analysts – SmartQuery

Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers

Data Access Today – Users and Consumers
 Tools: Hadoop (Scalding, Hive), Spark, RDBMS
 Consumers: Engineers, product managers, business
analysts, etc.

The Cost of Decentralization
 Different producers and consumers have different priorities
File format, end-to-end latency, correctness, etc…
 Adding a platform layer could add friction

Not Abandoning Managed Infrastructure
or: There and Back Again
 Managed hardware is still important
On-premises Hadoop cluster
Clients ETL into managed hardware
 Experience with Data Liberation broke down “walled garden” feel
of AWS

Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
 Re-thinking the tools and skills needed for data warehousing
 Avoid tech debt by evolving our software and ideas before
committing to hardware
 Move away from trickle-down data

THANK YOU!
Rory Sawyer
Software Engineer
Data Platform
Rsawyer@mediamath.com

MediaMath - Big Data Warehousing Meetup - 2/16/2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MediaMath - Big Data Warehousing Meetup - 2/16/2016

Similar to MediaMath - Big Data Warehousing Meetup - 2/16/2016 (20)

Recently uploaded

Recently uploaded (20)

MediaMath - Big Data Warehousing Meetup - 2/16/2016

Editor's Notes