SlideShare a Scribd company logo
1 of 34
©2016 MediaMath Inc. 1
02.16.2016
Rory Sawyer – Software Engineer, Data Platform
Moving Past Infrastructure
Limitations
©2016 MediaMath Inc. 2
©2016 MediaMath Inc. 3
Massive Volume of Data
 180 billion impression opportunities a day
 3+ million peak qps
 3+ TB of data per day (compressed)
 Logs represent financial transactions
Every record counts!
©2016 MediaMath Inc. 4
MediaMath’s Data Platform
 Centralized location for data at MM
Collect data from across the company
Standardize access for internal and external clients
 End-result of data warehouse transformation
©2016 MediaMath Inc. 5
Once Upon A Time
The Old Days
Etc..
©2016 MediaMath Inc. 6
Architecture – 2013
©2016 MediaMath Inc. 7
Data Warehousing at MM circa 2013
 No proper QA/testing environment
 Production workflows and ad-hoc analytics ran side-by-side
 Scaling becomes an issue
Developing/testing/deploying changes to workflows frustrating
Copying data to more monolithic systems
More shell, more problems
©2016 MediaMath Inc. 8
Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers
©2016 MediaMath Inc. 9
Data Access circa 2013
 Logs: Custom FTP transfers
Merely extracting data could cause production problems
FTP could run out of space
 Heavy reliance on canned reports
Served via reporting API
Updated at most three times a day, usually just once a day
 Hard to keep pace with growing demands
Internal Clients
External Clients
©2016 MediaMath Inc. 10
Data Liberation
©2016 MediaMath Inc. 11
Moving Past Infrastructure
 Resource flexibility
 Fully own our conceptual problems
Can’t just get a bigger box or a higher support license
 Lower barrier to entry
Decouple storage and computation
©2016 MediaMath Inc. 12
Move to the Cloud
 Simple Storage Service (S3):
Primary data store; source of truth
Append-only. Update = delete + append
 Elastic Map Reduce (EMR):
Transient hadoop clusters
Spot instances – save money
 Redshift:
Columnar storage for efficient querying
©2016 MediaMath Inc. 13
Data Platform – Today
©2016 MediaMath Inc. 14
Data Access – Today
©2016 MediaMath Inc. 15
Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
 Clearly distinct, perfectly synced QA environment
Run multiple versions of workflows simultaneously on same source data
 More control over components
 Localized impact of processing
Each team uses their own compute environment
©2016 MediaMath Inc. 16
We don’t worry about this like we used to
©2016 MediaMath Inc. 17
Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
 Transparently handle different data sources
Bridge storage types and AWS accounts
 Choose your preferred query method
Spark, MapReduce, Flink, or BI tool
 All barriers removed
©2016 MediaMath Inc. 18
Productize it, cap’n
 Log level data API
Direct log access on S3
 Interactive Query
Scalable, user-friendly data processing with Qubole
©2016 MediaMath Inc. 19
Hive
©2016 MediaMath Inc. 20
SmartQuery
©2016 MediaMath Inc. 21
Clusters
©2016 MediaMath Inc. 22
Qubole’s Greatest Hits
©2016 MediaMath Inc. 23
Hybrid Life
©2016 MediaMath Inc. 24
New and Old
©2016 MediaMath Inc. 25
Managing a Hybrid Warehouse
 Upfront effort to keep old and new consistent
After that, could migrate in pieces
 Keeping datasets in sync
Store metadata about datasets and processes
Keep record of what data was processed by which batches
©2016 MediaMath Inc. 26
Ch-ch-ch-challenges
 Spot instances: bid too low, jobs never start
Build processes around selecting best/cheapest zones
 Maintaining two systems at once
Consistency, monitoring, updates…
 Migrating mindset
New set of questions to answer
©2016 MediaMath Inc. 27
What we’ve learned
©2016 MediaMath Inc. 28
Life after Liberation
 Decentralize all the things
Single-machine -> distributed computing
Single data team -> data engineers on all the teams
 Engineers on every team
Data Science – Spark (Scala)
Analytics – Spark/Hive (with Redshift connector)
Product – Hive
Engineering – Spark/Hive/MapReduce
Business analysts – SmartQuery
©2016 MediaMath Inc. 29
Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers
©2016 MediaMath Inc. 30
Data Access Today – Users and Consumers
 Tools: Hadoop (Scalding, Hive), Spark, RDBMS
 Consumers: Engineers, product managers, business
analysts, etc.
©2016 MediaMath Inc. 31
The Cost of Decentralization
 Different producers and consumers have different priorities
File format, end-to-end latency, correctness, etc…
 Adding a platform layer could add friction
©2016 MediaMath Inc. 32
Not Abandoning Managed Infrastructure
or: There and Back Again
 Managed hardware is still important
On-premises Hadoop cluster
Clients ETL into managed hardware
 Experience with Data Liberation broke down “walled garden” feel
of AWS
©2016 MediaMath Inc. 33
Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
 Re-thinking the tools and skills needed for data warehousing
 Avoid tech debt by evolving our software and ideas before
committing to hardware
 Move away from trickle-down data
©2016 MediaMath Inc. 34
THANK YOU!
Rory Sawyer
Software Engineer
Data Platform
Rsawyer@mediamath.com

More Related Content

What's hot

Understanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataUnderstanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataVoltDB
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...Big Data Spain
 
Next generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan KolmarNext generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan KolmarBig Data Spain
 
The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016Tableau Software
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
TripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldTripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldVoltDB
 
Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0SnapLogic
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee
 
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...Matt Stubbs
 
Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence OverviewAlex Meadows
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentationDavid Rice
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Harald Erb
 

What's hot (20)

Understanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataUnderstanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast Data
 
NoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and AnalyticsNoSQL Type, Bigdata, and Analytics
NoSQL Type, Bigdata, and Analytics
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
Next generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan KolmarNext generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan Kolmar
 
The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016The Top 8 Trends for Big Data in 2016
The Top 8 Trends for Big Data in 2016
 
Ford
FordFord
Ford
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016
 
TripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldTripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech World
 
Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0Hadoop for Humans: Introducing SnapReduce 2.0
Hadoop for Humans: Introducing SnapReduce 2.0
 
Introduction
IntroductionIntroduction
Introduction
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
Big Data LDN 2018: FORTUNE 100 LESSONS ON ARCHITECTING DATA LAKES FOR REAL-TI...
 
Open Source Business Intelligence Overview
Open Source Business Intelligence OverviewOpen Source Business Intelligence Overview
Open Source Business Intelligence Overview
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
 
Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020Dataiku & Snowflake Meetup Berlin 2020
Dataiku & Snowflake Meetup Berlin 2020
 

Similar to MediaMath - Big Data Warehousing Meetup - 2/16/2016

Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
Elastic data services on Apache Mesos via Mesosphere’s DCOS
Elastic data services on Apache Mesos via Mesosphere’s DCOSElastic data services on Apache Mesos via Mesosphere’s DCOS
Elastic data services on Apache Mesos via Mesosphere’s DCOSharrythewiz
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jrJonathan Raspaud
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Hortonworks
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data ArchitecturesGuido Schmutz
 
Downtime is not an option - day 2 operations - Jörg Schad
Downtime is not an option - day 2 operations -  Jörg SchadDowntime is not an option - day 2 operations -  Jörg Schad
Downtime is not an option - day 2 operations - Jörg SchadCodemotion
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
Data at the corner of SAP and AWS
Data at the corner of SAP and AWSData at the corner of SAP and AWS
Data at the corner of SAP and AWSOcean9, Inc.
 
How Enterprises are Using NoSQL for Mission-Critical Applications
How Enterprises are Using NoSQL for Mission-Critical ApplicationsHow Enterprises are Using NoSQL for Mission-Critical Applications
How Enterprises are Using NoSQL for Mission-Critical ApplicationsDATAVERSITY
 
Top 10 Enterprise Use Cases for NoSQL
Top 10 Enterprise Use Cases for NoSQLTop 10 Enterprise Use Cases for NoSQL
Top 10 Enterprise Use Cases for NoSQLDATAVERSITY
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Moldovan Radu Adrian
 

Similar to MediaMath - Big Data Warehousing Meetup - 2/16/2016 (20)

Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
Elastic data services on Apache Mesos via Mesosphere’s DCOS
Elastic data services on Apache Mesos via Mesosphere’s DCOSElastic data services on Apache Mesos via Mesosphere’s DCOS
Elastic data services on Apache Mesos via Mesosphere’s DCOS
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Downtime is not an option - day 2 operations - Jörg Schad
Downtime is not an option - day 2 operations -  Jörg SchadDowntime is not an option - day 2 operations -  Jörg Schad
Downtime is not an option - day 2 operations - Jörg Schad
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Data at the corner of SAP and AWS
Data at the corner of SAP and AWSData at the corner of SAP and AWS
Data at the corner of SAP and AWS
 
How Enterprises are Using NoSQL for Mission-Critical Applications
How Enterprises are Using NoSQL for Mission-Critical ApplicationsHow Enterprises are Using NoSQL for Mission-Critical Applications
How Enterprises are Using NoSQL for Mission-Critical Applications
 
Top 10 Enterprise Use Cases for NoSQL
Top 10 Enterprise Use Cases for NoSQLTop 10 Enterprise Use Cases for NoSQL
Top 10 Enterprise Use Cases for NoSQL
 
Silicon Valley Workshop: Xanadu introduction
Silicon Valley Workshop: Xanadu introduction Silicon Valley Workshop: Xanadu introduction
Silicon Valley Workshop: Xanadu introduction
 
Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)Big data introduction (HackTM 2016)
Big data introduction (HackTM 2016)
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

MediaMath - Big Data Warehousing Meetup - 2/16/2016

  • 1. ©2016 MediaMath Inc. 1 02.16.2016 Rory Sawyer – Software Engineer, Data Platform Moving Past Infrastructure Limitations
  • 3. ©2016 MediaMath Inc. 3 Massive Volume of Data  180 billion impression opportunities a day  3+ million peak qps  3+ TB of data per day (compressed)  Logs represent financial transactions Every record counts!
  • 4. ©2016 MediaMath Inc. 4 MediaMath’s Data Platform  Centralized location for data at MM Collect data from across the company Standardize access for internal and external clients  End-result of data warehouse transformation
  • 5. ©2016 MediaMath Inc. 5 Once Upon A Time The Old Days Etc..
  • 6. ©2016 MediaMath Inc. 6 Architecture – 2013
  • 7. ©2016 MediaMath Inc. 7 Data Warehousing at MM circa 2013  No proper QA/testing environment  Production workflows and ad-hoc analytics ran side-by-side  Scaling becomes an issue Developing/testing/deploying changes to workflows frustrating Copying data to more monolithic systems More shell, more problems
  • 8. ©2016 MediaMath Inc. 8 Data Access circa 2013 – Users and Consumers  Tools: SQL, shell  Consumers: Data analysts, data engineers
  • 9. ©2016 MediaMath Inc. 9 Data Access circa 2013  Logs: Custom FTP transfers Merely extracting data could cause production problems FTP could run out of space  Heavy reliance on canned reports Served via reporting API Updated at most three times a day, usually just once a day  Hard to keep pace with growing demands Internal Clients External Clients
  • 10. ©2016 MediaMath Inc. 10 Data Liberation
  • 11. ©2016 MediaMath Inc. 11 Moving Past Infrastructure  Resource flexibility  Fully own our conceptual problems Can’t just get a bigger box or a higher support license  Lower barrier to entry Decouple storage and computation
  • 12. ©2016 MediaMath Inc. 12 Move to the Cloud  Simple Storage Service (S3): Primary data store; source of truth Append-only. Update = delete + append  Elastic Map Reduce (EMR): Transient hadoop clusters Spot instances – save money  Redshift: Columnar storage for efficient querying
  • 13. ©2016 MediaMath Inc. 13 Data Platform – Today
  • 14. ©2016 MediaMath Inc. 14 Data Access – Today
  • 15. ©2016 MediaMath Inc. 15 Developer Experience  Get to say “yes” more Rapid development/testing/deployment removes inertia  Clearly distinct, perfectly synced QA environment Run multiple versions of workflows simultaneously on same source data  More control over components  Localized impact of processing Each team uses their own compute environment
  • 16. ©2016 MediaMath Inc. 16 We don’t worry about this like we used to
  • 17. ©2016 MediaMath Inc. 17 Improved User Experience  Augmented standard reporting with easily-accessible data warehouse AWS + Qubole provides value to all skill levels  Transparently handle different data sources Bridge storage types and AWS accounts  Choose your preferred query method Spark, MapReduce, Flink, or BI tool  All barriers removed
  • 18. ©2016 MediaMath Inc. 18 Productize it, cap’n  Log level data API Direct log access on S3  Interactive Query Scalable, user-friendly data processing with Qubole
  • 20. ©2016 MediaMath Inc. 20 SmartQuery
  • 21. ©2016 MediaMath Inc. 21 Clusters
  • 22. ©2016 MediaMath Inc. 22 Qubole’s Greatest Hits
  • 23. ©2016 MediaMath Inc. 23 Hybrid Life
  • 24. ©2016 MediaMath Inc. 24 New and Old
  • 25. ©2016 MediaMath Inc. 25 Managing a Hybrid Warehouse  Upfront effort to keep old and new consistent After that, could migrate in pieces  Keeping datasets in sync Store metadata about datasets and processes Keep record of what data was processed by which batches
  • 26. ©2016 MediaMath Inc. 26 Ch-ch-ch-challenges  Spot instances: bid too low, jobs never start Build processes around selecting best/cheapest zones  Maintaining two systems at once Consistency, monitoring, updates…  Migrating mindset New set of questions to answer
  • 27. ©2016 MediaMath Inc. 27 What we’ve learned
  • 28. ©2016 MediaMath Inc. 28 Life after Liberation  Decentralize all the things Single-machine -> distributed computing Single data team -> data engineers on all the teams  Engineers on every team Data Science – Spark (Scala) Analytics – Spark/Hive (with Redshift connector) Product – Hive Engineering – Spark/Hive/MapReduce Business analysts – SmartQuery
  • 29. ©2016 MediaMath Inc. 29 Data Access circa 2013 – Users and Consumers  Tools: SQL, shell  Consumers: Data analysts, data engineers
  • 30. ©2016 MediaMath Inc. 30 Data Access Today – Users and Consumers  Tools: Hadoop (Scalding, Hive), Spark, RDBMS  Consumers: Engineers, product managers, business analysts, etc.
  • 31. ©2016 MediaMath Inc. 31 The Cost of Decentralization  Different producers and consumers have different priorities File format, end-to-end latency, correctness, etc…  Adding a platform layer could add friction
  • 32. ©2016 MediaMath Inc. 32 Not Abandoning Managed Infrastructure or: There and Back Again  Managed hardware is still important On-premises Hadoop cluster Clients ETL into managed hardware  Experience with Data Liberation broke down “walled garden” feel of AWS
  • 33. ©2016 MediaMath Inc. 33 Some sort of “last slide” title  Moving DW to cloud has proven itself Quick development allows us to keep pace Ease of use helps teams and clients fine tune their own reporting  Re-thinking the tools and skills needed for data warehousing  Avoid tech debt by evolving our software and ideas before committing to hardware  Move away from trickle-down data
  • 34. ©2016 MediaMath Inc. 34 THANK YOU! Rory Sawyer Software Engineer Data Platform Rsawyer@mediamath.com

Editor's Notes

  1. A bit about Mediamath: we’re an Ad Tech company and we write software to buy digital media. So say you go to a site and see a banner ad at the top, there was an auction to decide who was going to get that space and how much they would pay. We build systems to ingest, analyze, and decision on those bid opportunities and use machine learning to optimize our bidding.
  2. This story starts in 2013, with a description of our data warehouse at the time.
  3. Three Netezza servers would store and process all of our logs into standard reports The Netezza servers held separate copies of the source data for standard reports Push reports to Oracle data marts All the lines here, the glue of this system, is SQL, executed by shell scripts Netezza servers would store 7-13 days of logs before purging
  4. Architecture diagram is pretty true to life, and so you may have noticed that there’s no dedicated QA environment Have a dev server or two, but with the amount of data we deal with it’s costly to keep an up-to-date QA envirnoment, leading to a mismatch with production. Similarly, we had no environment for ad-hoc analytics. Simply selecting fields – no aggregations, nothing fancy – would cause reporting delays And so with these in mind, the question of scaling was a frightful one. Updating workflows and creating new ones were frustrating, can’t just keep copying our logs from server to server (needed to scale vertically as well), and adding more shell and SQL would only lead to more problems.
  5. This is the organizational data flow. The data warehouse team held most of the data engineers at mediamath, they would push reports to where the reporting team could lay an API over it, but the reporting team was mostly DBAs and API developers, and only after the reporting was the rest of the company able to get their first crack at the data. We had unofficial links to bypass reporting, but those were very tightly controlled
  6. The “productized” version of our log-level data was custom FTP transfers. Would compete for resources with production workflows FTP would run out of space, usually after hours, then you’d get into the office the next day to a client who’s upset that you deleted their data. All of this led to a heavy reliance on canned reports, served via our API. Some of these reports were updated three times a day, some were updated once a day. Canned reports are great, but with the aforementioned developer difficulties, we just couldn’t keep pace. Log level data is the lifeblood of our reporting. But for the longest time our logs – the greatest source for insights – was also one of the hardest things to get at.
  7. So this was the state of affairs around 2013, and these are the issues that led to this process of “data liberation”. The name very accurately describes our goal, in that we wanted to break down the silos that existed within our company (and outside our data warehouse). In short, we needed to remove infrastructure as a limiting factor in data sharing, both internally and externally, and this led our transition from data warehouse to data platform
  8. Need to leave behind our monolithic, big-box data warehouse No more single-machine processing, much more fault-tolerant Standardize access to data and make it easier for folks of all backgrouds to get real value There were the super high-level goals, and we that central to these goals would be decoupling storage and computation Need to make sure extracting data doesn’t interfere with processing data. We did this along two axes: technically an organizationally. Technically: we decided to move our data warehouse to the cloud, and in the process move to more of a platform A little later on I’ll discuss an organizational decoupling of storage and computation
  9. If we need 40 nodes for 2 hours, we can get that. Spot instances: leftover inventory that you bid on Redshift is marketed as Amazon’s “data warehouse” solution, but we saw this as a more suitable replacement for our Oracle datamart since it allowed us to de-aggregate some of our reports (i.e. – allow for custom date ranges instead of pre-aggregated by “yesterday”, “last 7 days”, etc…)
  10. The direct s3 access is our solution to our data access problem, so I’ll zoom in on that a little more.
  11. Data is generated by various teams within mediamath, and we’ll enrich the logs and store them partitioned by organization. Identity and Access Management (IAM) is the Amazon service we use for access control, and from there clients can safely read their data (and only their data) and process it however they like. This, essentially, is our replacement for the FTP transfers we used to set up.
  12. So that’s data access, and this setup sets the stage for the developer experience. For developer experience, first: we get to say “yes” more Much like two clients can run processes side by side, we can run our QA jobs side by side With the maturation of the hadoop ecosystem, there seems to be a new “big data analytics framework” every couple months, so we don’t force developers to be too dogmatic about a single system
  13. Part of lowering the barrier of entry was making it easier to get users from more backgrounds.
  14. Again, select what you want from a dropdown and then hit “launch”
  15. Altogether this lets our platform serve as the foundation for data-driven applications, or act as “big data for dummies” To be clear, this is not Qubole’s official “greatest hits” compilation, but rather what we use at MediaMath
  16. So that’s where Data Liberation led us, but in reality we bridge the two systems
  17. A look at how our old and new architectures sit side by side, with load balancing done at the service layer to point to either AWS or our own data center. AWS enables us to allow access along the way. Sproxy will update a Dynamo DB table with filenames and upload times. Similarly, we keep a table in Netezza for filenames and batch numbers
  18. Not without new challenges - The effort to keep old and new systems consistent meant that we could migrate in pieces, not just our code but our people too. Could take time to properly learn new things.
  19. Migrate from SQL to Scala Migrate from RDBMS to Hadoop
  20. So that’s where we are today. I discussed the goals of data liberation and how we solved (or tried to solve) for these, and now I’m going to discuss what challenges and questions we face moving forward. I’ll start by talking about life after liberation.
  21. This is where our organizational decoupling of storage and computation happened Decoupling storage (data platform) from processing (anywhere)
  22. Cloud isn’t necessarily important, what was really important was decoupling storage and computation S3 is a great touchpoint to help break down the walled garden of AWS and help bridge the gap between on-premises hardware and the cloud