SlideShare a Scribd company logo
HADOOP AT TAPAD
March 14, 2013
A Case Study
Mike Moss, VP Engineering
@michaelmoss
What is Tapad?
2
 Tapad is the first digital advertising solution for real-time mobile audience buying and multi-
screen targeting.
 Marketers use Tapad to obtain a unified view of their customers across smartphones,
tablets, computers and smart TVs, enabling more relevant and device-specific messaging.
 Tapad bridges devices together to create the Device Graph which enables Cross Platform
Targeting and Analytics
Device Graph Targeting Capabilities
 Retargeting
- Retarget PC visitors on mobile or tablet
 Location Targeting
- Geo-Fencing
- Airport Targeting
 Audience Targeting
- Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable
Contributions, Invested Assets)
- Demographic (Age, Genders Present, Presence of Children, Ethnicity)
 Platform Targeting
- Platform (PC Web, Mobile Web, In-App, Connected TV)
- Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone,
Palm, Symbian, Windows Phone)
- Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
Data at Tapad
• MySQL
• “CRUD” – Tapad UI and Campaign Manager
• Redis
• Counters – Revenue, Bid Requests, Impressions
• Aerospike
• Device Graph
• Vertica
• Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
Use Case: Predict Available Monthly Impressions
for New Campaigns
 How can we predict how many monthly impressions a new advertiser can buy on our
platform?
D1 D2
D3
Advertiser
Home
Page
1 – Pixel for D1
2 - Device Graph Propagation
3 – Bid Request for D2
MonthlyUniquesNewAdvertiser
MonthlyUniquesSimilarAdvertiser
*MonthlyBid RequestsSimilarAdvertiser
Bid Requests
 At peak, we get over 150K bid requests/sec
 High Volume/”Low Value” data
 Complex data type (bid_sample_avro.json)
 Not sure of all the ways we would query it
 At a sampling rate of 1/1000, we are capturing 200MB/Hour
 …in other words: Perfect for Hadoop
Hadoop Ecosystem
 Hadoop Ecosystem – Heavily fragmented, lots of choices!
 Trends
- “Distro Wars” – Cloudera vs Hortonworks vs MapR
- Real-time, interactive ad-hoc querying – aka “Faster Hive”
- Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile)
- Many influenced by Google Dremel paper
- All are similar and seek to improve on M/R expensive start-up time, avoid
shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines.
- New languages/frameworks
- Many more choices than just Pig and Cascading
- Scalding, Scoobi, Spark, Crunch/Scrunch
- Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF
programming model, and experiment with richer typed data models (not just tuples)
Tapad Hadoop POC
 Some SQL, some code
 POC
- Hive
- Familiar SQL syntax
- Easy to get started
- Hue/Beeswax makes SQL on Hadoop easy to non-programmers
- Impala (Cloudera)
- Most developed of the pack (as of Feb 2013)
- Scalding (Twitter)
- “A Scala API for Cascading”
- Algebird
- Cloudera CDH4
 On our Radar
- Hortonworks – Stinger
- Scoobi
 Also tried
- Shark/Spark
Serialization
 Serialization Considerations:
- Parsing efficiency
- Schema evolution
- Compactness
- Complex type support
- Hadoop ecosystem support
 CSV
 JSON
 Avro – Like Protocol Buffers/Thrift, but better:
- Dynamic typing – No code gen required
- Untagged data – Since schema included with data, smaller serialization size
- No manually-assigned field IDs – Schema migrations are a breeze with presence of old
and new schemas
Compression
 Compression Considerations:
- Splittability
- Speed vs. Compression
- Hadoop ecosystem support
 gzip
 lzo
 Snappy
- “…aims for very high speeds and reasonable compression”
- Integrates seamlessly with Avro
Hive Demo
CREATE TABLE bids
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’);
LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;
Impala Demo
Scalding
UnpackedAvroSource(args("input"), schema = None)
.read
.flatMapTo('request -> 'audienceId) { record: Tuple =>
val request: Tuple = record.getObject(0).asInstanceOf[Tuple]
val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple])
val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record =>
Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]])
}
audienceRecords.toSeq.flatMap { records =>
records.asScala.map(_.getString(0))
}
}
.groupBy('audienceId) { _.size('count) }
.groupAll { _.sortBy('count) }
.debug
.write(Tsv(args("output")))
Hardware
14
 1 Master Node – 1U
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9240-8i 8-Port RAID Card
- 2 x 1TB Seagate Constellation.2 SAS
 3 Data Nodes – 2U 12 HD Bays
- 2 x Intel Xeon E5-2620 6-Core 2GHz
- 64GB DDR-1600 RAM
- LSI 9207-8i 8-Port RAID Card
- OS Drive: 100GB Intel DC 3700
- Data Drives: 12 x 3TB Seagate Constellation CS SATA
References
15
Cloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare
Dremel:
http://research.google.com/pubs/pub36632.html
http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases
FlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf
Hadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/
Hardware:
http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/
Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions
Spark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdf
Stinger: http://hortonworks.com/blog/100x-faster-hive/
SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/
Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
Thank You
16
 Questions?
 Tapad is hiring!
- Data Scientists, Platform/Data/Frontend Engineers
- http://www.tapad.com/careers/
- michael.moss@tapad.com

More Related Content

Similar to Hadoop at Tapad

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
Amazon Web Services
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Sachin Holla
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
Nicola Ferraro
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
Madhur Nawandar
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Steve Watt
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
Databricks
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
Yash Ranadive
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Milos Milovanovic
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
Shashwat Shriparv
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
Darko Marjanovic
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Spark!
Spark!Spark!
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 

Similar to Hadoop at Tapad (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Spark!
Spark!Spark!
Spark!
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
Open Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics
 

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Recently uploaded

Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Hadoop at Tapad

  • 1. HADOOP AT TAPAD March 14, 2013 A Case Study Mike Moss, VP Engineering @michaelmoss
  • 2. What is Tapad? 2  Tapad is the first digital advertising solution for real-time mobile audience buying and multi- screen targeting.  Marketers use Tapad to obtain a unified view of their customers across smartphones, tablets, computers and smart TVs, enabling more relevant and device-specific messaging.  Tapad bridges devices together to create the Device Graph which enables Cross Platform Targeting and Analytics
  • 3. Device Graph Targeting Capabilities  Retargeting - Retarget PC visitors on mobile or tablet  Location Targeting - Geo-Fencing - Airport Targeting  Audience Targeting - Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable Contributions, Invested Assets) - Demographic (Age, Genders Present, Presence of Children, Ethnicity)  Platform Targeting - Platform (PC Web, Mobile Web, In-App, Connected TV) - Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone, Palm, Symbian, Windows Phone) - Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
  • 4. Data at Tapad • MySQL • “CRUD” – Tapad UI and Campaign Manager • Redis • Counters – Revenue, Bid Requests, Impressions • Aerospike • Device Graph • Vertica • Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
  • 5. Use Case: Predict Available Monthly Impressions for New Campaigns  How can we predict how many monthly impressions a new advertiser can buy on our platform? D1 D2 D3 Advertiser Home Page 1 – Pixel for D1 2 - Device Graph Propagation 3 – Bid Request for D2 MonthlyUniquesNewAdvertiser MonthlyUniquesSimilarAdvertiser *MonthlyBid RequestsSimilarAdvertiser
  • 6. Bid Requests  At peak, we get over 150K bid requests/sec  High Volume/”Low Value” data  Complex data type (bid_sample_avro.json)  Not sure of all the ways we would query it  At a sampling rate of 1/1000, we are capturing 200MB/Hour  …in other words: Perfect for Hadoop
  • 7. Hadoop Ecosystem  Hadoop Ecosystem – Heavily fragmented, lots of choices!  Trends - “Distro Wars” – Cloudera vs Hortonworks vs MapR - Real-time, interactive ad-hoc querying – aka “Faster Hive” - Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile) - Many influenced by Google Dremel paper - All are similar and seek to improve on M/R expensive start-up time, avoid shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines. - New languages/frameworks - Many more choices than just Pig and Cascading - Scalding, Scoobi, Spark, Crunch/Scrunch - Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF programming model, and experiment with richer typed data models (not just tuples)
  • 8. Tapad Hadoop POC  Some SQL, some code  POC - Hive - Familiar SQL syntax - Easy to get started - Hue/Beeswax makes SQL on Hadoop easy to non-programmers - Impala (Cloudera) - Most developed of the pack (as of Feb 2013) - Scalding (Twitter) - “A Scala API for Cascading” - Algebird - Cloudera CDH4  On our Radar - Hortonworks – Stinger - Scoobi  Also tried - Shark/Spark
  • 9. Serialization  Serialization Considerations: - Parsing efficiency - Schema evolution - Compactness - Complex type support - Hadoop ecosystem support  CSV  JSON  Avro – Like Protocol Buffers/Thrift, but better: - Dynamic typing – No code gen required - Untagged data – Since schema included with data, smaller serialization size - No manually-assigned field IDs – Schema migrations are a breeze with presence of old and new schemas
  • 10. Compression  Compression Considerations: - Splittability - Speed vs. Compression - Hadoop ecosystem support  gzip  lzo  Snappy - “…aims for very high speeds and reasonable compression” - Integrates seamlessly with Avro
  • 11. Hive Demo CREATE TABLE bids ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'=‘<JSON SCHEMA HERE>’); LOAD DATA LOCAL INPATH ‘bids.avro' INTO TABLE `bids`;
  • 13. Scalding UnpackedAvroSource(args("input"), schema = None) .read .flatMapTo('request -> 'audienceId) { record: Tuple => val request: Tuple = record.getObject(0).asInstanceOf[Tuple] val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple]) val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record => Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]]) } audienceRecords.toSeq.flatMap { records => records.asScala.map(_.getString(0)) } } .groupBy('audienceId) { _.size('count) } .groupAll { _.sortBy('count) } .debug .write(Tsv(args("output")))
  • 14. Hardware 14  1 Master Node – 1U - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9240-8i 8-Port RAID Card - 2 x 1TB Seagate Constellation.2 SAS  3 Data Nodes – 2U 12 HD Bays - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9207-8i 8-Port RAID Card - OS Drive: 100GB Intel DC 3700 - Data Drives: 12 x 3TB Seagate Constellation CS SATA
  • 15. References 15 Cloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare Dremel: http://research.google.com/pubs/pub36632.html http://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releases FlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf Hadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/ Hardware: http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/ http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions Spark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdf Stinger: http://hortonworks.com/blog/100x-faster-hive/ SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/ Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
  • 16. Thank You 16  Questions?  Tapad is hiring! - Data Scientists, Platform/Data/Frontend Engineers - http://www.tapad.com/careers/ - michael.moss@tapad.com