Hadoop at Tapad


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop at Tapad"

  1. 1. HADOOP AT TAPAD A Case Study Mike Moss, VP Engineering @michaelmossMarch 14, 2013
  2. 2. What is Tapad? Tapad is the first digital advertising solution for real-time mobile audience buying and multi- screen targeting. Marketers use Tapad to obtain a unified view of their customers across smartphones, tablets, computers and smart TVs, enabling more relevant and device-specific messaging. Tapad bridges devices together to create the Device Graph which enables Cross Platform Targeting and Analytics 2
  3. 3. Device Graph Targeting Capabilities Retargeting - Retarget PC visitors on mobile or tablet Location Targeting - Geo-Fencing - Airport Targeting Audience Targeting - Economic (Income, Net Worth, Discretionary Income, Home Value, Charitable Contributions, Invested Assets) - Demographic (Age, Genders Present, Presence of Children, Ethnicity) Platform Targeting - Platform (PC Web, Mobile Web, In-App, Connected TV) - Device (Android, Android Tablet, Blackberry, Computer, Feature phones, iPad, iPhone, Palm, Symbian, Windows Phone) - Carrier (AT&T Wireless, MetroPCS, Sprint, T-Mobile, TracFone, Verizon Wireless, etc.)
  4. 4. Data at Tapad• MySQL • “CRUD” – Tapad UI and Campaign Manager• Redis • Counters – Revenue, Bid Requests, Impressions• Aerospike • Device Graph• Vertica • Impressions, Clicks, Aggregations - Reporting, ad-hoc queries
  5. 5. Use Case: Predict Available Monthly Impressionsfor New Campaigns Advertiser Home 1 – Pixel for D1 3 – Bid Request for D2 Page D1 D2 D3 2 - Device Graph Propagation How can we predict how many monthly impressions a new advertiser can buy on our platform? MonthlyUniquesNewAdvertiser * MonthlyBid RequestsSimilarAdvertiser MonthlyUniquesSimilarAdvertiser
  6. 6. Bid Requests At peak, we get over 150K bid requests/sec High Volume/”Low Value” data Complex data type (bid_sample_avro.json) Not sure of all the ways we would query it At a sampling rate of 1/1000, we are capturing 200MB/Hour …in other words: Perfect for Hadoop
  7. 7. Hadoop Ecosystem Hadoop Ecosystem – Heavily fragmented, lots of choices! Trends - “Distro Wars” – Cloudera vs Hortonworks vs MapR - Real-time, interactive ad-hoc querying – aka “Faster Hive” - Apache Drill, Cloudera Impala, Stinger Initiative (YARN, Tez, ORCFile) - Many influenced by Google Dremel paper - All are similar and seek to improve on M/R expensive start-up time, avoid shuffle/sort disk serialization where possible, as well as unnecessary M/R pipelines. - New languages/frameworks - Many more choices than just Pig and Cascading - Scalding, Scoobi, Spark, Crunch/Scrunch - Many influenced by Google Flume paper, seek to avoid awkwardness of the UDF programming model, and experiment with richer typed data models (not just tuples)
  8. 8. Tapad Hadoop POC Some SQL, some code POC - Hive - Familiar SQL syntax - Easy to get started - Hue/Beeswax makes SQL on Hadoop easy to non-programmers - Impala (Cloudera) - Most developed of the pack (as of Feb 2013) - Scalding (Twitter) - “A Scala API for Cascading” - Algebird - Cloudera CDH4 On our Radar - Hortonworks – Stinger - Scoobi Also tried - Shark/Spark
  9. 9. Serialization Serialization Considerations: - Parsing efficiency - Schema evolution - Compactness - Complex type support - Hadoop ecosystem support CSV JSON Avro – Like Protocol Buffers/Thrift, but better: - Dynamic typing – No code gen required - Untagged data – Since schema included with data, smaller serialization size - No manually-assigned field IDs – Schema migrations are a breeze with presence of old and new schemas
  10. 10. Compression Compression Considerations: - Splittability - Speed vs. Compression - Hadoop ecosystem support gzip lzo Snappy - “…aims for very high speeds and reasonable compression” - Integrates seamlessly with Avro
  11. 11. Hive DemoCREATE TABLE bidsROW FORMATSERDE org.apache.hadoop.hive.serde2.avro.AvroSerDeSTORED ASINPUTFORMAT org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormatOUTPUTFORMAT org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormatTBLPROPERTIES (avro.schema.literal=„<JSON SCHEMA HERE>‟);LOAD DATA LOCAL INPATH „bids.avro INTO TABLE `bids`;
  12. 12. Impala Demo
  13. 13. ScaldingUnpackedAvroSource(args("input"), schema = None) .read .flatMapTo(request -> audienceId) { record: Tuple => val request: Tuple = record.getObject(0).asInstanceOf[Tuple] val device: Option[Tuple] = Option(request.getObject(6).asInstanceOf[Tuple]) val audienceRecords: Option[ArrayList[Tuple]] = device.flatMap { record => Option(record.getObject(7).asInstanceOf[ArrayList[Tuple]]) } audienceRecords.toSeq.flatMap { records => records.asScala.map(_.getString(0)) } } .groupBy(audienceId) { _.size(count) } .groupAll { _.sortBy(count) } .debug .write(Tsv(args("output")))
  14. 14. Hardware 1 Master Node – 1U - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9240-8i 8-Port RAID Card - 2 x 1TB Seagate Constellation.2 SAS 3 Data Nodes – 2U 12 HD Bays - 2 x Intel Xeon E5-2620 6-Core 2GHz - 64GB DDR-1600 RAM - LSI 9207-8i 8-Port RAID Card - OS Drive: 100GB Intel DC 3700 - Data Drives: 12 x 3TB Seagate Constellation CS SATA 14
  15. 15. ReferencesCloudera vs. Hortonworks: http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_MindshareDremel:http://research.google.com/pubs/pub36632.htmlhttp://www.quora.com/How-will-Googles-Dremel-change-future-Hadoop-releasesFlumeJava: http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdfHadoop Ecosystem (Mar 2013): http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic/Hardware:http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/Impala: https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+QuestionsSpark/Shark: http://www.cs.berkeley.edu/~matei/talks/2012/hadoop_summit_spark.pdfStinger: http://hortonworks.com/blog/100x-faster-hive/SQL on Hadoop: http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/Tuples vs. Complex Types: http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading 15
  16. 16. Thank You Questions? Tapad is hiring! - Data Scientists, Platform/Data/Frontend Engineers - http://www.tapad.com/careers/ - michael.moss@tapad.com 16
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.