DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMNDMMIIIIII...
PYTHON INAN EVOLVINGENTERPRISE SYSTEMEVALUATING INTEGRATIONSOLUTIONS WITH HADOOPDAVE HIMRODSTEVE KANNANANGELICA PANDO
Building today’s most powerful,open, and customizable advertisingtechnology platform.
Ad is served in<100 milliseconds                                                                       WINNING            ...
Evolution of AppNexus                20    350      430 PEOPLE      FROM    100M    39B     45B AD REQUESTS     5000+    M...
Evolution of AppNexus    ENG OFFICES         ENGINEERING    IN PORTLAND         HQ IN NYC    & SF
Data-Driven Decisioning (D3)                  Bidder                   Bidder                    Bidder                   ...
Python at AppNexusPython enables us to scale our team and rapidlyiterate and prototype technologies.
Hadoop at AppNexusHadoop enables us to   1PB    CLUSTERdo aggregations forreporting and other    862    NODES ACROSS      ...
Data modeling today BIG DATA: TBS/HOUR                    MEDIUM DATA: GBS/HOUR   Task    Task     Task      Task        l...
To enable the nextgeneration of data modeling,we need to leverage ourHadoop cluster
What are we trying to doAccess the data on HadoopContinue to use Python to modelà No consensus on the best solutionSo we ...
The budget problemWe have thousands of bidders buying billionsof ads per hour in real-time auctions.We need to create a mo...
Data modeling today BIG DATA: TBS/HOUR                MEDIUM DATA: GBS/HOUR   Task    Task     Task      Task        logs ...
Test problem:Budget aggregationSCENARIO:Each auction creates a row in a log. timestamp, auction_id, object_type, object_id...
Method:Budget aggregationSTEP 1: De-duplicate records whereKEY: object_type, object_id, method, auction_idSTEP 2: Aggregat...
HARDWARE•  300 GB of log data•  5 nodes running Scientific Linux 6.3 (Carbon)   •  Intel Xeon CPU @ 2.13 GHz, 4 cores •   ...
Research: Potential solutions1.   Native Java2.   Streaming ‒ no framework3.   mrjob4.   Happy / Jython / PyCascading5.  P...
Research: Criteria1. Usability2. Performance3. Versatility / Flexibility
Research: Native JavaBenchmark for comparison, using new Hadoop Java APIBudgetAgg.java Mapper classBudgetAgg.java Reducer ...
Research: Native JavaUSABILITY: ›  Not   straightforward for analysts to implement, launch, or tweakPERFORMANCE: ›  Fastes...
Research: Native JavaVERSATILITY / FLEXIBILITY: ›  Abilityto customize pretty   much everything ›  CustomPartitioner,   Co...
Research: StreamingSupplies an executable to Hadoop that reads from stdinand writes to stdoutmapper.py                    ...
Research: StreamingUSABILITY: ›  Key/value detection has to      be done by the user ›  Still, straightforward for      re...
Research: StreamingPERFORMANCE: ›  ~50%   slower than JavaVERSATILITY / FLEXIBILITY: ›  Inputs in reducer are iterated lin...
Research: mrjobOpen-source Python framework that wraps Hadoop StreamingUSABILITY: ›  “Simplified   Java” ›  Great docs,   ...
Research: mrjobPERFORMANCE: ›  Not   much slower than Streaming if only using RawValueProtocol
Research: mrjobPERFORMANCE: ›  Involvingobjects or   multiple steps slow it   down a lotVERSATILITY /FLEXIBILITY: ›  Cande...
Research: Happy / JythonHAPPY: ›  Full access to Java MapReduce   API ›  Happy project is deprecated    ›  Depends on Hado...
Research: PyCascadingPython wrapper around Cascading framework for dataprocessing workflow.Uses Jython as high level langu...
Research: PyCascadingUSABILITY: ›  Relatively new project ›  Cascading API is simple and intuitive ›  Job Planner abstract...
Research: PyCascadingVERSATILITY / FLEXIBILITY: ›  Allows Jython UDFs ›  Rich set of built-in   functions: GroupBy, Join, ...
Research: PigProvides a high-level language for data analysiswhich is compiled into a sequence of MapReduceoperations.USAB...
Research: PigUSABILITY: ›  Powerful   debugging and optimization tools (e.g. explain, illustrate) ›  Automatically optimiz...
Research: PigPERFORMANCE: ›  Pig compiler produces performant code ›  Complex operations might require manual optimization...
Research: PigVERSATILITY / FLEXIBILITY:USING PIG + JYTHON UDF ›  PigLatin           is expressive and can  capture most us...
Research: Summary             Running Time / Lines of Code for Implementations           Pig   PyCascading        MRJob   ...
Research: Recommendations•  Pig and PyCascading enable complex   pipelines to be expressed simply•  Pig is more mature and...
??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ...
Upcoming SlideShare
Loading in...5
×

Python in an Evolving Enterprise System (PyData SV 2013)

543

Published on

Video can be found here: https://vimeo.com/63253563

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
543
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Python in an Evolving Enterprise System (PyData SV 2013)

  1. 1. DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMNDMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MMMMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMMMM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMMMMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMMMMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMMMMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMMMMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMMMMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMMMMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMMMMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMMMMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMMMMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM$MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7
  2. 2. PYTHON INAN EVOLVINGENTERPRISE SYSTEMEVALUATING INTEGRATIONSOLUTIONS WITH HADOOPDAVE HIMRODSTEVE KANNANANGELICA PANDO
  3. 3. Building today’s most powerful,open, and customizable advertisingtechnology platform.
  4. 4. Ad is served in<100 milliseconds WINNING AUCTION BID REQUEST 300x250 AD ADVERTISER 1 ADVERTISER 2 ADVERTISER 3 RESPONSE BID: $2.50 BID: $3.25 BID: $4.10 APPNEXUS OPTIMIZATION
  5. 5. Evolution of AppNexus 20 350 430 PEOPLE FROM 100M 39B 45B AD REQUESTS 5000+ MYSQL, HADOOP/HBASE, AEROSPIKE, SERVERS NETEZZA, VERTICA 38+ TB OF DATA EVERY DAY 99.99% UPTIME
  6. 6. Evolution of AppNexus ENG OFFICES ENGINEERING IN PORTLAND HQ IN NYC & SF
  7. 7. Data-Driven Decisioning (D3) Bidder Bidder Bidder BIDDERS DATA D3 PIPELINE PROCESSING
  8. 8. Python at AppNexusPython enables us to scale our team and rapidlyiterate and prototype technologies.
  9. 9. Hadoop at AppNexusHadoop enables us to 1PB CLUSTERdo aggregations forreporting and other 862 NODES ACROSS SEVERAL CLUSTERSdata pipeline jobs 40B BILLION LOG RECORDS DAILY BILLION 5.6B LOG RECORDS/HOUR AT PEAK
  10. 10. Data modeling today BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR Task Task Task Task logs logs CACHE logs VERTICA logs HADOOP Σ DATA DATA DRIVEN SERVICES DECISIONING
  11. 11. To enable the nextgeneration of data modeling,we need to leverage ourHadoop cluster
  12. 12. What are we trying to doAccess the data on HadoopContinue to use Python to modelà No consensus on the best solutionSo we conducted our own researchto evaluate integration options
  13. 13. The budget problemWe have thousands of bidders buying billionsof ads per hour in real-time auctions.We need to create a model that can manipulatehow our bidders spend their budgets andpurchase ads.
  14. 14. Data modeling today BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR Task Task Task Task logs logs CACHE logs DATA DRIVEN VERTICA logs DECISIONING HADOOP Σ DATA DATA DRIVEN SERVICES DECISIONING
  15. 15. Test problem:Budget aggregationSCENARIO:Each auction creates a row in a log. timestamp, auction_id, object_type, object_id, method, valueWe need to aggregate and model to updatebidders.
  16. 16. Method:Budget aggregationSTEP 1: De-duplicate records whereKEY: object_type, object_id, method, auction_idSTEP 2: Aggregate value whereKEY: object_type, object_id, method
  17. 17. HARDWARE•  300 GB of log data•  5 nodes running Scientific Linux 6.3 (Carbon) •  Intel Xeon CPU @ 2.13 GHz, 4 cores •  2 TB Disk•  CDH4•  45 map, 35 reduce tasks at a time
  18. 18. Research: Potential solutions1. Native Java2. Streaming ‒ no framework3. mrjob4. Happy / Jython / PyCascading5.  Pig + Jython UDF6. Pydoop prohibitive installation7. Disco evaluating Hadoop8.  Hadoopy / dumbo similar to mrjob9. Hipy Effectively ORM for Hive
  19. 19. Research: Criteria1. Usability2. Performance3. Versatility / Flexibility
  20. 20. Research: Native JavaBenchmark for comparison, using new Hadoop Java APIBudgetAgg.java Mapper classBudgetAgg.java Reducer class
  21. 21. Research: Native JavaUSABILITY: ›  Not straightforward for analysts to implement, launch, or tweakPERFORMANCE: ›  Fastest implementation. ›  Can further enhance by overriding comparators for grouping and sorting
  22. 22. Research: Native JavaVERSATILITY / FLEXIBILITY: ›  Abilityto customize pretty much everything ›  CustomPartitioner, Comparator, Grouping Comparator in our implementation ›  Canuse complex objects as keys or values
  23. 23. Research: StreamingSupplies an executable to Hadoop that reads from stdinand writes to stdoutmapper.py reducer.py
  24. 24. Research: StreamingUSABILITY: ›  Key/value detection has to be done by the user ›  Still, straightforward for relatively simple jobs hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar -D stream.num.map.output.key.fields=4 -D num.key.fields.for.partition=3 -D mapred.reduce.tasks=35 -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer_nongroup.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output
  25. 25. Research: StreamingPERFORMANCE: ›  ~50% slower than JavaVERSATILITY / FLEXIBILITY: ›  Inputs in reducer are iterated line-by-line ›  Straightforward to get de-duplication and agg to work in a single step
  26. 26. Research: mrjobOpen-source Python framework that wraps Hadoop StreamingUSABILITY: ›  “Simplified Java” ›  Great docs, actively developedpython budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop --jobconf stream.num.map.output.key.fields=4 --jobconf num.key.fields.for.partition=3 --jobconf mapred.reduce.tasks=35 --partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -o hdfs:///user/apando/budget_logs/mrjob_output hdfs:///logs/log_budget/v002/2013/03/06/19/
  27. 27. Research: mrjobPERFORMANCE: ›  Not much slower than Streaming if only using RawValueProtocol
  28. 28. Research: mrjobPERFORMANCE: ›  Involvingobjects or multiple steps slow it down a lotVERSATILITY /FLEXIBILITY: ›  Candefine Input / Internal / Output protocols
  29. 29. Research: Happy / JythonHAPPY: ›  Full access to Java MapReduce API ›  Happy project is deprecated ›  Depends on Hadoop 0.17JYTHON: ›  Doesn’t work easily out of the box ›  Relies on deprecated Jython compiler in Jython 2.2 ›  Limited to Jython implementation of Python ›  Numpy/SciPy and Pandas unavailable
  30. 30. Research: PyCascadingPython wrapper around Cascading framework for dataprocessing workflow.Uses Jython as high level language for definingworkflows.
  31. 31. Research: PyCascadingUSABILITY: ›  Relatively new project ›  Cascading API is simple and intuitive ›  Job Planner abstracts details of MapReducePERFORMANCE: ›  Abstraction makes performance tuning challenging ›  Does not support Combiner operation ›  Dev time was fast, runtime was slow
  32. 32. Research: PyCascadingVERSATILITY / FLEXIBILITY: ›  Allows Jython UDFs ›  Rich set of built-in functions: GroupBy, Join, Merge
  33. 33. Research: PigProvides a high-level language for data analysiswhich is compiled into a sequence of MapReduceoperations.USABILITY:
  34. 34. Research: PigUSABILITY: ›  Powerful debugging and optimization tools (e.g. explain, illustrate) ›  Automatically optimizes MapReduce operations: ›  Applies Combiner operations where applicable ›  Reorders and conflates data flow for efficiency
  35. 35. Research: PigPERFORMANCE: ›  Pig compiler produces performant code ›  Complex operations might require manual optimization ›  Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step
  36. 36. Research: PigVERSATILITY / FLEXIBILITY:USING PIG + JYTHON UDF ›  PigLatin is expressive and can capture most use cases ›  Define custom data operations in Jython called UDFs ›  UDFs can implement custom loaders, partitioners, and other advanced features
  37. 37. Research: Summary Running Time / Lines of Code for Implementations Pig PyCascading MRJob Lines of Code Running Time Streaming Java 0 50 100 150 200 250 300 Running Time (minutes), Lines of Code
  38. 38. Research: Recommendations•  Pig and PyCascading enable complex pipelines to be expressed simply•  Pig is more mature and the most viable option for ad-hoc analysis
  39. 39. ??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ??? ??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? QUESTIONS ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? pydata@appnexus.com ??::?? ???? ??? ??:?? ???
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×