Python in an Evolving Enterprise System (PyData SV 2013)
Upcoming SlideShare
Loading in...5
×
 

Python in an Evolving Enterprise System (PyData SV 2013)

on

  • 711 views

Video can be found here: https://vimeo.com/63253563

Video can be found here: https://vimeo.com/63253563

Statistics

Views

Total Views
711
Slideshare-icon Views on SlideShare
711
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Python in an Evolving Enterprise System (PyData SV 2013) Python in an Evolving Enterprise System (PyData SV 2013) Presentation Transcript

    • DMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMD MMM8OOOOOOOOOOO8MMMM8OOOOOOOOOOOOOOODMMMMOOOOOOOOOOOOOOMMMNDMMIIIIIIIIIIIII$MMMM$IIIIIIIIIIIIIIIOMMMM7III?IIIIIIIIII7MMMMOIIIIIIIIIIIII7MMMMOIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMDIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMM7IIIIIIIIIIIIIII?MMMMMIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMM8IIIIIIIIIIIIIIIIMMMMMOIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIZMMMMMMIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIZMMMM8IIIIIIIIIIIIIIIII7MMMMMMOIIIIIIIIIMMMM8$$IIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIII?III8MMMMMMMZIIIIIIMMMMMMMMMMMMN87IIIIII8MMMMDIIIIIIIIIIIIIIIIIIIZMMMMMMMMMNZIIMMMMMMMMMMMMMMMMMMMOII$MMMMMIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM8MMMMMZIIIIIIIIIIIIIIIIIIIIII8MMMMMMMMMMMMOIIIIIIII7NMMMMMMMMMMMMMMMIIIIIIIIIIIIIIIIIIIIII?IIII$ODMMMMOIIIIIIIIII?I8MMMMMMMMMMMMDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIZMMMMMMMMMM7II?IIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIII7NMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIDMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMIIIIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMM8IIIIIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMM$IIIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIIIIOMMMMMMMMMM8I?IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIIIIIIIIIIIIIII7MMMMMMMMMMMMMN7IIIIIIIIIIIMMMMMMMMD$IIIIIIIIIIIIIIIIIIIIIIIIIMMMMMMMMMMMMMMMMDZIIIIIIIMMMMMMMMMMMMMNIIIIIIIIIIIIIIIIIIIIIINMMMMDZNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?IIIIIIIIIIIIIIIIIIIIMMMMMIII7NMMMMMMMMMMMMMMMMMOIII7DMMMMMMMM$IIIIIIIIIIIIIIIIIIZMMMMNIIIIIIII7$DNMMMMMMMMMOIIIIII7MMMMMMM7IIIIIIIIIIIIIIIIIIOMMMM8IIIIIIIIIIIIIIIIMMMMOIIIIIIIIIMMMMMMNIIIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIIDMMMM7IIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIMMMMMMIIIIIIIIIIIIIIII7MMMMDIIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIMMMMMZIIIIIIIIIIIIIIIIOMMMM$IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIIMMMM8IIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIMMMMMNIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIMMMMOIIIIIIIIIIIIIIOMMMMMIIIIIIIIIIIIIIIIMMMMMI??IIIIIIIIIIIMM$MMIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIIIIIIMMMMMIIIIIIIIIIIII8MM MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 777777777777777777777777777777777777777777777777I77 MD$N? MMM8MN OMM8MZ OMMMDM MMDM+ MD~NO M= MM ZZMI +ZI M7 M O7 MO OM M OMMMMM 8M7 M= MM MN7 MM?+I M7 M O7 MO OM M O+ 8M7 M~ MM ZMI MDMMN MMNMM7 OMMMM OM M MMM8M MD:N8 MMMMMM MO8M M7 O7 M7 O7
    • PYTHON INAN EVOLVINGENTERPRISE SYSTEMEVALUATING INTEGRATIONSOLUTIONS WITH HADOOPDAVE HIMRODSTEVE KANNANANGELICA PANDO
    • Building today’s most powerful,open, and customizable advertisingtechnology platform.
    • Ad is served in<100 milliseconds WINNING AUCTION BID REQUEST 300x250 AD ADVERTISER 1 ADVERTISER 2 ADVERTISER 3 RESPONSE BID: $2.50 BID: $3.25 BID: $4.10 APPNEXUS OPTIMIZATION
    • Evolution of AppNexus 20 350 430 PEOPLE FROM 100M 39B 45B AD REQUESTS 5000+ MYSQL, HADOOP/HBASE, AEROSPIKE, SERVERS NETEZZA, VERTICA 38+ TB OF DATA EVERY DAY 99.99% UPTIME
    • Evolution of AppNexus ENG OFFICES ENGINEERING IN PORTLAND HQ IN NYC & SF
    • Data-Driven Decisioning (D3) Bidder Bidder Bidder BIDDERS DATA D3 PIPELINE PROCESSING
    • Python at AppNexusPython enables us to scale our team and rapidlyiterate and prototype technologies.
    • Hadoop at AppNexusHadoop enables us to 1PB CLUSTERdo aggregations forreporting and other 862 NODES ACROSS SEVERAL CLUSTERSdata pipeline jobs 40B BILLION LOG RECORDS DAILY BILLION 5.6B LOG RECORDS/HOUR AT PEAK
    • Data modeling today BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR Task Task Task Task logs logs CACHE logs VERTICA logs HADOOP Σ DATA DATA DRIVEN SERVICES DECISIONING
    • To enable the nextgeneration of data modeling,we need to leverage ourHadoop cluster
    • What are we trying to doAccess the data on HadoopContinue to use Python to modelà No consensus on the best solutionSo we conducted our own researchto evaluate integration options
    • The budget problemWe have thousands of bidders buying billionsof ads per hour in real-time auctions.We need to create a model that can manipulatehow our bidders spend their budgets andpurchase ads.
    • Data modeling today BIG DATA: TBS/HOUR MEDIUM DATA: GBS/HOUR Task Task Task Task logs logs CACHE logs DATA DRIVEN VERTICA logs DECISIONING HADOOP Σ DATA DATA DRIVEN SERVICES DECISIONING
    • Test problem:Budget aggregationSCENARIO:Each auction creates a row in a log. timestamp, auction_id, object_type, object_id, method, valueWe need to aggregate and model to updatebidders.
    • Method:Budget aggregationSTEP 1: De-duplicate records whereKEY: object_type, object_id, method, auction_idSTEP 2: Aggregate value whereKEY: object_type, object_id, method
    • HARDWARE•  300 GB of log data•  5 nodes running Scientific Linux 6.3 (Carbon) •  Intel Xeon CPU @ 2.13 GHz, 4 cores •  2 TB Disk•  CDH4•  45 map, 35 reduce tasks at a time
    • Research: Potential solutions1. Native Java2. Streaming ‒ no framework3. mrjob4. Happy / Jython / PyCascading5.  Pig + Jython UDF6. Pydoop prohibitive installation7. Disco evaluating Hadoop8.  Hadoopy / dumbo similar to mrjob9. Hipy Effectively ORM for Hive
    • Research: Criteria1. Usability2. Performance3. Versatility / Flexibility
    • Research: Native JavaBenchmark for comparison, using new Hadoop Java APIBudgetAgg.java Mapper classBudgetAgg.java Reducer class
    • Research: Native JavaUSABILITY: ›  Not straightforward for analysts to implement, launch, or tweakPERFORMANCE: ›  Fastest implementation. ›  Can further enhance by overriding comparators for grouping and sorting
    • Research: Native JavaVERSATILITY / FLEXIBILITY: ›  Abilityto customize pretty much everything ›  CustomPartitioner, Comparator, Grouping Comparator in our implementation ›  Canuse complex objects as keys or values
    • Research: StreamingSupplies an executable to Hadoop that reads from stdinand writes to stdoutmapper.py reducer.py
    • Research: StreamingUSABILITY: ›  Key/value detection has to be done by the user ›  Still, straightforward for relatively simple jobs hadoop jar /usr/lib/hadoop-0.23.0-mr1-cdh4b1/contrib/streaming/hadoop-*streaming*.jar -D stream.num.map.output.key.fields=4 -D num.key.fields.for.partition=3 -D mapred.reduce.tasks=35 -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer_nongroup.py -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -input /logs/log_budget/v002/2013/03/06/19/ -output bidder_logs/streaming_output
    • Research: StreamingPERFORMANCE: ›  ~50% slower than JavaVERSATILITY / FLEXIBILITY: ›  Inputs in reducer are iterated line-by-line ›  Straightforward to get de-duplication and agg to work in a single step
    • Research: mrjobOpen-source Python framework that wraps Hadoop StreamingUSABILITY: ›  “Simplified Java” ›  Great docs, actively developedpython budget_agg.py -r hadoop --hadoop-bin /usr/bin/hadoop --jobconf stream.num.map.output.key.fields=4 --jobconf num.key.fields.for.partition=3 --jobconf mapred.reduce.tasks=35 --partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -o hdfs:///user/apando/budget_logs/mrjob_output hdfs:///logs/log_budget/v002/2013/03/06/19/
    • Research: mrjobPERFORMANCE: ›  Not much slower than Streaming if only using RawValueProtocol
    • Research: mrjobPERFORMANCE: ›  Involvingobjects or multiple steps slow it down a lotVERSATILITY /FLEXIBILITY: ›  Candefine Input / Internal / Output protocols
    • Research: Happy / JythonHAPPY: ›  Full access to Java MapReduce API ›  Happy project is deprecated ›  Depends on Hadoop 0.17JYTHON: ›  Doesn’t work easily out of the box ›  Relies on deprecated Jython compiler in Jython 2.2 ›  Limited to Jython implementation of Python ›  Numpy/SciPy and Pandas unavailable
    • Research: PyCascadingPython wrapper around Cascading framework for dataprocessing workflow.Uses Jython as high level language for definingworkflows.
    • Research: PyCascadingUSABILITY: ›  Relatively new project ›  Cascading API is simple and intuitive ›  Job Planner abstracts details of MapReducePERFORMANCE: ›  Abstraction makes performance tuning challenging ›  Does not support Combiner operation ›  Dev time was fast, runtime was slow
    • Research: PyCascadingVERSATILITY / FLEXIBILITY: ›  Allows Jython UDFs ›  Rich set of built-in functions: GroupBy, Join, Merge
    • Research: PigProvides a high-level language for data analysiswhich is compiled into a sequence of MapReduceoperations.USABILITY:
    • Research: PigUSABILITY: ›  Powerful debugging and optimization tools (e.g. explain, illustrate) ›  Automatically optimizes MapReduce operations: ›  Applies Combiner operations where applicable ›  Reorders and conflates data flow for efficiency
    • Research: PigPERFORMANCE: ›  Pig compiler produces performant code ›  Complex operations might require manual optimization ›  Budget Aggregation require the implementation of a User Defined Function in Jython to eliminate unnecessary MapReduce step
    • Research: PigVERSATILITY / FLEXIBILITY:USING PIG + JYTHON UDF ›  PigLatin is expressive and can capture most use cases ›  Define custom data operations in Jython called UDFs ›  UDFs can implement custom loaders, partitioners, and other advanced features
    • Research: Summary Running Time / Lines of Code for Implementations Pig PyCascading MRJob Lines of Code Running Time Streaming Java 0 50 100 150 200 250 300 Running Time (minutes), Lines of Code
    • Research: Recommendations•  Pig and PyCascading enable complex pipelines to be expressed simply•  Pig is more mature and the most viable option for ad-hoc analysis
    • ??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? ??::?? ???? ??? ??:?? ??? ??????? ??:::::::?? ??:::::::::::? ?:::::????:::::? ?::::? ?::::? ?::::? ?::::? ?????? ?::::? QUESTIONS ?::::? ?::::? ?::::? ?::::? ?::::? ?::::? pydata@appnexus.com ??::?? ???? ??? ??:?? ???