Eat whatever you can with PyBabe
Upcoming SlideShare
Loading in...5

Eat whatever you can with PyBabe






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Eat whatever you can with PyBabe Eat whatever you can with PyBabe Presentation Transcript

  • PyBabeEat whatever data you wanna eat Dataiku™
  • Projet Goal was … Integrate Game Logs on a Large actor, social Gaming IsCool Entertainment (Euronext: ALWEK), 70 people, 10M€ revenues. Around 30 GB raw logs per day for 7 games(web, mobile) That’s about 10TB per year. At the end some Hadoop’ing + Analytics SQ L, but in the middle lots of data integration Anykind of logs and Data Partial database extracts Apache/NGinx logs Tracking Logs (Web Analytics stuff. etc..) Application Logs REST APIs (Currency Exchange, Geo Data, Facebook APIs. )..) Dataiku™
  • As a reminderWhat most data scientists do ? LinkedIn&Twitter “Data Science” Real Life “Recommendation” 80% of its time is spent “Clustering algorithms” getting the data right “Big Data” “Machine Learning” 19% Analytics “Hidden Markov Model” “Predictive Analytics” 1% Twitter & LinkedIn “Logistic Regression” Dataiku™
  • Goal An project based on a ETL solution had previously failed Need for Agility To manage any data To be quick The answer is …. PYTHON !!! Dataiku™
  • Step 1: Open your favoriteeditor, write a .py file Script for data parsing, filling up the database, enrichment, cleanup, etc.. Around 2000 line of code 5 man days work !Good, but hard to maintain on the long run !Not fun I switched from emacs to SublimeText2 in the meantime, that was cool. Dataiku™
  • Step 2: Abstract andGeneralize. PyBabe Micro-ETL in Python Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP, GZIP, MongoDB, Excel, etc.. Basic file filters and transformations (filters, regular expressions, date parsing, geoip, transpose, sort, group, …) Use yield and named tuples Open-source And the old project ? The old project became 200 linesof specific code Dataiku™
  • Sample pybabe(1) Fetch a log file in s3 and put integrate inbabe = Babe()## Fetch multiple CSV file from S3, har, cache en localbabe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)## Recupère l’IP dans le champs IP, trouve pas geoip le paysbabe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True)## Récupère le user agent, et stocke le nom du navigateurbabe = babe.user_agent(field=“user_agent”, browser=“browser”)## Ne garde que les champs pertinentsbabe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”])## Stocke le résultat dans une base de donnéebabe.push_sql(database=“mydb”, table=“mytable”, username=“…”); Dataiku™
  • Sample PyBabe script (2) Large file sort, joinbabe = Babe()## Fetch a large CSV filebabe = babe.pull(filename=“mybigfile.csv”)## Perform a disk-based sort, batch 100k lines in memorybabe = babe.sortDiskBased(field=“uid”, nsize=100000)## Group By uid and sum revenu per user.babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))## Join this stream on “uid” with the result of a CSV fileabe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)## Store the result in an Excel filebabe.push (filename=“reports.xlxs”); Dataiku™
  • Sample PyBabe script (3) Mail a reportbabe = Babe()## Pull the result of a SQL querybabe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “)## Pull the result of a second SQL querybabe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”)## Send the overall stream (concatenated) as an email, with content attached in Excel, andsome sample data in the bodybabe = babe.sendmail(subject=“Your Report”,, data_in_body=True,data_in_body_row_limit=10, attach_formats=“xlsx”) Dataiku™
  • Some Design Choices Use collections.namedtuple Use generators Nice and easy programming style def filter(stream, f): for data in stream: if isinstance(data, StreamMeta): yield data elif f(data): yield data IO Streaming whenever possible An HTTP downloaded file begins to be processed as it starts downloading Use bulk-loaders (SQL) or external program when faster than the python implementation (e.g gzip) Dataiku™
  • PyBabe data model def sample_pull(): header = StreamHeader(name=”visits”, partition={‘day’:‘2012-09-14’}, A Babe works on a fields=[“name”, “day”]) generator that contains yield header a sequence of partition yield header.makeRow(‘Florian’,‘2012-09-14’) A Partition is yield header.makeRow(‘John’, ‘2012-09-14’) composed of a header yield StreamFooter() (StreamHeader), rows, yield header.replace(partition={‘day’:‘2012-09-15’}) and a Footer yield header.makeRow(‘Phil’, ‘2012-09-15’) yield StreamFooter() Dataiku™
  • Some thoughts andassociated projects strptime and performance Parse a date with time.strptime or datetime.strptime 30 microseconds vs. 3 microseconds for regexp matching !!! “Tarpys” a date parsing library, with date guessing Charset management (pyencoding_cleaner) Sniff ISO or UTF-8 charset over a fragment Optionally try to fix bad encoding ( î, í, ü) python2.X csv module is ok but … No Unicode support Separator sniffing buggy on edge cases Dataiku™
  • Future Need to separate the github project into core and plugins Rewrite in C a CSV module ? … Configurable Error system. Should a error row fail the whole stream, fail the whole babe, send a warning, be or be skipped Pandas/NumPy integration An Homepage, Docs, etc... Dataiku™
  • Ask questions ? ? babe = Babe().pull(“questions.csv”) babe = babe.filter(smart=True) babe = babe.mapTo(oracle) Florian Douetteau babe.push(“answers.csv”);@fdouetteau CEO Dataiku Dataiku : Our Goal - Leverage and Provide the best of open souce technologies to help people build their own data science platform Dataiku™