Eat whatever you can with PyBabe

899 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
899
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Eat whatever you can with PyBabe

    1. 1. PyBabeEat whatever data you wanna eat Dataiku™
    2. 2. Projet Goal was … Integrate Game Logs on a Large actor, social Gaming IsCool Entertainment (Euronext: ALWEK), 70 people, 10M€ revenues. Around 30 GB raw logs per day for 7 games(web, mobile) That’s about 10TB per year. At the end some Hadoop’ing + Analytics SQ L, but in the middle lots of data integration Anykind of logs and Data Partial database extracts Apache/NGinx logs Tracking Logs (Web Analytics stuff. etc..) Application Logs REST APIs (Currency Exchange, Geo Data, Facebook APIs. )..) Dataiku™
    3. 3. As a reminderWhat most data scientists do ? LinkedIn&Twitter “Data Science” Real Life “Recommendation” 80% of its time is spent “Clustering algorithms” getting the data right “Big Data” “Machine Learning” 19% Analytics “Hidden Markov Model” “Predictive Analytics” 1% Twitter & LinkedIn “Logistic Regression” Dataiku™
    4. 4. Goal An project based on a ETL solution had previously failed Need for Agility To manage any data To be quick The answer is …. PYTHON !!! Dataiku™
    5. 5. Step 1: Open your favoriteeditor, write a .py file Script for data parsing, filling up the database, enrichment, cleanup, etc.. Around 2000 line of code 5 man days work !Good, but hard to maintain on the long run !Not fun I switched from emacs to SublimeText2 in the meantime, that was cool. Dataiku™
    6. 6. Step 2: Abstract andGeneralize. PyBabe Micro-ETL in Python Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP, GZIP, MongoDB, Excel, etc.. Basic file filters and transformations (filters, regular expressions, date parsing, geoip, transpose, sort, group, …) Use yield and named tuples Open-source https://github.com/fdouetteau/PyBabe And the old project ? The old project became 200 linesof specific code Dataiku™
    7. 7. Sample pybabe(1) Fetch a log file in s3 and put integrate inbabe = Babe()## Fetch multiple CSV file from S3, har, cache en localbabe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)## Recupère l’IP dans le champs IP, trouve pas geoip le paysbabe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True)## Récupère le user agent, et stocke le nom du navigateurbabe = babe.user_agent(field=“user_agent”, browser=“browser”)## Ne garde que les champs pertinentsbabe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”])## Stocke le résultat dans une base de donnéebabe.push_sql(database=“mydb”, table=“mytable”, username=“…”); Dataiku™
    8. 8. Sample PyBabe script (2) Large file sort, joinbabe = Babe()## Fetch a large CSV filebabe = babe.pull(filename=“mybigfile.csv”)## Perform a disk-based sort, batch 100k lines in memorybabe = babe.sortDiskBased(field=“uid”, nsize=100000)## Group By uid and sum revenu per user.babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))## Join this stream on “uid” with the result of a CSV fileabe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)## Store the result in an Excel filebabe.push (filename=“reports.xlxs”); Dataiku™
    9. 9. Sample PyBabe script (3) Mail a reportbabe = Babe()## Pull the result of a SQL querybabe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “)## Pull the result of a second SQL querybabe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”)## Send the overall stream (concatenated) as an email, with content attached in Excel, andsome sample data in the bodybabe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True,data_in_body_row_limit=10, attach_formats=“xlsx”) Dataiku™
    10. 10. Some Design Choices Use collections.namedtuple Use generators Nice and easy programming style def filter(stream, f): for data in stream: if isinstance(data, StreamMeta): yield data elif f(data): yield data IO Streaming whenever possible An HTTP downloaded file begins to be processed as it starts downloading Use bulk-loaders (SQL) or external program when faster than the python implementation (e.g gzip) Dataiku™
    11. 11. PyBabe data model def sample_pull(): header = StreamHeader(name=”visits”, partition={‘day’:‘2012-09-14’}, A Babe works on a fields=[“name”, “day”]) generator that contains yield header a sequence of partition yield header.makeRow(‘Florian’,‘2012-09-14’) A Partition is yield header.makeRow(‘John’, ‘2012-09-14’) composed of a header yield StreamFooter() (StreamHeader), rows, yield header.replace(partition={‘day’:‘2012-09-15’}) and a Footer yield header.makeRow(‘Phil’, ‘2012-09-15’) yield StreamFooter() Dataiku™
    12. 12. Some thoughts andassociated projects strptime and performance Parse a date with time.strptime or datetime.strptime 30 microseconds vs. 3 microseconds for regexp matching !!! “Tarpys” a date parsing library, with date guessing Charset management (pyencoding_cleaner) Sniff ISO or UTF-8 charset over a fragment Optionally try to fix bad encoding ( î, í, ü) python2.X csv module is ok but … No Unicode support Separator sniffing buggy on edge cases Dataiku™
    13. 13. Future Need to separate the github project into core and plugins Rewrite in C a CSV module ? … Configurable Error system. Should a error row fail the whole stream, fail the whole babe, send a warning, be or be skipped Pandas/NumPy integration An Homepage, Docs, etc... Dataiku™
    14. 14. Ask questions ? ? babe = Babe().pull(“questions.csv”) babe = babe.filter(smart=True) babe = babe.mapTo(oracle) Florian Douetteau babe.push(“answers.csv”);@fdouetteau CEO Dataiku Dataiku : Our Goal - Leverage and Provide the best of open souce technologies to help people build their own data science platform Dataiku™

    ×