Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a data processing pipeline in Python

1,602 views

Published on

Most data is poorly formatted, we want to use Python, SQLAlchemy, Celery and Requests to build a pipeline to fix this data.

Published in: Data & Analytics
  • Be the first to comment

Building a data processing pipeline in Python

  1. 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  2. 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  3. 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  4. 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  5. 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  6. 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  7. 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  8. 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  9. 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  10. 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  11. 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  12. 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  13. 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  14. 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  15. 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  16. 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python

×