Building a data processing pipeline in Python

Joe Cabrera
Joe CabreraSoftware Engineer at Hearst
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
Joe Cabrera Building a data processing pipeline in Python
1 of 16

Recommended

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me... by
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
3.9K views134 slides
On the Persistence of Persistent Identifiers of the Scholarly Web by
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
572 views19 slides
Sustainable queryable access to Linked Data by
Sustainable queryable access to Linked DataSustainable queryable access to Linked Data
Sustainable queryable access to Linked DataRuben Verborgh
2.2K views46 slides
Querying datasets on the Web with high availability by
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
3.4K views46 slides
Linked Data Fragments by
Linked Data FragmentsLinked Data Fragments
Linked Data FragmentsRuben Verborgh
7.2K views37 slides
Python for Big Data Analytics by
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
37.1K views20 slides

More Related Content

What's hot

The Lonesome LOD Cloud by
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD CloudRuben Verborgh
4.5K views88 slides
SQL: The one language to rule all your data by
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your dataBrendan Tierney
3.3K views73 slides
How to Build a Semantic Search System by
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
5.3K views68 slides
Fire-fighting java big data problems by
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
77 views94 slides
How to Light a Beacon by
How to Light a BeaconHow to Light a Beacon
How to Light a BeaconMiro Cupak
89 views26 slides

What's hot(19)

SQL: The one language to rule all your data by Brendan Tierney
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
Brendan Tierney3.3K views
How to Build a Semantic Search System by Trey Grainger
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger5.3K views
Fire-fighting java big data problems by grepalex
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
grepalex77 views
How to Light a Beacon by Miro Cupak
How to Light a BeaconHow to Light a Beacon
How to Light a Beacon
Miro Cupak89 views
Building Search & Recommendation Engines by Trey Grainger
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
Trey Grainger6K views
Aqua Browser Implementation at Oklahoma State University by youthelectronix
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
youthelectronix1.8K views
Linking media, data, and services by Ruben Verborgh
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and services
Ruben Verborgh2K views
LinkedGov extension for Google Refine by danpaulsmith
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
danpaulsmith2.1K views
Python and BIG Data analytics | Python Fundamentals | Python Architecture by Skillspeed
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Skillspeed870 views
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc... by Krist Wongsuphasawat
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Adventure in Data: A tour of visualization projects at Twitter by Krist Wongsuphasawat
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
The Intent Algorithms of Search & Recommendation Engines by Trey Grainger
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger2.4K views
Hadoop with Python by Donald Miner
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner35.4K views
Democratizing Data at Airbnb by Neo4j
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
Neo4j17.7K views

Viewers also liked

Pyxley: Easy Web Applications with Flask and React.js by
Pyxley: Easy Web Applications with Flask and React.jsPyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.jsNick Kridler
4.3K views34 slides
A Beginner's Guide to Building Data Pipelines with Luigi by
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
57K views26 slides
Functional Programming with Ruby by
Functional Programming with RubyFunctional Programming with Ruby
Functional Programming with Rubytokland
20.6K views52 slides
Large scale data processing pipelines at trivago by
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
2.2K views21 slides
How To Download and Process SEC XBRL Data Directly from EDGAR by
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARAlexander Falk
36.6K views26 slides
Managing data workflows with Luigi by
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
6.2K views35 slides

Viewers also liked(9)

Pyxley: Easy Web Applications with Flask and React.js by Nick Kridler
Pyxley: Easy Web Applications with Flask and React.jsPyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.js
Nick Kridler4.3K views
A Beginner's Guide to Building Data Pipelines with Luigi by Growth Intelligence
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Functional Programming with Ruby by tokland
Functional Programming with RubyFunctional Programming with Ruby
Functional Programming with Ruby
tokland20.6K views
Large scale data processing pipelines at trivago by Clemens Valiente
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
Clemens Valiente2.2K views
How To Download and Process SEC XBRL Data Directly from EDGAR by Alexander Falk
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
Alexander Falk36.6K views
Managing data workflows with Luigi by Teemu Kurppa
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
Teemu Kurppa6.2K views
Bubbles – Virtual Data Objects by Stefan Urbanek
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
Stefan Urbanek76.4K views
Building a Data Pipeline from Scratch - Joe Crobak by Hakka Labs
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs38.6K views

Similar to Building a data processing pipeline in Python

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf by
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
19 views34 slides
Database story by DevOps by
Database story by DevOpsDatabase story by DevOps
Database story by DevOpsAnton Martynenko
1.1K views79 slides
Big Data made easy with a Spark by
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a SparkJean-Georges Perrin
689 views76 slides
Off-Label Data Mesh: A Prescription for Healthier Data by
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
38 views84 slides
Measure All the Things! - Austin Data Day 2014 by
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014gdusbabek
1.3K views106 slides
Data infrastructure architecture for medium size organization: tips for colle... by
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
1.7K views30 slides

Similar to Building a data processing pipeline in Python(20)

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf by Altinity Ltd
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
Altinity Ltd19 views
Off-Label Data Mesh: A Prescription for Healthier Data by HostedbyConfluent
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
Measure All the Things! - Austin Data Day 2014 by gdusbabek
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
gdusbabek1.3K views
Performance tuning by Eric Phan
Performance tuningPerformance tuning
Performance tuning
Eric Phan392 views
Building, Evaluating, and Optimizing your RAG App for Production by Sri Ambati
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati14 views
Integrating Hadoop in Your Existing DW and BI Environment by Cloudera, Inc.
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
Cloudera, Inc.1.7K views
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark by Databricks
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Databricks715 views
R meetup talk scaling data science with dgit by Venkata Pingali
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgit
Venkata Pingali646 views
The Great Lakes: How to Approach a Big Data Implementation by Inside Analysis
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
Inside Analysis1.1K views
Introduction to Machine Learning with H2O and Python by Sri Ambati
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Sri Ambati2.4K views
Introduction to Machine Learning with H2O and Python by Jo-fai Chow
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
Jo-fai Chow842 views
Building Data Applications with Apache Druid by Imply
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
Imply 267 views
SharePoint Search Results Branding by Cory Peters
SharePoint Search Results BrandingSharePoint Search Results Branding
SharePoint Search Results Branding
Cory Peters676 views

Recently uploaded

Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
5 views30 slides
Introduction to Microsoft Fabric.pdf by
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdfishaniuudeshika
29 views16 slides
ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides
Supercharging your Data with Azure AI Search and Azure OpenAI by
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAIPeter Gallagher
37 views32 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 views24 slides
Understanding Hallucinations in LLMs - 2023 09 29.pptx by
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxGreg Makowski
17 views18 slides

Recently uploaded(20)

Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika29 views
Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski17 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 19 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann111 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views

Building a data processing pipeline in Python

  • 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  • 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  • 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  • 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  • 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  • 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  • 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  • 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  • 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  • 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  • 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  • 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  • 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python