A Beginner's Guide to Building Data Pipelines with Luigi

Growth Intelligence
Growth IntelligenceGrowth Intelligence
A Beginner’s Guide to Building Data Pipelines with
Where should I focus my outbound sales and marketing efforts to
yield the highest possible ROI?
UK Limited
Companies
Customer
CRM Data
Predictive
Model
With big data, comes big responsibility
Hard to maintain, extend, and… look at.
Script Soup
omg moar
codez
Code
More Codes
if __name__ == '__main__':
today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process
arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read
(supports globstar wildcards)', required=True)
arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of
rows to save to DB at once')
arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data
were released')
arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen',
default='INFO')
args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?
log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))
log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')
ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)
ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')
ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date,
company_status_params=company_status_params) <- Need to clean up if this fails
ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together
● Open-sourced & maintained by
Spotify data team
● Erik Berhhardsson and Elias Freider.
Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code
and create dependencies between
tasks.
Luigi to the rescue!
● Task templating
● Dependency graphs
● Resumption of data flows after
intermediate failure
● Command line integration
● Error emails
Luigi
Luigi 101- Counting the number of companies in the UK
companies.csv
Count companies
count.txt
input() output()
class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code
Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies
Download
Companies
Data Server
output()
requires()
class CompanyCount(luigi.Task):
def requires(self):
return CompanyDownload()
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Company count with download dependency
the output of the
required task
this task must complete
before CompanyCount
runs
Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked
up by previous task
download the data and
write it to the output
Target
$ python company_flow.py CompanyCount --local-scheduler
DEBUG: Checking if CompanyCount() is complete
DEBUG: Checking if CompanyDownload() is complete
INFO: Scheduled CompanyCount() (PENDING)
INFO: Scheduled CompanyDownload() (PENDING)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10076] Worker Worker(...) running CompanyDownload()
INFO: [pid 10076] Worker Worker(...) done CompanyDownload()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10076] Worker Worker(...) running CompanyCount()
INFO: [pid 10076] Worker Worker(...) done CompanyCount()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
INFO: Done
Time dependent tasks - change in companies
Companies Count Task(Date 1)
Companies Count Task(Date 2)
Companies Count Task(Date 3)
Companies Delta
company_count_
delta.txt
output()
input()
input()
input()
class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies
class CompanyCount(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Adding the date dependency to Company Count
added date dependency
to company count
The central scheduler
$ luigid & # start central scheduler in background
$ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082
Persisting our data
companies.csv
Count
companies(Date)
count.txt
output()Companies
Download(Date)
Companies
Data Server
output()
requires(Date)
Companies
ToMySQL(Date)
output()
SQL
Database
requires(Date)
class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data
My pipes broke
# ./client.cfg
[core]
error-email: dylan@growthintel.com, stuart@growthintel.com
Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop
● Spark
● ssh
● Elasticsearch
● Hive
● Pig
● etc.
Check out the luigi.contrib package
class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in chunks]
def output(self):
return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce
methods instead of
run()
● Doesn’t provide a way to trigger
flows
● Doesn’t support distributed
execution
Luigi Limitations
Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all
PRs!
Stuart Coleman
@stubacca81 / stuart@growthintel.com
Dylan Barth
@dylan_barth / dylan@growthintel.com
Thanks!
We’re hiring Python data scientists & engineers!
http://www.growthintel.com/careers/
1 of 26

Recommended

Luigi presentation NYC Data Science by
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
60.4K views69 slides
Luigi presentation OA Summit by
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
16.2K views18 slides
Python as part of a production machine learning stack by Michael Manapat PyDa... by
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
20.7K views23 slides
Best Practices in Handling Performance Issues by
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesOdoo
1.1K views19 slides
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire... by
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...
Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil
6K views17 slides
The Apache Spark File Format Ecosystem by
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
2.1K views44 slides

More Related Content

What's hot

Prometheus Overview by
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
34.1K views19 slides
Prometheus by
PrometheusPrometheus
Prometheuswyukawa
1.3K views11 slides
Altinity Quickstart for ClickHouse by
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Ltd
1.5K views62 slides
Cloud Monitoring with Prometheus by
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
3.8K views20 slides
Data pipelines from zero to solid by
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
10.7K views58 slides
Using ClickHouse for Experimentation by
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
12.8K views33 slides

What's hot(20)

Prometheus Overview by Brian Brazil
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil34.1K views
Prometheus by wyukawa
PrometheusPrometheus
Prometheus
wyukawa 1.3K views
Altinity Quickstart for ClickHouse by Altinity Ltd
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
Altinity Ltd1.5K views
Cloud Monitoring with Prometheus by QAware GmbH
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
QAware GmbH3.8K views
Data pipelines from zero to solid by Lars Albertsson
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson10.7K views
Using ClickHouse for Experimentation by Gleb Kanterov
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov12.8K views
Better than you think: Handling JSON data in ClickHouse by Altinity Ltd
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd2K views
Scalability, Availability & Stability Patterns by Jonas Bonér
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér516.3K views
Introduction to Prometheus by Julien Pivotto
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
Julien Pivotto6.7K views
Iceberg: A modern table format for big data (Strata NY 2018) by Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K views
Introduction to Redis by Dvir Volk
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk121K views
High Performance, High Reliability Data Loading on ClickHouse by Altinity Ltd
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd2.3K views
Systems Monitoring with Prometheus (Devops Ireland April 2015) by Brian Brazil
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil24.1K views
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ... by Flink Forward
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward579 views
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO by Altinity Ltd
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd22.4K views
Presto best practices for Cluster admins, data engineers and analysts by Shubham Tagra
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra316 views
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA... by Altinity Ltd
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Altinity Ltd1.2K views

Viewers also liked

The Mechanics of Testing Large Data Pipelines (QCon London 2016) by
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
7.7K views74 slides
10 ways to stumble with big data by
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
1.4K views18 slides
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
5.7K views80 slides
Testing data streaming applications by
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
4K views26 slides
Building a unified data pipeline in Apache Spark by
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
26.2K views26 slides
Test strategies for data processing pipelines, v2.0 by
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
2.7K views36 slides

Viewers also liked(7)

The Mechanics of Testing Large Data Pipelines (QCon London 2016) by Mathieu Bastian
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian7.7K views
10 ways to stumble with big data by Lars Albertsson
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
Lars Albertsson1.4K views
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by Evan Chan
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan5.7K views
Testing data streaming applications by Lars Albertsson
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson4K views
Building a unified data pipeline in Apache Spark by DataWorks Summit
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit26.2K views
Test strategies for data processing pipelines, v2.0 by Lars Albertsson
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson2.7K views
Building a Data Pipeline from Scratch - Joe Crobak by Hakka Labs
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs38.6K views

Similar to A Beginner's Guide to Building Data Pipelines with Luigi

Optimization in django orm by
Optimization in django ormOptimization in django orm
Optimization in django ormDenys Levchenko
296 views32 slides
MongoDB World 2018: Keynote by
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
1.5K views126 slides
Mock Hell PyCon DE and PyData Berlin 2019 by
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Edwin Jung
257 views130 slides
Google App Engine in 40 minutes (the absolute essentials) by
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Python Ireland
567 views76 slides
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 by
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
612 views50 slides
Advanced integration services on microsoft ssis 1 by
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Skillwise Group
812 views41 slides

Similar to A Beginner's Guide to Building Data Pipelines with Luigi(20)

MongoDB World 2018: Keynote by MongoDB
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
MongoDB1.5K views
Mock Hell PyCon DE and PyData Berlin 2019 by Edwin Jung
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019
Edwin Jung257 views
Google App Engine in 40 minutes (the absolute essentials) by Python Ireland
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)
Python Ireland567 views
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 by Codemotion
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Codemotion612 views
Advanced integration services on microsoft ssis 1 by Skillwise Group
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1
Skillwise Group812 views
GHC Participant Training by AidIQ
GHC Participant TrainingGHC Participant Training
GHC Participant Training
AidIQ605 views
Tools for Solving Performance Issues by Odoo
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
Odoo520 views
Building Services With gRPC, Docker and Go by Martin Kess
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
Martin Kess1.4K views
Server side rendering with React and Symfony by Ignacio Martín
Server side rendering with React and SymfonyServer side rendering with React and Symfony
Server side rendering with React and Symfony
Ignacio Martín2.8K views
Designing REST API automation tests in Kotlin by Dmitriy Sobko
Designing REST API automation tests in KotlinDesigning REST API automation tests in Kotlin
Designing REST API automation tests in Kotlin
Dmitriy Sobko3.1K views
Gae Meets Django by fool2nd
Gae Meets DjangoGae Meets Django
Gae Meets Django
fool2nd1.5K views
IndexedDB and Push Notifications in Progressive Web Apps by Adégòkè Obasá
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web Apps
Adégòkè Obasá1.5K views
Introduction to Django by Joaquim Rocha
Introduction to DjangoIntroduction to Django
Introduction to Django
Joaquim Rocha2.2K views
The Ring programming language version 1.5.3 book - Part 40 of 184 by Mahmoud Samir Fayed
The Ring programming language version 1.5.3 book - Part 40 of 184The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184
Tactical data engineering by Julian Hyde
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde971 views
Ruby on rails by Mohit Jain
Ruby on rails Ruby on rails
Ruby on rails
Mohit Jain722 views

Recently uploaded

Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
56 views10 slides
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
58 views9 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
50 views69 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
110 views31 slides
DRBD Deep Dive - Philipp Reisner - LINBIT by
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
110 views21 slides
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveNetwork Automation Forum
49 views35 slides

Recently uploaded(20)

Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue110 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue97 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue147 views

A Beginner's Guide to Building Data Pipelines with Luigi

  • 1. A Beginner’s Guide to Building Data Pipelines with
  • 2. Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? UK Limited Companies Customer CRM Data Predictive Model
  • 3. With big data, comes big responsibility
  • 4. Hard to maintain, extend, and… look at. Script Soup omg moar codez Code More Codes
  • 5. if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments') arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args() The Old Way Define a command line interface for every task?
  • 6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today)) log.setLogLevel(args.log_level, 'screen') <- Custom logging table_date = parse_date(args.table_date, datetime.now()) log.info('Starting Companies House data loader...') ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date) ch_loader.go(args.file_names) <- What to do if this fails? log.info('Loader complete. Starting Companies House updater') ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go() The Old Way Long, processor intensive tasks stacked together
  • 7. ● Open-sourced & maintained by Spotify data team ● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani. ● Abstracts batch processing jobs ● Makes it easy to write modular code and create dependencies between tasks. Luigi to the rescue!
  • 8. ● Task templating ● Dependency graphs ● Resumption of data flows after intermediate failure ● Command line integration ● Error emails Luigi
  • 9. Luigi 101- Counting the number of companies in the UK companies.csv Count companies count.txt input() output()
  • 10. class CompanyCount(luigi.Task): def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries("companies.csv") with self.output().open("w") as out_file: out_file.write(count) Company Count Job in Luigi code
  • 11. Luigi 101- Keeping our count up to date companies.csv Count companies count.txt input() output()Companies Download Companies Data Server output() requires()
  • 12. class CompanyCount(luigi.Task): def requires(self): return CompanyDownload() def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Company count with download dependency the output of the required task this task must complete before CompanyCount runs
  • 13. Download task class CompanyDownload(luigi.Task): def output(self): return luigi.LocalTarget("companies.csv") def run(self): data = get_company_download() with self.output().open('w') as out_file: out_file.write(data) local output to be picked up by previous task download the data and write it to the output Target
  • 14. $ python company_flow.py CompanyCount --local-scheduler DEBUG: Checking if CompanyCount() is complete DEBUG: Checking if CompanyDownload() is complete INFO: Scheduled CompanyCount() (PENDING) INFO: Scheduled CompanyDownload() (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 10076] Worker Worker(...) running CompanyDownload() INFO: [pid 10076] Worker Worker(...) done CompanyDownload() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 10076] Worker Worker(...) running CompanyCount() INFO: [pid 10076] Worker Worker(...) done CompanyCount() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... INFO: Done
  • 15. Time dependent tasks - change in companies Companies Count Task(Date 1) Companies Count Task(Date 2) Companies Count Task(Date 3) Companies Delta company_count_ delta.txt output() input() input() input()
  • 16. class AnnualCompanyCountDelta(luigi.Task): year = luigi.Parameter() def requires(self): tasks = [] for month in range(1, 13): tasks.append(CompanyCount(dt.datetime.strptime( "{}-{}-01".format(self.year, month), "%Y-%m-%d")) ) return tasks # not shown: output(), run() Parameterising Luigi tasks define parameter generate dependencies
  • 17. class CompanyCount(luigi.Task): date = luigi.DateParameter(default=datetime.date.today()) def requires(self): return CompanyDownload(self.date) def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Adding the date dependency to Company Count added date dependency to company count
  • 18. The central scheduler $ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014 by default, localhost:8082
  • 19. Persisting our data companies.csv Count companies(Date) count.txt output()Companies Download(Date) Companies Data Server output() requires(Date) Companies ToMySQL(Date) output() SQL Database requires(Date)
  • 20. class CompaniesToMySQL(luigi.sqla.CopyToTable): date = luigi.DateParameter() columns = [(["name", String(100)], {}), ...] connection_string = "mysql://localhost/test" # or something table = "companies" # name of the table to store data def requires(self): return CompanyDownload(self.date) def rows(self): for row in self.get_unique_rows(): # uses self.input() yield row Persisting our data
  • 21. My pipes broke # ./client.cfg [core] error-email: dylan@growthintel.com, stuart@growthintel.com
  • 22. Things we missed out There are lots of task types which can be used which we haven’t mentioned ● Hadoop ● Spark ● ssh ● Elasticsearch ● Hive ● Pig ● etc. Check out the luigi.contrib package
  • 23. class CompanyCount(luigi.contrib.hadoop.JobTask): chunks = luigi.Parameter() def requires(self): return [CompanyDownload(chunk) for chunk in chunks] def output(self): return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv") def mapper(self, line): yield "count", 1 def reducer(self, key, values): yield key, sum(values) Counting the companies using Hadoop split input in chunks HDFS target map and reduce methods instead of run()
  • 24. ● Doesn’t provide a way to trigger flows ● Doesn’t support distributed execution Luigi Limitations
  • 25. Onwards ● The docs: http://luigi.readthedocs.org/ ● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/ ● The source: https://github.com/spotify/luigi ● The maintainers are really helpful, responsive, and open to any and all PRs!
  • 26. Stuart Coleman @stubacca81 / stuart@growthintel.com Dylan Barth @dylan_barth / dylan@growthintel.com Thanks! We’re hiring Python data scientists & engineers! http://www.growthintel.com/careers/

Editor's Notes

  1. How many of you currently manage data pipelines in your day to day? And how many of you use some sort of framework to manage them?
  2. We work for a startup called Growth Intelligence We use predictive modeling to help generate high quality leads for our customers. help customers answer: where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? How we track all the companies in the UK using a variety of data sources we look at sales data for our customers (positive and negative examples from their CRM) we use that to build a predictive model to predict which leads will convert for our customers
  3. So we work with a fair amount of data, from a lot of different sources. We have data pipelines for: taking in new data to keep our data set current doing analytics on existing data and doing model building processing or transforming our existing data, e.g. indexing a subset of it into elasticsearch And as you all know, the more data you deal with, things can get really messy, really fast and the more of a burden it becomes to maintain.
  4. In the past, we used to deal with each data pipeline on an ad hoc, individual basis. For awhile, this worked fine. As our data set grew, we realized we were quickly creating script soup entire repositories with directories of scripts that had a lot of boilerplate clearly needing some abstraction and cleanup.
  5. We had stuff like this: bespoke command line interfaces for each pipeline fine once or twice, but when you have a lot of pipelines, it becomes unwieldy
  6. We also had processor intensive, longish running tasks stacked up against one another in the script. If one failed, how could we re-run without having to repeat a lot of the work that had already been done? Some other challenges included simply keeping things modular and well-structured especially when different data pipelines may be created at different times by different devs? We also had varying levels of reporting across our tasks, so some had great logging and reporting, others didn’t. And the list goes on... And so now that you understand a bit of the challenge, Stuart is going to take over for a bit to help you understand how we approached this problem.
  7. Stuart: Fortunately, we aren’t the only ones to have this problem. In fact, everyone here probably has at some point! A few years ago, the data team at Spotify open-sourced a python library called Luigi. Erik Bernhardsson and Elias Freider Currently maintained by Arash Rouhani Really active and responsive community. Open to pull requests. Integrated within a week Luigi basically provides a framework for structuring any batch processing job or data pipeline. It helps to abstract away some of the question marks that we just talked about and will stop you going crazy maintaining your pipes
  8. Luigi has some awesome abstractions: It provide a Task class which is a template for a single unit of work and outputs a Target It’s easy to define dependencies between Tasks Luigi generates a dependency graph at runtime, so you don’t have to worry about running scripts in a particular order. It will figure out what is the best order because each task is a unit of work, if something breaks, you can restart in the middle instead of the beginning. You get a graceful restart. This is really useful if you have a task which depends on a long running task which runs infrequently and a short running task which runs every day say. There is no need to rerun your long running task. Luigi in that sense is idempotent It comes with intuitive command line integration so that you can pass parameters into your tasks without writing boilerplate It can notify you when tasks fall over so you don’t have to waste time babysitting And the list goes on! We’ll be the first to say that we aren’t experts in Luigi just yet. But we have slowly started converting some of our data pipelines to use it, and have also been adding our pipelines exclusively in luigi. Now we’ll go through a couple of simple examples now that demonstrate the power of the framework but are simple enough to follow in a 25min talk. We’ll also mention a few of it’s limitations we’ve discovered so far, and then open up for questions and discussion.
  9. Let’s imagine that we have miraculously been given a csv file that contains data about the limited companies here in the UK. It just has simple metadata like the companies registration number, the company name, incorporation date, and sector. For the purposes of this example, let’s imagine we simply want to count the number of unique companies currently operating in the UK and write that count to a file on disk. So the workflow might look something like the above -- read the file, count the unique company names, and write it to a text file
  10. Here we’ve defined our task “CompanyCount” and it inherits from the vanilla Luigi task It has a couple of methods: output and run: Run simply contains the business logic of the task and is executed when the task runs. You can put whatever processing logic you want in here. Output returns a Luigi Target -- valid output can be a lot of things: a location on disk, on a remote server, or location in a database. In this case, we’re simply writing to disk. When we run this from the command line, luigi executes the code in the run method and finishes by writing the count to the output target. Great! But we obviously can’t do much with this data. We want to make sure we have the latest count instead of using our local, outdated file, we’re going to go and get the latest data from a UK government server.
  11. We can break this flow into two units of work: a download task and a processing task. The Task class has another method, requires, which makes it simple to define dependencies between Tasks. In this case, we simply say that the CompanyCount task requires the DownloadCompaniesData task. Let’s see how that changes our CompanyCount task:
  12. Here we’ve made two changes to our CompanyCount task: a “requires” method, specifying that CompanyDownload is required before CompanyCount can complete successfully. we’ve replaced the name of the file with the self.input() method, which returns the Target object that the Task requires. In this case, the LocalTarget returned by CompanyDownload. Now we need to define our CompanyDownload task.
  13. CompanyDownload is a simple task that goes up and gets the company data and downloads it to our directory. The output method returns a target object pointing to a file location on disk. The run method simply downloads the file and writes it to the output location. Note that this output target becomes the input for any task that requires this one (in our case, this is the CompanyCount task). Now, let’s try running this from the command line
  14. To run our company count task from beginning to end, we simply call python company_flow.py CompanyCount That tells Luigi which task we want to run. Also worth noting that we told luigi to use the local-scheduler. This tells luigi to not use the central-scheduler, which is a daemon that comes bundled with luigi and handles scheduling tasks. We’ll talk about what that’s good for in a bit, but for now, we just use the local-scheduler When we run this from the command line, luigi builds up a dependency graph and see’s that before it can run CompanyCount, it needs to run CompanyDownload. It establishes this by calling the exists() method on required tasks, which simply checks to see if the Target returned by the output method already exists. If it does, that task is marked as DONE, otherwise it’s included in the task queue. So first Luigi runs CompanyDownload, and then if it executes successfully, it runs CompanyCount and generates a new count for us. So that was the MVP Luigi task - but from these simple building blocks it is possible to build up complicated examples quite quickly.
  15. Dylan: Having a count of all the companies in the UK is pretty cool, but what if we wanted to get way more awesome and visualize how the overall number of companies in the UK has changed over the past year? We can do that pretty easily with our current code, mostly thanks to the fact that our tasks only do one job each. We just need to add a third task calls out to our previous tasks, getting the company data for each month in a given year then outputs something useful -- a csv, or a histogram of the counts returned.
  16. A couple of interesting things are happening here. First, we are passing in a year as a parameter. Luigi intelligently accepts defined parameters as command line args, so no boilerplate needed! Second, the task uses the year to dynamically generate its requirements. e.g. for each month in this year, run CompanyCount for that month This triggers a download for that date’s data if we don’t have it already. In order for this to work, we’ll have to add a date parameter to our previous tasks Note that we didn’t include our output or run method here this could be a histogram or a csv, whatever you want to do.
  17. And here you can see how we’ve added a date parameter to the CompanyCount task. We’ll also need to add this to the CompanyDownload task (not shown) Now, we can trigger quite a few subtasks with just one task This can be hard to keep track of Luckily, the luigi central-scheduler comes with a basic visualizer. Remember the first time we ran our task sequence from the CLI with --local-scheduler? Let’s try it again, but this time let’s start a luigi central scheduler.
  18. To start a scheduler, we just run luigid from the command line in the background Next we run our task, leaving off the --local-scheduler option this time This tells luigi to use the central scheduler Note: useful in production because it ensures that two instances of the same task never run simultaneously Also: awesome in development because you can visualize tasks dependencies. While this is running, we can visit localhost:8082 check out the dependency graph that luigi has created. Our simple command spawned 24 subtasks, a download and company count task for each month of the 2014. The colors represent task status so all of our previous tasks have run the delta task is still in progress. If a task fails, it’s marked as red. Now that you’ve seen the central-scheduler let’s talk a bit about how Luigi nicely integrates with other tools like mySQL.
  19. In reality, we’re going to want to store the companies data in mySQL so we can use it for modeling and ad hoc querying. We define a new task, CompaniesToMysql it simply takes in a date param writes companies to table for that month In this way, we can leverage the download task that we created previously and run this task completely separately from our analytics tasks. Let’s look at how we can represent this in code
  20. You’ll notice that this looks very different from tasks you’ve seen before This is because our task isn’t inheriting directly from the vanilla luigi task We are using the contrib.sqla module The SQLA CopyToTable task provides powerful abstractions on top of a base luigi task when working with SQL Alchemy It assumes the output of the task will be a SQLAlchemy table you can control by specifying the connection string, table, and columns (if the table doesn’t already exist) Instead of a run method, we override the rows method, which returns a generator of row tuples. This simplifies things, because you can do all the processing you want in the rows method and let the task deal with batch inserts. When we run this, Luigi first checks to make sure that DownloadCompanyData has been run for the date we specified, and then it runs the copy to table to task, inserting the records.
  21. One last thing to think about for now, what happens when something falls over? Luigi handles errors intelligently. Because tasks are individual units of work, if something breaks, you don’t have to re-run everything. You can simply restart the task, and the dependencies that finished will be skipped. We also want to get a notification with a stack trace. we can just add a configuration file specifying the emails to send the stack trace to and here’s an example of an email from Luigi in a case where you try to divide by zero (lolz) It’s worth noting that in addition to system wide luigi settings, you can also specify settings on a per task basis in the config.
  22. Stuart There is a ton of extensibility and integration with other services that Luigi provides abstractions for, and we’ve listed them out here Definitely check out the contrib docs for more info. Here’s a quick example of how you might use the hadoop module
  23. We point our files to HDFS Rather than implementing a run() method, we can have a mapper() and reducer() method
  24. Although Luigi is pretty awesome, there are some limitations worth pointing out: Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically. Luigi does not support distribution of execution. When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded. Probably ok for a lot of you Idea was that api was more important than architecture. If you are interested in architecture you may want to check out Airflow from the AirBnB team which has just open sourced a library called Airflow
  25. Definitely check out the docs, join the mailing list (it’s pretty active), and check out the repo. There’s active churn on the issues and the maintainers are super responsive. Docs could use more examples probably, that could be your first contribution!