SlideShare a Scribd company logo
1 of 26
A Beginner’s Guide to Building Data Pipelines with
Where should I focus my outbound sales and marketing efforts to
yield the highest possible ROI?
UK Limited
Companies
Customer
CRM Data
Predictive
Model
With big data, comes big responsibility
Hard to maintain, extend, and… look at.
Script Soup
omg moar
codez
Code
More Codes
if __name__ == '__main__':
today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process
arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read
(supports globstar wildcards)', required=True)
arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of
rows to save to DB at once')
arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data
were released')
arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen',
default='INFO')
args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?
log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))
log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')
ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)
ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')
ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date,
company_status_params=company_status_params) <- Need to clean up if this fails
ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together
● Open-sourced & maintained by
Spotify data team
● Erik Berhhardsson and Elias Freider.
Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code
and create dependencies between
tasks.
Luigi to the rescue!
● Task templating
● Dependency graphs
● Resumption of data flows after
intermediate failure
● Command line integration
● Error emails
Luigi
Luigi 101- Counting the number of companies in the UK
companies.csv
Count companies
count.txt
input() output()
class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code
Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies
Download
Companies
Data Server
output()
requires()
class CompanyCount(luigi.Task):
def requires(self):
return CompanyDownload()
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Company count with download dependency
the output of the
required task
this task must complete
before CompanyCount
runs
Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked
up by previous task
download the data and
write it to the output
Target
$ python company_flow.py CompanyCount --local-scheduler
DEBUG: Checking if CompanyCount() is complete
DEBUG: Checking if CompanyDownload() is complete
INFO: Scheduled CompanyCount() (PENDING)
INFO: Scheduled CompanyDownload() (PENDING)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10076] Worker Worker(...) running CompanyDownload()
INFO: [pid 10076] Worker Worker(...) done CompanyDownload()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10076] Worker Worker(...) running CompanyCount()
INFO: [pid 10076] Worker Worker(...) done CompanyCount()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
INFO: Done
Time dependent tasks - change in companies
Companies Count Task(Date 1)
Companies Count Task(Date 2)
Companies Count Task(Date 3)
Companies Delta
company_count_
delta.txt
output()
input()
input()
input()
class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies
class CompanyCount(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Adding the date dependency to Company Count
added date dependency
to company count
The central scheduler
$ luigid & # start central scheduler in background
$ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082
Persisting our data
companies.csv
Count
companies(Date)
count.txt
output()Companies
Download(Date)
Companies
Data Server
output()
requires(Date)
Companies
ToMySQL(Date)
output()
SQL
Database
requires(Date)
class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data
My pipes broke
# ./client.cfg
[core]
error-email: dylan@growthintel.com, stuart@growthintel.com
Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop
● Spark
● ssh
● Elasticsearch
● Hive
● Pig
● etc.
Check out the luigi.contrib package
class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in chunks]
def output(self):
return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce
methods instead of
run()
● Doesn’t provide a way to trigger
flows
● Doesn’t support distributed
execution
Luigi Limitations
Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all
PRs!
Stuart Coleman
@stubacca81 / stuart@growthintel.com
Dylan Barth
@dylan_barth / dylan@growthintel.com
Thanks!
We’re hiring Python data scientists & engineers!
http://www.growthintel.com/careers/

More Related Content

What's hot

Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseAltinity Ltd
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresLorenzo Alberton
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Ltd
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
文字コードに起因する脆弱性とその対策(増補版)
文字コードに起因する脆弱性とその対策(増補版)文字コードに起因する脆弱性とその対策(増補版)
文字コードに起因する脆弱性とその対策(増補版)Hiroshi Tokumaru
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
Inside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseInside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseMike Dirolf
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?Google Cloud Platform - Japan
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]MongoDB
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 

What's hot (20)

Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structures
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Models for hierarchical data
Models for hierarchical dataModels for hierarchical data
Models for hierarchical data
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
文字コードに起因する脆弱性とその対策(増補版)
文字コードに起因する脆弱性とその対策(増補版)文字コードに起因する脆弱性とその対策(増補版)
文字コードに起因する脆弱性とその対策(増補版)
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Inside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source DatabaseInside MongoDB: the Internals of an Open-Source Database
Inside MongoDB: the Internals of an Open-Source Database
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
KPI w projektach IT
KPI w projektach ITKPI w projektach IT
KPI w projektach IT
 
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?
Google Cloud ベストプラクティス:Google BigQuery 編 - 01 : BigQuery とは?
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 

Viewers also liked

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (8)

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Build Data Pipelines with Luigi

Optimization in django orm
Optimization in django ormOptimization in django orm
Optimization in django ormDenys Levchenko
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Edwin Jung
 
Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Python Ireland
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
 
Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Skillwise Group
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant TrainingAidIQ
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance IssuesOdoo
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoMartin Kess
 
Server side rendering with React and Symfony
Server side rendering with React and SymfonyServer side rendering with React and Symfony
Server side rendering with React and SymfonyIgnacio Martín
 
Designing REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDesigning REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDmitriy Sobko
 
Gae Meets Django
Gae Meets DjangoGae Meets Django
Gae Meets Djangofool2nd
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsAdégòkè Obasá
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence PortfolioChris Seebacher
 
Incremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfIncremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfnadine39280
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJoaquim Rocha
 
The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184Mahmoud Samir Fayed
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 

Similar to Build Data Pipelines with Luigi (20)

Optimization in django orm
Optimization in django ormOptimization in django orm
Optimization in django orm
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019
 
Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 
Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1
 
Capstone ms2
Capstone ms2Capstone ms2
Capstone ms2
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
 
Server side rendering with React and Symfony
Server side rendering with React and SymfonyServer side rendering with React and Symfony
Server side rendering with React and Symfony
 
Serverless
ServerlessServerless
Serverless
 
Designing REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDesigning REST API automation tests in Kotlin
Designing REST API automation tests in Kotlin
 
Gae Meets Django
Gae Meets DjangoGae Meets Django
Gae Meets Django
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web Apps
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Incremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfIncremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdf
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Build Data Pipelines with Luigi

  • 1. A Beginner’s Guide to Building Data Pipelines with
  • 2. Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? UK Limited Companies Customer CRM Data Predictive Model
  • 3. With big data, comes big responsibility
  • 4. Hard to maintain, extend, and… look at. Script Soup omg moar codez Code More Codes
  • 5. if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments') arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args() The Old Way Define a command line interface for every task?
  • 6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today)) log.setLogLevel(args.log_level, 'screen') <- Custom logging table_date = parse_date(args.table_date, datetime.now()) log.info('Starting Companies House data loader...') ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date) ch_loader.go(args.file_names) <- What to do if this fails? log.info('Loader complete. Starting Companies House updater') ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go() The Old Way Long, processor intensive tasks stacked together
  • 7. ● Open-sourced & maintained by Spotify data team ● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani. ● Abstracts batch processing jobs ● Makes it easy to write modular code and create dependencies between tasks. Luigi to the rescue!
  • 8. ● Task templating ● Dependency graphs ● Resumption of data flows after intermediate failure ● Command line integration ● Error emails Luigi
  • 9. Luigi 101- Counting the number of companies in the UK companies.csv Count companies count.txt input() output()
  • 10. class CompanyCount(luigi.Task): def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries("companies.csv") with self.output().open("w") as out_file: out_file.write(count) Company Count Job in Luigi code
  • 11. Luigi 101- Keeping our count up to date companies.csv Count companies count.txt input() output()Companies Download Companies Data Server output() requires()
  • 12. class CompanyCount(luigi.Task): def requires(self): return CompanyDownload() def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Company count with download dependency the output of the required task this task must complete before CompanyCount runs
  • 13. Download task class CompanyDownload(luigi.Task): def output(self): return luigi.LocalTarget("companies.csv") def run(self): data = get_company_download() with self.output().open('w') as out_file: out_file.write(data) local output to be picked up by previous task download the data and write it to the output Target
  • 14. $ python company_flow.py CompanyCount --local-scheduler DEBUG: Checking if CompanyCount() is complete DEBUG: Checking if CompanyDownload() is complete INFO: Scheduled CompanyCount() (PENDING) INFO: Scheduled CompanyDownload() (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 10076] Worker Worker(...) running CompanyDownload() INFO: [pid 10076] Worker Worker(...) done CompanyDownload() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 10076] Worker Worker(...) running CompanyCount() INFO: [pid 10076] Worker Worker(...) done CompanyCount() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... INFO: Done
  • 15. Time dependent tasks - change in companies Companies Count Task(Date 1) Companies Count Task(Date 2) Companies Count Task(Date 3) Companies Delta company_count_ delta.txt output() input() input() input()
  • 16. class AnnualCompanyCountDelta(luigi.Task): year = luigi.Parameter() def requires(self): tasks = [] for month in range(1, 13): tasks.append(CompanyCount(dt.datetime.strptime( "{}-{}-01".format(self.year, month), "%Y-%m-%d")) ) return tasks # not shown: output(), run() Parameterising Luigi tasks define parameter generate dependencies
  • 17. class CompanyCount(luigi.Task): date = luigi.DateParameter(default=datetime.date.today()) def requires(self): return CompanyDownload(self.date) def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Adding the date dependency to Company Count added date dependency to company count
  • 18. The central scheduler $ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014 by default, localhost:8082
  • 19. Persisting our data companies.csv Count companies(Date) count.txt output()Companies Download(Date) Companies Data Server output() requires(Date) Companies ToMySQL(Date) output() SQL Database requires(Date)
  • 20. class CompaniesToMySQL(luigi.sqla.CopyToTable): date = luigi.DateParameter() columns = [(["name", String(100)], {}), ...] connection_string = "mysql://localhost/test" # or something table = "companies" # name of the table to store data def requires(self): return CompanyDownload(self.date) def rows(self): for row in self.get_unique_rows(): # uses self.input() yield row Persisting our data
  • 21. My pipes broke # ./client.cfg [core] error-email: dylan@growthintel.com, stuart@growthintel.com
  • 22. Things we missed out There are lots of task types which can be used which we haven’t mentioned ● Hadoop ● Spark ● ssh ● Elasticsearch ● Hive ● Pig ● etc. Check out the luigi.contrib package
  • 23. class CompanyCount(luigi.contrib.hadoop.JobTask): chunks = luigi.Parameter() def requires(self): return [CompanyDownload(chunk) for chunk in chunks] def output(self): return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv") def mapper(self, line): yield "count", 1 def reducer(self, key, values): yield key, sum(values) Counting the companies using Hadoop split input in chunks HDFS target map and reduce methods instead of run()
  • 24. ● Doesn’t provide a way to trigger flows ● Doesn’t support distributed execution Luigi Limitations
  • 25. Onwards ● The docs: http://luigi.readthedocs.org/ ● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/ ● The source: https://github.com/spotify/luigi ● The maintainers are really helpful, responsive, and open to any and all PRs!
  • 26. Stuart Coleman @stubacca81 / stuart@growthintel.com Dylan Barth @dylan_barth / dylan@growthintel.com Thanks! We’re hiring Python data scientists & engineers! http://www.growthintel.com/careers/

Editor's Notes

  1. How many of you currently manage data pipelines in your day to day? And how many of you use some sort of framework to manage them?
  2. We work for a startup called Growth Intelligence We use predictive modeling to help generate high quality leads for our customers. help customers answer: where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? How we track all the companies in the UK using a variety of data sources we look at sales data for our customers (positive and negative examples from their CRM) we use that to build a predictive model to predict which leads will convert for our customers
  3. So we work with a fair amount of data, from a lot of different sources. We have data pipelines for: taking in new data to keep our data set current doing analytics on existing data and doing model building processing or transforming our existing data, e.g. indexing a subset of it into elasticsearch And as you all know, the more data you deal with, things can get really messy, really fast and the more of a burden it becomes to maintain.
  4. In the past, we used to deal with each data pipeline on an ad hoc, individual basis. For awhile, this worked fine. As our data set grew, we realized we were quickly creating script soup entire repositories with directories of scripts that had a lot of boilerplate clearly needing some abstraction and cleanup.
  5. We had stuff like this: bespoke command line interfaces for each pipeline fine once or twice, but when you have a lot of pipelines, it becomes unwieldy
  6. We also had processor intensive, longish running tasks stacked up against one another in the script. If one failed, how could we re-run without having to repeat a lot of the work that had already been done? Some other challenges included simply keeping things modular and well-structured especially when different data pipelines may be created at different times by different devs? We also had varying levels of reporting across our tasks, so some had great logging and reporting, others didn’t. And the list goes on... And so now that you understand a bit of the challenge, Stuart is going to take over for a bit to help you understand how we approached this problem.
  7. Stuart: Fortunately, we aren’t the only ones to have this problem. In fact, everyone here probably has at some point! A few years ago, the data team at Spotify open-sourced a python library called Luigi. Erik Bernhardsson and Elias Freider Currently maintained by Arash Rouhani Really active and responsive community. Open to pull requests. Integrated within a week Luigi basically provides a framework for structuring any batch processing job or data pipeline. It helps to abstract away some of the question marks that we just talked about and will stop you going crazy maintaining your pipes
  8. Luigi has some awesome abstractions: It provide a Task class which is a template for a single unit of work and outputs a Target It’s easy to define dependencies between Tasks Luigi generates a dependency graph at runtime, so you don’t have to worry about running scripts in a particular order. It will figure out what is the best order because each task is a unit of work, if something breaks, you can restart in the middle instead of the beginning. You get a graceful restart. This is really useful if you have a task which depends on a long running task which runs infrequently and a short running task which runs every day say. There is no need to rerun your long running task. Luigi in that sense is idempotent It comes with intuitive command line integration so that you can pass parameters into your tasks without writing boilerplate It can notify you when tasks fall over so you don’t have to waste time babysitting And the list goes on! We’ll be the first to say that we aren’t experts in Luigi just yet. But we have slowly started converting some of our data pipelines to use it, and have also been adding our pipelines exclusively in luigi. Now we’ll go through a couple of simple examples now that demonstrate the power of the framework but are simple enough to follow in a 25min talk. We’ll also mention a few of it’s limitations we’ve discovered so far, and then open up for questions and discussion.
  9. Let’s imagine that we have miraculously been given a csv file that contains data about the limited companies here in the UK. It just has simple metadata like the companies registration number, the company name, incorporation date, and sector. For the purposes of this example, let’s imagine we simply want to count the number of unique companies currently operating in the UK and write that count to a file on disk. So the workflow might look something like the above -- read the file, count the unique company names, and write it to a text file
  10. Here we’ve defined our task “CompanyCount” and it inherits from the vanilla Luigi task It has a couple of methods: output and run: Run simply contains the business logic of the task and is executed when the task runs. You can put whatever processing logic you want in here. Output returns a Luigi Target -- valid output can be a lot of things: a location on disk, on a remote server, or location in a database. In this case, we’re simply writing to disk. When we run this from the command line, luigi executes the code in the run method and finishes by writing the count to the output target. Great! But we obviously can’t do much with this data. We want to make sure we have the latest count instead of using our local, outdated file, we’re going to go and get the latest data from a UK government server.
  11. We can break this flow into two units of work: a download task and a processing task. The Task class has another method, requires, which makes it simple to define dependencies between Tasks. In this case, we simply say that the CompanyCount task requires the DownloadCompaniesData task. Let’s see how that changes our CompanyCount task:
  12. Here we’ve made two changes to our CompanyCount task: a “requires” method, specifying that CompanyDownload is required before CompanyCount can complete successfully. we’ve replaced the name of the file with the self.input() method, which returns the Target object that the Task requires. In this case, the LocalTarget returned by CompanyDownload. Now we need to define our CompanyDownload task.
  13. CompanyDownload is a simple task that goes up and gets the company data and downloads it to our directory. The output method returns a target object pointing to a file location on disk. The run method simply downloads the file and writes it to the output location. Note that this output target becomes the input for any task that requires this one (in our case, this is the CompanyCount task). Now, let’s try running this from the command line
  14. To run our company count task from beginning to end, we simply call python company_flow.py CompanyCount That tells Luigi which task we want to run. Also worth noting that we told luigi to use the local-scheduler. This tells luigi to not use the central-scheduler, which is a daemon that comes bundled with luigi and handles scheduling tasks. We’ll talk about what that’s good for in a bit, but for now, we just use the local-scheduler When we run this from the command line, luigi builds up a dependency graph and see’s that before it can run CompanyCount, it needs to run CompanyDownload. It establishes this by calling the exists() method on required tasks, which simply checks to see if the Target returned by the output method already exists. If it does, that task is marked as DONE, otherwise it’s included in the task queue. So first Luigi runs CompanyDownload, and then if it executes successfully, it runs CompanyCount and generates a new count for us. So that was the MVP Luigi task - but from these simple building blocks it is possible to build up complicated examples quite quickly.
  15. Dylan: Having a count of all the companies in the UK is pretty cool, but what if we wanted to get way more awesome and visualize how the overall number of companies in the UK has changed over the past year? We can do that pretty easily with our current code, mostly thanks to the fact that our tasks only do one job each. We just need to add a third task calls out to our previous tasks, getting the company data for each month in a given year then outputs something useful -- a csv, or a histogram of the counts returned.
  16. A couple of interesting things are happening here. First, we are passing in a year as a parameter. Luigi intelligently accepts defined parameters as command line args, so no boilerplate needed! Second, the task uses the year to dynamically generate its requirements. e.g. for each month in this year, run CompanyCount for that month This triggers a download for that date’s data if we don’t have it already. In order for this to work, we’ll have to add a date parameter to our previous tasks Note that we didn’t include our output or run method here this could be a histogram or a csv, whatever you want to do.
  17. And here you can see how we’ve added a date parameter to the CompanyCount task. We’ll also need to add this to the CompanyDownload task (not shown) Now, we can trigger quite a few subtasks with just one task This can be hard to keep track of Luckily, the luigi central-scheduler comes with a basic visualizer. Remember the first time we ran our task sequence from the CLI with --local-scheduler? Let’s try it again, but this time let’s start a luigi central scheduler.
  18. To start a scheduler, we just run luigid from the command line in the background Next we run our task, leaving off the --local-scheduler option this time This tells luigi to use the central scheduler Note: useful in production because it ensures that two instances of the same task never run simultaneously Also: awesome in development because you can visualize tasks dependencies. While this is running, we can visit localhost:8082 check out the dependency graph that luigi has created. Our simple command spawned 24 subtasks, a download and company count task for each month of the 2014. The colors represent task status so all of our previous tasks have run the delta task is still in progress. If a task fails, it’s marked as red. Now that you’ve seen the central-scheduler let’s talk a bit about how Luigi nicely integrates with other tools like mySQL.
  19. In reality, we’re going to want to store the companies data in mySQL so we can use it for modeling and ad hoc querying. We define a new task, CompaniesToMysql it simply takes in a date param writes companies to table for that month In this way, we can leverage the download task that we created previously and run this task completely separately from our analytics tasks. Let’s look at how we can represent this in code
  20. You’ll notice that this looks very different from tasks you’ve seen before This is because our task isn’t inheriting directly from the vanilla luigi task We are using the contrib.sqla module The SQLA CopyToTable task provides powerful abstractions on top of a base luigi task when working with SQL Alchemy It assumes the output of the task will be a SQLAlchemy table you can control by specifying the connection string, table, and columns (if the table doesn’t already exist) Instead of a run method, we override the rows method, which returns a generator of row tuples. This simplifies things, because you can do all the processing you want in the rows method and let the task deal with batch inserts. When we run this, Luigi first checks to make sure that DownloadCompanyData has been run for the date we specified, and then it runs the copy to table to task, inserting the records.
  21. One last thing to think about for now, what happens when something falls over? Luigi handles errors intelligently. Because tasks are individual units of work, if something breaks, you don’t have to re-run everything. You can simply restart the task, and the dependencies that finished will be skipped. We also want to get a notification with a stack trace. we can just add a configuration file specifying the emails to send the stack trace to and here’s an example of an email from Luigi in a case where you try to divide by zero (lolz) It’s worth noting that in addition to system wide luigi settings, you can also specify settings on a per task basis in the config.
  22. Stuart There is a ton of extensibility and integration with other services that Luigi provides abstractions for, and we’ve listed them out here Definitely check out the contrib docs for more info. Here’s a quick example of how you might use the hadoop module
  23. We point our files to HDFS Rather than implementing a run() method, we can have a mapper() and reducer() method
  24. Although Luigi is pretty awesome, there are some limitations worth pointing out: Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically. Luigi does not support distribution of execution. When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded. Probably ok for a lot of you Idea was that api was more important than architecture. If you are interested in architecture you may want to check out Airflow from the AirBnB team which has just open sourced a library called Airflow
  25. Definitely check out the docs, join the mailing list (it’s pretty active), and check out the repo. There’s active churn on the issues and the maintainers are super responsive. Docs could use more examples probably, that could be your first contribution!