Build Data Pipelines with Luigi

A Beginner’s Guide to Building Data Pipelines with

Where should I focus my outbound sales and marketing efforts to
yield the highest possible ROI?
UK Limited
Companies
Customer
CRM Data
Predictive
Model

With big data, comes big responsibility

Hard to maintain, extend, and… look at.
Script Soup
omg moar
codez
Code
More Codes

if __name__ == '__main__':
today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process
arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read
(supports globstar wildcards)', required=True)
arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of
rows to save to DB at once')
arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data
were released')
arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen',
default='INFO')
args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?

log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))
log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')
ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)
ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')
ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date,
company_status_params=company_status_params) <- Need to clean up if this fails
ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together

● Open-sourced & maintained by
Spotify data team
● Erik Berhhardsson and Elias Freider.
Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code
and create dependencies between
tasks.
Luigi to the rescue!

● Task templating
● Dependency graphs
● Resumption of data flows after
intermediate failure
● Command line integration
● Error emails
Luigi

Luigi 101- Counting the number of companies in the UK
companies.csv
Count companies
count.txt
input() output()

class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code

Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies
Download
Companies
Data Server
output()
requires()

def requires(self):
return CompanyDownload()
def output(self):
def run(self):
count = count_unique_entries(self.input())
Company count with download dependency
the output of the
required task
this task must complete
before CompanyCount
runs

Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked
up by previous task
download the data and
write it to the output
Target

$ python company_flow.py CompanyCount --local-scheduler
DEBUG: Checking if CompanyCount() is complete
DEBUG: Checking if CompanyDownload() is complete
INFO: Scheduled CompanyCount() (PENDING)
INFO: Scheduled CompanyDownload() (PENDING)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10076] Worker Worker(...) running CompanyDownload()
INFO: [pid 10076] Worker Worker(...) done CompanyDownload()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Pending tasks: 1
INFO: [pid 10076] Worker Worker(...) running CompanyCount()
INFO: [pid 10076] Worker Worker(...) done CompanyCount()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Done

Time dependent tasks - change in companies
Companies Count Task(Date 1)
Companies Delta
company_count_
delta.txt
output()
input()
input()
input()

class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies

date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
def run(self):
count = count_unique_entries(self.input())
Adding the date dependency to Company Count
added date dependency
to company count

The central scheduler
$ luigid & # start central scheduler in background
$ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082

Persisting our data
companies.csv
Count
companies(Date)
count.txt
output()Companies
Download(Date)
Companies
Data Server
output()
requires(Date)
Companies
ToMySQL(Date)
output()
SQL
Database
requires(Date)

class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data

My pipes broke
# ./client.cfg
[core]
error-email: dylan@growthintel.com, stuart@growthintel.com

Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop
● Spark
● ssh
● Elasticsearch
● Hive
● Pig
● etc.
Check out the luigi.contrib package

class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in chunks]
def output(self):
return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce
methods instead of
run()

● Doesn’t provide a way to trigger
flows
● Doesn’t support distributed
execution
Luigi Limitations

Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all
PRs!

Stuart Coleman
@stubacca81 / stuart@growthintel.com
Dylan Barth
@dylan_barth / dylan@growthintel.com
Thanks!
We’re hiring Python data scientists & engineers!
http://www.growthintel.com/careers/

Build Data Pipelines with Luigi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Build Data Pipelines with Luigi

Similar to Build Data Pipelines with Luigi (20)

Recently uploaded

Recently uploaded (20)

Build Data Pipelines with Luigi

Editor's Notes