In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
2. Where should I focus my outbound sales and marketing efforts to
yield the highest possible ROI?
UK Limited
Companies
Customer
CRM Data
Predictive
Model
4. Hard to maintain, extend, and… look at.
Script Soup
omg moar
codez
Code
More Codes
5. if __name__ == '__main__':
today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process
arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read
(supports globstar wildcards)', required=True)
arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of
rows to save to DB at once')
arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data
were released')
arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen',
default='INFO')
args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?
6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))
log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')
ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)
ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')
ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date,
company_status_params=company_status_params) <- Need to clean up if this fails
ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together
7. ● Open-sourced & maintained by
Spotify data team
● Erik Berhhardsson and Elias Freider.
Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code
and create dependencies between
tasks.
Luigi to the rescue!
8. ● Task templating
● Dependency graphs
● Resumption of data flows after
intermediate failure
● Command line integration
● Error emails
Luigi
9. Luigi 101- Counting the number of companies in the UK
companies.csv
Count companies
count.txt
input() output()
10. class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code
11. Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies
Download
Companies
Data Server
output()
requires()
12. class CompanyCount(luigi.Task):
def requires(self):
return CompanyDownload()
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Company count with download dependency
the output of the
required task
this task must complete
before CompanyCount
runs
13. Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked
up by previous task
download the data and
write it to the output
Target
14. $ python company_flow.py CompanyCount --local-scheduler
DEBUG: Checking if CompanyCount() is complete
DEBUG: Checking if CompanyDownload() is complete
INFO: Scheduled CompanyCount() (PENDING)
INFO: Scheduled CompanyDownload() (PENDING)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10076] Worker Worker(...) running CompanyDownload()
INFO: [pid 10076] Worker Worker(...) done CompanyDownload()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10076] Worker Worker(...) running CompanyCount()
INFO: [pid 10076] Worker Worker(...) done CompanyCount()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
INFO: Done
16. class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies
17. class CompanyCount(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Adding the date dependency to Company Count
added date dependency
to company count
18. The central scheduler
$ luigid & # start central scheduler in background
$ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082
20. class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data
21. My pipes broke
# ./client.cfg
[core]
error-email: dylan@growthintel.com, stuart@growthintel.com
22. Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop
● Spark
● ssh
● Elasticsearch
● Hive
● Pig
● etc.
Check out the luigi.contrib package
23. class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in chunks]
def output(self):
return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce
methods instead of
run()
24. ● Doesn’t provide a way to trigger
flows
● Doesn’t support distributed
execution
Luigi Limitations
25. Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all
PRs!
26. Stuart Coleman
@stubacca81 / stuart@growthintel.com
Dylan Barth
@dylan_barth / dylan@growthintel.com
Thanks!
We’re hiring Python data scientists & engineers!
http://www.growthintel.com/careers/
Editor's Notes
How many of you currently manage data pipelines in your day to day?
And how many of you use some sort of framework to manage them?
We work for a startup called Growth Intelligence
We use predictive modeling to help generate high quality leads for our customers.
help customers answer:
where should I focus my outbound sales and marketing efforts to yield the highest possible ROI?
How
we track all the companies in the UK using a variety of data sources
we look at sales data for our customers (positive and negative examples from their CRM)
we use that to build a predictive model to predict which leads will convert for our customers
So we work with a fair amount of data, from a lot of different sources.
We have data pipelines for:
taking in new data to keep our data set current
doing analytics on existing data and doing model building
processing or transforming our existing data, e.g. indexing a subset of it into elasticsearch
And as you all know, the more data you deal with, things can get really messy, really fast and the more of a burden it becomes to maintain.
In the past, we used to deal with each data pipeline on an ad hoc, individual basis.
For awhile, this worked fine.
As our data set grew, we realized we were quickly creating script soup
entire repositories with directories of scripts that had a lot of boilerplate
clearly needing some abstraction and cleanup.
We had stuff like this:
bespoke command line interfaces for each pipeline
fine once or twice, but when you have a lot of pipelines, it becomes unwieldy
We also had processor intensive, longish running tasks stacked up against one another in the script.
If one failed, how could we re-run without having to repeat a lot of the work that had already been done?
Some other challenges included simply keeping things modular and well-structured
especially when different data pipelines may be created at different times by different devs?
We also had varying levels of reporting across our tasks, so some had great logging and reporting, others didn’t.
And the list goes on...
And so now that you understand a bit of the challenge, Stuart is going to take over for a bit to help you understand how we approached this problem.
Stuart:
Fortunately, we aren’t the only ones to have this problem. In fact, everyone here probably has at some point!
A few years ago, the data team at Spotify open-sourced a python library called Luigi.
Erik Bernhardsson and Elias Freider
Currently maintained by Arash Rouhani
Really active and responsive community. Open to pull requests. Integrated within a week
Luigi basically provides a framework for structuring any batch processing job or data pipeline.
It helps to abstract away some of the question marks that we just talked about and will stop you going crazy maintaining your pipes
Luigi has some awesome abstractions:
It provide a Task class which is a template for a single unit of work and outputs a Target
It’s easy to define dependencies between Tasks
Luigi generates a dependency graph at runtime, so you don’t have to worry about running scripts in a particular order. It will figure out what is the best order
because each task is a unit of work, if something breaks, you can restart in the middle instead of the beginning. You get a graceful restart. This is really useful if you have a task which depends on a long running task which runs infrequently and a short running task which runs every day say. There is no need to rerun your long running task. Luigi in that sense is idempotent
It comes with intuitive command line integration so that you can pass parameters into your tasks without writing boilerplate
It can notify you when tasks fall over so you don’t have to waste time babysitting
And the list goes on!
We’ll be the first to say that we aren’t experts in Luigi just yet.
But we have slowly started converting some of our data pipelines to use it, and have also been adding our pipelines exclusively in luigi.
Now we’ll go through a couple of simple examples now that demonstrate the power of the framework but are simple enough to follow in a 25min talk. We’ll also mention a few of it’s limitations we’ve discovered so far, and then open up for questions and discussion.
Let’s imagine that we have miraculously been given a csv file that contains data about the limited companies here in the UK.
It just has simple metadata like the companies registration number, the company name, incorporation date, and sector.
For the purposes of this example, let’s imagine we simply want to count the number of unique companies currently operating in the UK and write that count to a file on disk.
So the workflow might look something like the above -- read the file, count the unique company names, and write it to a text file
Here we’ve defined our task “CompanyCount” and it inherits from the vanilla Luigi task
It has a couple of methods: output and run:
Run simply contains the business logic of the task and is executed when the task runs. You can put whatever processing logic you want in here.
Output returns a Luigi Target -- valid output can be a lot of things: a location on disk, on a remote server, or location in a database. In this case, we’re simply writing to disk.
When we run this from the command line, luigi executes the code in the run method and finishes by writing the count to the output target.
Great! But we obviously can’t do much with this data.
We want to make sure we have the latest count
instead of using our local, outdated file, we’re going to go and get the latest data from a UK government server.
We can break this flow into two units of work: a download task and a processing task.
The Task class has another method, requires, which makes it simple to define dependencies between Tasks.
In this case, we simply say that the CompanyCount task requires the DownloadCompaniesData task.
Let’s see how that changes our CompanyCount task:
Here we’ve made two changes to our CompanyCount task:
a “requires” method, specifying that CompanyDownload is required before CompanyCount can complete successfully.
we’ve replaced the name of the file with the self.input() method, which returns the Target object that the Task requires.
In this case, the LocalTarget returned by CompanyDownload.
Now we need to define our CompanyDownload task.
CompanyDownload is a simple task that goes up and gets the company data and downloads it to our directory.
The output method returns a target object pointing to a file location on disk.
The run method simply downloads the file and writes it to the output location.
Note that this output target becomes the input for any task that requires this one (in our case, this is the CompanyCount task).
Now, let’s try running this from the command line
To run our company count task from beginning to end, we simply call python company_flow.py CompanyCount
That tells Luigi which task we want to run.
Also worth noting that we told luigi to use the local-scheduler.
This tells luigi to not use the central-scheduler, which is a daemon that comes bundled with luigi and handles scheduling tasks.
We’ll talk about what that’s good for in a bit, but for now, we just use the local-scheduler
When we run this from the command line, luigi builds up a dependency graph and see’s that before it can run CompanyCount, it needs to run CompanyDownload.
It establishes this by calling the exists() method on required tasks, which simply checks to see if the Target returned by the output method already exists.
If it does, that task is marked as DONE, otherwise it’s included in the task queue. So first Luigi runs CompanyDownload, and then if it executes successfully, it runs CompanyCount and generates a new count for us.
So that was the MVP Luigi task - but from these simple building blocks it is possible to build up complicated examples quite quickly.
Dylan:
Having a count of all the companies in the UK is pretty cool, but what if we wanted to get way more awesome and visualize how the overall number of companies in the UK has changed over the past year?
We can do that pretty easily with our current code, mostly thanks to the fact that our tasks only do one job each.
We just need to add a third task
calls out to our previous tasks, getting the company data for each month in a given year
then outputs something useful -- a csv, or a histogram of the counts returned.
A couple of interesting things are happening here.
First, we are passing in a year as a parameter.
Luigi intelligently accepts defined parameters as command line args, so no boilerplate needed!
Second, the task uses the year to dynamically generate its requirements.
e.g. for each month in this year, run CompanyCount for that month
This triggers a download for that date’s data if we don’t have it already.
In order for this to work, we’ll have to add a date parameter to our previous tasks
Note that we didn’t include our output or run method here
this could be a histogram or a csv, whatever you want to do.
And here you can see how we’ve added a date parameter to the CompanyCount task.
We’ll also need to add this to the CompanyDownload task (not shown)
Now, we can trigger quite a few subtasks with just one task
This can be hard to keep track of
Luckily, the luigi central-scheduler comes with a basic visualizer.
Remember the first time we ran our task sequence from the CLI with --local-scheduler?
Let’s try it again, but this time let’s start a luigi central scheduler.
To start a scheduler, we just run luigid from the command line in the background
Next we run our task, leaving off the --local-scheduler option this time
This tells luigi to use the central scheduler
Note: useful in production because it ensures that two instances of the same task never run simultaneously
Also: awesome in development because you can visualize tasks dependencies.
While this is running, we can visit localhost:8082
check out the dependency graph that luigi has created.
Our simple command spawned 24 subtasks, a download and company count task for each month of the 2014.
The colors represent task status
so all of our previous tasks have run
the delta task is still in progress.
If a task fails, it’s marked as red.
Now that you’ve seen the central-scheduler let’s talk a bit about how Luigi nicely integrates with other tools like mySQL.
In reality, we’re going to want to store the companies data in mySQL so we can use it for modeling and ad hoc querying.
We define a new task, CompaniesToMysql
it simply takes in a date param
writes companies to table for that month
In this way, we can leverage the download task that we created previously and run this task completely separately from our analytics tasks.
Let’s look at how we can represent this in code
You’ll notice that this looks very different from tasks you’ve seen before
This is because our task isn’t inheriting directly from the vanilla luigi task
We are using the contrib.sqla module
The SQLA CopyToTable task provides powerful abstractions on top of a base luigi task when working with SQL Alchemy
It assumes the output of the task will be a SQLAlchemy table
you can control by specifying the connection string, table, and columns (if the table doesn’t already exist)
Instead of a run method, we override the rows method, which returns a generator of row tuples.
This simplifies things, because you can do all the processing you want in the rows method and let the task deal with batch inserts.
When we run this, Luigi first checks to make sure that DownloadCompanyData has been run for the date we specified, and then it runs the copy to table to task, inserting the records.
One last thing to think about for now, what happens when something falls over?
Luigi handles errors intelligently.
Because tasks are individual units of work, if something breaks, you don’t have to re-run everything.
You can simply restart the task, and the dependencies that finished will be skipped.
We also want to get a notification with a stack trace.
we can just add a configuration file specifying the emails to send the stack trace to
and here’s an example of an email from Luigi in a case where you try to divide by zero (lolz)
It’s worth noting that in addition to system wide luigi settings, you can also specify settings on a per task basis in the config.
Stuart
There is a ton of extensibility and integration with other services that Luigi provides abstractions for, and we’ve listed them out here
Definitely check out the contrib docs for more info.
Here’s a quick example of how you might use the hadoop module
We point our files to HDFS
Rather than implementing a run() method, we can have a mapper() and reducer() method
Although Luigi is pretty awesome, there are some limitations worth pointing out:
Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.
Luigi does not support distribution of execution.
When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded.
Probably ok for a lot of you
Idea was that api was more important than architecture. If you are interested in architecture you may want to check out Airflow from the AirBnB team which has just open sourced a library called Airflow
Definitely check out the docs, join the mailing list (it’s pretty active), and check out the repo.
There’s active churn on the issues and the maintainers are super responsive.
Docs could use more examples probably, that could be your first contribution!