A Beginner's Guide to Building Data Pipelines with Luigi

22,863 views

Published on

In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.

Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.

In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.

Published in: Technology
0 Comments
24 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
22,863
On SlideShare
0
From Embeds
0
Number of Embeds
65
Actions
Shares
0
Downloads
175
Comments
0
Likes
24
Embeds 0
No embeds

No notes for slide
  • How many of you currently manage data pipelines in your day to day?
    And how many of you use some sort of framework to manage them?
  • We work for a startup called Growth Intelligence
    We use predictive modeling to help generate high quality leads for our customers.
    help customers answer:
    where should I focus my outbound sales and marketing efforts to yield the highest possible ROI?
    How
    we track all the companies in the UK using a variety of data sources
    we look at sales data for our customers (positive and negative examples from their CRM)
    we use that to build a predictive model to predict which leads will convert for our customers
  • So we work with a fair amount of data, from a lot of different sources.
    We have data pipelines for:
    taking in new data to keep our data set current
    doing analytics on existing data and doing model building
    processing or transforming our existing data, e.g. indexing a subset of it into elasticsearch
    And as you all know, the more data you deal with, things can get really messy, really fast and the more of a burden it becomes to maintain.
  • In the past, we used to deal with each data pipeline on an ad hoc, individual basis.
    For awhile, this worked fine.
    As our data set grew, we realized we were quickly creating script soup
    entire repositories with directories of scripts that had a lot of boilerplate
    clearly needing some abstraction and cleanup.
  • We had stuff like this:
    bespoke command line interfaces for each pipeline
    fine once or twice, but when you have a lot of pipelines, it becomes unwieldy
  • We also had processor intensive, longish running tasks stacked up against one another in the script.
    If one failed, how could we re-run without having to repeat a lot of the work that had already been done?
    Some other challenges included simply keeping things modular and well-structured
    especially when different data pipelines may be created at different times by different devs?
    We also had varying levels of reporting across our tasks, so some had great logging and reporting, others didn’t.
    And the list goes on...
    And so now that you understand a bit of the challenge, Stuart is going to take over for a bit to help you understand how we approached this problem.
  • Stuart:
    Fortunately, we aren’t the only ones to have this problem. In fact, everyone here probably has at some point!
    A few years ago, the data team at Spotify open-sourced a python library called Luigi.
    Erik Bernhardsson and Elias Freider
    Currently maintained by Arash Rouhani
    Really active and responsive community. Open to pull requests. Integrated within a week
    Luigi basically provides a framework for structuring any batch processing job or data pipeline.
    It helps to abstract away some of the question marks that we just talked about and will stop you going crazy maintaining your pipes
  • Luigi has some awesome abstractions:
    It provide a Task class which is a template for a single unit of work and outputs a Target
    It’s easy to define dependencies between Tasks
    Luigi generates a dependency graph at runtime, so you don’t have to worry about running scripts in a particular order. It will figure out what is the best order
    because each task is a unit of work, if something breaks, you can restart in the middle instead of the beginning. You get a graceful restart. This is really useful if you have a task which depends on a long running task which runs infrequently and a short running task which runs every day say. There is no need to rerun your long running task. Luigi in that sense is idempotent
    It comes with intuitive command line integration so that you can pass parameters into your tasks without writing boilerplate
    It can notify you when tasks fall over so you don’t have to waste time babysitting
    And the list goes on!
    We’ll be the first to say that we aren’t experts in Luigi just yet.
    But we have slowly started converting some of our data pipelines to use it, and have also been adding our pipelines exclusively in luigi.
    Now we’ll go through a couple of simple examples now that demonstrate the power of the framework but are simple enough to follow in a 25min talk. We’ll also mention a few of it’s limitations we’ve discovered so far, and then open up for questions and discussion.
  • Let’s imagine that we have miraculously been given a csv file that contains data about the limited companies here in the UK.
    It just has simple metadata like the companies registration number, the company name, incorporation date, and sector.
    For the purposes of this example, let’s imagine we simply want to count the number of unique companies currently operating in the UK and write that count to a file on disk.
    So the workflow might look something like the above -- read the file, count the unique company names, and write it to a text file
  • Here we’ve defined our task “CompanyCount” and it inherits from the vanilla Luigi task
    It has a couple of methods: output and run:
    Run simply contains the business logic of the task and is executed when the task runs. You can put whatever processing logic you want in here.
    Output returns a Luigi Target -- valid output can be a lot of things: a location on disk, on a remote server, or location in a database. In this case, we’re simply writing to disk.
    When we run this from the command line, luigi executes the code in the run method and finishes by writing the count to the output target.
    Great! But we obviously can’t do much with this data.
    We want to make sure we have the latest count
    instead of using our local, outdated file, we’re going to go and get the latest data from a UK government server.
  • We can break this flow into two units of work: a download task and a processing task.
    The Task class has another method, requires, which makes it simple to define dependencies between Tasks.
    In this case, we simply say that the CompanyCount task requires the DownloadCompaniesData task.
    Let’s see how that changes our CompanyCount task:
  • Here we’ve made two changes to our CompanyCount task:
    a “requires” method, specifying that CompanyDownload is required before CompanyCount can complete successfully.
    we’ve replaced the name of the file with the self.input() method, which returns the Target object that the Task requires.
    In this case, the LocalTarget returned by CompanyDownload.
    Now we need to define our CompanyDownload task.
  • CompanyDownload is a simple task that goes up and gets the company data and downloads it to our directory.
    The output method returns a target object pointing to a file location on disk.
    The run method simply downloads the file and writes it to the output location.
    Note that this output target becomes the input for any task that requires this one (in our case, this is the CompanyCount task).
    Now, let’s try running this from the command line
  • To run our company count task from beginning to end, we simply call python company_flow.py CompanyCount
    That tells Luigi which task we want to run.
    Also worth noting that we told luigi to use the local-scheduler.
    This tells luigi to not use the central-scheduler, which is a daemon that comes bundled with luigi and handles scheduling tasks.
    We’ll talk about what that’s good for in a bit, but for now, we just use the local-scheduler
    When we run this from the command line, luigi builds up a dependency graph and see’s that before it can run CompanyCount, it needs to run CompanyDownload.
    It establishes this by calling the exists() method on required tasks, which simply checks to see if the Target returned by the output method already exists.
    If it does, that task is marked as DONE, otherwise it’s included in the task queue. So first Luigi runs CompanyDownload, and then if it executes successfully, it runs CompanyCount and generates a new count for us.
    So that was the MVP Luigi task - but from these simple building blocks it is possible to build up complicated examples quite quickly.
  • Dylan:
    Having a count of all the companies in the UK is pretty cool, but what if we wanted to get way more awesome and visualize how the overall number of companies in the UK has changed over the past year?
    We can do that pretty easily with our current code, mostly thanks to the fact that our tasks only do one job each.
    We just need to add a third task
    calls out to our previous tasks, getting the company data for each month in a given year
    then outputs something useful -- a csv, or a histogram of the counts returned.
  • A couple of interesting things are happening here.
    First, we are passing in a year as a parameter.
    Luigi intelligently accepts defined parameters as command line args, so no boilerplate needed!
    Second, the task uses the year to dynamically generate its requirements.
    e.g. for each month in this year, run CompanyCount for that month
    This triggers a download for that date’s data if we don’t have it already.
    In order for this to work, we’ll have to add a date parameter to our previous tasks
    Note that we didn’t include our output or run method here
    this could be a histogram or a csv, whatever you want to do.
  • And here you can see how we’ve added a date parameter to the CompanyCount task.
    We’ll also need to add this to the CompanyDownload task (not shown)
    Now, we can trigger quite a few subtasks with just one task
    This can be hard to keep track of
    Luckily, the luigi central-scheduler comes with a basic visualizer.
    Remember the first time we ran our task sequence from the CLI with --local-scheduler?
    Let’s try it again, but this time let’s start a luigi central scheduler.

  • To start a scheduler, we just run luigid from the command line in the background
    Next we run our task, leaving off the --local-scheduler option this time
    This tells luigi to use the central scheduler
    Note: useful in production because it ensures that two instances of the same task never run simultaneously
    Also: awesome in development because you can visualize tasks dependencies.
    While this is running, we can visit localhost:8082
    check out the dependency graph that luigi has created.
    Our simple command spawned 24 subtasks, a download and company count task for each month of the 2014.
    The colors represent task status
    so all of our previous tasks have run
    the delta task is still in progress.
    If a task fails, it’s marked as red.
    Now that you’ve seen the central-scheduler let’s talk a bit about how Luigi nicely integrates with other tools like mySQL.
  • In reality, we’re going to want to store the companies data in mySQL so we can use it for modeling and ad hoc querying.
    We define a new task, CompaniesToMysql
    it simply takes in a date param
    writes companies to table for that month
    In this way, we can leverage the download task that we created previously and run this task completely separately from our analytics tasks.
    Let’s look at how we can represent this in code
  • You’ll notice that this looks very different from tasks you’ve seen before
    This is because our task isn’t inheriting directly from the vanilla luigi task
    We are using the contrib.sqla module
    The SQLA CopyToTable task provides powerful abstractions on top of a base luigi task when working with SQL Alchemy
    It assumes the output of the task will be a SQLAlchemy table
    you can control by specifying the connection string, table, and columns (if the table doesn’t already exist)
    Instead of a run method, we override the rows method, which returns a generator of row tuples.
    This simplifies things, because you can do all the processing you want in the rows method and let the task deal with batch inserts.
    When we run this, Luigi first checks to make sure that DownloadCompanyData has been run for the date we specified, and then it runs the copy to table to task, inserting the records.
  • One last thing to think about for now, what happens when something falls over?
    Luigi handles errors intelligently.
    Because tasks are individual units of work, if something breaks, you don’t have to re-run everything.
    You can simply restart the task, and the dependencies that finished will be skipped.
    We also want to get a notification with a stack trace.
    we can just add a configuration file specifying the emails to send the stack trace to
    and here’s an example of an email from Luigi in a case where you try to divide by zero (lolz)
    It’s worth noting that in addition to system wide luigi settings, you can also specify settings on a per task basis in the config.
  • Stuart
    There is a ton of extensibility and integration with other services that Luigi provides abstractions for, and we’ve listed them out here
    Definitely check out the contrib docs for more info.
    Here’s a quick example of how you might use the hadoop module
  • We point our files to HDFS
    Rather than implementing a run() method, we can have a mapper() and reducer() method
  • Although Luigi is pretty awesome, there are some limitations worth pointing out:
    Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.
    Luigi does not support distribution of execution.
    When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded.
    Probably ok for a lot of you
    Idea was that api was more important than architecture. If you are interested in architecture you may want to check out Airflow from the AirBnB team which has just open sourced a library called Airflow
  • Definitely check out the docs, join the mailing list (it’s pretty active), and check out the repo.
    There’s active churn on the issues and the maintainers are super responsive.
    Docs could use more examples probably, that could be your first contribution!
  • A Beginner's Guide to Building Data Pipelines with Luigi

    1. 1. A Beginner’s Guide to Building Data Pipelines with
    2. 2. Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? UK Limited Companies Customer CRM Data Predictive Model
    3. 3. With big data, comes big responsibility
    4. 4. Hard to maintain, extend, and… look at. Script Soup omg moar codez Code More Codes
    5. 5. if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments') arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args() The Old Way Define a command line interface for every task?
    6. 6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today)) log.setLogLevel(args.log_level, 'screen') <- Custom logging table_date = parse_date(args.table_date, datetime.now()) log.info('Starting Companies House data loader...') ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date) ch_loader.go(args.file_names) <- What to do if this fails? log.info('Loader complete. Starting Companies House updater') ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go() The Old Way Long, processor intensive tasks stacked together
    7. 7. ● Open-sourced & maintained by Spotify data team ● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani. ● Abstracts batch processing jobs ● Makes it easy to write modular code and create dependencies between tasks. Luigi to the rescue!
    8. 8. ● Task templating ● Dependency graphs ● Resumption of data flows after intermediate failure ● Command line integration ● Error emails Luigi
    9. 9. Luigi 101- Counting the number of companies in the UK companies.csv Count companies count.txt input() output()
    10. 10. class CompanyCount(luigi.Task): def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries("companies.csv") with self.output().open("w") as out_file: out_file.write(count) Company Count Job in Luigi code
    11. 11. Luigi 101- Keeping our count up to date companies.csv Count companies count.txt input() output()Companies Download Companies Data Server output() requires()
    12. 12. class CompanyCount(luigi.Task): def requires(self): return CompanyDownload() def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Company count with download dependency the output of the required task this task must complete before CompanyCount runs
    13. 13. Download task class CompanyDownload(luigi.Task): def output(self): return luigi.LocalTarget("companies.csv") def run(self): data = get_company_download() with self.output().open('w') as out_file: out_file.write(data) local output to be picked up by previous task download the data and write it to the output Target
    14. 14. $ python company_flow.py CompanyCount --local-scheduler DEBUG: Checking if CompanyCount() is complete DEBUG: Checking if CompanyDownload() is complete INFO: Scheduled CompanyCount() (PENDING) INFO: Scheduled CompanyDownload() (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 10076] Worker Worker(...) running CompanyDownload() INFO: [pid 10076] Worker Worker(...) done CompanyDownload() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 10076] Worker Worker(...) running CompanyCount() INFO: [pid 10076] Worker Worker(...) done CompanyCount() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... INFO: Done
    15. 15. Time dependent tasks - change in companies Companies Count Task(Date 1) Companies Count Task(Date 2) Companies Count Task(Date 3) Companies Delta company_count_ delta.txt output() input() input() input()
    16. 16. class AnnualCompanyCountDelta(luigi.Task): year = luigi.Parameter() def requires(self): tasks = [] for month in range(1, 13): tasks.append(CompanyCount(dt.datetime.strptime( "{}-{}-01".format(self.year, month), "%Y-%m-%d")) ) return tasks # not shown: output(), run() Parameterising Luigi tasks define parameter generate dependencies
    17. 17. class CompanyCount(luigi.Task): date = luigi.DateParameter(default=datetime.date.today()) def requires(self): return CompanyDownload(self.date) def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Adding the date dependency to Company Count added date dependency to company count
    18. 18. The central scheduler $ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014 by default, localhost:8082
    19. 19. Persisting our data companies.csv Count companies(Date) count.txt output()Companies Download(Date) Companies Data Server output() requires(Date) Companies ToMySQL(Date) output() SQL Database requires(Date)
    20. 20. class CompaniesToMySQL(luigi.sqla.CopyToTable): date = luigi.DateParameter() columns = [(["name", String(100)], {}), ...] connection_string = "mysql://localhost/test" # or something table = "companies" # name of the table to store data def requires(self): return CompanyDownload(self.date) def rows(self): for row in self.get_unique_rows(): # uses self.input() yield row Persisting our data
    21. 21. My pipes broke # ./client.cfg [core] error-email: dylan@growthintel.com, stuart@growthintel.com
    22. 22. Things we missed out There are lots of task types which can be used which we haven’t mentioned ● Hadoop ● Spark ● ssh ● Elasticsearch ● Hive ● Pig ● etc. Check out the luigi.contrib package
    23. 23. class CompanyCount(luigi.contrib.hadoop.JobTask): chunks = luigi.Parameter() def requires(self): return [CompanyDownload(chunk) for chunk in chunks] def output(self): return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv") def mapper(self, line): yield "count", 1 def reducer(self, key, values): yield key, sum(values) Counting the companies using Hadoop split input in chunks HDFS target map and reduce methods instead of run()
    24. 24. ● Doesn’t provide a way to trigger flows ● Doesn’t support distributed execution Luigi Limitations
    25. 25. Onwards ● The docs: http://luigi.readthedocs.org/ ● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/ ● The source: https://github.com/spotify/luigi ● The maintainers are really helpful, responsive, and open to any and all PRs!
    26. 26. Stuart Coleman @stubacca81 / stuart@growthintel.com Dylan Barth @dylan_barth / dylan@growthintel.com Thanks! We’re hiring Python data scientists & engineers! http://www.growthintel.com/careers/

    ×