Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
by Teemu Kurppa
www.teemukurppa.net
Metrics Monday at Custobar, Helsinki,
30.5.2016
Managing data workflows
with Luigi
2
Customer analytics and
marketing tool for retailers
I’m an advisor at your host:
teemu@ouraring.com
www.ouraring.com
the world's first wellness ring
Head of Software: Cloud & Mobile
I work at
Introducing
Data Workflows
4
gunzip -c /var/log/syslog.3.gz | grep -e UFW
Complex data workflow
Let’s analyse if the weather affects sleep quality:
• Get sleep data of all study participants
• Get ...
Case Custobar:ETL
Extract - Transform - Load
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw aw...
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw aw...
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw aw...
Luigi
by Spotify
11
Data workflow tools
Pinball
by Pinterest
Luigi
by Spotify
Airflow
by AirBnB
Luigi Concepts
13
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
custom...
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
custom...
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
custom...
Concepts: Target
17
Target
Target is simply something that exists or doesn’t exist
For example
• a file in a local file system
• a file in a remo...
Target
class MongoTarget(Luigi.Target):
def __init__(self, database, collection, predicate):
self.client = MongoClient()
s...
Target
Lots of ready-made targets in Luigi:
• local file
• HDFS file
• S3 key/value target
• SSH remote target
• SFTP remote...
Concepts: Task
21
Task: basic structure
class TransformDailySalesCSVtoJSON(Luigi.Task):
def requires(self): #…
def run(self): # …
def output...
Task: parameters
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def r...
Task: requires
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return Imp...
Task: output
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): # …
def run(...
Task: run
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self...
Task
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCS...
Tasks
Lots of ready-made tasks in Luigi:
• dump data to SQL table
• copy to Redshift Table
• run Hadoop job
• query SalesF...
Dependency patterns
29
Multiple dependencies
class TransformAllSales(Luigi.Task):
def requires(self):
for i in range(1000):
return [ImportInitial...
Dynamic dependencies
class LoadDailyAPIData(Luigi.Task):
date = luigi.DateParameter()
def run(self):
for filepath in os.lis...
Wrapper task
class LoadAllDailyData(Luigi.WrapperTask):
date = luigi.DateParameter()
def run(self):
yield LoadDailyProduct...
Why to use
data workflow tools?
33
34
1. Resume the data workflow after a failure
2. Parametrize and rerun tasks every day
3. Organise code with shared patter...
35
Thanks! Questions?
Custobar is hiring!
Approach Juha, Tatu or me to learn more
Follow @teemu on Twitter to stay in touc...
Upcoming SlideShare
Loading in …5
×

Managing data workflows with Luigi

3,289 views

Published on

A talk about data workflow tools in Metrics Monday Helsinki.

Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.

Published in: Technology
  • Be the first to comment

Managing data workflows with Luigi

  1. 1. 1 by Teemu Kurppa www.teemukurppa.net Metrics Monday at Custobar, Helsinki, 30.5.2016 Managing data workflows with Luigi
  2. 2. 2 Customer analytics and marketing tool for retailers I’m an advisor at your host:
  3. 3. teemu@ouraring.com www.ouraring.com the world's first wellness ring Head of Software: Cloud & Mobile I work at
  4. 4. Introducing Data Workflows 4
  5. 5. gunzip -c /var/log/syslog.3.gz | grep -e UFW
  6. 6. Complex data workflow Let’s analyse if the weather affects sleep quality: • Get sleep data of all study participants • Get location data of all study participants • Fetch weather data for each day and location • Fetch historical weather data for each location • Calculate difference from an average weather for each data point • Do a statistical analysis over users and days, comparing weather data and sleep quality data A lot of can go wrong on each step. Rerunning takes time
  7. 7. Case Custobar:ETL Extract - Transform - Load
  8. 8. Case Custobar:ETL Fetch custom sales.csv from SFTP Transform custom sales.csv to standard sales.json Validate and throw away invalid fields Load valid sales data to database
  9. 9. Case Custobar:ETL Fetch custom sales.csv from SFTP Transform custom sales.csv to standard sales.json Validate and throw away invalid fields Load valid sales data to database Transform Load Extract
  10. 10. Case Custobar:ETL Fetch custom sales.csv from SFTP Transform custom sales.csv to standard sales.json Validate and throw away invalid fields Load valid sales data to database Do this, for millions of rows of initial data, and continue doing it every day, for products customers sales
  11. 11. Luigi by Spotify 11
  12. 12. Data workflow tools Pinball by Pinterest Luigi by Spotify Airflow by AirBnB
  13. 13. Luigi Concepts 13
  14. 14. Luigi Concepts Get Changed Customers sql: Customers table Tasks Targets Export Changed Customers to FTP file://data/ customers.csv sftp://data/ customers.csv Dependencies
  15. 15. Luigi Concepts Get Changed Customers sql: Customers table Tasks Targets Export Changed Customers to FTP file://data/ customers.csv sftp://data/ customers.csv Dependencies output()input() input() output() requires()
  16. 16. Luigi Concepts Get Changed Customers sql: Customers table Tasks Targets Export Changed Customers to FTP file://data/ customers.csv sftp://data/ customers.csv Dependencies company: Parameter date: DateParameter company: Parameter date: DateParameter Parameters
  17. 17. Concepts: Target 17
  18. 18. Target Target is simply something that exists or doesn’t exist For example • a file in a local file system • a file in a remote file system • a file in an Amazon S3 bucket • a database row in a SQL database
  19. 19. Target class MongoTarget(Luigi.Target): def __init__(self, database, collection, predicate): self.client = MongoClient() self.database = database self.collection = collection self.predicate = predicate def exists(self): db = self.client[self.database] one = db[self.collection].find_one(self.predicate) return one is not None
  20. 20. Target Lots of ready-made targets in Luigi: • local file • HDFS file • S3 key/value target • SSH remote target • SFTP remote target • SQL table row target • Amazon Redshift table row target • ElasticSearch target
  21. 21. Concepts: Task 21
  22. 22. Task: basic structure class TransformDailySalesCSVtoJSON(Luigi.Task): def requires(self): #… def run(self): # … def output(self): #…
  23. 23. Task: parameters class TransformDailySalesCSVtoJSON(Luigi.Task): date = luigi.DateParameter() def requires(self): #… def run(self): # … def output(self): #…
  24. 24. Task: requires class TransformDailySalesCSVtoJSON(Luigi.Task): date = luigi.DateParameter() def requires(self): return ImportDailyCSVFromSFTP(self.date) def run(self): # … def output(self): #…
  25. 25. Task: output class TransformDailySalesCSVtoJSON(Luigi.Task): date = luigi.DateParameter() def requires(self): # … def run(self): # … def output(self): path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’)) return luigi.LocalTarget(path)
  26. 26. Task: run class TransformDailySalesCSVtoJSON(Luigi.Task): date = luigi.DateParameter() def requires(self): #… def run(self): # Note: luigi’s input() and output() takes care of atomicity with self.input().open(‘r’) as infile: data = transform_csv_to_dict(infile) with self.output().open(‘w’) as outfile: json.dump(data, outfile) def output(self): #…
  27. 27. Task class TransformDailySalesCSVtoJSON(Luigi.Task): date = luigi.DateParameter() def requires(self): return ImportDailyCSVFromSFTP(self.date) def run(self): with self.input().open(‘r’) as infile: data = transform_csv_to_dict(infile) with self.output().open(‘w’) as outfile: json.dump(data, outfile) def output(self): path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’)) return luigi.LocalTarget(path)
  28. 28. Tasks Lots of ready-made tasks in Luigi: • dump data to SQL table • copy to Redshift Table • run Hadoop job • query SalesForce • copy to Redshift Table • Load ElasticSearch index • …
  29. 29. Dependency patterns 29
  30. 30. Multiple dependencies class TransformAllSales(Luigi.Task): def requires(self): for i in range(1000): return [ImportInitialSaleFile(index=i)] def run(self): #… def output(self): #…
  31. 31. Dynamic dependencies class LoadDailyAPIData(Luigi.Task): date = luigi.DateParameter() def run(self): for filepath in os.listdir(‘/d/api_data/*.json’): TransformDailyAPIData(filepath)
  32. 32. Wrapper task class LoadAllDailyData(Luigi.WrapperTask): date = luigi.DateParameter() def run(self): yield LoadDailyProducts(self.date) yield LoadDailyCustomers(self.date) yield LoadDailySales(self.date)
  33. 33. Why to use data workflow tools? 33
  34. 34. 34 1. Resume the data workflow after a failure 2. Parametrize and rerun tasks every day 3. Organise code with shared patterns
  35. 35. 35 Thanks! Questions? Custobar is hiring! Approach Juha, Tatu or me to learn more Follow @teemu on Twitter to stay in touch.

×