A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Complex data workflow
Let’s analyse if the weather affects sleep quality:
• Get sleep data of all study participants
• Get location data of all study participants
• Fetch weather data for each day and location
• Fetch historical weather data for each location
• Calculate difference from an average weather for each
data point
• Do a statistical analysis over users and days, comparing
weather data and sleep quality data
A lot of can go wrong on each step. Rerunning takes time
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Transform
Load
Extract
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Do this, for
millions of rows of initial data,
and continue doing it every day, for
products
customers
sales
Target
Target is simply something that exists or doesn’t exist
For example
• a file in a local file system
• a file in a remote file system
• a file in an Amazon S3 bucket
• a database row in a SQL database
Target
class MongoTarget(Luigi.Target):
def __init__(self, database, collection, predicate):
self.client = MongoClient()
self.database = database
self.collection = collection
self.predicate = predicate
def exists(self):
db = self.client[self.database]
one = db[self.collection].find_one(self.predicate)
return one is not None
Task: run
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self):
# Note: luigi’s input() and output() takes care of atomicity
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self): #…
Task
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCSVFromSFTP(self.date)
def run(self):
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self):
path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’))
return luigi.LocalTarget(path)
Tasks
Lots of ready-made tasks in Luigi:
• dump data to SQL table
• copy to Redshift Table
• run Hadoop job
• query SalesForce
• copy to Redshift Table
• Load ElasticSearch index
• …