Building data flows with Celery
and SQLAlchemy
PyCon Australia 2013
Roger Barnes
@mindsocket
roger@mindsocket.com.au
http://slideshare.net/mindsocket
Coming up
● Data warehousing
– AKA data integration
● Processing data flows
– SQLAlchemy
– Celery
● Tying it all together
About me
● 15 years doing all things software
● 11 years at a Business Intelligence vendor
● Currently contracting
– This talk based on a real reporting system
Why Data Warehousing?
Why Data Warehousing?
But we need reports that are
● Timely
● Unambiguous
● Accurate
● Complete
● … and don't impact production systems
What is a Data Warehouse
"… central repository of data
which is created by integrating
data from one or more
disparate sources" - Wikipedia
Extract, Transform, Load
Source: www.imc.com
Python can help!
● Rapid prototyping
● Code re-use
● Existing libraries
● Decouple
– data flow management
– data processing
– business logic
Existing solutions
● Not a lot available in the Python space
● People roll their own
● Bubbles (Brewery 2)
– Framework for Python 3
– "Focus on the process, not the data technology"
Ways to move data around
● Flat files
● NOSQL data stores
● RDBMS
SQLAlchemy is...
Python SQL toolkit
&
Object Relational Mapper
About SQLAlchemy
● Full featured
● Mature, robust, documented, maintained
● Flexible
Enterprise!
DB support
● SQLite
● Postgresql
● MySQL
● Oracle
● MS-SQL
● Firebird
● Sybase
● ...
Python support
cPython 2.5+
cPython 3+
Jython 2.5+
Pypy 1.5+
Structure
SQLAlchemy Core
● Abstraction over Python's DBAPI
● SQL language via generative Python
expressions
SQLAlchemy Core
● Good for DB performance
– bulk operations
– complex queries
– fine-tuning
– connection/tx management
Create a table
from sqlalchemy import *
engine = create_engine('sqlite:///:memory:')
metadata = MetaData()
vehicles_table = Table('vehicles', metadata,
Column('model', String),
Column('registration', String),
Column('odometer', Integer),
Column('last_service', Date),)
vehicles_table.create(bind=engine)
Insert data
values = [
{'model': 'Ford Festiva',
'registration': 'HAX00R',
'odometer': 3141 },
{'model': 'Lotus Elise',
'registration': 'DELEG8',
'odometer': 31415 },
]
rows = engine.execute(
vehicles_table.insert(),
list(values)).rowcount
Query data
query = select(
[vehicles_table]
).where(
vehicles_table.c.odometer < 100
)
results = engine.execute(query)
for row in results:
print row
Encapsulating a unit of work
Example Processor Types
● Extract
– Extract from CSV
– Extract from DB table
– Scrape web page
● Transform
– Copy table from extract layer
– Derive column
– Join tables
Abstract Processor
class BaseProcessor(object):
def dispatch(self):
return self._run()
def _run(self):
return self.run()
def run(self):
raise NotImplementedError
Abstract Database Processor
class DatabaseProcessor(BaseProcessor):
db_class = None
engine = None
metadata = None
@contextlib.contextmanager
def _with_session(self):
with self.db_class().get_engine() as engine:
self.engine = engine
self.metadata = MetaData(bind=engine)
yield
def _run(self):
with self._with_session():
return self.run()
CSV Extract Mixin
class CSVExtractMixin(object):
input_file = None
def _run(self):
with self._with_engine():
self.reader = csv.DictReader(
self.input_file
)
return self.run()
A Concrete Extract
class SalesHistoryExtract(CSVExtractMixin,
DatabaseProcessor):
target_table_name = 'SalesHistoryExtract'
input_file = SALES_FILENAME
def run(self):
target_table = Table(self.target_table_name,
self.metadata)
columns = self.reader.next()
[target_table.append_column(Column(column, ...))
for column in columns if column]
target_table.create()
insert = target_table.insert()
new_record_count = self.engine.execute(insert,
list(self.reader)).rowcount
return new_record_count
An Abstract Derive Transform
class AbstractDeriveTransform(DatabaseProcessor):
table_name = None
key_columns = None
select_columns = None
target_columns = None
def process_row(self, row):
raise NotImplementedError
...
# Profit!
A Concrete Transform
from business_logic import derive_foo
class DeriveFooTransform(AbstractDeriveTransform):
table_name = 'SalesTransform'
key_columns = ['txn_id']
select_columns = ['location', 'username']
target_columns = [Column('foo', FOO_TYPE)]
def process_row(self, row):
foo = derive_foo(row.location, row.username)
return {'foo': foo}
Introducing Celery
Distributed Task Queue
A Processor Task
class AbstractProcessorTask(celery.Task):
abstract = True
processor_class = None
def run(self, *args, **kwargs):
processor = self.processor_class(
*args, **kwargs)
return processor.dispatch()
class DeriveFooTask(AbstractProcessorTask):
processor_class = DeriveFooTransform
DeriveFooTask().apply_async() # Run it!
Canvas: Designing Workflows
● Combines a series of tasks
● Groups run in parallel
● Chains run in series
● Can be combined in different ways
>>> new_user_workflow = (create_user.s() | group(
... import_contacts.s(),
... send_welcome_email.s()))
... new_user_workflow.delay(username='artv',
... first='Art',
... last='Vandelay',
... email='art@vandelay.com')
Sample Data Processing Flow
Extrac
t sales
Extract
customers
Extract
product
s
Copy sales to
transform
Copy customers
to transform
Copy products
to transform
Join
table
s
Aggregate sales
by customer
Normalis
e
currency
Aggregate
sales by region
Customer data
exception report
Sample Data Processing Flow
extract_flow = group((
ExtractSalesTask().si(),
ExtractCustTask().si(),
ExtractProductTask().si()))
transform_flow = group((
CopySalesTask().si() | NormaliseCurrencyTask().si(),
CopyCustTask().si(),
CopyProductTask().si())) | JoinTask().si()
load_flow = group((
QualityTask().si(),
AggregateTask().si('cust_id'),
AggregateTask().si('region_id')))
all_flow = extract_flow | transform_flow | load_flow
Monitoring – celery events
Monitoring – celery flower
Turning it up to 11
● A requires/depends structure
● Incremental data loads
● Parameterised flows
● Tracking flow history
● Hooking into other libraries
– NLTK
– SciPy/NumPy
– ...
Summary
● Intro to data warehousing
● Process data with SQLAlchemy
● Task dependencies with Celery
canvas
Resources
● SQLAlchemy core: http://bit.ly/10FdYZo
● Celery Canvas: http://bit.ly/MOjazT
● http://databrewery.org
– Bubbles: http://bit.ly/14hNsV0
– Pipeline: http://bit.ly/15RXvWa
● http://schoolofdata.org
Thank You!
Questions?
http://slideshare.net/mindsocket

Building data flows with Celery and SQLAlchemy

  • 1.
    Building data flowswith Celery and SQLAlchemy PyCon Australia 2013 Roger Barnes @mindsocket roger@mindsocket.com.au http://slideshare.net/mindsocket
  • 2.
    Coming up ● Datawarehousing – AKA data integration ● Processing data flows – SQLAlchemy – Celery ● Tying it all together
  • 3.
    About me ● 15years doing all things software ● 11 years at a Business Intelligence vendor ● Currently contracting – This talk based on a real reporting system
  • 4.
  • 5.
    Why Data Warehousing? Butwe need reports that are ● Timely ● Unambiguous ● Accurate ● Complete ● … and don't impact production systems
  • 6.
    What is aData Warehouse "… central repository of data which is created by integrating data from one or more disparate sources" - Wikipedia
  • 9.
  • 10.
    Python can help! ●Rapid prototyping ● Code re-use ● Existing libraries ● Decouple – data flow management – data processing – business logic
  • 11.
    Existing solutions ● Nota lot available in the Python space ● People roll their own ● Bubbles (Brewery 2) – Framework for Python 3 – "Focus on the process, not the data technology"
  • 12.
    Ways to movedata around ● Flat files ● NOSQL data stores ● RDBMS
  • 13.
    SQLAlchemy is... Python SQLtoolkit & Object Relational Mapper
  • 14.
    About SQLAlchemy ● Fullfeatured ● Mature, robust, documented, maintained ● Flexible
  • 15.
  • 16.
    DB support ● SQLite ●Postgresql ● MySQL ● Oracle ● MS-SQL ● Firebird ● Sybase ● ...
  • 17.
    Python support cPython 2.5+ cPython3+ Jython 2.5+ Pypy 1.5+
  • 18.
  • 19.
    SQLAlchemy Core ● Abstractionover Python's DBAPI ● SQL language via generative Python expressions
  • 20.
    SQLAlchemy Core ● Goodfor DB performance – bulk operations – complex queries – fine-tuning – connection/tx management
  • 21.
    Create a table fromsqlalchemy import * engine = create_engine('sqlite:///:memory:') metadata = MetaData() vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),) vehicles_table.create(bind=engine)
  • 22.
    Insert data values =[ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 }, ] rows = engine.execute( vehicles_table.insert(), list(values)).rowcount
  • 23.
    Query data query =select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 ) results = engine.execute(query) for row in results: print row
  • 24.
  • 25.
    Example Processor Types ●Extract – Extract from CSV – Extract from DB table – Scrape web page ● Transform – Copy table from extract layer – Derive column – Join tables
  • 26.
    Abstract Processor class BaseProcessor(object): defdispatch(self): return self._run() def _run(self): return self.run() def run(self): raise NotImplementedError
  • 27.
    Abstract Database Processor classDatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None @contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield def _run(self): with self._with_session(): return self.run()
  • 28.
    CSV Extract Mixin classCSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()
  • 29.
    A Concrete Extract classSalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count
  • 30.
    An Abstract DeriveTransform class AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None def process_row(self, row): raise NotImplementedError ... # Profit!
  • 31.
    A Concrete Transform frombusiness_logic import derive_foo class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)] def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}
  • 32.
  • 33.
    A Processor Task classAbstractProcessorTask(celery.Task): abstract = True processor_class = None def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch() class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform DeriveFooTask().apply_async() # Run it!
  • 34.
    Canvas: Designing Workflows ●Combines a series of tasks ● Groups run in parallel ● Chains run in series ● Can be combined in different ways >>> new_user_workflow = (create_user.s() | group( ... import_contacts.s(), ... send_welcome_email.s())) ... new_user_workflow.delay(username='artv', ... first='Art', ... last='Vandelay', ... email='art@vandelay.com')
  • 35.
    Sample Data ProcessingFlow Extrac t sales Extract customers Extract product s Copy sales to transform Copy customers to transform Copy products to transform Join table s Aggregate sales by customer Normalis e currency Aggregate sales by region Customer data exception report
  • 36.
    Sample Data ProcessingFlow extract_flow = group(( ExtractSalesTask().si(), ExtractCustTask().si(), ExtractProductTask().si())) transform_flow = group(( CopySalesTask().si() | NormaliseCurrencyTask().si(), CopyCustTask().si(), CopyProductTask().si())) | JoinTask().si() load_flow = group(( QualityTask().si(), AggregateTask().si('cust_id'), AggregateTask().si('region_id'))) all_flow = extract_flow | transform_flow | load_flow
  • 37.
  • 38.
  • 39.
    Turning it upto 11 ● A requires/depends structure ● Incremental data loads ● Parameterised flows ● Tracking flow history ● Hooking into other libraries – NLTK – SciPy/NumPy – ...
  • 40.
    Summary ● Intro todata warehousing ● Process data with SQLAlchemy ● Task dependencies with Celery canvas
  • 41.
    Resources ● SQLAlchemy core:http://bit.ly/10FdYZo ● Celery Canvas: http://bit.ly/MOjazT ● http://databrewery.org – Bubbles: http://bit.ly/14hNsV0 – Pipeline: http://bit.ly/15RXvWa ● http://schoolofdata.org
  • 42.