Building a data preparation pipeline with Pandas and AWS Lambda
What is data preparation and why it is required.
How to prepare data with pandas.
How to set up a pipeline with AWS Lambda
https://youtu.be/pc0Xn0uAm34?t=9m15s
2. Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
⸠What is data preparation and why it is required.
⸠How to prepare data with pandas.
⸠How to set up a pipeline with AWS Lambda
3. Building a data preparation pipeline with Pandas and AWS Lambda
About Me
⸠Based in Tokyo
⸠Using python with data for 6 years
⸠Freelance Data Products Developper and Consultantâ¨
(data visualization, machine learning)
⸠Former Orange Labs and Locariseâ¨
(connected sensors data processing and visualization)
⸠Current side project denryoku.io an API for electric grid
power demand and capacity prediction.
5. Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
⸠Showing it to an audience:
⸠a report from a survey?
⸠a news article with charts?
⸠a sales dashboard?
6. Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
⸠incomplete or missing data
⸠mis-formatted, mis-typed data
⸠wrong / corrupted values
7. Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
⸠non availability
⸠no appropriate mean of collection
⸠lack of validation
⸠human errors
8. Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
⸠Crash in your report generator
⸠incomplete reports
⸠report reaches wrong conclusions
⸠Ultimately, if your data is really bad, you cannot trust any
conclusion from it
9. Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
⸠Enriching the data
⸠Aggregating
!" "
clean
" !clean
!
aggregate,â¨
classify, âŚinput 1
input 2
output
⸠ClassiďŹcation (ML)
⸠Predictions (ML)
Visualize
|
10. Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
⸠Often manually gathered
data in spreadsheets
⸠Data cleaning required
⸠Data aggregation/
preprocessing required
⸠Data may be updated on a
weekly basis
11. Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
⸠Who is going to run the script?
"
New data
Needs to be automated (the pipeline)
12. Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparationâ¨
pipeline (batch)â¨
prototype
micro-batch or real-timeâ¨
processing pipeline
our focus
14. Building a data preparation pipeline with Pandas and AWS Lambda
common operations
⸠Date parsing
⸠Deciding on a strategy for null or non parseable values
⸠Enforce value ranges
⸠Sanitise strings
15. Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
⸠Trifacta Wrangler, Talend Dataprep, Google Open ReďŹne
⸠great tools to check data quality and deďŹne transformations
16. Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
⸠With python, you can do anything!
⸠It is not that diďŹcult
⸠Pandas is a versatile tool that manipulate Dataframes
⸠Easy to specify transformations
⸠Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn
17. Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
⸠load a simple ďŹle with a list of name and ages of diďŹerent
persons
18. Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
⸠Is there a
relationship
between name
length and
median age?
⸠Chain
operations
⸠plot the length
of name vs age
for each name
Warning
Outlier
19. Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
20. Building a data preparation pipeline with Pandas and AWS Lambda
Letâs ďŹx this
⸠deal with
missing values
with `dropna`
or `ďŹllna`
⸠clean names
⸠reject outliers
21. Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
⸠Many errors can be avoided during data collection:
⸠form / column validation
⸠drop down selections for categories
⸠Report rejected rows to improve collection process
$
Data
! preparationâ¨
script"
list of issues
%Improveâ¨
formsâŚ
22. Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
⸠Unit tests
⸠Test for anticipated edge cases (defensive programming)
⸠Property based testing (http://hypothesis.works/)
23. Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
⸠Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
⸠Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
⸠Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html
25. Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
⸠Donât let users run scripts
⸠Automating is part of a quality process
⸠Keeping things simpleâŚ
⸠and cheap
26. Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
⸠Serverless oďŹer by AWS
⸠No lifecycle to manage or shared state => resilient
⸠Auto-scaling
⸠Pay for actual running time: low cost
⸠No server, infra management: reduced dev / devops cost
âŚevents
lambda function
output
âŚ
27. Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
28. Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
29. Building a data preparation pipeline with Pandas and AWS Lambda
Creating an âarchitectureâ with triggers
30. Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
⸠cron scheduling
⸠let your function get some data and process it at regular interval
31. Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
⸠on API call
⸠Can be triggered from a google spreadsheet
32. Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3
/usr/lib64/libblas.so.3
/usr/lib64/libgfortran.so.3
/usr/lib64/libquadmath.so.0
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell
33. Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
⸠The lambda process
need to access those
binaries
⸠Set up env variables
⸠Call a subprocess
⸠And pickle the function
input
⸠AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( â/tmp/event.pâ, âwbâ ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py
34. Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
⸠Get the input data from
a google spreadsheet,
a css ďŹle on s3, an FTP
⸠Clean it
⸠Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( â/tmp/event.pâ, ârbâ ))
# load some data from a google spreadsheet
r = requests.get(âhttps://docs.google.com/spreadsheets'
+ â/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py
35. Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
⸠add your lambda function code to the environment zip.
⸠upload your function
36. Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
⸠oďŹcially, only python 2.7 is supported
⸠But python 3 is available and can be called as a
subprocess
⸠details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/
37. Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
⸠need to split the dataset if tool large
⸠loop over in your lambda call:
⸠may excess timeout
⸠map to multiple lambda calls
⸠need to merge the dataset at the end
⸠Lambda functions should be simple, chain if required
39. Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
⸠Know your data and your target
⸠Pandas can solve many issues
⸠Defensive programming and closing the loop
⸠AWS Lambda is a powerful and ďŹexible tool for time and
resource constrained teams