SlideShare a Scribd company logo
PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it is required.
▸ How to prepare data with pandas.
▸ How to set up a pipeline with AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 years
▸ Freelance Data Products Developper and Consultant

(data visualization, machine learning)
▸ Former Orange Labs and Locarise

(connected sensors data processing and visualization)
▸ Current side project denryoku.io an API for electric grid
power demand and capacity prediction.
Why Data
Preparation?
Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audience:
▸ a report from a survey?
▸ a news article with charts?
▸ a sales dashboard?
Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missing data
▸ mis-formatted, mis-typed data
▸ wrong / corrupted values
Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no appropriate mean of collection
▸ lack of validation
▸ human errors
Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your report generator
▸ incomplete reports
▸ report reaches wrong conclusions
▸ Ultimately, if your data is really bad, you cannot trust any
conclusion from it
Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ Aggregating
!" "
clean
" !clean
!
aggregate,

classify, …input 1
input 2
output
▸ Classification (ML)
▸ Predictions (ML)
Visualize
|
Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Often manually gathered
data in spreadsheets
▸ Data cleaning required
▸ Data aggregation/
preprocessing required
▸ Data may be updated on a
weekly basis
Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
▸ Who is going to run the script?
"
New data
Needs to be automated (the pipeline)
Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparation

pipeline (batch)

prototype
micro-batch or real-time

processing pipeline
our focus
How to
prepare data?
Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy for null or non parseable values
▸ Enforce value ranges
▸ Sanitise strings
Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine
▸ great tools to check data quality and define transformations
Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With python, you can do anything!
▸ It is not that difficult
▸ Pandas is a versatile tool that manipulate Dataframes
▸ Easy to specify transformations
▸ Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn
Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a list of name and ages of different
persons
Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relationship
between name
length and
median age?
▸ Chain
operations
▸ plot the length
of name vs age
for each name
Warning
Outlier
Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or `fillna`
▸ clean names
▸ reject outliers
Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Many errors can be avoided during data collection:
▸ form / column validation
▸ drop down selections for categories
▸ Report rejected rows to improve collection process
$
Data
! preparation

script"
list of issues
%Improve

forms…
Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipated edge cases (defensive programming)
▸ Property based testing (http://hypothesis.works/)
Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
▸ Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
▸ Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html
Setting up a
pipeline with AWS
Lambda.
Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating is part of a quality process
▸ Keeping things simple…
▸ and cheap
Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer by AWS
▸ No lifecycle to manage or shared state => resilient
▸ Auto-scaling
▸ Pay for actual running time: low cost
▸ No server, infra management: reduced dev / devops cost
…events
lambda function
output
…
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ let your function get some data and process it at regular interval
Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a google spreadsheet
Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack 
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3 
/usr/lib64/libblas.so.3 
/usr/lib64/libgfortran.so.3 
/usr/lib64/libquadmath.so.0 
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell
Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
need to access those
binaries
▸ Set up env variables
▸ Call a subprocess
▸ And pickle the function
input
▸ AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( “/tmp/event.p”, “wb” ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spreadsheet,
a css file on s3, an FTP
▸ Clean it
▸ Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( “/tmp/event.p”, “rb” ))
# load some data from a google spreadsheet
r = requests.get(‘https://docs.google.com/spreadsheets'
+ ‘/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the environment zip.
▸ upload your function
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is supported
▸ But python 3 is available and can be called as a
subprocess
▸ details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ need to split the dataset if tool large
▸ loop over in your lambda call:
▸ may excess timeout
▸ map to multiple lambda calls
▸ need to merge the dataset at the end
▸ Lambda functions should be simple, chain if required
Takeaways
Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can solve many issues
▸ Defensive programming and closing the loop
▸ AWS Lambda is a powerful and flexible tool for time and
resource constrained teams
Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io

More Related Content

What's hot

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.OrgCloudera, Inc.
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Databricks
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBSebastian Dahlgren
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
Laercio Serra
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
DataWorks Summit
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Amazon Web Services
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Amazon Web Services
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
Amazon Web Services
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
stevemcpherson
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
Amazon Web Services
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
Landoop Ltd
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
Databricks
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
Kouhei Nakaji
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
Amazon Web Services
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
C4Media
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
Joseph Niemiec
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
Amazon Web Services
 

What's hot (20)

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDB
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 

Similar to PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Turi, Inc.
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
Amazon Web Services
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
Amazon Web Services
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
Amazon Web Services
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
Amazon Web Services
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
AWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as Code
Amazon Web Services
 
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
Amazon Web Services
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
IBM Cloud Data Services
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
Neil Mackenzie
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
Fabian Dubois
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
Amazon Web Services
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
Amazon Web Services
 
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Amazon Web Services
 
Analyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftAnalyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon Redshift
George Psistakis
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 

Similar to PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda (20)

Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
AWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as Code
 
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
 
Analyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftAnalyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon Redshift
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 

Recently uploaded

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 

Recently uploaded (20)

2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

  • 1. PyconJP 2016-09-22 Fabian Dubois Building a data preparation pipeline with Pandas and AWS Lambda
  • 2. Building a data preparation pipeline with Pandas and AWS Lambda What Will You Learn? ▸ What is data preparation and why it is required. ▸ How to prepare data with pandas. ▸ How to set up a pipeline with AWS Lambda
  • 3. Building a data preparation pipeline with Pandas and AWS Lambda About Me ▸ Based in Tokyo ▸ Using python with data for 6 years ▸ Freelance Data Products Developper and Consultant
 (data visualization, machine learning) ▸ Former Orange Labs and Locarise
 (connected sensors data processing and visualization) ▸ Current side project denryoku.io an API for electric grid power demand and capacity prediction.
  • 5. Building a data preparation pipeline with Pandas and AWS Lambda So you have got data, now what? ▸ Showing it to an audience: ▸ a report from a survey? ▸ a news article with charts? ▸ a sales dashboard?
  • 6. Building a data preparation pipeline with Pandas and AWS Lambda But a lot of available data is messy ▸ incomplete or missing data ▸ mis-formatted, mis-typed data ▸ wrong / corrupted values
  • 7. Building a data preparation pipeline with Pandas and AWS Lambda It has all the reasons to be messy ▸ non availability ▸ no appropriate mean of collection ▸ lack of validation ▸ human errors
  • 8. Building a data preparation pipeline with Pandas and AWS Lambda And this can have very bad consequences ▸ Crash in your report generator ▸ incomplete reports ▸ report reaches wrong conclusions ▸ Ultimately, if your data is really bad, you cannot trust any conclusion from it
  • 9. Building a data preparation pipeline with Pandas and AWS Lambda It is not just about quality (ETL) ▸ Enriching the data ▸ Aggregating !" " clean " !clean ! aggregate,
 classify, …input 1 input 2 output ▸ Classification (ML) ▸ Predictions (ML) Visualize |
  • 10. Building a data preparation pipeline with Pandas and AWS Lambda Example: data journalism & interactive visualization ▸ Often manually gathered data in spreadsheets ▸ Data cleaning required ▸ Data aggregation/ preprocessing required ▸ Data may be updated on a weekly basis
  • 11. Building a data preparation pipeline with Pandas and AWS Lambda If it is a product, it needs to deal with data updates Current Data ! preparation script visualisation ready data Visualisation " " | ▸ Who is going to run the script? " New data Needs to be automated (the pipeline)
  • 12. Building a data preparation pipeline with Pandas and AWS Lambda What does it apply to? data quality data update frequency once monthly real-timedaily low high dashboards, data products data journalism interactive reports, email reports ad hoc data analysisapplication solution jupiter notebook automated preparation
 pipeline (batch)
 prototype micro-batch or real-time
 processing pipeline our focus
  • 14. Building a data preparation pipeline with Pandas and AWS Lambda common operations ▸ Date parsing ▸ Deciding on a strategy for null or non parseable values ▸ Enforce value ranges ▸ Sanitise strings
  • 15. Building a data preparation pipeline with Pandas and AWS Lambda Existing tools ▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine ▸ great tools to check data quality and define transformations
  • 16. Building a data preparation pipeline with Pandas and AWS Lambda So why custom solutions with Python and Pandas? ▸ With python, you can do anything! ▸ It is not that difficult ▸ Pandas is a versatile tool that manipulate Dataframes ▸ Easy to specify transformations ▸ Not limited by Pandas, the whole python ecosystem is available, like scikit-learn
  • 17. Building a data preparation pipeline with Pandas and AWS Lambda Example from a Jupiter notebook ▸ load a simple file with a list of name and ages of different persons
  • 18. Building a data preparation pipeline with Pandas and AWS Lambda Example: statistics on groups (names) ▸ Is there a relationship between name length and median age? ▸ Chain operations ▸ plot the length of name vs age for each name Warning Outlier
  • 19. Building a data preparation pipeline with Pandas and AWS Lambda something is wrong null values label issues
  • 20. Building a data preparation pipeline with Pandas and AWS Lambda Let’s fix this ▸ deal with missing values with `dropna` or `fillna` ▸ clean names ▸ reject outliers
  • 21. Building a data preparation pipeline with Pandas and AWS Lambda Close the loop to improve the data entry/acquisition ▸ Many errors can be avoided during data collection: ▸ form / column validation ▸ drop down selections for categories ▸ Report rejected rows to improve collection process $ Data ! preparation
 script" list of issues %Improve
 forms…
  • 22. Building a data preparation pipeline with Pandas and AWS Lambda Testing your preparation ▸ Unit tests ▸ Test for anticipated edge cases (defensive programming) ▸ Property based testing (http://hypothesis.works/)
  • 23. Building a data preparation pipeline with Pandas and AWS Lambda More references for data cleaning ▸ Data cleaning with Pandas https://www.youtube.com/ watch?v=_eQ_8U5kruQ ▸ Data cleanup with Python: http://kjamistan.com/ automating-your-data-cleanup-with-python/ ▸ Modern Pandas: Tidy Data https:// tomaugspurger.github.io/modern-5-tidy.html
  • 24. Setting up a pipeline with AWS Lambda.
  • 25. Building a data preparation pipeline with Pandas and AWS Lambda Some challenges ▸ Don’t let users run scripts ▸ Automating is part of a quality process ▸ Keeping things simple… ▸ and cheap
  • 26. Building a data preparation pipeline with Pandas and AWS Lambda What is AWS Lambda: server less solution ▸ Serverless offer by AWS ▸ No lifecycle to manage or shared state => resilient ▸ Auto-scaling ▸ Pay for actual running time: low cost ▸ No server, infra management: reduced dev / devops cost …events lambda function output …
  • 27. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function just a python function
  • 28. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function: options
  • 29. Building a data preparation pipeline with Pandas and AWS Lambda Creating an “architecture” with triggers
  • 30. Building a data preparation pipeline with Pandas and AWS Lambda Batch processing at regular interval ▸ cron scheduling ▸ let your function get some data and process it at regular interval
  • 31. Building a data preparation pipeline with Pandas and AWS Lambda An API / webhook ▸ on API call ▸ Can be triggered from a google spreadsheet
  • 32. Building a data preparation pipeline with Pandas and AWS Lambda Setting up AWS Lambda for Pandas Pandas and dependencies need to be compiled for Amazon Linux x86_64 # install compilation environment sudo yum -y update sudo yum -y upgrade sudo yum groupinstall "Development Tools" sudo yum install blas blas-devel lapack lapack-devel Cython --enablerepo=epel # create and activate virtual env virtualenv pdenv source pdenv/bin/activate # install pandas pip install pandas # zip the environment content cd ~/pdenv/lib/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc cd ~/pdenv/lib64/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc # add the supporting libraries cd ~/ mkdir -p libs cp /usr/lib64/liblapack.so.3 /usr/lib64/libblas.so.3 /usr/lib64/libgfortran.so.3 /usr/lib64/libquadmath.so.0 libs/ zip -r ~/pdenv.zip libs 1. Launch an EC2 instance and connect to it 2. Install pandas in a virtualenv 3. Zip the installed libraries shell
  • 33. Building a data preparation pipeline with Pandas and AWS Lambda Using pandas from a lambda function ▸ The lambda process need to access those binaries ▸ Set up env variables ▸ Call a subprocess ▸ And pickle the function input ▸ AWS will call `lambda_function.lambda _handler` import os, sys, subprocess, json import cPickle as pickle LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): pickle.dump( event, open( “/tmp/event.p”, “wb” )) env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdout=subprocess.PIPE) proc.wait() return proc.stdout.read() return handle lambda_handler = handler('my_function.py') python: lambda_function.py
  • 34. Building a data preparation pipeline with Pandas and AWS Lambda The actual function ▸ Get the input data from a google spreadsheet, a css file on s3, an FTP ▸ Clean it ▸ Copy it somewhere import pandas as pd import pickle import requests from StringIO import StringIO def run(): # get the lambda call arguments event = pickle.load( open( “/tmp/event.p”, “rb” )) # load some data from a google spreadsheet r = requests.get(‘https://docs.google.com/spreadsheets' + ‘/d/{sheet_id}/export?format=csv&gid={page_id}') data = r.content.decode('utf-8') df = pd.read_csv(StringIO(data)) # Do something # save as file file_ = StringIO() df.to_csv(file_, encoding='utf-8') # copy the result somewhere if __name__ == '__main__': run() python: my_function.py
  • 35. Building a data preparation pipeline with Pandas and AWS Lambda upload and test ▸ add your lambda function code to the environment zip. ▸ upload your function
  • 36. Building a data preparation pipeline with Pandas and AWS Lambda caveat 1: python 2.7 ▸ officially, only python 2.7 is supported ▸ But python 3 is available and can be called as a subprocess ▸ details here: http://www.cloudtrek.com.au/blog/ running-python-3-on-aws-lambda/
  • 37. Building a data preparation pipeline with Pandas and AWS Lambda caveat 2: max process memory (1.5GB) and execution time ▸ need to split the dataset if tool large ▸ loop over in your lambda call: ▸ may excess timeout ▸ map to multiple lambda calls ▸ need to merge the dataset at the end ▸ Lambda functions should be simple, chain if required
  • 39. Building a data preparation pipeline with Pandas and AWS Lambda Takeaways ▸ Know your data and your target ▸ Pandas can solve many issues ▸ Defensive programming and closing the loop ▸ AWS Lambda is a powerful and flexible tool for time and resource constrained teams