Data Science meets Software Development

DATA SCIENCE MEETS
SOFTWARE DEVELOPMENT
Alexis Seigneurin - IpponTechnologies

Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Favorite subjects: Spark, Cassandra,Ansible, Docker
• @aseigneurin

• 200 software engineers in France and the US
• In the US: ofﬁces in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa

The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers

Data Scientists
Who they are
&
How they work

Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering

Programming languages
• Mostly Python, incl. frameworks:
• NumPy
• Pandas
• SciKit Learn
• SQL
• R

Development environments
• IPython Notebook

Development environments
• Dataiku

Machine Learning
• Algorithms:
• Logistic Regression
• Decision trees
• Random forests
• Implementations:
• Dataiku
• Scikit-Learn
• Vowpal Wabbit

Programmers
Who they are
&
How they work
http://xkcd.com/378/

Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning

How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …

A typical Data
Science project
In the Lab

Workﬂow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data

Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …

Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text →Vector of words
• 2 names → Levensthein distance

Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest

Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…

Disclaimer
• Context of this project:
• Not So Big Data (but Smart Data)
• No real-time workﬂows (yet?)

Distribute the
processing
R E C I P E # 1

Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)

Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!

Use a centralized
data store
R E C I P E # 2

Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses

Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet ﬁles
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku

Rationalize the use of
programming
languages
R E C I P E # 3

• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge

• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!

Use an IDE
• Notebooks:
• Powerful for exploratory work
• Weak for code edition and code
structuring
• Inadequate for code versioning

Use an IDE
• IntelliJ IDEA / PyCharm
• Code compilation
• Refactoring
• Execution of unit tests
• Support for Git

Source Control
R E C I P E # 5

Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review

Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review

Packaging the code
R E C I P E # 6

Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies

Packaging the code
• Freeze the dependencies:
• Scala → Maven
• Python → Setuptools
• Packaging:
• Scala → Jar (Maven Shade plugin)
• Python → Egg (Setuptools)
• Compliant with spark-submit.sh

R E C I P E # 7
Secure the build
process

Secure the build process
• Data Scientists may commit code… without
running tests ﬁrst!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible

Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notiﬁcations (Slack + email)

Automate the process
R E C I P E # 8

• Data is loaded manually in HDFS:
• CSV ﬁles, sometimes compressed
• Often received by email
• Often samples

• No human intervention should be required
• All steps should be code / tools
• E.g. automate ﬁle transfers, unzipping…

Adapt to living data
R E C I P E # 9

• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values

• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations

Provide a library of
transformations
R E C I P E # 1 0

Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku

Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests

Unit test the data
pipeline
R E C I P E # 1 1

Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken

Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists

Assemble the
Workﬂow
R E C I P E # 1 2

Assemble the Workﬂow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark

Assemble the workﬂow
• Oozie:
• Spark
• Map-Reduce
• Shell
• …
• Scheduling
• Alerts
• Logs

Summary
• Keys:
• Use industrialization-ready tools
• Pair Programming: Data Scientist + Developer
• Success criteria:
• Lower time to market
• Higher processing speed
• More robust processes

Thank you!
@aseigneurin - @ipponusa

Data Science meets Software Development

More Related Content

What's hot

Viewers also liked

Similar to Data Science meets Software Development

More from Alexis Seigneurin

Recently uploaded

Data Science meets Software Development