DATA SCIENCE MEETS
SOFTWARE DEVELOPMENT
Alexis Seigneurin - IpponTechnologies
Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Favorite subjects: Spark, Cassandra,Ansible, Docker
• @aseigneurin
• 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa
The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
Data Scientists
Who they are
&
How they work
Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering
Programming languages
• Mostly Python, incl. frameworks:
• NumPy
• Pandas
• SciKit Learn
• SQL
• R
Development environments
• IPython Notebook
Development environments
• Dataiku
Machine Learning
• Algorithms:
• Logistic Regression
• Decision trees
• Random forests
• Implementations:
• Dataiku
• Scikit-Learn
• Vowpal Wabbit
Programmers
Who they are
&
How they work
http://xkcd.com/378/
Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning
How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …
A typical Data
Science project
In the Lab
Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data
Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text →Vector of words
• 2 names → Levensthein distance
Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest
Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…
Industrialization
Cookbook
Disclaimer
• Context of this project:
• Not So Big Data (but Smart Data)
• No real-time workflows (yet?)
Distribute the
processing
R E C I P E # 1
Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)
Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!
Use a centralized
data store
R E C I P E # 2
Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses
Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
Rationalize the use of
programming
languages
R E C I P E # 3
Programming languages
• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge
Programming languages
• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!
Use an IDE
R E C I P E # 4
Use an IDE
• Notebooks:
• Powerful for exploratory work
• Weak for code edition and code
structuring
• Inadequate for code versioning
Use an IDE
• IntelliJ IDEA / PyCharm
• Code compilation
• Refactoring
• Execution of unit tests
• Support for Git
Source Control
R E C I P E # 5
Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review
Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
Packaging the code
R E C I P E # 6
Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies
Packaging the code
• Freeze the dependencies:
• Scala → Maven
• Python → Setuptools
• Packaging:
• Scala → Jar (Maven Shade plugin)
• Python → Egg (Setuptools)
• Compliant with spark-submit.sh
R E C I P E # 7
Secure the build
process
Secure the build process
• Data Scientists may commit code… without
running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible
Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notifications (Slack + email)
Automate the process
R E C I P E # 8
Automate the process
• Data is loaded manually in HDFS:
• CSV files, sometimes compressed
• Often received by email
• Often samples
Automate the process
• No human intervention should be required
• All steps should be code / tools
• E.g. automate file transfers, unzipping…
Adapt to living data
R E C I P E # 9
Adapt to living data
• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values
Adapt to living data
• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations
Provide a library of
transformations
R E C I P E # 1 0
Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku
Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests
Unit test the data
pipeline
R E C I P E # 1 1
Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken
Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists
Assemble the
Workflow
R E C I P E # 1 2
Assemble the Workflow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark
Assemble the workflow
• Oozie:
• Spark
• Map-Reduce
• Shell
• …
• Scheduling
• Alerts
• Logs
Summary
&
Conclusion
Summary
• Keys:
• Use industrialization-ready tools
• Pair Programming: Data Scientist + Developer
• Success criteria:
• Lower time to market
• Higher processing speed
• More robust processes
Thank you!
@aseigneurin - @ipponusa

Data Science meets Software Development

  • 1.
    DATA SCIENCE MEETS SOFTWAREDEVELOPMENT Alexis Seigneurin - IpponTechnologies
  • 2.
    Who I am •Software engineer for 15 years • Consultant at IpponTech in Paris, France • Favorite subjects: Spark, Cassandra,Ansible, Docker • @aseigneurin
  • 3.
    • 200 softwareengineers in France and the US • In the US: offices in DC, NYC and Richmond,Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster,Tatami, etc. • @ipponusa
  • 4.
    The project • DataInnovation Lab of a large insurance company • Data → Business value • Team of 30 Data Scientists + Software Developers
  • 5.
    Data Scientists Who theyare & How they work
  • 6.
    Skill set ofa Data Scientist • Strong in: • Science (maths / statistics) • Machine Learning • Analyzing data • Good / average in: • Programming • Not good in: • Software engineering
  • 7.
    Programming languages • MostlyPython, incl. frameworks: • NumPy • Pandas • SciKit Learn • SQL • R
  • 8.
  • 9.
  • 10.
    Machine Learning • Algorithms: •Logistic Regression • Decision trees • Random forests • Implementations: • Dataiku • Scikit-Learn • Vowpal Wabbit
  • 11.
    Programmers Who they are & Howthey work http://xkcd.com/378/
  • 12.
    Skill set ofa Developer • Strong in: • Software engineering • Programming • Good / average in: • Science (maths / statistics) • Analyzing data • Not good in: • Machine Learning
  • 13.
    How Developers work •Programming languages • Java • Scala • Development environment • Eclipse • IntelliJ IDEA • Toolbox • Maven • …
  • 14.
    A typical Data Scienceproject In the Lab
  • 15.
    Workflow 1. Data Cleansing 2.Feature Engineering 3. Train a Machine Learning model 1. Split the dataset: training/validation/test datasets 2. Train the model 4. Apply the model on new data
  • 16.
    Data Cleansing • Convertstrings to numbers/booleans/… • Parse dates • Handle missing values • Handle data in an incorrect format • …
  • 17.
    Feature Engineering • Transformdata into numerical features • E.g.: • A birth date → age • Dates of phone calls → Number of calls • Text →Vector of words • 2 names → Levensthein distance
  • 18.
    Machine Learning • Traina model • Test an algorithm with different params • Cross validation (Grid Search) • Compare different algorithms, e.g.: • Logistic regression • Gradient boosting trees • Random forest
  • 19.
    Machine Learning • Evaluatethe accuracy of the model • Root Mean Square Error (RMSE) • ROC curve • … • Examine predictions • False positives, false negatives…
  • 20.
  • 21.
    Disclaimer • Context ofthis project: • Not So Big Data (but Smart Data) • No real-time workflows (yet?)
  • 22.
  • 23.
    Distribute the processing •Data Scientists work with data samples • No constraint on processing time • Processing on the Data Scientist’s workstation (IPython Notebook) or on a single server (Dataiku)
  • 24.
    Distribute the processing •In production: • H/W resources are constrained • Large data sets to process • Spark: • Included in CDH • DataFrames (Spark 1.3+) ≃ Pandas DataFrames • Fast!
  • 25.
    Use a centralized datastore R E C I P E # 2
  • 26.
    Use a centralizeddata store • Data Scientists store data on their workstations • Limited storage • Data not shared within the team • Data privacy not enforced • Subject to data losses
  • 27.
    Use a centralizeddata store • Store data on HDFS: • Hive tables (SQL) • Parquet files • Security: Kerberos + permissions • Redundant + potentially unlimited storage • Easy access from Spark and Dataiku
  • 28.
    Rationalize the useof programming languages R E C I P E # 3
  • 29.
    Programming languages • DataScientists write code on their workstations • This code may not run in the datacenter • Language variety → Hard to share knowledge
  • 30.
    Programming languages • Usewidely spread languages • Spark in Python/Scala • Support for R is too young • Provide assistance to ease the adoption!
  • 31.
    Use an IDE RE C I P E # 4
  • 32.
    Use an IDE •Notebooks: • Powerful for exploratory work • Weak for code edition and code structuring • Inadequate for code versioning
  • 33.
    Use an IDE •IntelliJ IDEA / PyCharm • Code compilation • Refactoring • Execution of unit tests • Support for Git
  • 34.
    Source Control R EC I P E # 5
  • 35.
    Source Control • DataScientists work on their workstations • Code is not shared • Code may be lost • Intermediate versions are not preserved • Lack of code review
  • 36.
    Source Control • Git+ GitHub / GitLab • Versioning • Easy to go back to a version running in production • Easy sharing (+permissions) • Code review
  • 37.
    Packaging the code RE C I P E # 6
  • 38.
    Packaging the code •Source code has dependencies • Dependencies in production ≠ at dev time • Assemble the code + its dependencies
  • 39.
    Packaging the code •Freeze the dependencies: • Scala → Maven • Python → Setuptools • Packaging: • Scala → Jar (Maven Shade plugin) • Python → Egg (Setuptools) • Compliant with spark-submit.sh
  • 40.
    R E CI P E # 7 Secure the build process
  • 41.
    Secure the buildprocess • Data Scientists may commit code… without running tests first! • Quality may decrease over time • Packages built by hand on a workstation are not reproducible
  • 42.
    Secure the buildprocess • Jenkins • Unit test report • Code coverage report • Packaging: Jar / Egg • Dashboard • Notifications (Slack + email)
  • 43.
  • 44.
    Automate the process •Data is loaded manually in HDFS: • CSV files, sometimes compressed • Often received by email • Often samples
  • 45.
    Automate the process •No human intervention should be required • All steps should be code / tools • E.g. automate file transfers, unzipping…
  • 46.
    Adapt to livingdata R E C I P E # 9
  • 47.
    Adapt to livingdata • Data Scientists work with: • Frozen data • Samples • Risks with data received on a regular basis: • Incorrect format (dates, numbers…) • Corrupt data (incl. encoding changes) • Missing values
  • 48.
    Adapt to livingdata • Data Checking & Cleansing • Preliminary steps before processing the data • Decide what to do with invalid data • Thetis • Internal tool • Performs most checking & cleansing operations
  • 49.
    Provide a libraryof transformations R E C I P E # 1 0
  • 50.
    Library of transformations •Dataiku « shakers »: • Parse dates • Split a URL (protocol, host, path, …) • Transform a post code into a city / department name • … • Cannot be used outside Dataiku
  • 51.
    Library of transformations •All transformations should be code • Reuse transformations between projects • Provide a library • Transformation = DataFrame → DataFrame • Unit tests
  • 52.
    Unit test thedata pipeline R E C I P E # 1 1
  • 53.
    Unit test thedata pipeline • Independent data processing steps • Data pipeline not often tested from beginning to end • Data pipeline easily broken
  • 54.
    Unit test thedata pipeline • Unit test each data transformation stage • Scala: Scalatest • Python: Unittest • Use mock data • Compare DataFrames: • No library (yet?) • Compare lists of lists
  • 55.
  • 56.
    Assemble the Workflow •Separate transformation processes: • Transformations applied to some data • Results are frozen and used in other processes • Jobs are launched manually • No built-in scheduler in Spark
  • 57.
    Assemble the workflow •Oozie: • Spark • Map-Reduce • Shell • … • Scheduling • Alerts • Logs
  • 58.
  • 59.
    Summary • Keys: • Useindustrialization-ready tools • Pair Programming: Data Scientist + Developer • Success criteria: • Lower time to market • Higher processing speed • More robust processes
  • 60.