SlideShare a Scribd company logo
1 of 65
Download to read offline
© 2019 Ellen Friedman 11
Doing Data:
Data Prep for Machine Learning
Ellen Friedman, PhD
18 June 2019
Berlin Buzzwords #bbuzz
© 2019 Ellen Friedman 22
Contact Information
Ellen Friedman, PhD
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email ellenf@apache.org
Twitter @Ellen_Friedman Today: #bbuzz
© 2019 Ellen Friedman 33
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 44
Sometimes, the most powerful
solutions are very basic.
That doesn’t necessarily make them
easy.
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 55
Sometimes, the most powerful
solutions are very basic.
That doesn’t necessarily make
them easy.
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 66
© 2019 Ellen Friedman 77
What makes machine learning
work?
© 2019 Ellen Friedman 88
The data
© 2019 Ellen Friedman 99
Preparation of training data is a big part of the effort.
Important to keep track of how you did it, and where data is archived.
© 2019 Ellen Friedman 1010
Find the ML code
Figure based on “Hidden Technical Debt in Machine Learning Systems” by Scully et al. (Google, Inc) https://
papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
ML
Configuration
Data Collection
Feature
Extraction
Data
Verification
Analysis Tools
Process
Management
Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Only a small part of ML systems is the learning code.
The rest is vast infrastructure of data collection and processing.
ML
© 2019 Ellen Friedman 1111
Things That Matter in Data Preparation
I. What’s in your data? (really?)
II. How do you know what features to build?
III. How do you know what you did?
© 2019 Ellen Friedman 1212
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did.
•  Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1313
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
•  Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1414
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1515
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1616
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
The data was preserved before the analysis was even begun
Retroactive Value in Data
frozen
© 2019 Ellen Friedman 1717
Things That Matter in Data Preparation
I. What’s in your data? (really?)
II. How do you know what features to build?
III. How do you know what you did?
© 2019 Ellen Friedman 1818
Thanks to these data scientists for their stories
Joe Blue
Director Global Data Science,
MapR
Ted Dunning
Chief Technical Officer,
MapR
© 2019 Ellen Friedman 1919
A loyal fan of Berlin Buzzwords
Ted Dunning, Berlin 2018
© 2019 Ellen Friedman 2020
Machine Learning in the Real World
vsKaggle
https://visibleearth.nasa.gov/view.php?id=56229
© 2019 Ellen Friedman 2121
I. What’s in Your Data (Really)?
Verify!
•  Examine data
•  Ask questions (domain knowledge matters)
•  Make sure what you say is what they hear
Explore!
•  Find out what you’ve got
•  Sometimes data exploration gives you the solution
•  Visual inspection & draw pictures
•  Example tool: Apache Drill
© 2019 Ellen Friedman 2222
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a ranking for likely risk of fraud
events.
© 2019 Ellen Friedman 2323
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a flag for known fraud
© 2019 Ellen Friedman 2424
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a flag for known fraud
Clear communication is essential.
© 2019 Ellen Friedman 2525
Fun example: Can you spot the pattern tea vs chai?
tea
chai
https://wals.info/chapter/138
© 2019 Ellen Friedman 2626
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2727
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2828
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2929
II. How do you know what features to build?
Features are built, not just chosen
There’s no “right” answer: trial and error (success) to find winners
What makes good features?
Performance Feasibility Interpretability
© 2019 Ellen Friedman 3030
Think through behaviors

© 2019 Ellen Friedman 3131
Build Features for Fraud Detection
© 2019 Ellen Friedman 3232
What would you do if you were
a fraudster?
© 2019 Ellen Friedman 3333
Behaviors That Point to Fraud
Fraudster has stolen debit card, but doesn’t know pin number
Tries to use it as credit card with signature: easier to fake
© 2019 Ellen Friedman 3434
Behaviors That Point to Fraud
Fraudster has stolen debit card, but doesn’t know pin number
Tries to use it as credit card with signature: easier to fake
Leaves clues you can discover: Make a feature from this change
9990
© 2019 Ellen Friedman 3535
It’s the Changed Behavior That Matters
Remember:
In this example, the model doesn’t analyze whether or not the signature is a
fake.
The feature is the change in behavior: using a signature instead of a pin.
© 2019 Ellen Friedman 3636
More Behaviors That Point to Fraud
•  Domain expert says “Fraudsters often do a probe transaction at a
gas station just before making their big fraud transaction(s).”
•  How do you build a feature to detect probe behaviors?
•  Risk tables can be constructed:
–  to find a probe event or
–  to find the main fraud event
© 2019 Ellen Friedman 3737
We Build a "Risk Table"
When they say "gas station", we think "merchant type”
When they say "just before", we think of several possible time periods
Take many transactions grouped by consumer, ordered by time
•  For each fraud, count the merchant types in the preceding window of time
•  For lots of non-frauds, count the merchant types in the preceding window
A risk table has the (log of the) ratio of the fraud counts to the non-fraud counts
for each merchant type, for each window size
© 2019 Ellen Friedman 3838
Building a Risk Table
A bigger positive value for ratio = more risk
[This is just synthetic data to make a point]
© 2019 Ellen Friedman 3939
Look at recent events for
a particular card
© 2019 Ellen Friedman 4040
© 2019 Ellen Friedman 4141
© 2019 Ellen Friedman 4242
Data Augmentation
Augment data: add external information
Example:
•  You have merchant ID
•  Look up store location
•  Give model location as a feature
© 2019 Ellen Friedman 4343
Data Transformation
Simple data transformation can be powerful
Domain knowledge helps you know what to do
Example:
•  Data is value for amount of €
•  Take log of value because % gives a more meaningful feature
10 € à 12 € is very different change than 100 € à 102 €
© 2019 Ellen Friedman 4444
Velocity as a Feature
Commonly used
How many ways can you describe velocity?
Domain knowledge helps you know what to do
Examples:
•  Geo-distance / time
•  # events / time
•  € / time
© 2019 Ellen Friedman 4545
III. How do you know what you did?
Code:
Document the reasoning behind code
Version control for code
Data:
Document how training data was prepared
•  Which features? Why?
•  How were they built?
Version control for the training data
© 2019 Ellen Friedman 4646
Remember:
The training data helps build the trained model. That’s why the specific data
you use makes a difference to the resulting model.
© 2019 Ellen Friedman 4747
What is the role of data in building an ML model?
This blog post has a good basic explanation of what a machine learning model
really is and how data plays a role:
“Computer Science vs Data Science” by Ted Dunning
https://mapr.com/blog/data-science-vs-computer-science/
© 2019 Ellen Friedman 4848
Data makes the model



Different training data, different model

© 2019 Ellen Friedman 4949
Data makes the model



Different training data, different model

© 2019 Ellen Friedman 5050
Based on Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
Is it OK if only one person
can compile my production
source code?
© 2019 Ellen Friedman 5151
The same applies for data used
to build a machine learning
model
© 2019 Ellen Friedman 5252
Notes to Your Future Self
© 2019 Ellen Friedman 5353
Machine Learning is Iterative Process
•  Held-out training data is used for
evaluation
•  Usually have much more data in
training than in production
•  Don’t fall for the myth of unitary
model: Lots of models, lots of
trials
© 2019 Ellen Friedman 5454
Notebooks are Excellent
Notes to self: A good way to remember what you’ve done
A good way to communicate to others as well
Share code, share explanation of how features were built..
https://jupyter.org/ https://zeppelin.apache.org/
© 2019 Ellen Friedman 5555
Code Versioning Via Git
© 2019 Ellen Friedman 5656
Code Versioning Via Git
© 2019 Ellen Friedman 5757
Easy Data Version Control with Snapshots
Snapshot 1
Snapshots based on MapR volumes
True point-in-time version of data
Less expensive than copying
Fully distributed across cluster
© 2019 Ellen Friedman 5858
Valohai Uses Snapshots for Their ML Pipeline Service
Eero Laaksonen & Juha Kilii from Finnish company Valohai demonstrate how they do version
control using snapshots in this webinar with Ian Downard (MapR):
https://mapr.com/webinars/a-guide-to-version-control-for-machine-learning/
© 2019 Ellen Friedman 5959
•  For Training Data:
•  Pathname of raw data snapshot
•  Git reference for data preparation
process (feature extraction)
•  For Code (Delivered Model):
•  The model
•  Pathname of training data
snapshot
•  Git reference for learning script
o  Includes random number seed
o  Includes knob settings for
learning process
o  The learning code
This is What You Track
© 2019 Ellen Friedman 6060
Data Unit Testing
Does what you’re doing now match what you did before?
Test that: build a way to see if there are changes that matter.
•  Test outputs: Maybe what changed doesn’t matter
•  Test inputs: Another good approach (see Google paper)
© 2019 Ellen Friedman 6161
Data Validation Article from Google Research
https://www.sysml.cc/doc/2019/167.pdf
© 2019 Ellen Friedman 6262
https://mapr.com/ebook/
machine-learning-logistics/
https://mapr.com/ebook/ai-and-
analytics-in-production/
Free eBooks courtesy of MapR Technologies
https://mapr.com/
© 2019 Ellen Friedman 6363
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2019 Ellen Friedman 6464
Thank you !
© 2019 Ellen Friedman 6565
Contact Information
Ellen Friedman, PhD
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email ellenf@apache.org
Twitter @Ellen_Friedman Today: #bbuzz

More Related Content

Similar to Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019

Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...DAIN Studios
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success StoriesDATAVERSITY
 
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Textkernel
 
How To Write College Admission Essay. How To
How To Write College Admission Essay. How ToHow To Write College Admission Essay. How To
How To Write College Admission Essay. How ToLori Moore
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityDATAVERSITY
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - ForresterApigee | Google Cloud
 
Nursing Essay Writing
Nursing Essay WritingNursing Essay Writing
Nursing Essay WritingKate Loge
 
Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Digiday
 
aochs intern presentation (2)
aochs intern presentation (2)aochs intern presentation (2)
aochs intern presentation (2)Anna Ochs
 
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...44CON
 
3 tips to funding your security program
3 tips to funding your security program3 tips to funding your security program
3 tips to funding your security programCloudBees
 
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Connotate
 
CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaSteve Poole
 
Douglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationDouglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationTurner and Associates, Inc.
 
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4j
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4jKeynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4j
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4jNeo4j
 

Similar to Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019 (20)

Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success Stories
 
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
 
How To Write College Admission Essay. How To
How To Write College Admission Essay. How ToHow To Write College Admission Essay. How To
How To Write College Admission Essay. How To
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data Quality
 
AgileCamp Silicon Valley 2015: Experiment Design
AgileCamp Silicon Valley 2015: Experiment DesignAgileCamp Silicon Valley 2015: Experiment Design
AgileCamp Silicon Valley 2015: Experiment Design
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - Forrester
 
Nursing Essay Writing
Nursing Essay WritingNursing Essay Writing
Nursing Essay Writing
 
Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...
 
Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016
 
aochs intern presentation (2)
aochs intern presentation (2)aochs intern presentation (2)
aochs intern presentation (2)
 
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
 
3 tips to funding your security program
3 tips to funding your security program3 tips to funding your security program
3 tips to funding your security program
 
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
 
CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 Sofia
 
DMA Data Protection 2014
DMA Data Protection 2014DMA Data Protection 2014
DMA Data Protection 2014
 
Douglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationDouglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity Presentation
 
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4j
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4jKeynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4j
Keynote GraphTour Europe 2019, Emil Eifrem, CEO & Co-Founder Neo4j
 

More from Ellen Friedman

7 Habits for Big Data in Production - keynote Big Data London Nov 2018
7 Habits for Big Data in Production - keynote Big Data London Nov 20187 Habits for Big Data in Production - keynote Big Data London Nov 2018
7 Habits for Big Data in Production - keynote Big Data London Nov 2018Ellen Friedman
 
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
 
Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Ellen Friedman
 
Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Ellen Friedman
 
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...Ellen Friedman
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 

More from Ellen Friedman (6)

7 Habits for Big Data in Production - keynote Big Data London Nov 2018
7 Habits for Big Data in Production - keynote Big Data London Nov 20187 Habits for Big Data in Production - keynote Big Data London Nov 2018
7 Habits for Big Data in Production - keynote Big Data London Nov 2018
 
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
 
Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101
 
Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018
 
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...
Why Stream? Advantages of Streaming Architecture #StrataData SJ 2017 presenta...
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019

  • 1. © 2019 Ellen Friedman 11 Doing Data: Data Prep for Machine Learning Ellen Friedman, PhD 18 June 2019 Berlin Buzzwords #bbuzz
  • 2. © 2019 Ellen Friedman 22 Contact Information Ellen Friedman, PhD Committer Apache Drill & Apache Mahout projects O’Reilly author Email ellenf@apache.org Twitter @Ellen_Friedman Today: #bbuzz
  • 3. © 2019 Ellen Friedman 33 Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 4. © 2019 Ellen Friedman 44 Sometimes, the most powerful solutions are very basic. That doesn’t necessarily make them easy. Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 5. © 2019 Ellen Friedman 55 Sometimes, the most powerful solutions are very basic. That doesn’t necessarily make them easy. Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 6. © 2019 Ellen Friedman 66
  • 7. © 2019 Ellen Friedman 77 What makes machine learning work?
  • 8. © 2019 Ellen Friedman 88 The data
  • 9. © 2019 Ellen Friedman 99 Preparation of training data is a big part of the effort. Important to keep track of how you did it, and where data is archived.
  • 10. © 2019 Ellen Friedman 1010 Find the ML code Figure based on “Hidden Technical Debt in Machine Learning Systems” by Scully et al. (Google, Inc) https:// papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf ML Configuration Data Collection Feature Extraction Data Verification Analysis Tools Process Management Tools Machine Resource Management Serving Infrastructure Monitoring Only a small part of ML systems is the learning code. The rest is vast infrastructure of data collection and processing. ML
  • 11. © 2019 Ellen Friedman 1111 Things That Matter in Data Preparation I. What’s in your data? (really?) II. How do you know what features to build? III. How do you know what you did?
  • 12. © 2019 Ellen Friedman 1212 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did. •  Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 13. © 2019 Ellen Friedman 1313 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did •  Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 14. © 2019 Ellen Friedman 1414 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 15. © 2019 Ellen Friedman 1515 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 16. © 2019 Ellen Friedman 1616 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients The data was preserved before the analysis was even begun Retroactive Value in Data frozen
  • 17. © 2019 Ellen Friedman 1717 Things That Matter in Data Preparation I. What’s in your data? (really?) II. How do you know what features to build? III. How do you know what you did?
  • 18. © 2019 Ellen Friedman 1818 Thanks to these data scientists for their stories Joe Blue Director Global Data Science, MapR Ted Dunning Chief Technical Officer, MapR
  • 19. © 2019 Ellen Friedman 1919 A loyal fan of Berlin Buzzwords Ted Dunning, Berlin 2018
  • 20. © 2019 Ellen Friedman 2020 Machine Learning in the Real World vsKaggle https://visibleearth.nasa.gov/view.php?id=56229
  • 21. © 2019 Ellen Friedman 2121 I. What’s in Your Data (Really)? Verify! •  Examine data •  Ask questions (domain knowledge matters) •  Make sure what you say is what they hear Explore! •  Find out what you’ve got •  Sometimes data exploration gives you the solution •  Visual inspection & draw pictures •  Example tool: Apache Drill
  • 22. © 2019 Ellen Friedman 2222 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a ranking for likely risk of fraud events.
  • 23. © 2019 Ellen Friedman 2323 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a flag for known fraud
  • 24. © 2019 Ellen Friedman 2424 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a flag for known fraud Clear communication is essential.
  • 25. © 2019 Ellen Friedman 2525 Fun example: Can you spot the pattern tea vs chai? tea chai https://wals.info/chapter/138
  • 26. © 2019 Ellen Friedman 2626 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 27. © 2019 Ellen Friedman 2727 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 28. © 2019 Ellen Friedman 2828 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 29. © 2019 Ellen Friedman 2929 II. How do you know what features to build? Features are built, not just chosen There’s no “right” answer: trial and error (success) to find winners What makes good features? Performance Feasibility Interpretability
  • 30. © 2019 Ellen Friedman 3030 Think through behaviors

  • 31. © 2019 Ellen Friedman 3131 Build Features for Fraud Detection
  • 32. © 2019 Ellen Friedman 3232 What would you do if you were a fraudster?
  • 33. © 2019 Ellen Friedman 3333 Behaviors That Point to Fraud Fraudster has stolen debit card, but doesn’t know pin number Tries to use it as credit card with signature: easier to fake
  • 34. © 2019 Ellen Friedman 3434 Behaviors That Point to Fraud Fraudster has stolen debit card, but doesn’t know pin number Tries to use it as credit card with signature: easier to fake Leaves clues you can discover: Make a feature from this change 9990
  • 35. © 2019 Ellen Friedman 3535 It’s the Changed Behavior That Matters Remember: In this example, the model doesn’t analyze whether or not the signature is a fake. The feature is the change in behavior: using a signature instead of a pin.
  • 36. © 2019 Ellen Friedman 3636 More Behaviors That Point to Fraud •  Domain expert says “Fraudsters often do a probe transaction at a gas station just before making their big fraud transaction(s).” •  How do you build a feature to detect probe behaviors? •  Risk tables can be constructed: –  to find a probe event or –  to find the main fraud event
  • 37. © 2019 Ellen Friedman 3737 We Build a "Risk Table" When they say "gas station", we think "merchant type” When they say "just before", we think of several possible time periods Take many transactions grouped by consumer, ordered by time •  For each fraud, count the merchant types in the preceding window of time •  For lots of non-frauds, count the merchant types in the preceding window A risk table has the (log of the) ratio of the fraud counts to the non-fraud counts for each merchant type, for each window size
  • 38. © 2019 Ellen Friedman 3838 Building a Risk Table A bigger positive value for ratio = more risk [This is just synthetic data to make a point]
  • 39. © 2019 Ellen Friedman 3939 Look at recent events for a particular card
  • 40. © 2019 Ellen Friedman 4040
  • 41. © 2019 Ellen Friedman 4141
  • 42. © 2019 Ellen Friedman 4242 Data Augmentation Augment data: add external information Example: •  You have merchant ID •  Look up store location •  Give model location as a feature
  • 43. © 2019 Ellen Friedman 4343 Data Transformation Simple data transformation can be powerful Domain knowledge helps you know what to do Example: •  Data is value for amount of € •  Take log of value because % gives a more meaningful feature 10 € à 12 € is very different change than 100 € à 102 €
  • 44. © 2019 Ellen Friedman 4444 Velocity as a Feature Commonly used How many ways can you describe velocity? Domain knowledge helps you know what to do Examples: •  Geo-distance / time •  # events / time •  € / time
  • 45. © 2019 Ellen Friedman 4545 III. How do you know what you did? Code: Document the reasoning behind code Version control for code Data: Document how training data was prepared •  Which features? Why? •  How were they built? Version control for the training data
  • 46. © 2019 Ellen Friedman 4646 Remember: The training data helps build the trained model. That’s why the specific data you use makes a difference to the resulting model.
  • 47. © 2019 Ellen Friedman 4747 What is the role of data in building an ML model? This blog post has a good basic explanation of what a machine learning model really is and how data plays a role: “Computer Science vs Data Science” by Ted Dunning https://mapr.com/blog/data-science-vs-computer-science/
  • 48. © 2019 Ellen Friedman 4848 Data makes the model
 
 Different training data, different model

  • 49. © 2019 Ellen Friedman 4949 Data makes the model
 
 Different training data, different model

  • 50. © 2019 Ellen Friedman 5050 Based on Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/ Is it OK if only one person can compile my production source code?
  • 51. © 2019 Ellen Friedman 5151 The same applies for data used to build a machine learning model
  • 52. © 2019 Ellen Friedman 5252 Notes to Your Future Self
  • 53. © 2019 Ellen Friedman 5353 Machine Learning is Iterative Process •  Held-out training data is used for evaluation •  Usually have much more data in training than in production •  Don’t fall for the myth of unitary model: Lots of models, lots of trials
  • 54. © 2019 Ellen Friedman 5454 Notebooks are Excellent Notes to self: A good way to remember what you’ve done A good way to communicate to others as well Share code, share explanation of how features were built.. https://jupyter.org/ https://zeppelin.apache.org/
  • 55. © 2019 Ellen Friedman 5555 Code Versioning Via Git
  • 56. © 2019 Ellen Friedman 5656 Code Versioning Via Git
  • 57. © 2019 Ellen Friedman 5757 Easy Data Version Control with Snapshots Snapshot 1 Snapshots based on MapR volumes True point-in-time version of data Less expensive than copying Fully distributed across cluster
  • 58. © 2019 Ellen Friedman 5858 Valohai Uses Snapshots for Their ML Pipeline Service Eero Laaksonen & Juha Kilii from Finnish company Valohai demonstrate how they do version control using snapshots in this webinar with Ian Downard (MapR): https://mapr.com/webinars/a-guide-to-version-control-for-machine-learning/
  • 59. © 2019 Ellen Friedman 5959 •  For Training Data: •  Pathname of raw data snapshot •  Git reference for data preparation process (feature extraction) •  For Code (Delivered Model): •  The model •  Pathname of training data snapshot •  Git reference for learning script o  Includes random number seed o  Includes knob settings for learning process o  The learning code This is What You Track
  • 60. © 2019 Ellen Friedman 6060 Data Unit Testing Does what you’re doing now match what you did before? Test that: build a way to see if there are changes that matter. •  Test outputs: Maybe what changed doesn’t matter •  Test inputs: Another good approach (see Google paper)
  • 61. © 2019 Ellen Friedman 6161 Data Validation Article from Google Research https://www.sysml.cc/doc/2019/167.pdf
  • 62. © 2019 Ellen Friedman 6262 https://mapr.com/ebook/ machine-learning-logistics/ https://mapr.com/ebook/ai-and- analytics-in-production/ Free eBooks courtesy of MapR Technologies https://mapr.com/
  • 63. © 2019 Ellen Friedman 6363 Please support women in tech – help build girls’ dreams of what they can accomplish © Ellen Friedman 2015#womenintech #datawomen
  • 64. © 2019 Ellen Friedman 6464 Thank you !
  • 65. © 2019 Ellen Friedman 6565 Contact Information Ellen Friedman, PhD Committer Apache Drill & Apache Mahout projects O’Reilly author Email ellenf@apache.org Twitter @Ellen_Friedman Today: #bbuzz