A day in the life of 
a functional data 
scientist 
Richard Minerich, Director of R&D at Bayard Rock 
@Rickasaurus
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/functional-data-scientist ...
Presented at QCon New York 
www.qconnewyork.com 
Purpose of QCon 
- to empower software development by facilitating the sp...
Projecting onto a 2D Plane
The Pairwise Entity Resolution Process 
Blocking 
• Two Datasets (Customer Data and Sanctions) 
• Pairs of Somehow Similar...
Blocking
Scoring: 
Risk vs Probability 
Likely to (The Ideal) 
Launder Money 
Probably the 
Same Person
The Reality (Dominated by Garbage) 
Tiny Bump 
Upper 937 
Threshold 
161 
161,358
Let’s dig into a single point 
Jimmy Cournoyer 
El: 95/ SI:16
Citation Network (Safe View)
Relationship Network (Safe View)
British 
Columbia 
Rizzuto Crime Family 
Jimmy “Cosmo” 
“Superman” 
Cournoyer 
Quebec 
New York/NYC 
Bonanno Crime Family ...
Jorge HankRhon 
Family & Friends 
$100s Millions 
Citibank, CH 
Brother Murdered
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
Munging Data Redoing Work / 
Investigating Problems 
Fun Algorithms 
% Time Spent
Disgustingly Bad but Fairly Large Datasets 
▪ Both Wide (many fields) and Tall (many records) 
▪ From different systems (d...
SAM – Building for Bad Data 
▪ Lazy Pure Functional Core 
▪ Programmable Data Cleaning 
▪ Programmable ETL 
▪ Ad-Hoc Behav...
Other Kinds of Problems 
(sometimes even my fault) 
▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) 
▪ W...
F# Tools From Bayard Rock 
http://github.com/BayardRock 
Tokens Classification 
Pegasus Airlines ORGANIZATION 
Istanbul LO...
FSharpWebIntellisense 
https://github.com/BayardRock/FSharpWebIntellisense
iFSharp Notebook 
https://github.com/BayardRock/IfSharp
Barb, a simple .net record query language 
Name.Contains "John“ and (Age > 20 or Weight > 200) 
https://github.com/Rickasa...
MITIE Dot Net (a wrapper for MIT’s MITIE) 
A Pegasus Airlines plane landed at 
an Istanbul airport Friday after a 
passeng...
Other F# Community Tools (Not by Us) 
▪ Data Type Providers (SQL, OData, CSV, etc..) 
▪ Language Type Providers (R, Matlab...
The Magic of Type Providers 
type Netflix = ODataService<"http://odata.netflix.com"> 
let avatarTitles = 
query { for t in...
How it works! 
Compiler Type Provider 
Types 
Erased Types 
The 
World! 
Type Providers! 
Libraries For Free!
Deedle (Like Python’s pandas but for F#) 
▪ Designed with Data Type Providers in Mind 
▪ Interops with the R Type Provider
But what about algorithmic code?
Ranking vs Regression 
▪ Regression - you’re trying to guess a number, only distance matters 
▪ May do a very bad job at o...
Regression 
푦 = 푋훽 + 휀 
y is labels 
X is features 
훽 is weights 
휀 is errors
“OLS” Regression via Gradient Descent in F#
Simple Ranking? You Can Use Regression. 
▪ The features are the difference in would-be regression features 
▪ The value to...
Simple Ranking in F#
Combined Ranking and 
Regression – D. Sculley 
You can improve your regression with 
ranking, and your ranking with regres...
Combined Ranking and Regression – 
D. Sculley @ Google, Inc
Thank You! 
Check out the NYC F# User Group: 
http://www.meetup.com/nyc-fsharp 
Find out more about F#: 
http://fsharp.org...
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/functional 
-data-scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
Upcoming SlideShare
Loading in …5
×

A Day in the Life of a Functional Data Scientist

2,484 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1tnEpjY.

Richard Minerich explains how ideas and tools from functional programming can save time, prevent subtle mistakes in data science, and how he incorporates them into his everyday workflow. Filmed at qconnewyork.com.

Richard Minerich works tirelessly at Bayard Rock to apply cutting edge research to anti-money laundering and fraud while using typed functional programming whenever possible. As an F# MVP he's been running events, speaking, and writing for over five years.

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,484
On SlideShare
0
From Embeds
0
Number of Embeds
150
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

A Day in the Life of a Functional Data Scientist

  1. 1. A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /functional-data-scientist InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Projecting onto a 2D Plane
  5. 5. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Sanctions) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  6. 6. Blocking
  7. 7. Scoring: Risk vs Probability Likely to (The Ideal) Launder Money Probably the Same Person
  8. 8. The Reality (Dominated by Garbage) Tiny Bump Upper 937 Threshold 161 161,358
  9. 9. Let’s dig into a single point Jimmy Cournoyer El: 95/ SI:16
  10. 10. Citation Network (Safe View)
  11. 11. Relationship Network (Safe View)
  12. 12. British Columbia Rizzuto Crime Family Jimmy “Cosmo” “Superman” Cournoyer Quebec New York/NYC Bonanno Crime Family John “Big Man” Venizelos Reinvested in Cocaine California Flow of Drugs Hells Angels El Chapo Sinaloa Cartel
  13. 13. Jorge HankRhon Family & Friends $100s Millions Citibank, CH Brother Murdered
  14. 14. 0.6 0.5 0.4 0.3 0.2 0.1 0 Munging Data Redoing Work / Investigating Problems Fun Algorithms % Time Spent
  15. 15. Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way. NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 00000 DOB 10/24/80; 1/1/1979
  16. 16. SAM – Building for Bad Data ▪ Lazy Pure Functional Core ▪ Programmable Data Cleaning ▪ Programmable ETL ▪ Ad-Hoc Behaviors All with an F# Core and Barb for scripting. UI (C#) & Analysis (C#) Glue (F# and Barb) Data & Config In Data Out Algorithms (F#)
  17. 17. Other Kinds of Problems (sometimes even my fault) ▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) ▪ Wrong version of data (e.g. bad sync in SQL) ▪ Bad configuration of dependencies The data lives in a locked down environment and so feedback cycles are slow. Lesson: Be Paranoid
  18. 18. F# Tools From Bayard Rock http://github.com/BayardRock Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation Ministry ORGANIZATION
  19. 19. FSharpWebIntellisense https://github.com/BayardRock/FSharpWebIntellisense
  20. 20. iFSharp Notebook https://github.com/BayardRock/IfSharp
  21. 21. Barb, a simple .net record query language Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb
  22. 22. MITIE Dot Net (a wrapper for MIT’s MITIE) A Pegasus Airlines plane landed at an Istanbul airport Friday after a passenger "said that there was a bomb on board" and wanted the plane to land in Sochi, Russia, the site of the Winter Olympics, said officials with Turkey's Transportation Ministry. https://github.com/BayardRock/MITIE-Dot-Net Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation ORGANIZATION Ministry
  23. 23. Other F# Community Tools (Not by Us) ▪ Data Type Providers (SQL, OData, CSV, etc..) ▪ Language Type Providers (R, Matlab, Python soon) ▪ Deedle (like Pandas but for F#) ▪ F# Charting
  24. 24. The Magic of Type Providers type Netflix = ODataService<"http://odata.netflix.com"> let avatarTitles = query { for t in netflix.Titles do where (t.Name.Contains "Avatar") sortBy t.Name take 100 }
  25. 25. How it works! Compiler Type Provider Types Erased Types The World! Type Providers! Libraries For Free!
  26. 26. Deedle (Like Python’s pandas but for F#) ▪ Designed with Data Type Providers in Mind ▪ Interops with the R Type Provider
  27. 27. But what about algorithmic code?
  28. 28. Ranking vs Regression ▪ Regression - you’re trying to guess a number, only distance matters ▪ May do a very bad job at ordering ▪ In Ranking you’re trying to figure out some order, only order matters ▪ May do a very bad job at providing a meaningful number Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?
  29. 29. Regression 푦 = 푋훽 + 휀 y is labels X is features 훽 is weights 휀 is errors
  30. 30. “OLS” Regression via Gradient Descent in F#
  31. 31. Simple Ranking? You Can Use Regression. ▪ The features are the difference in would-be regression features ▪ The value to predict is the difference in rank Select 2 labeled samples randomly => (x1,y1) (x2,y2) x = x1 – x2 y = y1 – y2 Sample 1 Sample 2 Result Names? 1 1 0 Addresses? 1 0 1 DOB? 0 1 -1 Same Person? 0 0 0
  32. 32. Simple Ranking in F#
  33. 33. Combined Ranking and Regression – D. Sculley You can improve your regression with ranking, and your ranking with regression. The best of both worlds!
  34. 34. Combined Ranking and Regression – D. Sculley @ Google, Inc
  35. 35. Thank You! Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Find out more about F#: http://fsharp.org Contact me on twitter: @Rickasaurus
  36. 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/functional -data-scientist

×