Your SlideShare is downloading. ×
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
A Day in the Life of a Functional Data Scientist
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Day in the Life of a Functional Data Scientist

1,724

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1tnEpjY. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1tnEpjY.

Richard Minerich explains how ideas and tools from functional programming can save time, prevent subtle mistakes in data science, and how he incorporates them into his everyday workflow. Filmed at qconnewyork.com.

Richard Minerich works tirelessly at Bayard Rock to apply cutting edge research to anti-money laundering and fraud while using typed functional programming whenever possible. As an F# MVP he's been running events, speaking, and writing for over five years.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,724
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /functional-data-scientist InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Projecting onto a 2D Plane
  • 5. The Pairwise Entity Resolution Process Blocking • Two Datasets (Customer Data and Sanctions) • Pairs of Somehow Similar Records Scoring • Pairs of Records • Probability of Representing Same Entity Review • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand)
  • 6. Blocking
  • 7. Scoring: Risk vs Probability Likely to (The Ideal) Launder Money Probably the Same Person
  • 8. The Reality (Dominated by Garbage) Tiny Bump Upper 937 Threshold 161 161,358
  • 9. Let’s dig into a single point Jimmy Cournoyer El: 95/ SI:16
  • 10. Citation Network (Safe View)
  • 11. Relationship Network (Safe View)
  • 12. British Columbia Rizzuto Crime Family Jimmy “Cosmo” “Superman” Cournoyer Quebec New York/NYC Bonanno Crime Family John “Big Man” Venizelos Reinvested in Cocaine California Flow of Drugs Hells Angels El Chapo Sinaloa Cartel
  • 13. Jorge HankRhon Family & Friends $100s Millions Citibank, CH Brother Murdered
  • 14. 0.6 0.5 0.4 0.3 0.2 0.1 0 Munging Data Redoing Work / Investigating Problems Fun Algorithms % Time Spent
  • 15. Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (different encodings) ▪ Missing data ▪ Poorly merged data ▪ Extra data ▪ Non-unique IDs Every client is awful in a completely different way. NAME LARRY O BRIAN STATE CANADA CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ADDRESS NULL ZIP 00000 DOB 10/24/80; 1/1/1979
  • 16. SAM – Building for Bad Data ▪ Lazy Pure Functional Core ▪ Programmable Data Cleaning ▪ Programmable ETL ▪ Ad-Hoc Behaviors All with an F# Core and Barb for scripting. UI (C#) & Analysis (C#) Glue (F# and Barb) Data & Config In Data Out Algorithms (F#)
  • 17. Other Kinds of Problems (sometimes even my fault) ▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) ▪ Wrong version of data (e.g. bad sync in SQL) ▪ Bad configuration of dependencies The data lives in a locked down environment and so feedback cycles are slow. Lesson: Be Paranoid
  • 18. F# Tools From Bayard Rock http://github.com/BayardRock Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation Ministry ORGANIZATION
  • 19. FSharpWebIntellisense https://github.com/BayardRock/FSharpWebIntellisense
  • 20. iFSharp Notebook https://github.com/BayardRock/IfSharp
  • 21. Barb, a simple .net record query language Name.Contains "John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb
  • 22. MITIE Dot Net (a wrapper for MIT’s MITIE) A Pegasus Airlines plane landed at an Istanbul airport Friday after a passenger "said that there was a bomb on board" and wanted the plane to land in Sochi, Russia, the site of the Winter Olympics, said officials with Turkey's Transportation Ministry. https://github.com/BayardRock/MITIE-Dot-Net Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation ORGANIZATION Ministry
  • 23. Other F# Community Tools (Not by Us) ▪ Data Type Providers (SQL, OData, CSV, etc..) ▪ Language Type Providers (R, Matlab, Python soon) ▪ Deedle (like Pandas but for F#) ▪ F# Charting
  • 24. The Magic of Type Providers type Netflix = ODataService<"http://odata.netflix.com"> let avatarTitles = query { for t in netflix.Titles do where (t.Name.Contains "Avatar") sortBy t.Name take 100 }
  • 25. How it works! Compiler Type Provider Types Erased Types The World! Type Providers! Libraries For Free!
  • 26. Deedle (Like Python’s pandas but for F#) ▪ Designed with Data Type Providers in Mind ▪ Interops with the R Type Provider
  • 27. But what about algorithmic code?
  • 28. Ranking vs Regression ▪ Regression - you’re trying to guess a number, only distance matters ▪ May do a very bad job at ordering ▪ In Ranking you’re trying to figure out some order, only order matters ▪ May do a very bad job at providing a meaningful number Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?
  • 29. Regression 푦 = 푋훽 + 휀 y is labels X is features 훽 is weights 휀 is errors
  • 30. “OLS” Regression via Gradient Descent in F#
  • 31. Simple Ranking? You Can Use Regression. ▪ The features are the difference in would-be regression features ▪ The value to predict is the difference in rank Select 2 labeled samples randomly => (x1,y1) (x2,y2) x = x1 – x2 y = y1 – y2 Sample 1 Sample 2 Result Names? 1 1 0 Addresses? 1 0 1 DOB? 0 1 -1 Same Person? 0 0 0
  • 32. Simple Ranking in F#
  • 33. Combined Ranking and Regression – D. Sculley You can improve your regression with ranking, and your ranking with regression. The best of both worlds!
  • 34. Combined Ranking and Regression – D. Sculley @ Google, Inc
  • 35. Thank You! Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Find out more about F#: http://fsharp.org Contact me on twitter: @Rickasaurus
  • 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/functional -data-scientist

×