Joel Pobar
Languages Geek
DEV450
http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
Agenda
 What is it?
 F# Intro
 Algorithms:
   Search
   Fuzzy Matching
   Classification (SVM)
   Recommendations
 Q&A
All This in 1 hour?
 This is an awareness session!
    Lots of content, very broad, very fast
    You’ll get all demos, po...
F# is
 ...a functional, object-oriented, imperative and
     explorative programming language for .NET

        what is Fu...
What is Functional Programming?
 Wikipedia: “A programming paradigm that
 treats computation as the evaluation of
 mathema...
Motivation for Functional
 Simplicity in life is good: cheaper, easier, faster,
 better.
    We typically achieve simplici...
Functional Programming
Safer, while still being useful

 Useful
               C#, C++, …     F#   V.Next#




           ...
Motivation for Functional
 Data driven world
    More and more data: need higher order algorithms
    and techniques to de...
What is F# for?
 F# is a General Purpose Language
    Can be used for a broad range of programming
    tasks
    Superset ...
Let
                                          Type inference.
                                    The static typing of C# ...
Tuples
 Simple, very useful data structure

  let   site1 = (“msdn.com”, 10)
  let   site2 = (“abc.net.au”, 12)
  let   si...
List, Arrays, Seq, and Options
 Lists and Arrays are first class citizens
 Options provide a some-or-nothing capability
  ...
Records
 Simple concrete type definition

  type Person =
  { Name: string;
    DateOfBirth: System.DateTime; }

  let n =...
Immutability

                 Data is immutable by
                        default




Values may not
  be changed
Discriminated Unions
 Great for representing the structure of data

  type Make = string
  type Model = string
  type Tran...
Functions
 Functions: like delegates, but unified and simple
 Deep type inference
  (fun x -> x + 1)

  let myFunc x = x +...
Pattern Matching
 Helps tease apart data and data structures
 Works best with Unions and Records

   let (fst, _) = (“firs...
F# Interactive
Search
 Given a search term and a large document
 corpus, rank and return a list of the most
 relevant results…
Blog Crawler
Search
         Words
            Stemming?
            Tokenise
         Markup
            Title/Author/Date
         Li...
Search
 Simplify:
    For easy machine/language manipulation
    … and most importantly, easy computation
 Vectors: nature...
Vector space:
                                                                                     Term Count




2
    9
...
Term Count Issues




                                         incubation



                                             ...
TF-IDF
 Normalise the term count against the doc:
    tf = termCount / docWordCount

 Measure importance of term
    idf =...
Search in under 10 minutes
Fuzzy Matching
 String similarity algorithms:
    SoundEx; Metaphone
    Jaro Winkler Distance; Cosine similarity; Sellers...
Fuzzy Matching
 Edit costs:
    In-place copy – cost 0
    Delete a character in string1 – cost 1
    Insert a character i...
Fuzzy Matching
 Estimated string similarity computation costs:
    Hard on the GC (lots of temporary strings created
    a...
Did You Mean?
Classification
 Support Vector Machines (SVM)
    Supervised learning for binary classification
    Training Inputs: ‘in’ ...
Classification
SVM Issues
 Classification on 2 dimensions is easy, but most
 input is multi-dimensional
 Some ‘tricks’ are needed to tran...
SVM Classifier Demo
F# Recommendation Engine
 Netflix Prize - $1 million USD
    Must beat Netflix prediction algorithm by 10%
    480k users
...
Netflix Data Format

    MovieId     CustomerId   Rating
    Clerks      444444       5
    Clerks      2093393      4
   ...
Nearest Neighbour

   MovieId     CustomerId   Rating
   Clerks      444444       5
   Clerks      2093393      4
   Clerk...
Nearest Neighbour
 Find the best movies my neighbours agree on

        CustomerId   302   4418   3   56   732

        44...
Netflix Demo
Vector Math Made Easy
                             A (x1,y1)


                                    B (x2,y2)


           ...
http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be...
Upcoming SlideShare
Loading in …5
×

Big Algorithms Made Easy with Microsoft's F#

4,262 views

Published on

Joel Pobar's slides from his presentation at TechEd Australia 2009

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
4,262
On SlideShare
0
From Embeds
0
Number of Embeds
42
Actions
Shares
0
Downloads
58
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Big Algorithms Made Easy with Microsoft's F#

  1. 1. Joel Pobar Languages Geek DEV450 http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
  2. 2. Agenda What is it? F# Intro Algorithms: Search Fuzzy Matching Classification (SVM) Recommendations Q&A
  3. 3. All This in 1 hour? This is an awareness session! Lots of content, very broad, very fast You’ll get all demos, pointers, and slide deck to take offline and digest Two takeaways: F# is a great language for data Smart algorithms aren’t hard – use them, explore more!
  4. 4. F# is ...a functional, object-oriented, imperative and explorative programming language for .NET what is Functional Programming?
  5. 5. What is Functional Programming? Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data” -> Emphasizes functions -> Emphasizes shapes of data, rather than impl. -> Modeled on lambda calculus -> Reduced emphasis on imperative -> Safely raises level of abstraction
  6. 6. Motivation for Functional Simplicity in life is good: cheaper, easier, faster, better. We typically achieve simplicity in software in two ways: By raising the level of abstraction (and OO was one design to raise abstraction) Increasing modularity Better composition and modularity == reuse Increasing signal to noise another good strategy: Communicate more in less time with more clarity
  7. 7. Functional Programming Safer, while still being useful Useful C#, C++, … F# V.Next# Haskell Not Useful Unsafe Safe
  8. 8. Motivation for Functional Data driven world More and more data: need higher order algorithms and techniques to derive value from data Scalability is king Economies of software scale are changing: the web requires tools + frameworks + languages that scale to millions The Multi-core (r)evolution! Need more adaptive languages + compilers to scale Language features matter!
  9. 9. What is F# for? F# is a General Purpose Language Can be used for a broad range of programming tasks Superset of imperative and dynamic features Great for learning FP concepts Some particularly important domains: Financial modelling Data mining Scientific analysis Academic
  10. 10. Let Type inference. The static typing of C# with Let binds values to identifiers the succinctness of a scripting language let helloWorld = “Hello, World” print_any helloWorld let myNum = 12 let myAddFunction x y = let sum = x + y sum
  11. 11. Tuples Simple, very useful data structure let site1 = (“msdn.com”, 10) let site2 = (“abc.net.au”, 12) let site3 = (“news.com.au”, 22) let allSites = (site1, site2, site3) let fst (a, b) = a let snd (a, b) = b
  12. 12. List, Arrays, Seq, and Options Lists and Arrays are first class citizens Options provide a some-or-nothing capability let list1 = [“Joel"; "Luke"] let array = [|2; 3; 5;|] let myseq = seq [0; 1; 2; ] let option1 = Some(“Joel") let option2 = None
  13. 13. Records Simple concrete type definition type Person = { Name: string; DateOfBirth: System.DateTime; } let n = { Name = “Joel”; DateOfBirth = “13/04/81”; }
  14. 14. Immutability Data is immutable by default Values may not be changed
  15. 15. Discriminated Unions Great for representing the structure of data type Make = string type Model = string type Transport = | Car of Make * Model | Bicycle Both of these identifiers are of type “Transport” let me = Car (“Holden”, “Barina”) let you = Bicycle
  16. 16. Functions Functions: like delegates, but unified and simple Deep type inference (fun x -> x + 1) let myFunc x = x + 1 val myFunc : int -> int let rec factorial n = if n>1 then n * factorial (n-1) else 1 let data = [5; 3; 4; 4; 5] List.sort (fun x y -> x – y) data
  17. 17. Pattern Matching Helps tease apart data and data structures Works best with Unions and Records let (fst, _) = (“first”, “second”) Console.WriteLine(fst) let switchOnType(a:obj) match a with | :? Int32 -> printfn “int!” | :? Transport -> printfn “Transport“ | _ -> printfn “Everything Else!”
  18. 18. F# Interactive
  19. 19. Search Given a search term and a large document corpus, rank and return a list of the most relevant results…
  20. 20. Blog Crawler
  21. 21. Search Words Stemming? Tokenise Markup Title/Author/Date Links? A sign of strength? Let’s explore something simple
  22. 22. Search Simplify: For easy machine/language manipulation … and most importantly, easy computation Vectors: natures own quality data structure Convenient machine representation (lists/arrays) Lots of existing vector math algorithms After a loving incubation moonlight incubation period, binaries moonlight 2.0 has firefox loving been released. <a linux after href=“whatever”>sour ce code</a><br><a href”something else”>FireFox binaries</a> … after 2 1 1 6 4 6 2
  23. 23. Vector space: Term Count 2 9 the Document1: Linux post: 0 1 incubation Document2: Animal post: 9 2 1 crazy the 1 0 6 moonlight incubation 1 0 4 firefox crazy 6 0 6 2 linux crazy moonlight 4 1 0 2 dog the firefox 1 6 5 2 penguin dog linux 5 2 penguin penguin
  24. 24. Term Count Issues incubation moonlight penguin firefox crazy linux ‘the dog penguin’ dog the Linux: 9+0+2 = 11 9 1 1 6 4 6 0 2 Animal: 2+1+5 = 8 2 0 2 0 0 0 1 5 ‘the’ is overweight Enter TF-IDF: Term Frequency Inverse Document Frequency A weight to evaluate how important a word is to a corpus i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query
  25. 25. TF-IDF Normalise the term count against the doc: tf = termCount / docWordCount Measure importance of term idf = log ( |D| / termInDocumentCount) where |D| is the total documents in the corpus tfidf = tf * idf A high weight is reached by high term frequency, and a low document frequency
  26. 26. Search in under 10 minutes
  27. 27. Fuzzy Matching String similarity algorithms: SoundEx; Metaphone Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; … We’ll look at Levenshtein Distance algorithm Defined as: The minimum edit operations which transforms string1 into string2
  28. 28. Fuzzy Matching Edit costs: In-place copy – cost 0 Delete a character in string1 – cost 1 Insert a character in string2 – cost 1 Substitute a character for another – cost 1 Transform ‘kitten’ in to ‘sitting’ kitten -> sitten (cost 1 – replace k with s) sitten -> sittin (cost 1 - replace e with i) sittin -> sitting (cost 1 – add g) Levenshtein distance: 3
  29. 29. Fuzzy Matching Estimated string similarity computation costs: Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance. Parallelisable – split the set of words to compare across n cores. Can do approximately 10,000 compares per second on a standard single core laptop.
  30. 30. Did You Mean?
  31. 31. Classification Support Vector Machines (SVM) Supervised learning for binary classification Training Inputs: ‘in’ and ‘out’ vectors. SVM will then find a separating ‘hyperplane’ in an n-dimensional space Training costs, but classification is cheap Can retrain on the fly in some cases
  32. 32. Classification
  33. 33. SVM Issues Classification on 2 dimensions is easy, but most input is multi-dimensional Some ‘tricks’ are needed to transform the input data
  34. 34. SVM Classifier Demo
  35. 35. F# Recommendation Engine Netflix Prize - $1 million USD Must beat Netflix prediction algorithm by 10% 480k users 100 million ratings 18,000 movies Great example of deriving value out of large datasets Earns Netflix loads and loads of $$$!
  36. 36. Netflix Data Format MovieId CustomerId Rating Clerks 444444 5 Clerks 2093393 4 Clerks 999 5 Clerks 8668478 1 Dogma 2432114 3 Dogma 444444 5 Dogma 999 5 ... ... ...
  37. 37. Nearest Neighbour MovieId CustomerId Rating Clerks 444444 5 Clerks 2093393 4 Clerks 999 5 Clerks 8668478 1 Dogma 2432114 3 Dogma 444444 5 Dogma 999 5 ... ... ...
  38. 38. Nearest Neighbour Find the best movies my neighbours agree on CustomerId 302 4418 3 56 732 444444 5 4 5 2 999 5 5 1 111211 3 5 3 66666 5 5 1212121 5 4 5656565 1 454545 5 5
  39. 39. Netflix Demo
  40. 40. Vector Math Made Easy A (x1,y1) B (x2,y2) C (x0,y0) If we want to calculate the distance between A and B, we call on Euclidean Distance We can represent the points in the same way using Vectors: Magnitude and Direction. Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.
  41. 41. http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
  42. 42. © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

×