1.
Joel Pobar
Languages Geek
DEV450
http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
2.
Agenda
What is it?
F# Intro
Algorithms:
Search
Fuzzy Matching
Classification (SVM)
Recommendations
Q&A
3.
All This in 1 hour?
This is an awareness session!
Lots of content, very broad, very fast
You’ll get all demos, pointers, and slide deck to take
offline and digest
Two takeaways:
F# is a great language for data
Smart algorithms aren’t hard – use them, explore
more!
4.
F# is
...a functional, object-oriented, imperative and
explorative programming language for .NET
what is Functional Programming?
5.
What is Functional Programming?
Wikipedia: “A programming paradigm that
treats computation as the evaluation of
mathematical functions and avoids state and
mutable data”
-> Emphasizes functions
-> Emphasizes shapes of data, rather than impl.
-> Modeled on lambda calculus
-> Reduced emphasis on imperative
-> Safely raises level of abstraction
6.
Motivation for Functional
Simplicity in life is good: cheaper, easier, faster,
better.
We typically achieve simplicity in software in two
ways:
By raising the level of abstraction (and OO was one
design to raise abstraction)
Increasing modularity
Better composition and modularity == reuse
Increasing signal to noise another good
strategy:
Communicate more in less time with more clarity
7.
Functional Programming
Safer, while still being useful
Useful
C#, C++, … F# V.Next#
Haskell
Not Useful
Unsafe Safe
8.
Motivation for Functional
Data driven world
More and more data: need higher order algorithms
and techniques to derive value from data
Scalability is king
Economies of software scale are changing: the web
requires tools + frameworks + languages that scale
to millions
The Multi-core (r)evolution!
Need more adaptive languages + compilers to scale
Language features matter!
9.
What is F# for?
F# is a General Purpose Language
Can be used for a broad range of programming
tasks
Superset of imperative and dynamic features
Great for learning FP concepts
Some particularly important domains:
Financial modelling
Data mining
Scientific analysis
Academic
10.
Let
Type inference.
The static typing of C# with
Let binds values to identifiers the succinctness of a scripting
language
let helloWorld = “Hello, World”
print_any helloWorld
let myNum = 12
let myAddFunction x y =
let sum = x + y
sum
11.
Tuples
Simple, very useful data structure
let site1 = (“msdn.com”, 10)
let site2 = (“abc.net.au”, 12)
let site3 = (“news.com.au”, 22)
let allSites = (site1, site2, site3)
let fst (a, b) = a
let snd (a, b) = b
12.
List, Arrays, Seq, and Options
Lists and Arrays are first class citizens
Options provide a some-or-nothing capability
let list1 = [“Joel"; "Luke"]
let array = [|2; 3; 5;|]
let myseq = seq [0; 1; 2; ]
let option1 = Some(“Joel")
let option2 = None
13.
Records
Simple concrete type definition
type Person =
{ Name: string;
DateOfBirth: System.DateTime; }
let n = { Name = “Joel”;
DateOfBirth = “13/04/81”; }
14.
Immutability
Data is immutable by
default
Values may not
be changed
15.
Discriminated Unions
Great for representing the structure of data
type Make = string
type Model = string
type Transport =
| Car of Make * Model
| Bicycle Both of these identifiers are of
type “Transport”
let me = Car (“Holden”, “Barina”)
let you = Bicycle
16.
Functions
Functions: like delegates, but unified and simple
Deep type inference
(fun x -> x + 1)
let myFunc x = x + 1
val myFunc : int -> int
let rec factorial n =
if n>1 then n * factorial (n-1)
else 1
let data = [5; 3; 4; 4; 5]
List.sort (fun x y -> x – y) data
17.
Pattern Matching
Helps tease apart data and data structures
Works best with Unions and Records
let (fst, _) = (“first”, “second”)
Console.WriteLine(fst)
let switchOnType(a:obj)
match a with
| :? Int32 -> printfn “int!”
| :? Transport -> printfn “Transport“
| _ -> printfn “Everything Else!”
21.
Search
Words
Stemming?
Tokenise
Markup
Title/Author/Date
Links?
A sign of strength?
Let’s explore
something simple
22.
Search
Simplify:
For easy machine/language manipulation
… and most importantly, easy computation
Vectors: natures own quality data structure
Convenient machine representation (lists/arrays)
Lots of existing vector math algorithms
After a loving
incubation
moonlight
incubation period,
binaries
moonlight 2.0 has
firefox
loving
been released. <a
linux
after
href=“whatever”>sour
ce code</a><br><a
href”something
else”>FireFox
binaries</a> … after 2 1 1 6 4 6 2
23.
Vector space:
Term Count
2
9
the
Document1: Linux post:
0
1
incubation
Document2: Animal post:
9
2
1
crazy the
1
0
6 moonlight incubation
1
0
4
firefox crazy
6
0
6
2
linux crazy moonlight
4
1
0
2
dog the firefox
1
6
5
2
penguin dog linux
5
2
penguin penguin
24.
Term Count Issues
incubation
moonlight
penguin
firefox
crazy
linux
‘the dog penguin’
dog
the
Linux: 9+0+2 = 11 9 1 1 6 4 6 0 2
Animal: 2+1+5 = 8 2 0 2 0 0 0 1 5
‘the’ is overweight
Enter TF-IDF: Term Frequency Inverse Document
Frequency
A weight to evaluate how important a word is to a
corpus
i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t
weight it very highly in the total query
25.
TF-IDF
Normalise the term count against the doc:
tf = termCount / docWordCount
Measure importance of term
idf = log ( |D| / termInDocumentCount)
where |D| is the total documents in the corpus
tfidf = tf * idf
A high weight is reached by high term frequency,
and a low document frequency
27.
Fuzzy Matching
String similarity algorithms:
SoundEx; Metaphone
Jaro Winkler Distance; Cosine similarity; Sellers;
Euclidean distance; …
We’ll look at Levenshtein Distance algorithm
Defined as: The minimum edit operations which
transforms string1 into string2
28.
Fuzzy Matching
Edit costs:
In-place copy – cost 0
Delete a character in string1 – cost 1
Insert a character in string2 – cost 1
Substitute a character for another – cost 1
Transform ‘kitten’ in to ‘sitting’
kitten -> sitten (cost 1 – replace k with s)
sitten -> sittin (cost 1 - replace e with i)
sittin -> sitting (cost 1 – add g)
Levenshtein distance: 3
29.
Fuzzy Matching
Estimated string similarity computation costs:
Hard on the GC (lots of temporary strings created
and thrown away, use arrays if possible.
Levenshtein can be computed in O (kl) time, where
‘l’ is the length of the shortest string, and ‘k’ is the
maximum distance.
Parallelisable – split the set of words to compare
across n cores.
Can do approximately 10,000 compares per second
on a standard single core laptop.
31.
Classification
Support Vector Machines (SVM)
Supervised learning for binary classification
Training Inputs: ‘in’ and ‘out’ vectors.
SVM will then find a separating ‘hyperplane’ in an
n-dimensional space
Training costs, but classification is cheap
Can retrain on the fly in some cases
35.
F# Recommendation Engine
Netflix Prize - $1 million USD
Must beat Netflix prediction algorithm by 10%
480k users
100 million ratings
18,000 movies
Great example of deriving value out of large
datasets
Earns Netflix loads and loads of $$$!
40.
Vector Math Made Easy
A (x1,y1)
B (x2,y2)
C (x0,y0)
If we want to calculate the distance between A and B, we call on
Euclidean Distance
We can represent the points in the same way using Vectors:
Magnitude and Direction.
Having this Vector representation, allows us to work in ‘n’
dimensions, yet still achieve Euclidean Distance/Angle
calculations.