SlideShare a Scribd company logo
Algorithmic Data Science
=
Theory + Practice
Matteo Riondato – Labs, Two Sigma Investments
@teorionda – http://matteo.rionda.to
IEEE MIT URTC – November 5, 2016
1 / 24
Matteo Riondato
Ph.D. in CS
Working at
Labs, Two Sigma Investments (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Doing research on algorithmic data science;
Tweeting @teorionda;
Reading matteo@twosigma.com;
“Living” at http://matteo.rionda.to.
2 / 24
Conjecture
Let X be a scientific discipline. Then
21st
-century X = datascience (X) + ε .
Partial evidence: “Computational X” exists for many X.
3 / 24
data science : 21st
century = statistics : 20th
century
4 / 24
data science for 21st
century society



questions
data
5 / 24
data science
6 / 24
data science
6 / 24
data science =



1/4 data representation and management
1/4 mathematical and statistical modeling
1/4 computational thinking and algorithms
1/4 domain expertise
Shake well, and strain into a cocktail glass.
7 / 24
domain expertise modeling
management
algorithms
8 / 24
domain expertise modeling
management
algorithms
8 / 24
domain expertise modeling
management
algorithms
8 / 24
algorithmic data science:
=
algorithms for/with:



approximation guarantees
data streams
Spark/MapReduce
sampling
statistical testing
graph analysis
. . . 9 / 24
algorithmic data science
=
theory
10 / 24
algorithmic data science
≈
theory + practice
10 / 24
algorithmic data science
=
(theory × practice)(theory×practice)
10 / 24
Example
11 / 24
Scientific question: Find relevant webpages on the web, influential participants in
a email chain, key proteins in a network, . . .
Data representation: represent the data as a graph G = (V , E).
a
h
b
g f e
c d
Modeling question: What are the important nodes in a graph G = (V , E)?
We need f : V → R+ to express the importance of a node.
The higher is f (x), the more important is x ∈ V .
12 / 24
Domain Knowledge / Modeling: Assume that
1) every node wants to communicate with every node; and
2) communication progresses along Shortest Paths (SPs).
Then, the higher the no. of SPs that a node v belongs to, the more important v is.
Definition
For each node x ∈ V , the betweeness b(x) of x is:
b(x) =
1
n(n − 1) u=x=v∈V
σuv (x)
σuv
∈ [0, 1]
• σuv : number of SPs from u to v, u, v ∈ V ;
• σuv (x): number of SPs from u to v that go through x.
I.e., b(x) is weighted fraction of SPs that go through x, among all SPs in G.
13 / 24
a
h
b
g f e
c d
Node x a b c d e f g h
b(x) 0 0.250 0.125 0.036 0.054 0.080 0.268 0
14 / 24
Algorithmic question: How to compute all b(x)?
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
15 / 24
Algorithmic question: How to compute all b(x)?
Brandes’ Algorithm
Intuition: For each vertex s ∈ V :
1) Build the SP DAG from s via Dijkstra/BFS;
2) Traverse the SP DAG from the most distant node towards s, in reverse order of
distance. During the walk, appropriately increment b(v) of each non-leaf node v
traversed.
Source s: 1
1
234
567
89
(update to b(v) not shown)
Time complexity: O(nm + n2 log n)
n Dijkstra’s, plus n backward walks,
taking at most n each
Too much even with just 104 nodes.
15 / 24
Modeling / Domain knowledge:
High-quality approximations of all BCs are sufficient.
16 / 24
Modeling / Domain knowledge:
High-quality approximations of all BCs are sufficient.
Let ε ∈ (0, 1), and δ ∈ (0, 1) be user-specified parameters;
An (ε, δ)-approximation is a set {b(x), x ∈ V } of n values s.t.
Pr(∃x ∈ V s.t. |b(x) − b(x)| > ε) ≤ δ
i.e., with prob. ≥ 1 − δ, for all x ∈ V , b(x) is within ε of b(x):
a uniform probabilistic guarantee over all the estimations.
16 / 24
Algorithmic question:
How to obtain an (ε, δ)-approximation quickly?
Answer:
Sampling
Instead of computing all the SPs from each node x ∈ V , compute them only from
some randomly chosen nodes (samples).
Theory question:
How many samples do we need to obtain an (ε, δ)-approximation?
The more the better, but really, how many?
17 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
18 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
Need O
1
ε2
log |V | + log
1
δ
samples
18 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Hoeffding Bound + Union Bound
Need O
1
ε2
log |V | + log
1
δ
samples
Comments
Practice:
Fewer samples than the above are sufficient for (ε, δ)-approx.
Theory:
Dependency on |V | and not on edge structure seems wrong.
18 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
19 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
Need O
1
ε2
log diam(G) + log
1
δ
samples
Decreased sample size exponentially on small-world networks.
19 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Vapnik-Chervonenkis (VC) Dimension
Developed to evaluate supervised learning classifiers.
We twisted it to work in a non-supervised graph mining problem.
“The most practical theory ever” – Me, right now
Need O
1
ε2
log diam(G) + log
1
δ
samples
Decreased sample size exponentially on small-world networks.
Comments
Practice: Great improvement but still too many samples.
Theory: Graphs with the same diameter are not equally “hard”.
19 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
20 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
Stop when ηi ≤ ε, where ηi is. . .
20 / 24
How many samples do we need to obtain an (ε, δ)-approximation?
Theory: Progressive sampling + Rademacher Averages
Let’s start sampling, use the sample to decide when to stop.
Stop when ηi ≤ ε, where ηi is. . .
ηi = 2 min
t∈R+
1
t
ln
(r,C)∈T
et2
r2
/(2S2
i )
+ 3
(i + 1) ln(2/δ)
2Si
Comments
Practice: Getting closer to the empirical bound
Theory: Proving stuff is getting complicated (isn’t that good?)
20 / 24
Theory + Practice:
Get rid of “theoretical elegance” while maintaining correctness.
21 / 24
Theory + Practice:
Get rid of “theoretical elegance” while maintaining correctness.
Let
gS(x, y) = 2 exp −2 x2
(y − 2RF (S))2
+ exp − ((1 − x)y + 2xRF (S))
φ
2RF (S)
(1 − x)y + 2xRF (S)
− 1 .
Then compute
min
x,ξ
ξ
s.t. gS(x, ξ) ≤ η
ξ ∈ (2RF (S), 1]
x ∈ (0, 1)
and check if ξ < ε.
21 / 24
To be a data scientist, you need to get your hands dirty in data.
To be an algorithmic data scientist,
you need to get your hands dirty in



data
theory
22 / 24
Other examples



pattern mining
(Rademacher Averages)
selectivity of database queries
(VC-dimension)
triangle counting from data streams
(non-i.i.d. sampling)
graph summarization
(Szemerédi Regularity)
23 / 24
1) Embrace data science
2) Combine theory and practice
24 / 24
1) Embrace data science
2) Combine theory and practice
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
24 / 24
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

More Related Content

What's hot

Graph Algorithms
Graph AlgorithmsGraph Algorithms
Graph Algorithms
Ashwin Shiv
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
Mathieu Dutour Sikiric
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
Krish_ver2
 
Towards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessTowards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic Randomness
Hector Zenil
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Amrinder Arora
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
Nisha Soms
 
Cs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyCs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer key
appasami
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
Hector Zenil
 
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Hector Zenil
 
Assignment 2 daa
Assignment 2 daaAssignment 2 daa
Assignment 2 daa
gaurav201196
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Hector Zenil
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectors
Francesco Tudisco
 
Lecture warshall floyd
Lecture warshall floydLecture warshall floyd
Lecture warshall floyd
Divya Ks
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
Amrinder Arora
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
Hector Zenil
 
elliptic-curves-modern
elliptic-curves-modernelliptic-curves-modern
elliptic-curves-modernEric Seifert
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
Francesco Tudisco
 

What's hot (20)

Graph Algorithms
Graph AlgorithmsGraph Algorithms
Graph Algorithms
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
Towards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessTowards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic Randomness
 
Lec 2-2
Lec 2-2Lec 2-2
Lec 2-2
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
Cs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyCs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer key
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
 
Assignment 2 daa
Assignment 2 daaAssignment 2 daa
Assignment 2 daa
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
 
Core–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectorsCore–periphery detection in networks with nonlinear Perron eigenvectors
Core–periphery detection in networks with nonlinear Perron eigenvectors
 
Lecture warshall floyd
Lecture warshall floydLecture warshall floyd
Lecture warshall floyd
 
Lecture26
Lecture26Lecture26
Lecture26
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
 
elliptic-curves-modern
elliptic-curves-modernelliptic-curves-modern
elliptic-curves-modern
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 

Similar to Algorithmic Data Science = Theory + Practice

Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
diannepatricia
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
Tokyo Tech (Tokyo Institute of Technology)
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
tuxette
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2
MuradAmn
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
Mark Kilgard
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
INSA Lyon - L'Institut National des Sciences Appliquées de Lyon
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
lecture 17
lecture 17lecture 17
lecture 17sajinsc
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacesbutest
 
Greedy Algorithms with examples' b-18298
Greedy Algorithms with examples'  b-18298Greedy Algorithms with examples'  b-18298
Greedy Algorithms with examples' b-18298
LGS, GBHS&IC, University Of South-Asia, TARA-Technologies
 
P5 - Routing Protocols
P5 - Routing ProtocolsP5 - Routing Protocols
P5 - Routing Protocols
Kurniawan Dwi Irianto
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Universitat Politècnica de Catalunya
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
Umberto Picchini
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
Tarat Diloksawatdikul
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
Tarat Diloksawatdikul
 
20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilersComputer Science Club
 

Similar to Algorithmic Data Science = Theory + Practice (20)

Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
lecture 17
lecture 17lecture 17
lecture 17
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Lecture set 5
Lecture set 5Lecture set 5
Lecture set 5
 
Greedy Algorithms with examples' b-18298
Greedy Algorithms with examples'  b-18298Greedy Algorithms with examples'  b-18298
Greedy Algorithms with examples' b-18298
 
P5 - Routing Protocols
P5 - Routing ProtocolsP5 - Routing Protocols
P5 - Routing Protocols
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Approx
ApproxApprox
Approx
 
Cgm Lab Manual
Cgm Lab ManualCgm Lab Manual
Cgm Lab Manual
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers20101017 program analysis_for_security_livshits_lecture02_compilers
20101017 program analysis_for_security_livshits_lecture02_compilers
 

More from Two Sigma

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
Two Sigma
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
Two Sigma
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
Two Sigma
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
Two Sigma
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
Two Sigma
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Two Sigma
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
Two Sigma
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Two Sigma
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh Leners
Two Sigma
 
Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...
Two Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
Two Sigma
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Two Sigma
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
Two Sigma
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Two Sigma
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
Two Sigma
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
Two Sigma
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
Two Sigma
 

More from Two Sigma (18)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh Leners
 
Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 

Algorithmic Data Science = Theory + Practice

  • 1. Algorithmic Data Science = Theory + Practice Matteo Riondato – Labs, Two Sigma Investments @teorionda – http://matteo.rionda.to IEEE MIT URTC – November 5, 2016 1 / 24
  • 2. Matteo Riondato Ph.D. in CS Working at Labs, Two Sigma Investments (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Doing research on algorithmic data science; Tweeting @teorionda; Reading matteo@twosigma.com; “Living” at http://matteo.rionda.to. 2 / 24
  • 3. Conjecture Let X be a scientific discipline. Then 21st -century X = datascience (X) + ε . Partial evidence: “Computational X” exists for many X. 3 / 24
  • 4. data science : 21st century = statistics : 20th century 4 / 24
  • 5. data science for 21st century society    questions data 5 / 24
  • 8. data science =    1/4 data representation and management 1/4 mathematical and statistical modeling 1/4 computational thinking and algorithms 1/4 domain expertise Shake well, and strain into a cocktail glass. 7 / 24
  • 12. algorithmic data science: = algorithms for/with:    approximation guarantees data streams Spark/MapReduce sampling statistical testing graph analysis . . . 9 / 24
  • 15. algorithmic data science = (theory × practice)(theory×practice) 10 / 24
  • 17. Scientific question: Find relevant webpages on the web, influential participants in a email chain, key proteins in a network, . . . Data representation: represent the data as a graph G = (V , E). a h b g f e c d Modeling question: What are the important nodes in a graph G = (V , E)? We need f : V → R+ to express the importance of a node. The higher is f (x), the more important is x ∈ V . 12 / 24
  • 18. Domain Knowledge / Modeling: Assume that 1) every node wants to communicate with every node; and 2) communication progresses along Shortest Paths (SPs). Then, the higher the no. of SPs that a node v belongs to, the more important v is. Definition For each node x ∈ V , the betweeness b(x) of x is: b(x) = 1 n(n − 1) u=x=v∈V σuv (x) σuv ∈ [0, 1] • σuv : number of SPs from u to v, u, v ∈ V ; • σuv (x): number of SPs from u to v that go through x. I.e., b(x) is weighted fraction of SPs that go through x, among all SPs in G. 13 / 24
  • 19. a h b g f e c d Node x a b c d e f g h b(x) 0 0.250 0.125 0.036 0.054 0.080 0.268 0 14 / 24
  • 20. Algorithmic question: How to compute all b(x)? 15 / 24
  • 21. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
  • 22. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
  • 23. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 24. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 25. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 26. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 27. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 28. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 29. Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) Time complexity: O(nm + n2 log n) n Dijkstra’s, plus n backward walks, taking at most n each Too much even with just 104 nodes. 15 / 24
  • 30. Modeling / Domain knowledge: High-quality approximations of all BCs are sufficient. 16 / 24
  • 31. Modeling / Domain knowledge: High-quality approximations of all BCs are sufficient. Let ε ∈ (0, 1), and δ ∈ (0, 1) be user-specified parameters; An (ε, δ)-approximation is a set {b(x), x ∈ V } of n values s.t. Pr(∃x ∈ V s.t. |b(x) − b(x)| > ε) ≤ δ i.e., with prob. ≥ 1 − δ, for all x ∈ V , b(x) is within ε of b(x): a uniform probabilistic guarantee over all the estimations. 16 / 24
  • 32. Algorithmic question: How to obtain an (ε, δ)-approximation quickly? Answer: Sampling Instead of computing all the SPs from each node x ∈ V , compute them only from some randomly chosen nodes (samples). Theory question: How many samples do we need to obtain an (ε, δ)-approximation? The more the better, but really, how many? 17 / 24
  • 33. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound 18 / 24
  • 34. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples 18 / 24
  • 35. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples Comments Practice: Fewer samples than the above are sufficient for (ε, δ)-approx. Theory: Dependency on |V | and not on edge structure seems wrong. 18 / 24
  • 36. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now 19 / 24
  • 37. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. 19 / 24
  • 38. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. Comments Practice: Great improvement but still too many samples. Theory: Graphs with the same diameter are not equally “hard”. 19 / 24
  • 39. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. 20 / 24
  • 40. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . 20 / 24
  • 41. How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . ηi = 2 min t∈R+ 1 t ln (r,C)∈T et2 r2 /(2S2 i ) + 3 (i + 1) ln(2/δ) 2Si Comments Practice: Getting closer to the empirical bound Theory: Proving stuff is getting complicated (isn’t that good?) 20 / 24
  • 42. Theory + Practice: Get rid of “theoretical elegance” while maintaining correctness. 21 / 24
  • 43. Theory + Practice: Get rid of “theoretical elegance” while maintaining correctness. Let gS(x, y) = 2 exp −2 x2 (y − 2RF (S))2 + exp − ((1 − x)y + 2xRF (S)) φ 2RF (S) (1 − x)y + 2xRF (S) − 1 . Then compute min x,ξ ξ s.t. gS(x, ξ) ≤ η ξ ∈ (2RF (S), 1] x ∈ (0, 1) and check if ξ < ε. 21 / 24
  • 44. To be a data scientist, you need to get your hands dirty in data. To be an algorithmic data scientist, you need to get your hands dirty in    data theory 22 / 24
  • 45. Other examples    pattern mining (Rademacher Averages) selectivity of database queries (VC-dimension) triangle counting from data streams (non-i.i.d. sampling) graph summarization (Szemerédi Regularity) 23 / 24
  • 46. 1) Embrace data science 2) Combine theory and practice 24 / 24
  • 47. 1) Embrace data science 2) Combine theory and practice Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 24 / 24
  • 48. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.