"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler, Researcher at Similar Web
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Sigalit Bechler is a data science researcher with a diverse academic background - a B.Sc. in electrical engineering, a B.Sc. in physics (cum laude) from Tel Aviv University's prestigious program for parallel B.Sc. in Physics and in Electrical Engineering, an M.Sc. in condensed matter (cum laude), and have started her Ph.D. in bioinformatics. Prior to her M.Sc. I have served as a captain in a technology unit of the IDF. She is passionate about science and solving complex big data problems that require out of the box thinking, and like to dive deep into the details. She always take a positive, proactive approach, and put an emphasis on understanding the big picture as well.
4. What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC ESTIMATION
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
We Provide Digital Insights to the Entire World
2M MOBILE APPS
DAILY
FOR EVERY MOBILE
APP:
RATING
ENGAGEMENT
APP STORE DATA
CATEGORY
KEYWORDS
5. What We Do
60M WEBSITES DAILY
FOR EVERY WEBSITE:
• TRAFFIC METRICS
• TRAFFIC SOURCES
• AUDIENCE
• INDUSTRY
• CONTENT
2M MOBILE APPS DAILY
FOR EVERY MOBILE APP:
• RATING
• ENGAGEMENT
• APP STORE
• CATEGORY
• KEYWORDS
INGEST:
INTERNATIONAL PANEL, CRAWLING,
ISP DATA, LEARNING SET
• 90K events/sec
• 4TB/day compressed
BATCH & ON DEMAND PROCESSING:
• 100TB i/o a day
• > 150 machines just in processing
cluster
• Statistical & machine learning
algorithms
We Provide Digital Insights to the Entire World
6. Business Proprietary & Confidential
Quantum clustering
December 1, 2014
Prof. David Horn and Dr. Assaf Gottlieb.
Phys. Rev. Lett. 88 (2002) 018702
7. • Unsupervised learning problem - dealing with unlabeled data
• Goal: group together elements that are similar to each other in some sense.
• We usually have an idea or a desire of what this “sense” should be
• Might discover new patterns
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
8. • The user identity is unknown
• Leaving it in for the example
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
?
?
?
?
?
?
?
?
9. • Grouping by gender
Clustering - general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
10. • Grouping by fields of interest
Clustering- general overview
label feature1 feature2 feature3 feature4 label feature1 feature2 feature3 feature4
11. Quantum Clustering - Motivation
• Relatively easy clustering task
• Still need to set the number of
clusters manually.
• Very complex clustering task.
• Unbiased analysis of X-Ray
absorption data
12. Quantum Clustering - Example
Analyzing Big Data with Dynamic
Quantum Clustering
M. Weinstein, F. Meirer, A. Hume, Ph.
Sciau, G. Shaked, R. Hofstetter, E. Persi, A.
Mehta, D. Horn
http://arxiv.org/abs/1310.2700
13. • Information era - big data
• Massive collection of data
• Strong presence of outliers
• Unknown structures
• Non trivial patterns
Why is it important?
Quantum
Clustering
Distributed
computation
technologies
14. Quantum clustering - the potential trick
1. Turn data-points into Gaussians centered around the data points:
𝜑 𝑥 =
𝑖=1
𝑛
𝑒
−
| 𝑥− 𝑥 𝑖|2
2𝜎2
2. Plug 𝜑 𝑥 into Schrodinger equation and find V( 𝑥).
Define the solution for V as the potential transform
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• Single point → Gaussian →𝑉 𝑥 =
1
2𝜎2 (𝑥 − 𝑥𝑖)2
• Multi-points: 𝑉 𝑥 =
1
2𝜎2 𝜑( 𝑥) 𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
=
1
2𝜎2
𝑖=1
𝑛 𝑒
−
|𝑥−𝑥 𝑖|2
2𝜎2
𝑖(𝑥 − 𝑥𝑖)2
𝑒
−
(𝑥−𝑥 𝑖)2
2𝜎2
3. Move each data point towards the direction of the minima of the 𝑉 𝑥
according to the potential surface with gradient descent.
15. Quantum clustering – reasoning
• Why does it make sense?
•
𝜎2
2
𝛻2: Models the divergence effects from the cluster center.
• V( 𝑥) : The effects that bind points from the same cluster together.
• We may say that we are looking for the minima of V( 𝑥) since this is where the
divergence effects are minimal (slow changes – small numerator and high
density- denominator:
𝑉 𝑥 =
𝜎2
2𝜑 𝑥
𝛻2 𝜑 𝑥
• SVD may be performed prior to the clustering: X=USVT , perform QC on U or V
• Solve the fact that each feature is of a different dimension type, and scale.
• enable dimension reduction to those with the highest variance.
16. A topographic map of the probability distribution for the
crab data set with =1/2 using principal components 2
and 3. There exists only one maximum.
A topographic map of the potential for the crab data set with
=1/2 using principal components 2 and 3 . The four minima
are denoted by crossed circles. The contours are set at values
V=cE for c=0.2,…,1.
The Crabs Example (from Ripley’s textbook), 4 classes, 50 samples each, d=5
The data 3D Plot of the potential
17. Quantum clustering - summary
• Built-in capability to handle outliers (divergence part): no need for additional
parameters or processes, no effect on the amount of significant clusters
• The cluster may be a line or other shape and not necessarily a point in the
feature space.
• The clusters are not defined by geometric or probability considerations alone
• No need to pre-define the amount of clusters
18. • Existing approximated quantum clustering variation for improving time
complexity.
• Sensitive to small variations in the data density unlike geometry
consideration alone.
• Possible Distributed calculation:
• Since all we have is to calculate V, 𝛻V for every data point parts can be calculated at
each point separately in a different machine
• Performed exceptionally in exposing hidden patterns of data structures
from a wide range of fields - finance, on-line marketing, experimental
physics, speech-recognition, biological data.
Quantum clustering
19. • Physics may provide interesting perspective to questions that at the first
glance has no connection to physics.
• It has been done in scale space theory
• Simulated annealing
• In bio-informatics for extracting protein structure
• And many more
• Next steps: implement in a distributed manner, examine this algorithm on
web data, improve time complexity, explore approximated QC, theoretical
research.
Quantum clustering