This document discusses methods for modeling networks using multifractal network generators (MFNG). MFNG is a recursive model that samples nodes into categories at different levels to generate graphs. The document outlines techniques for estimating MFNG parameters from real networks using method of moments, describes challenges in sampling from MFNG efficiently, and shows MFNG can match properties of Twitter and citation networks.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
(General) To retrieve a clean dataset by deleting outliers.
(Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
(General) To retrieve a clean dataset by deleting outliers.
(Computer Vision) the recovery of a digital image that has been contaminated by additive white Gaussian noise.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
Anomaly detection using deep one class classifier홍배 김
- Anomaly detection의 다양한 방법을 소개하고
- Support Vector Data Description (SVDD)를 이용하여
cluster의 모델링을 쉽게 하도록 cluster의 형상을 단순화하고
boundary근방의 애매한 point를 처리하는 방법 소개
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
3. Fast matrix multiplication:
bridging theory and practice
with Grey Ballard
Performance (24 cores) on N x 2800 x N
20
19
18
17
16
15
14
13
12
5000 10000 15000 20000
dimension (N)
Effective GFLOPS / core
arXiv: 1409.2908, 2014
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
BINI
SCHONHAGE
sequential
shared memory
distributed memory
high-performance
4. Setting
• We want a simple, scalable method to model networks
and generate random (undirected) graphs
• Looking for graph generators that can mimic real world
graph structure: degree distribution, large number of
triangles, etc.
Why?
• Useful as null models
• Helps to understanding graphs
• Generate test problems
5. Lots of work in this area
• Erdős-Rényi model [1959]
• Watts-Strogatz model [1998]
• Chung-Lu [2000]
• Random Typing Graphs [Akoglu+2009]
• Stochastic Kronecker Graphs [Leskovec+2010]
• Mixed Kroncker Product Graph [Moreno+2010]
• Multifractal Network Generators [Palla+2011]
• Block two-level Erdős-Rényi [Seshadhri+2012]
• Transitive Chung-Lu [Pfeiffer+2012]
6. Multifractal Network Generators (MFNG)
• Simple, recursive model [Palla+ 2011]
Parameters:
• Symmetric probability matrix
• Set of lengths that sum to 1
• Number of recursive levels
7. MFNG – two levels of recursion
1. Sample three nodes Uniform [0, 1]
2. Find “category” at first level
3. Expand interval back to [0, 1]
4. Find “category” at second level
1, 2
2, 1
2, 2
Categories:
8. MFNG – two levels of recursion
1, 2
2, 1
2, 2
Categories:
10. MFNG – general parameters
• Number of categories: c
• c x c symmetric probability matrix P
• Length-c vector of lengths
• Number of recursive levels, r
c = r = 3
11. Scalability issues
• Expanded probability matrix
grows exponentially with
number of recursive levels
• This makes it difficult to do
inference
• We only have c + c(c – 1) / 2
parameters with c categories
12. Recursive decomposition
• Lemma [Benson+14]: Take r i.i.d. graph samples from
MFNG process with one level of recursion. The
distribution of the intersection of the graphs is identical to
the same MFNG process with r recursive levels.
• Proof: This follows from the fact that the categories of a
node at each recursive level are independent.
13. Subgraph probabilities
• Theorem [Benson+14]: Probability of subset of edges exists
between any subset of nodes from MFNG process with r
recursive levels is the rth power of the probability that the same
subset of edges exists in same MFNG with one recursive level.
Prob( ) =
Probability of all
three edges existing,
given categories
Probability of nodes
having categories I, j, k
14. We can easily compute subgraph
moments
Any three nodes are equally likely to form
a triangle before Uniform [0, 1]
(before determining categories)
15. We can easily compute subgraph first
moments
Expected number of…
Edges, wedges,
3-stars, etc.
Triangles,
4-cliques, etc.
16. Can also get second moments
Number of
edges
Indicator of
edge (i, j)
= Xij
Wedge (i, j, k)
Independent
17. Open triangles
• We can only directly compute subgraph probabilities
where the edges exist
?
Prob( ) =
x
Prob( ) – Prob( )
18. Method of Moments
• Count number of wedges, 3-stars, triangles, 4-cliques,
etc. in network of interest (fi)
• Try to find parameters such that the expected values,
E[Fi], match the empirical counts, fi
Fast
computation
with our theory
19. Method of Moments
• Computing fi can be expensive (4-cliques)
can use estimators [Seshadhri+13]
• Not convex but fast
many random restarts
20. Graph property proxies
• Power law degree distribution: edges, wedges, 3-stars, etc.
• Clustering: cliques
21. We can recover the model
Original MFNG Single sample Method of Moments Recovered MFNG
Original
Recovered
22. Results on Twitter network
230M 13.1M
SKG method of moments [Gleich & Owen2012]
KronFit [Leskovec+2010]
28. Oscillating degree distributions
Interesting oscillations in
degree distribution. Also
observed in Stochastic
Kronecker Graphs.
Less noticeable when
using three categories
29. Oscillating degree distributions
At each recursive level, add noise to the
probability matrix [Seshadhri+2013]:
More noise leads to flatter degree distribution.
(All with two categories.)
31. MFNG – naïve sampling is expensive
1, 2
2, 1
2, 2
Categories:
How do we avoid
O(n2) coin flips?
32. Recursive decomposition does not help
• First idea: Use recursive decomposition. Generate
r graphs and take the intersection.
• With one level of recursion and two categories,
there are four Erdős-Rényi components to
consider. We can sample each one quickly.
• When p22 isn’t close to 0 or close to 1, have to
store a dense graph.
33. “Ball dropping” does not quite work
• Second idea: adapt “ball dropping” scheme from
SKG [Leskovec+2010].
Sample proportional to:
… at each level
Might get category pairs: (1, 1), (2, 2)
Choose random pair from (blue, green, orange)
Problem: just as likely to have 3 nodes fall into
box where yellow node is
34. Adapt for number of nodes per box
Sample proportional to:
… at each level
Might get category pairs: (1, 1), (2, 2)
3 actual nodes: 3 possible edges
Expected: 5(5 – 1)(1/4)^2 / 2 = 0.625
Sample e ~ Poisson(3 / 0.625λ)
Add e edges to the box
Larger λ need more samples, but less
dependency between edges.
35. Fast sampling heuristic: overview
Can compute these with methods presented earlier
Sample proportional to:
at each level
Sample e ~ Poisson(actual / expected * λ)
Add e edges to the box
1.
2.
3.
Repeat 2 and 3 until no more edges left
36. MFNG: what we have learned
• Method of Moments works
well to estimate MFNG
parameters for real networks
• Able to match degree
distribution, even though we
don’t explicitly fit for it
• Per-degree clustering
coefficient is still a challenge
• We can sample quickly, but
we would like an exact fast
sampling algorithm
37. Twitter: 3 categories
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LEARNING MULTIFRACTAL
STRUCTURE IN LARGE
NETWORKS
KDD 2014
Austin Benson (arbenson@stanford.edu)
Carlos Riquelme
Sven Schmit
Institute for Computational and Mathematical Engineering,
Stanford University
Purdue Machine Learning Seminar, Sept. 11 2014
Editor's Notes
min_{mathcal{K}, H ge 0} ||A - A(:, mathcal{K})H ||
egin{aligned}
& underset{P, ell, r}{ ext{minimize}}
& & sum_{i}frac{|f_i - mathbb{E}[F_i]|}{f_i} \
& ext{subject to}
& & 0 le p_{ij} = p_{ji} le 1, & 1 le i le j le c \
& & & 0 le ell_{i} le 1, & 1 le i le c \
& & & sum_{i=1}^{c}ell_i = 1
end{aligned}
egin{table}[tb]
centering
egin{tabular}{l c c c c c c c c}
& $|V|$ & $c$ & $r$ & $ell_1$ & $ell_2$ & $p_{11}$ & $p_{12}$ & $p_{22}$ \
& 6,000 & 2 & 10 & 0.25 & 0.75 & 0.59 & 0.43 & 0.78 \
& 6,000 & 2 & 9 & 0.2728 & 0.7272 & 0.5431 & 0.4101 & 0.7593 \
end{tabular}
label{tab:mfng_recovery}