The document discusses various graph clustering algorithms. It begins by introducing basic concepts in graph clustering such as partitioning a graph into clusters to minimize edge cuts between clusters or maximize intra-cluster connectivity. It then proceeds to explain several popular clustering algorithms, including hierarchical clustering methods like single-linkage and complete-linkage, spectral clustering which uses the eigenvectors of the graph Laplacian, and density-based clustering like DBSCAN. The document provides details on the mathematical formulations and applications of these different graph clustering approaches. It also includes several visual examples of cluster assignments on sample graph datasets.
A copy of my slides from the SILO Seminar at UW Madison on our recent developments for the NEO-K-Means methods including new optimization routines and results.
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
A copy of my slides from the SILO Seminar at UW Madison on our recent developments for the NEO-K-Means methods including new optimization routines and results.
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
Performance Analysis of CRT for Image Encryption ijcisjournal
With the fast advancements of information technology, the security of image data transmitted or stored over
internet is become very difficult. To hide the details, an effective method is encryption, so that only
authorized persons can decrypt the image with the keys available. Since the default features of digital
image such as high capacity data, large redundancy and large similarities among pixels, the conventional
encryption algorithms such as AES, , DES, 3DES, and Blow Fish, are not applicable for real time image
encryption. This paper presents the performance of CRT for image encryption to secure storage and
transmission of image over internet.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
An Algebraic Method to Check the Singularity-Free Paths for Parallel RobotsDr. Ranjan Jha
ASME 2015 International Design Engineering Technical Conferences / Computers and Information in Engineering Conference
Venue: Boston, Massachusetts, USA
Abstract: Trajectory planning is a critical step while programming the parallel manipulators in a robotic cell. The main problem arises when there exists a singular configuration between the two poses of the end-effectors while discretizing the path with a classical approach. This paper presents an algebraic method to check the feasibility of any given trajectories in the workspace. The solutions of the polynomial equations associated with the trajectories are projected into the joint space using Gr\"{o}bner based elimination methods and the remaining equations are expressed in a parametric form where the particular variables are functions of time $t$ unlike any numerical or discretization method.
These formal computations allow to write the Jacobian of the manipulator as a function of time and to check if its determinant can vanish between two poses. Another benefit of this approach is to use the largest workspace with a more complex shape than a cube, cylinder or sphere. For the Orthoglide, three degrees of freedom parallel robot, three different trajectories are used to illustrate this method.
Ph.D. Thesis : Ranjan JHA : Contributions to the Performance Analysis of Para...Dr. Ranjan Jha
Ph.D. Thesis Defense
Title: Contributions to the Performance Analysis of Parallel Robots
Venue: IRCCyN, Ecole Centrale de Nantes, France
Abstract: This doctoral thesis focuses on the different aspects which are associated with the efficient planning of desired tasks for parallel robots. These different aspects are mainly categorized into four parts, namely: workspace and joint space analysis, uniqueness domains, trajectory planning and accuracy analysis. The workspace and joint space analysis differentiate the regions with the different number of inverse kinematic solutions and direct kinematic solutions using a cylindrical algebraic decomposition algorithm, respectively. The influence of design parameters and joint limits on the workspace boundaries for the parallel robots are reported. Gr\"{o}bner based elimination methods are used to compute the parallel and serial singularities of the manipulator under study. The descriptive analysis of a family of delta-like robots is presented by using algebraic tools to induce an estimation about the complexity in representing the singularities in the workspace and the joint space. The generalized notions of aspects and uniqueness domains are defined for the parallel robot with several operation modes. The characteristic surfaces are also computed to define the uniqueness domains in the workspace. An algebraic method is proposed to check the feasibility of any given trajectory in the workspace to address the well-known problem which arises when there exists a singular configuration between the two poses of the end-effectors while discretizing the path with a classical approach. A Framework for the control loop of a parallel robot with several actuation modes is presented, which uses only the inverse geometric model. The accuracy analysis focuses on the estimation of errors in the pose of the end effector due to the joint's errors produced by the PID control loop. The proposed error model, which is based on the static and dynamic properties of the Orthoglide, helps in estimating the error in the Cartesian workspace.
The idea of metric dimension in graph theory was introduced by P J Slater in [2]. It has been found
applications in optimization, navigation, network theory, image processing, pattern recognition etc.
Several other authors have studied metric dimension of various standard graphs. In this paper we
introduce a real valued function called generalized metric G X × X × X ® R+ d : where X = r(v /W) =
{(d(v,v1),d(v,v2 ),...,d(v,v ) / v V (G))} k Î , denoted d G and is used to study metric dimension of graphs. It
has been proved that metric dimension of any connected finite simple graph remains constant if d G
numbers of pendant edges are added to the non-basis vertices.
Introduction to CNN with Application to Object RecognitionArtifacia
This is the presentation from our second AI Meet held on Dec 10, 2016.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
I discuss some runtimes for the personalized PageRank vector and how it relates to open questions in how we should tackle these network based measures via matrix computations.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Performance Analysis of CRT for Image Encryption ijcisjournal
With the fast advancements of information technology, the security of image data transmitted or stored over
internet is become very difficult. To hide the details, an effective method is encryption, so that only
authorized persons can decrypt the image with the keys available. Since the default features of digital
image such as high capacity data, large redundancy and large similarities among pixels, the conventional
encryption algorithms such as AES, , DES, 3DES, and Blow Fish, are not applicable for real time image
encryption. This paper presents the performance of CRT for image encryption to secure storage and
transmission of image over internet.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
An Algebraic Method to Check the Singularity-Free Paths for Parallel RobotsDr. Ranjan Jha
ASME 2015 International Design Engineering Technical Conferences / Computers and Information in Engineering Conference
Venue: Boston, Massachusetts, USA
Abstract: Trajectory planning is a critical step while programming the parallel manipulators in a robotic cell. The main problem arises when there exists a singular configuration between the two poses of the end-effectors while discretizing the path with a classical approach. This paper presents an algebraic method to check the feasibility of any given trajectories in the workspace. The solutions of the polynomial equations associated with the trajectories are projected into the joint space using Gr\"{o}bner based elimination methods and the remaining equations are expressed in a parametric form where the particular variables are functions of time $t$ unlike any numerical or discretization method.
These formal computations allow to write the Jacobian of the manipulator as a function of time and to check if its determinant can vanish between two poses. Another benefit of this approach is to use the largest workspace with a more complex shape than a cube, cylinder or sphere. For the Orthoglide, three degrees of freedom parallel robot, three different trajectories are used to illustrate this method.
Ph.D. Thesis : Ranjan JHA : Contributions to the Performance Analysis of Para...Dr. Ranjan Jha
Ph.D. Thesis Defense
Title: Contributions to the Performance Analysis of Parallel Robots
Venue: IRCCyN, Ecole Centrale de Nantes, France
Abstract: This doctoral thesis focuses on the different aspects which are associated with the efficient planning of desired tasks for parallel robots. These different aspects are mainly categorized into four parts, namely: workspace and joint space analysis, uniqueness domains, trajectory planning and accuracy analysis. The workspace and joint space analysis differentiate the regions with the different number of inverse kinematic solutions and direct kinematic solutions using a cylindrical algebraic decomposition algorithm, respectively. The influence of design parameters and joint limits on the workspace boundaries for the parallel robots are reported. Gr\"{o}bner based elimination methods are used to compute the parallel and serial singularities of the manipulator under study. The descriptive analysis of a family of delta-like robots is presented by using algebraic tools to induce an estimation about the complexity in representing the singularities in the workspace and the joint space. The generalized notions of aspects and uniqueness domains are defined for the parallel robot with several operation modes. The characteristic surfaces are also computed to define the uniqueness domains in the workspace. An algebraic method is proposed to check the feasibility of any given trajectory in the workspace to address the well-known problem which arises when there exists a singular configuration between the two poses of the end-effectors while discretizing the path with a classical approach. A Framework for the control loop of a parallel robot with several actuation modes is presented, which uses only the inverse geometric model. The accuracy analysis focuses on the estimation of errors in the pose of the end effector due to the joint's errors produced by the PID control loop. The proposed error model, which is based on the static and dynamic properties of the Orthoglide, helps in estimating the error in the Cartesian workspace.
The idea of metric dimension in graph theory was introduced by P J Slater in [2]. It has been found
applications in optimization, navigation, network theory, image processing, pattern recognition etc.
Several other authors have studied metric dimension of various standard graphs. In this paper we
introduce a real valued function called generalized metric G X × X × X ® R+ d : where X = r(v /W) =
{(d(v,v1),d(v,v2 ),...,d(v,v ) / v V (G))} k Î , denoted d G and is used to study metric dimension of graphs. It
has been proved that metric dimension of any connected finite simple graph remains constant if d G
numbers of pendant edges are added to the non-basis vertices.
Introduction to CNN with Application to Object RecognitionArtifacia
This is the presentation from our second AI Meet held on Dec 10, 2016.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
I discuss some runtimes for the personalized PageRank vector and how it relates to open questions in how we should tackle these network based measures via matrix computations.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
RasterFrames: Enabling Global-Scale Geospatial Machine LearningAstraea, Inc.
RasterFrames™, a proposed LocationTech project, brings the power of Spark SQL and Spark ML to the analysis of global-scale geospatial-temporal raster data. Employing the rich geospatial primitives of LocationTech GeoTrellis and GeoMesa, RasterFrames provides scientists, data scientists and software developers with a unified data and compute model for building image processing pipelines for ETL, data-product creation, statistical analysis, supervised & unsupervised machine learning, and deep learning. Data scientists particularly benefit from the DataFrame-centric entrypoint into big data geospatial analytics.
This talk will introduce RasterFrames, explaining the need it fulfills, the capabilities it provides, and context for determining if RasterFrames is right for the problems you're trying to solve.
By Simeon Fitch
Kazushi Okamoto: Families of Triangular Norm Based Kernel Function and Its Application to Kernel k-means, Joint 8th International Conference on Soft Computing and Intelligent Systems and 17th International Symposium on Advanced Intelligent Systems (SCIS-ISIS2016), 2016.08.25
Hierarchical matrix techniques for maximum likelihood covariance estimationAlexander Litvinenko
1. We apply hierarchical matrix techniques (HLIB, hlibpro) to approximate huge covariance matrices. We are able to work with 250K-350K non-regular grid nodes.
2. We maximize a non-linear, non-convex Gaussian log-likelihood function to identify hyper-parameters of covariance.
Background Estimation Using Principal Component Analysis Based on Limited Mem...IJECEIAES
Given a video of 푀 frames of size ℎ × 푤. Background components of a video are the elements matrix which relative constant over 푀 frames. In PCA (principal component analysis) method these elements are referred as “principal components”. In video processing, background subtraction means excision of background component from the video. PCA method is used to get the background component. This method transforms 3 dimensions video (ℎ × 푤 × 푀) into 2 dimensions one (푁 × 푀), where 푁 is a linear array of size ℎ × 푤 . The principal components are the dominant eigenvectors which are the basis of an eigenspace. The limited memory block Krylov subspace optimization then is proposed to improve performance the computation. Background estimation is obtained as the projection each input image (the first frame at each sequence image) onto space expanded principal component. The procedure was run for the standard dataset namely SBI (Scene Background Initialization) dataset consisting of 8 videos with interval resolution [146 150, 352 240], total frame [258,500]. The performances are shown with 8 metrics, especially (in average for 8 videos) percentage of error pixels (0.24%), the percentage of clustered error pixels (0.21%), multiscale structural similarity index (0.88 form maximum 1), and running time (61.68 seconds).
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
8. [1] Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.
•
•
•
•
•
[2] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666..
-8-
9. [3] Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27-64.
•
k
min ∑ ∑ w jl
i =1 j∈Ci
l∉Ci
where k is the number of clusters
•
•
•
•
•
[4] Boutin, F., & Hascoet, M. (2004, July). Cluster validity indices for graph partitioning. In Information Visualisation, 2004. IV 2004. Proceedings.
Eighth International Conference on (pp. 376-381). IEEE.
[5] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
[6] Patkar, S. B., & Narayanan, H. (2003, January). An efficient practical heuristic for good ratio-cut partitioning. In VLSI Design, 2003.
Proceedings. 16th International Conference on (pp. 64-69). IEEE.
-9-
10. •
•
•
•
•
[7] Sibson, R. (1973), “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 116, No. 1, pp. 30-34.
[8] Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366.
•
L
L= D− A
•
A
D
d1, d2, ..., dn
A = ⎡ aij ⎤ , i,j=1, 2,
⎣ ⎦
,n
n
di = ∑ aij
j =1
•
•
•
•
[9] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
[10] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing
systems, 2, 849-856.
- 10 -
11. •
Q=
(
1 k
∑ ∑ A jl − d j d l / 2m
2m i =1 j∈Ci
)
l∈Ci
•
•
•
•
•
[11] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[12] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.
[13] Kehagias, A. (2012). Bad Communities with High Modularity. arXiv preprint arXiv:1209.2678.
•
•
•
[14] Daszykowski, M., Walczak, B., & Massart, D. L. (2001). Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics
and Intelligent Laboratory Systems, 56(2), 83-92.
- 11 -
13. •
•
⎧
⎛ d x ,x
i
j
⎪
exp ⎜ −
⎪
⎜
wij = ⎨
d ik d k
j
⎜
⎝
⎪
⎪0
⎩
(
k
i
)
2
⎞
⎟
⎟
⎟
⎠
if x j ∈ xik and xi ∈ x k
j
ohterwise
k
i
where x is the k-nearest set of point i and d is distance between point i and k-th neighbor of point i
•
•
i
j
i
j
•
[15] Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems (pp. 1601-1608).
[16] Ertoz, L., Steinbach, M., & Kumar, V. (2002, April). A new shared nearest neighbor clustering algorithm and its applications. In Workshop
on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining (pp. 105-115).
•
- 13 -
14. •
•
•
di = ∑ wij + ∑ w jk .
j∈xik
j ,k∈xik
k
where xi is the k-nearest set of point i
•
i
i
•
•
•
α
α
•
•
160
5
140
4
120
3
100
2
80
1
60
0
40
-1
20
-2
-5
-4
-3
-2
-1
0
1
2
3
4
5
0
0
- 14 -
50
100
150
200
250
300
350
400
450
500
32. Item 1
Item 2
Item 3
Item 4
Matrix Factorization for Collaborative Prediction
User 1
6
9
3
?
3
0
User 2
4
?
2
0
2
0
User 3
0
0
2
3
0
1
User 4
0
?
4
?
0
2
Item Factor Matrix
|
u
2
0
3
0
1
2
0
3
User Factor Matrix
• Collaborative prediction
Filling missing entries of the user-item rating matrix
• Matrix factorization
Predicting an unknown rating by
product of user factor vector and item factor vector
3
Regularized Matrix Factorization
• Minimize the regularized squared error loss
Alternating Least Squares (ALS)
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
Tuning parameter
λ (regularization)
- 32 -
4
33. Regularized Matrix Factorization
• Minimize the regularized squared error loss
Stochastic Gradient Descent (SGD)
Time complexity
O(2|Ω|K)
Parallelization
Possible, but not easy
Tuning parameter
λ (regularization)
(learning rate)
5
Problem of parameter tuning
• Too small : overfitting
• Too large : underfitting
- 33 -
6
34. Problem of parameter tuning
• The value of optimal regularization parameter is
different depend on the dataset and rank K.
Regularization parameter chosen by cross-validation on various
datasets and rank K (Kim & Choi, IEEE SPL 2013)
7
Problem of parameter tuning
• SGD require tuning of regularization parameter,
learning rate and even the number of epochs.
0.005
0.007
0.010
0.015
0.020
0.005
0.9061/ 13 0.9079/ 15 0.9117/ 19 0.9168/ 28 0.9168/ 44
0.007
0.9056/ 10 0.9074/ 11 0.9112/ 13 0.9168/ 19 0.9169/ 31
0.010
0.9064/ 7
0.9077/ 8
0.9113/ 10 0.9174/ 13 0.9186/ 21
0.015
0.9099/ 5
0.9011/ 6
0.9152/ 6
0.9257/ 7
0.9390/ 7
0.020
0.9166/ 4
0.9175/ 4
0.9217/ 4
0.9314/ 4
0.9431/ 3
Netflix probe10 RMSE/optimal number of epochs of the BRSIMF for
various and values ( =40). (Tákacs et al., JMLR 2009)
- 34 -
8
35. Bayesian Matrix Factorization
Prior
P(U), P(V)
Likelihood
P(X |U,V)
Posterior
P(U,V |X)
Approximate the posterior by
MCMC (Salakhutdinov & Mnih, ICML 2008)
Variational method (Lim & Teh, KDDcup 2007)
MCMC on Netflix
No parameter tuning
No overfitting
High accuracy
Huge computational cost
O(2|Ω|K2+(I+J)K3)
9
Scalable Variational Bayesian Matrix Factorization
• No parameter tuning
• Linear space complexity: O(2(I+J)K)
• Linear time complexity: O(6|Ω|K)
• Easily parallelized on multi-core systems
• Optimize
element-wisely factorized variational distribution
with coordinate descent method.
- 35 -
10
36. Variational Bayesian Matrix Factorization
• Likelihood is given by
• Gaussian priors on factor matrices U and V:
• Approximate posterior by variational distribution by
maximizing the variational lower bound,
or equivalently minimizing the KL-divergence
11
VBMF-BCD (Lim & The KDDcup 2007)
• Matrix-wisely factorized variational distribution
VBMF-BCD
Space complexity
O((I+J)(K+K2))
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
- 36 -
12
37. Scalable VBMF: linear space complexity
Element-wisely factorized variational distribution
K=100
O((I+J)(K+K2))
O(2(I+J)K)
Netflix
I = 480,189
J = 17,770
4.4 GB
0.8 GB
Yahoo-music
I = 1,000,990
J = 624,961
131 GB
2.6 GB
13
Scalable VBMF: quadratic time complexity
Updating rules for q(uki)
Updating all variational parameters
- 37 -
14
38. Scalable VBMF: linear time complexity
Let Rij denote the residual on ( i, j ) observation:
With Rij , updating rule can be rewritten as
15
Scalable VBMF: linear time complexity
When
is changed to
updated to
,
- 38 -
can be easily
16
39. Scalable VBMF: parallelization
I
K
• Each column of variational parameters can be updated
independently from the updates of other columns.
• Parallelization can be easily done in a column-by-column
manner.
• Easy implementation with the OpenMP library on multi-core
system.
17
Related work
(Pilásy et al., ReSys 2010)
• Similar idea is used to reduce the cubic time
complexity of ALS to linear one.
RMF
Scalable VBMF
With small extra effort,
more accurate model
is obtainable without
tuning of regularization
parameter
- 39 -
18
40. Related Work
(Raiko et al., ECML 2007)
• Consider element-wisely factorized variational
distribution
• Update U and V by scaled gradient descent method
• Require tuning of learning rate
• Learning speed is slower than our algorithm
19
Numerical Experiments
• Compare VBMF-CD, VBMF-BCD (Lim & The KDDcup 2007),
VBMF-GD (Raiko et al., ECML 2007)
• Experimental environment
– Quad-core Intel® core™ i7-3820 @ 3.6GHz
– 64 GB memory
– Implemented in Matlab 2011a, where main computational
modules are implemented in C++ as mex files
– Parallelized with the OpenMP library
• Datasets
MovieLens10M
Netflix
Yahoo-music
# of user
69,878
480,189
1,000,990
# of item
10,677
17,770
624,961
10,000,054
100,480,507
262,810,275
# of rating
- 40 -
20
41. Numerical Experiments:
= 20
RMSE versus computation time on a quad-core system for each dataset:
(a) MovieLens10M, (b) Netflix, (c) Yahoo-music
MovieLens10M
Netflix
Yahoo-music
VBMF-CD
0.8589
0.9065
22.3425
VBMF-BCD
0.8671
0.9070
22.3671
VBMF-GD
0.8591
0.9167
22.5883
21
Numerical Experiments: Netflix,
= 50
Time per iter.
VBMF-BCD
66 min.
VBMF-CD
77 sec.
VBMF-GD
29 sec.
RMSE
VBMF-BCD
VBMF-CD
Iter.
Time
Iter.
Time
0.9005
19
21 h
63
74 m
0.9004
21
23 h
70
82 m
0.9003
22
24 h
84
98 m
0.9002
25
28 h
108
2h
0.9001
27
31 h
680
13 h
0.9000
30
33 h
- 41 -
22
42. Conclusion
• We presented scalable learning algorithm for VBMF, VBMFCD.
• VBMF-CD optimizes element-wisely factorized variational
distributions with coordinate descent method.
• Space and time complexity of VBMF-CD are linear.
• VBMF-CD can be easily parallelized.
• Experimental results confirmed the user behavior of VBMFCD such as scalability, fast learning, and prediction accuracy.
23
- 42 -
43. A hybrid genetic algorithm for accelerating feature selection and
parameter optimization of support vector machine
2013. 11. 29.
Introduction
• Support Vector Machine (SVM)
– One of the most popular state-of-the-art classification algorithms.
– efficiently finds non-linear solutions by exploiting kernel functions.
– Takes training time complexity O(N3).
• “Very important” issues on training SVM
– Feature selection
• SVM is a distance based algorithm (kernel matrix computation), and doesn’t include
any feature selection mechanism.
• Irrelevant features degrade the model performance.
– Parameter optimization
• Model Tradeoff parameter C, Kernel parameter σ (for the RBF kernel).
• SVM is very sensitive to the parameter settings.
– For SVM, feature selection and parameter optimization should be performed
simultaneously.
- 43 -
2
44. Introduction
• Genetic algorithm (GA)
– A stochastic algorithm that mimics natural evolution.
– easy, but very effective!
Selection
Parents
Genetic operation
(Crossover, Mutation)
Population
p
Replacement
Offspring
• GA-based feature selection and parameter selection of SVM [1-4]
– GA effectively finds near-optimal feature subsets and parameters.
– But, Slow. (But, MUCH better than Grid-search mechanism.)
3
Introduction
If the SVM have to be re-trained periodically, fast feature selection and
parameter optimization is required.
This study aims to avoid producing a bad offspring in the “Genetic Operation”
step of GA.
This study proposes a chromosome filtering method for faster convergence of
GA using Decision Tree (DT) for feature selection and parameter optimization
of SVM.
- 44 -
4
45. The proposed method
• Flowchart
Initialization
Population
Population Replacement
Evaluate fitness
no
yes
Chromosome
Filtering
Termination
condition?
no
yes
Do genetic operations
Optimized
parameters and
feature subset
5
The proposed method
• Chromosome design
– Parameters: binary representation
C:
0
0
1
0
1
10-2
σ:
1
10-1
1
101
102
103
C=1 x 10-2 + 1 x 101
2-5 , … , 25
– Feature subset: binary representation
1
0
0
1
0
…
f 1 f2 f 3 f4 f 5
1
0
{f1, f4, … , fp-1}
fp-1 fp
Genotype
Phenotype
- 45 -
6
46. The proposed method
• Fitness evaluation
– Decode chromosome and obtain C, σ, and a feature subset.
• Genotype Æ Phenotype
– Train SVM for a dataset
given the selected C, σ, and feature
subset.
– Fitness value: Cross Validation Accuracy
7
The proposed method
• Genetic operation
– Parent selection
• Roulette-wheel scheme - Fitness proportional selection (FPS)
• Probability of i-th chromosome ci in the population to be selected =
– where f(i) is the fitness of ci
– Crossover: N-point crossover
• Choose N random crossover points, split along those points.
– Mutation: Bit-flipping mutation
• Bitwise bit-flipping with fixed probability.
- 46 -
8
47. The proposed method
• Chromosome Filtering
– For each generation, chromosomes and their fitness are stored in the
knowledgebase. A DT is trained periodically based on the knowledgebase.
Using the DT, the offspring chromosomes that are likely to have bad fitness are
removed before the fitness evaluation step.
– Assumption
• Some features and parameter settings improve (or degrade) the model
performance.
• DT can find these rules.
9
The proposed method
• Chromosome Filtering (continued)
– Why DT?
Knowledgebase (sorted by fitness)
• Effectively deal with Categorical Features.
• Find Non-linear relationship.
• Use a few, relevant features in the classification
procedure.
– DT Training
• Each ci (i-th chromosome) in the knowledgebase is
labeled by
– first highest M fitness values Æ GOOD
(probable to yield a good fitness value)
– next highest M fitness values Æ NORMAL
– remaining Æ BAD
(probable to yield a bad fitness value)
c1
c2
c3
…
cM
GOOD
cM+1
cM+2
cM+3
…
c2M
NORMAL
c2M+1
c2M+2
c2M+3
…
…
…
BAD
• Input feature: chromosome (in phenotype)
• Output feature: label {GOOD, NORMAL, BAD}
- 47 -
10
48. The proposed method
• Chromosome Filtering (continued)
– Filtering
• A DT gives rules that assess a chromosome before fitness evaluation.
: Is a chromosome GOOD or NORMAL or BAD?
• Each chromosome has a different survival probability.
ex) GOOD: 1.0, NORMAL: 0.5, BAD: 0.2
• The DT is periodically updated, so the criteria of good chromosome changes
through the generations.
11
The proposed method
• Chromosome Filtering (continued)
– DT example
C>100
Contain
F1?
BAD
σ>1
GOOD
σ>0.25
BAD
Contain
F3?
GOOD
NORMAL
- 48 -
BAD
12
49. The proposed method
• Population Replacement: Steady state model
Å to verify the effectiveness of the proposed method in the initial period of GA.
– Only one chromosome in the population is updated in a generation.
– Replacement scheme [5, 6]: The offspring replaces one of its parents or the
lowest fitness chromosome in the population.
• If the offspring is superior to both parents, it replaces the similar parent.
• If it is in between the two parents, it replaces the inferior parent.
• otherwise, the most inferior chromosome in the population is replaced.
13
Experiments
• Experimental Design
–
–
–
–
10 datasets from UCI repository, all datasets were normalized to be in [-1,1].
5 independent runs, a random seed set was used for fairness.
In SVM training, 10-fold cross validation was used.
Parameter Settings
• GA parameters
–
–
–
–
–
population size Npop = 30
crossover probability pc = 0.9
mutation probability pm = 0.05
max iteration = 300
pgood=1; pnormal=0.5; pbad=0.2
• DT parameters
– CART
– Labeling: good=10, normal=10, bad=remaining
– Training starting point: 30th generation / period=10
- 49 -
14
51. Concluding Remarks
We presented a chromosome filtering method for GA-based feature selection
and parameter optimization of SVM.
The proposed method employed a DT as a chromosome filter to remove the
offspring chromosomes that are likely to have bad fitness before the fitness
evaluation step of GA.
On most datasets, the proposed method showed faster improvement of fitness
than standard GA.
17
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIP) (No. 2011-0030814), and the
Brain Korea 21 Program for Leading Universities & Students. This work was
also supported by the Engineering Research Institute of SNU.
- 51 -
18
52. References
1.
2.
3.
4.
5.
6.
Frohlich, H., Chapelle, O., & Scholkopf, B. (2003, November). Feature selection for support vector
machines by means of genetic algorithm. In Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE
International Conference on (pp. 142-148). IEEE.
Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support
vector machines. Expert Systems with applications, 31(2), 231-240.
Min, S. H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy
prediction. Expert Systems with Applications,31(3), 652-660.
Zhao, M., Fu, C., Ji, L., Tang, K., & Zhou, M. (2011). Feature selection and parameter optimization for
support vector machines: A new approach based on genetic algorithm with feature chromosomes. Expert
Systems with Applications,38(5), 5197-5204.
Bui, T. N., & Moon, B. R. (1996). Genetic algorithm and graph partitioning.Computers, IEEE Transactions
on, 45(7), 841-855.
Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis
and Machine Intelligence, IEEE Transactions on,26(11), 1424-1437.
19
End of Document
- 52 -
20
107. Document Indexing by Ensemble Model
Yanshan Wang and In-Chan Choi
Korea University
System Optimization Lab
yansh.wang@gmail.com
November 25, 2013
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
1 / 18
November 25, 2013
2 / 18
Overview
1
The Basics
Information Retrieval and Document Indexing
Topic Modelling
Indexing by Latent Dirichlet Allocation
2
Indexing by Ensemble Model
Introduction to Ensemble Model
Algorithms
Experimental Results
3
Conclusions and Discussion
Yanshan Wang and In-Chan Choi (KU)
- 107 EnM
Indexing by-
108. The problem in Information Retrieval
As more information (Big
Data) becomes available, it is
more difficult to access what
users are looking for.
We need new tools to help us
understand and search among
vast amounts of information.
Source: www.betaversion.org/ stefano/linotype/news/26/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
3 / 18
Document Indexing is Important
Users can get desired information by indexing (or ranking)
documents (or items). The higher position the document has, the
more valuable to users.
Yanshan Wang and In-Chan Choi (KU)
- 108 EnM
Indexing by-
November 25, 2013
4 / 18
109. Problems in Conventional Methods: Word
Representation
The majority of rule-based and statistical Natural Language
Processing (NLP) models regards words as atomic symbols.
In Vector Space Models (VSM), a word is represented by one 1
and a lot of zeros. For example,
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Its problem:
motel [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] =0
The conceptual meaning of words is ignored.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
5 / 18
Topic Modeling
Latent Dirichlet Allocation (LDA)
[Blei et al. (2003)].
Uncover the hidden topics that
generate the collection.
Words and Documents can be
represented according to those
topics.
Use the representation to organize,
index and search the text.
Yanshan Wang and In-Chan Choi (KU)
- 109 EnM
Indexing by-
⎡
⎢
⎢
⎢
⎢
⎢
apple = ⎢
⎢
⎢
⎢
⎢
⎣
0.325
0.792
0.214
0.107
0.109
0.612
0.314
0.245
November 25, 2013
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
6 / 18
110. LDA [Blei et al. (2003)]
E
D
1
2
3
T
]
Z
1
0
Choose the number of words N ∼ Poisson(ξ).
Choose θ ∼ Drichelet(α).
For n = 1, 2, ..., N
Choose a topic zn ∼ Multinomial(θ);
Choose a word wn ∼ Multinomial(wn |zn , β), a multinomial
distribution conditioned on the topic zn .
Joint Distribution: p(θ, z, d|α, β) = p(θ|α)
Yanshan Wang and In-Chan Choi (KU)
N
n=1
p(zn |θ)p(wn |zn , β)
Indexing by EnM
November 25, 2013
7 / 18
Indexing by LDA (LDI) [Choi and Lee (2010)]
With adequate assumptions, the probability of a word wj
embodying the concept z k is
βjk
Wjk = p(z k = 1|wj = 1) = K
h=1 βjh
The document (or query) probability can be defined within the
topic space
V
k
j=1 Wj nij
k
k
,
Di (Qi ) =
Ndi
where nij denotes the number of occurrence of word wj in
document di and Ndi denotes the number of words in the
document di , i.e. Ndi = V nij .
j=1
Similarity between document and query
ρ(D, Q) = D · Q
where D · Q =
D
D
Yanshan Wang and In-Chan Choi (KU)
,
Q
Q
.
- 110 EnM
Indexing by-
November 25, 2013
8 / 18
111. Indexing by Ensemble Model (EnM)
[Wang et al. (2013)]
Motivation: There exit optimal weights over constituent models.
Table: A toy example. The values in the table represent similarities of
documents with respect to a given query. The scores of Ensemble 1 and
2 are defined by 0.5*Model 1+0.5*Model 2 and 0.7*Model 1+0.3*Model
2, respectively. The relevant document list is assumed to be {2,3}.
Document 1
Document 2
Document 3
(M)AP
Model 1
0.35
0.4
0.25
0.72
Yanshan Wang and In-Chan Choi (KU)
Model 2
0.2
0.1
0.7
0.72
Indexing by EnM
Ensemble 1
0.55
0.5
0.95
0.72
Ensemble 2
0.305
0.31
0.385
0.89
November 25, 2013
9 / 18
AP and MAP
Average Precision (AP) and Mean Average Precision (MAP)
Notation
|Q|
|Di |
dij ∈ Di
φki
R(dij , φki )
H=
αk φk
the number of queries in the query set;
the number of documents in the relevant document
set w.r.t. the ith query;
the jth document in Di ;
the relevant score returned by kth model w.r.t. ith
query;
the indexing position of the jth document for the ith
query returned by the kth model;
the ensemble model, a linear combination of the constituent models, where αk ≥ 0.
Definition
1
E(H, Q) ==
|Q|
Yanshan Wang and In-Chan Choi (KU)
|Q|
1
AP (H, Di ), AP (H, Di ) =
|Di |
i=1
- 111 EnM
Indexing by-
|Di |
j=1
j
R(dij , H)
November 25, 2013
.
10 / 18
112. Formulation
Formulation of the Optimization Problem
Since 0 ≤ AP ≤ 1, we can define the empirical loss as follows:
|Q|
(1 − AP (H, Di )), or
min
i=1
|Q|
1
(1 − i
min
|D |
i=1
|Di |
j=1
j
R(dij , H)
).
Our goal is to uncover optimal weights α’s that minimize the
empirical loss.
Difficulty
The position function R(dij , H) is nonconvex, nondifferentiable and
noncontinuous w.r.t. α’s.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
11 / 18
Boosting Scheme
1
Select model:
'
|Q|
φˆ = arg max
j
j
2
TXHU
i=1
Di AP (φji );
Update the weight:
where δj =
3
1
2
=
log
t
αˆ
j
/RVV
'
+
|Q|
i=1
|Q|
i=1
t
δˆ,
j
t
αˆ
j
M
M
/RVV
Di (1+AP (φji ))
Di (1−AP (φji ))
EDG
TXHU
;
Update distribution on queries:
'
M
/RVV
exp(−AP (Hi ))
,
Di =
Z
where Z is a normalizer.
Yanshan Wang and In-Chan Choi (KU)
- 112 EnM
Indexing by-
November 25, 2013
12 / 18
113. Coordinate Descent
Since the objective is nonconvex, not each
coordinate will reduce the loss.
Select model:
1
φˆ = arg max E(Q, φj );
j
j
Update the weight:
2
D N
1 + AP (φji )
1
;
αj = log
2
1 − AP (φji )
If
3
Et
≤
E t−1 ,
DN
delete this coordinate.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
D N
November 25, 2013
13 / 18
Parallel Coordinate Descent
The coordinate descent algorithm can be parallelized on cores.
1:
2:
3:
4:
parfor p = 1, 2, ..., Kφ do
Update the weights using αp =
end parfor
return Ensemble model H.
Yanshan Wang and In-Chan Choi (KU)
1
2
log
1+AP (φpi )
;
1−AP (φpi )
- 113 EnM
Indexing by-
November 25, 2013
14 / 18
114. Experimental Results on EnM
Data: MED corpus1 .
1033 documents from the National Library of Medicine.
30 queries.
Results.
1
TFIDF
LSA
pLSI
LDI
EnM
0.9
0.8
Method
TFIDF
LSI
pLSI
LDI
EnM.B
EnM.CD
EnM.PCD
MAP
0.4605
0.5026
0.5334
0.5738
0.6420
0.6461
0.6414
improvement (%)
0.6
0.5
0.4
0.3
9.1
15.8
24.6
39.4
40.3
39.3
0.2
0.1
0
1: ftp://ftp.cs.cornell.edu/pub/smart.
Yanshan Wang and In-Chan Choi (KU)
0.7
Precision
Table: MAP of various methods for
MED corpus.
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure: Precision-Recall Curves for
various methods.
Indexing by EnM
November 25, 2013
15 / 18
Conclusions and Discussion
Conclusion
An ensemble model (EnM) is proposed and three algorithms are
introduced for solving the optimization problem.
The EnM outperformed any basis models through the overall recall
regimes.
Discussion
The algorithms cannot guarantee to converge to the global optimum
due to the nonconvexity of objective.
The parallel coordinate descent algorithm cannot guarantee the
optimum, even local optimum, due to the coupling between
variables.
Future Works
Approximate the objective with convex functions.
Using stochastic gradient descent for stochastic sequences and
large-scale data sets.
Yanshan Wang and In-Chan Choi (KU)
- 114 EnM
Indexing by-
November 25, 2013
16 / 18
115. References
Yanshan Wang and In-Chan Choi(2013)
Indexing by ensemble model
Working Paper. arXiv preprint arXiv:1309.3421.
David M, Blei, Andrew Y, Ng and Micheal I, Jordan (2003)
Latent dirichlet allocation
the Journal of machine Learning research, 3, 993-1022.
In-Chan Choi and Jae-Sung Lee (2010)
Document indexing by latent dirichlet allocation
DMIN, 409-414.
Y. Freund and R. E. Schapire (1995)
A desicion-theoretic generalization of on-line learning and an application to
boosting
Computational Learning Theory, Springer, 23-37.
My Homepage: http://optlab.korea.ac.kr/~ sam/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
17 / 18
November 25, 2013
18 / 18
The End
Yanshan Wang and In-Chan Choi (KU)
- 115 EnM
Indexing by-
137. •
•
•
•
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
- 137 -
13
14
138. Suarez Estrella, al Matrix assisted
Suarez, Estrella et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular li id profiles can differentiate sex,
desorption/ionization
t
f ti l lipid
fil
diff
ti t
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
15
•
•
•
•
Li, Lihua, et al. Data mining techniques for cancer detection using serum proteomic profiling. Artificial intelligence in medicine 32.2
(2004): 71-83.
- 138 -
16
141. (1)
ƒ
ƒ
Tibshirani, Robert, et al. Sparsity and smoothness via the fused lasso.Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.1 (2005): 91-108.
21
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- 141 -
22
142. (2)
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
23
Results
ƒ
ƒ
ƒ
- 142 -
24
143. Performance
Comparison
Average
misclass. rate
Average
Selected features
25
Fused lasso coefficient abs
6
2
4
2
1
0
0
2000
4000
6000
8000
10000
12000
m/z
Fused lasso selected features intensity value
1
B
Intensity
Others
MFemale7
1.5
2nd principal component
Coefficient
A
0.5
0
-0.5
0.5
-1
0
0
2000
4000
6000
8000
10000
12000
-1.5
m/z
- 143 -
-1
-0.5
0
0.5
1
1st principal component
1.5
2
2.5
26
427. ™
™
Country City Latitude Longitude Year DataType DataType2 DataType3 Institution Purpose
- 210 -
ScopeScopeTime Lag Count Ratio
Collection Application
443. •
•
•
™
ƒ
ƒ
ƒ
™
ƒ
ƒ
11
™
ƒ
•
•
Valid Voting Ratioi
Nall = Total number of conservative/progressive parties
Nall ≥ Ncy + Ncn + Npy + Npn
N cy N cn N py N pn
N all
ƒ
Yes No Diversityi
¦ P log
k
2
Pk
k{ y , n}
Py
N cy N py
N cy N cn N py N pn
, Pn
N cn N pn
N cy N cn N py N pn
ƒ
Political Orientation Diversityi
Pcy
N cy
N cy N cn N py N pn
, Pcn
- 226 12
¦ P log
ij
i{c , p}, j{ y , n}
4
Pij
N cn
, ...
N cy N cn N py N pn
445. ™
ƒ
ƒ
ƒ
™
ƒ
i
j
ƒ
Recall y
yy
, Precision y
yy yn
yy
, F1y
yy ny
2 u Recall y u Precision y
Recalln
nn
, Precision n
nn ny
nn
, F1n
nn yn
2 u Recalln u Precision n
Recalln Precision n
F1yn
Recall y Precision y
1
F1y F1n
483. Modified LDA with Bibliography Information
한국 BI 데이터마이닝 학회 2013 추계 학술 대회
System Optimization Lab.
Korea University
Young Min, Jun
1
Contents
1. LDA
1.1 Topic Model
1.2 LDA
2. Modified LDA with
Bibliography Information
2.1 Limitation of LDA
2.2 Introduction
2.3 Preliminary
2.4 Model
2.5 Expected Impacts
- 265 -
2
484. 1.1 Topic Model
“Topic modeling provides a suite of algorithms to discover hidden thematic structure in large
collections of texts. The results of topic modeling algorithms can be used to summarize, visualize,
explore, and theorize about a corpus.”(DM Blei, 2012)
Example
• What is the “topics” on the New York Times?
• How change the “topics” on the Twitter?
• How similar are these article?
Research of Topic Model
• LSA
• Based on reducing dimension (SVD Decomposition)
• pLSA
• Mixture decomposition
• LDA
• Most frequently studied model
3
1.2 LDA
“LDA is a generative probabilistic model for collection of discrete data such as text corpora. And this is
a three-level hierarchical Bayesian model, which each item of a collection is modeled as a finite mixture
over an underlying set of topics”(DM Blei, 2003)
Generative Process
Graphical Model
Geometric Interpretation
Example
• Three topics for three words.
• LDA makes a smooth
distribution on the topics.
- 266 -
4
485. 2.1 Limitation of LDA
LDA is effective tool for discovering topic structure, but there are some further research to improve
LDA. In that areas, this research focus three aspects such as individual, reference, and explanation.
Individual
Reference
Explanation
•LDA is a generative model for corpus. So
•LDA not considers referring to reference
•LDA often gives result which is hard to
it provides information of whole
literature in generative process.
•Modified LDA provides bibliography of a
documents.
•It provides a vector for information of a
document and its distribution.
understand.
•Modified LDA expects to provide more
explainable result.
document.
•In this study, modified LDA gives more
information of a document.
5
2.2 Introduction
LDA is motivated by writing a document.
Similarly, Modified LDA is motivated by writing a document in library.
Generic Generative Process
More detail
• The place in the library contains information that
probabilities of what reference is selected.
• References in same category have similar topics and
words.
- 267 -
6
486. 2.3 Preliminary
In this research, we use the language of text collections and introduce terms such as “parent corpus”,
“category”and “document distribution”
Parent Corpus
Category
Document Distribution
•A set of documents for reference.
•Category is a cluster of parent
•Document distribution is the probability
•Parent corpus is consisted with parent
documents.
distribution that selection of parent
•Parent documents in same category have
documents.
•Parent document influence topics and
words of the new document.
•Each parent document has own place.
corpus.
same topic and word prior.
•Each parent documents in category has
probability of selection.
7
2.3 Preliminary
Document distribution represents the information of new document.
Document Distribution
• Probabilistic representation as well as deterministic
representation of a document.
• Probabilistic
• Distribution over the parent document that the
probability of being used in generating a
document.
• Deterministic
• List of documents with high probability for
selection
Mixture of Gaussian Distribution
• The number of mixture provides the number of
category.
- 268 -
8
487. 2.3 Preliminary
This slide contains assumption of Modified LDA with Bibliography Information
Parent Corpus
Document Distribution
LDA
• Parent corpus assumed that it has
• Probability of parent document
• Bag-of-words assumption
own alpha, beta.
• Each parent document places at the
point in the document distribution.
follows mixture gaussian
distribution.
• It is known to the number of
mixture.(완화가능)
9
2.4 Model
This slide contains notation and terminology and generative process of Modified LDA with
Bibliography Information.
Notation and Terminology
Generative Process
- 269 -
10
488. 2.4 Model
This slide contains graphical model and probability of document.
Graphical Model
Probability of Document
11
2.4 Model
Estimation
- 270 -
12
489. 2.5 Expected Impacts
This research focus three aspects such as individual, reference and explanation.
Individual
Reference
Explanation
•Bibliography in probabilistic
•Verifying that a document is well
•Providing a variety of view for analyzing
representation of a document.
•Verifying plagiarism by comparing
document distribution.
classified.
text data.
•Representation that information of the
important reference.
13
2.5 Expected Impacts
Drawbacks
Dependency on LDA
Computational Complexity
Assumption
•This model is depended on LDA, such as
•This research yields a total number of
•It is assumed that the number of mixture
operations roughly on the order of
in document distribution is known (완화
O(N⁴k²)
가능)
the perplexity and the complexity.
- 271 -
14
490. References
[1] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden
markov models. International Computer Science Institute, 4(510):126, 1998.
[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022,
2003.
[3] DM Blei. Topic modeling and digital humanities. Journal of Digital Humanities, 2(1):8–11, 2012.
[4] Nikos Vlassis and Aristidis Likas. A greedy em algorithm for gaussian mixture learn- ing. Neural Processing Letters, 15(1):77–87,
2002.
15
- 272 -