Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
Branch & Bound and Beam search algorithms were illustrated according to the feature selection domain. Presentation is structured as follows,
- Motivation
- Introduction
- Analysis
- Algorithm
- Pseudo Code
- Illustration of examples
- Applications
- Observations and Recommendations
- Comparison between two algorithms
- References
Paper Study: Melding the data decision pipelineChenYiHuang5
Melding the data decision pipeline: Decision-Focused Learning for Combinatorial Optimization from AAAI2019.
Derive the math equation from myself and match the same result as two mentioned CMU papers [Donti et. al. 2017, Amos et. al. 2017] while applying the same derivation procedure.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
Branch & Bound and Beam search algorithms were illustrated according to the feature selection domain. Presentation is structured as follows,
- Motivation
- Introduction
- Analysis
- Algorithm
- Pseudo Code
- Illustration of examples
- Applications
- Observations and Recommendations
- Comparison between two algorithms
- References
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Branch and Bound Feature Selection for Hyperspectral Image Classification Sathishkumar Samiappan
Feature selection (FS) is a classical combinatorial problem in pattern recognition and data mining. It finds major importance in classification and regression scenarios. In this paper, a hybrid approach that combines branch-and-bound (BB) search with Bhattacharya distance based feature selection is presented for classifying hyperspectral data using Support Vector Machine (SVM) classifiers. The performance of this hybrid approach is compared to another hybrid approach that uses genetic algorithm (GA) based feature selection in place of BB. It is also compared to baseline SVMs with no feature reduction. Experimental results using hyperspectral data show that under small sample size situations, BB approach performs better than GA and SVM with no feature selection.
Overview of the state-of-the-art Time Series Clustering based on literature study; distance metrics, prototypes, time-series preprocessing, and clustering algorithms
Overview on Optimization algorithms in Deep LearningKhang Pham
Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
Support Vector Machine Techniques for Nonlinear EqualizationShamman Noor Shoudha
Equalization techniques have long been used to counteract effects of communication channels and non-linearities. Traditional nonlinear equalization techniques however are fraught with challenges. As such, research has been ongoing for defining the equalization problem as a classification problem. With this, machine learning techniques can be applied. In lieu of that approach, support vector machine techniques provide an efficient way to define boundaries for classifying non-linear symbol constellations in communication systems. Using BPSK modulation as a baseline with a two-tap channel filter model, this research goes on to validate the application of support vector machine techniques to correctly define symbol boundaries. The performance of support vector machines is directly related to the SNR and extent of non-linearities. However, the bit-error-rate performance shows that this approach is viable, providing results comparable to traditional methods as well as neural networks. As a further addition to the results in the reference paper, this research shows that SVM’s don’t generalize to different channel conditions as well as different SNRs to the ones that are defined for the training dataset. A filter bank SVM approach shows that it can be used to improve BER performance in these varying conditions.
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Branch and Bound Feature Selection for Hyperspectral Image Classification Sathishkumar Samiappan
Feature selection (FS) is a classical combinatorial problem in pattern recognition and data mining. It finds major importance in classification and regression scenarios. In this paper, a hybrid approach that combines branch-and-bound (BB) search with Bhattacharya distance based feature selection is presented for classifying hyperspectral data using Support Vector Machine (SVM) classifiers. The performance of this hybrid approach is compared to another hybrid approach that uses genetic algorithm (GA) based feature selection in place of BB. It is also compared to baseline SVMs with no feature reduction. Experimental results using hyperspectral data show that under small sample size situations, BB approach performs better than GA and SVM with no feature selection.
Overview of the state-of-the-art Time Series Clustering based on literature study; distance metrics, prototypes, time-series preprocessing, and clustering algorithms
Overview on Optimization algorithms in Deep LearningKhang Pham
Overview on function optimization in general and in deep learning. The slides cover from basic algorithms like batch gradient descent, stochastic gradient descent to the state of art algorithm like Momentum, Adagrad, RMSprop, Adam.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
What is boosting
Boosting algorithm
Building models using GBM
Algorithm main Parameters
Finetuning models
Hyper parameters in GBM
Validating GBM models
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
Accelerating Random Forests in Scikit-LearnGilles Louppe
Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include:
- An efficient formulation of the decision tree algorithm, tailored for Random Forests;
- Cythonization of the tree induction algorithm;
- CPU cache optimizations, through low-level organization of data into contiguous memory blocks;
- Efficient multi-threading through GIL-free routines;
- A dedicated sorting procedure, taking into account the properties of data;
- Shared pre-computations whenever critical.
Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.
Support Vector Machine Techniques for Nonlinear EqualizationShamman Noor Shoudha
Equalization techniques have long been used to counteract effects of communication channels and non-linearities. Traditional nonlinear equalization techniques however are fraught with challenges. As such, research has been ongoing for defining the equalization problem as a classification problem. With this, machine learning techniques can be applied. In lieu of that approach, support vector machine techniques provide an efficient way to define boundaries for classifying non-linear symbol constellations in communication systems. Using BPSK modulation as a baseline with a two-tap channel filter model, this research goes on to validate the application of support vector machine techniques to correctly define symbol boundaries. The performance of support vector machines is directly related to the SNR and extent of non-linearities. However, the bit-error-rate performance shows that this approach is viable, providing results comparable to traditional methods as well as neural networks. As a further addition to the results in the reference paper, this research shows that SVM’s don’t generalize to different channel conditions as well as different SNRs to the ones that are defined for the training dataset. A filter bank SVM approach shows that it can be used to improve BER performance in these varying conditions.
We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms
that use space that is linear or sublinear in the dimension. We prove general results showing that any sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation
inequalities for operators arising in the computation of these measures.
-Proceedings: https://arxiv.org/abs/1804.03065
-Origin: https://arxiv.org/abs/1804.03065
محاضرة ألقيت بتنظيم من مجموعة برمج @parmg_sa
https://www.meetup.com/parmg_sa/events/238339639/
في الرياض، مقر حاضنة بادر. بتاريخ 20 جمادى الآخر 1438هـ، الموافق 18 مارس 2017
This report is based on my final report of the course CommE 5051: Mathematical Principles of Machine Learning, National Taiwan University, 2018 spring. In this report, some theoretical principles of domain adaptation established in the literature are briefly presented.
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
This is a short presentation for the Poster of WMT13 shared task: This paper is to introduce our participation in
the WMT13 shared tasks on Quality Estimation
for machine translation without using reference
translations. We submitted the results
for Task 1.1 (sentence-level quality estimation),
Task 1.2 (system selection) and Task 2
(word-level quality estimation). In Task 1.1,
we used an enhanced version of BLEU metric
without using reference translations to evaluate
the translation quality. In Task 1.2, we utilized
a probability model Naïve Bayes (NB) as
a classification algorithm with the features
borrowed from the traditional evaluation metrics.
In Task 2, to take the contextual information
into account, we employed a discriminative
undirected probabilistic graphical model
Conditional random field (CRF), in addition
to the NB algorithm. The training experiments
on the past WMT corpora showed that the designed
methods of this paper yielded promising
results especially the statistical models of
CRF and NB. The official results show that
our CRF model achieved the highest F-score
0.8297 in binary classification of Task 2.
본 논문에서는 분배형 강화학습(Distributional Reinforcement Learning)에서 벨만 다이내믹스를 통해 확률 분포를 학습하는 문제를 고려합니다. 이전 연구들은 각 반환 분포의 유한 개의 통계량을 신경망을 통해 학습하는 방법을 사용해왔으나, 이 방법은 반환 분포의 함수적 형태에 제한을 받아 제한적인 표현력을 가지며, 미리 정의된 통계량을 유지하는 것이 어려웠습니다. 본 논문에서는 이러한 제한을 없애기 위해 최대 평균 거리(Maximum Mean Discrepancy, MMD)라는 가설 검정 기술을 활용해 반환 분포의 결정론적인(의사 난수를 사용한) 표본들을 학습하는 방법을 제안합니다. 이를 통해 반환 분포와 벨만 타겟 간의 모든 모멘트(순간값)를 암묵적으로 일치시킴으로써 분배형 벨만 연산자의 수렴성을 보장하며, 분포 근사에 대한 유한 샘플 분석을 제시합니다. 실험 결과, 본 논문에서 제안한 방법은 분배형 강화학습의 기본 모델보다 우수한 성능을 보이며, Atari 게임에서 분산형 에이전트를 사용하지 않는 경우에도 최고 성적을 기록합니다.
There are so many positive impacts of social media on our culture. Social media increased the connections between people and created an environment in which you can share your opinions, pictures and lots of stuff.
This presentation is about the introduction, history and inner supporting managing system of Operating System.
how Process Scheduling and file management works by Windows.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
4. SUPPORT VECTOR MACHINES
• The SVM is a large margin classifier that searches for the
hyperplane that maximizes the margin between the positive
samples and the negative samples.
5. SUPPORT VECTOR MACHINES
• Measures of the capacity of a learning machine: VC Dimension, Fat
Shattering Dimension
• The capacity of a learning machine is related to the margin on the
training data.
- As the margin goes up, VC-dimension may go down and thus the
upper bound of the test error goes down. (Vapnik 79)
6. SUPPORT VECTOR MACHINES
• SVMs’ theoretical accuracy is much lower than their actual
performance. The margin based upper bounds of the test error
are too loose. •
• This is why – SVM based voting algorithm.
7. SVM BASED VOTING
• Previous work (Dijkstra 02)- Use SVM for parse reranking
directly.- Positive samples: parse with highest f-score for each
sentence.
• First try-Tree kernel: compute dot-product on the space of all
the subtrees (Collins 02)-Linear kernel: rich features (Collins
00)
8. SVM BASED VOTING ALGORITHM
• Using pairwise parses as samples
• Let 𝑥𝑖𝑗 is the 𝑗-th candidate parse for the 𝑖-th sentence in the training
data.
• Let 𝑥𝑖𝑗 is the parse with highest f-score among all the parses for the 𝑖-
th sentence.
• Positive samples: (𝑥𝑖1,𝑥𝑖𝑗),𝑗 > 1
• Negative samples: (𝑥𝑖𝑗,𝑥𝑖1),𝑗 > 1
9. PREFERENCE KERNELS
• Let 𝑡1, 𝑡2 , (𝑣1, 𝑣2)are two pairs of parses
• K – kernel : linear or tree kernel
• The preference kernel is defined:
𝑃𝐾((𝑡1, 𝑡2), (𝑣1, 𝑣2)) =
𝐾 (𝑡1, 𝑣1)- 𝐾 (𝑡1, 𝑣2)- 𝐾 𝑡2, 𝑣1 + 𝐾 (𝑡2, 𝑣2)
• A sample (𝑡1, 𝑣1) represents the difference between a good
parse and a bad one, the preference computes the similarity
between the two differences.
10. SVM BASED VOTING
• Decision function f of SVM: for each of the pair parses:
𝑓 𝑥1, 𝑥2 = 𝑠𝑐𝑜𝑟𝑒(𝑥1) - 𝑠𝑐𝑜𝑟𝑒(𝑥2)
𝑠𝑐𝑜𝑟𝑒(𝑥) = 𝑖=1
𝑁𝑠
𝑎𝑖, 𝑦𝑖(𝐾(𝑠𝑖1, 𝑥),𝐾(𝑠𝑖2, 𝑥)
(𝑠𝑖1,𝑠𝑖2
) is the 𝑖-th support vector
𝑁𝑠 is the total number of support vectors
𝑌𝑖 is the class of (𝑠𝑖1,𝑠𝑖2
) can be {-1,1}
𝑎𝑖 is the Lagrange multiplier solved by the SVM
11. THEORETICAL ISSUES
• Justifying the Preference Kernel
• Justifying Pairwise Samples
• Margin Based Bound for the SVM Based Voting Algorithm
13. JUSTIFYING THE PAIRWISE SAMPLES
• The SVM using simple parses as samples searches for a decision
function score constrained by the condition:
- score (𝑥𝑖1) > 0
- score (𝑥𝑖1) < 0,𝑗 > 1
too strong.
• Pairwise:
- score (𝑥𝑖1) > -score (𝑥𝑖𝑗)
14. MARGIN BASED BOUND FOR SVM BASED VOTING
• Loss function of voting:
𝑙𝑣𝑜𝑡𝑒(𝑥, 𝑓) = {0 𝑒𝑙𝑠𝑒
1 𝑓 𝑥∗ < 𝑓(𝑥)
• Loss function of classification:
𝑙𝑐𝑙𝑎𝑠𝑠 𝑥1, 𝑥2, 𝑔𝑓 = {1 𝑥2= 𝑥1
∗,𝑔𝑓 𝑥1,𝑥2 = 1
1 𝑥1= 𝑥2
∗,𝑔𝑓 𝑥1,𝑥2 = −1
• Expected voting loss is equal expected classification loss (Herbrich
2000)
15. EXPERIMENTS – WSJ TREEBANK
• N-best parsing results (Collins 02)
• SVM-light (Joachims 98)
• Two Kernels (K) used in the preference kernel:
- Linear Kernel
- Tree Kernel
• Tree Kernel- very slow
16. EXPERIMENTS – LINEAR KERNEL
• Training data are cut into slices. Slice i contains two pairwise samples
((𝑝𝑘1𝑝𝑘𝑖), 1), ((𝑝𝑘𝑖𝑝𝑘1), −1) of each sentence.
• 22 SVMs on 22 slices of training data.
• 2 days to train an SVM in a Pentium III 1.13Ghz.
17. RESULTS
Experimental Results on section 23
≤40 words (2245 sentences)
Model LR LP CBs 0 CBs 2 CBs f-score
Collins 99 88.5% 88.7% 0.92 66.7% 87.1% 88.6%
Charniak 00 90.1% 90.1% 0.74 70.1% 89.6% 90.1%
Collins 00 90.1% 90.4% 0.75 70.7% 89.6% 90.3%
SVM - linear 89.9% 90.3% 0.73 71.7% 89.4% 90.1%
≤40 words (2245 sentences)
Model LR LP CBs 0 CBs 2 CBs f-score
Collins 99 88.1% 88.3% 1.06 64.0% 85.1% 88.2%
Charniak 00 89.6% 89.5% 0.88 67.6% 87.7% 89.6%
Collins 00 89.6% 89.5% 0.87 68.3% 87.7% 89.8%
SVM - linear 89.4% 89.8% 0.89 69.2% 87.6% 89.6%
18. CONCLUSIONS
• Using an SVM approach:
- achieving state-of-the-art results
- SVM with linear kernel is superior to tree kernel in speed and
accuracy.