Dr. Sobia Baig gave a lecture on probability and random variables. She discussed how mathematical models, including deterministic and probability models, can be used as tools in systems analysis and design. She provided examples of how probability models are applied in communication systems, signal processing, resource sharing, and other applications to account for uncertainty and randomness.
The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
This document summarizes an experiment comparing different character-level embedding approaches for Korean sentence classification tasks. Dense character-level embeddings using pre-trained fastText vectors outperformed sparse one-hot encodings. Character-level embeddings preserved local semantics around character boundaries better than Jamo-level encodings, which performed best with self-attention. While Jamo-level features may be useful for syntax-semantic tasks, character-level approaches had better performance and computation efficiency. These findings provide insights for character-rich languages beyond Korean.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
SemEval is an ongoing series of evaluations of computational semantic analysis systems that evolved from word sense evaluation. SemEval 2014 included several tasks, including aspect based sentiment analysis (Task 4) which had four subtasks: (1) aspect term extraction, (2) aspect term polarity classification, (3) aspect category detection, and (4) aspect category polarity classification. The top performing system for this task used a semi-Markov tagger for aspect term extraction and SVMs trained on lexical, syntactic, and semantic features for the other subtasks.
The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
This document summarizes an experiment comparing different character-level embedding approaches for Korean sentence classification tasks. Dense character-level embeddings using pre-trained fastText vectors outperformed sparse one-hot encodings. Character-level embeddings preserved local semantics around character boundaries better than Jamo-level encodings, which performed best with self-attention. While Jamo-level features may be useful for syntax-semantic tasks, character-level approaches had better performance and computation efficiency. These findings provide insights for character-rich languages beyond Korean.
Michael Manukyan and Hrayr Harutyunyan gave a talk on sentence representations in the context of deep learning during Armenian NLP Meetup. They also reviewed a recent paper on machine comprehension (Wang, Jiang, 2016)
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
SemEval is an ongoing series of evaluations of computational semantic analysis systems that evolved from word sense evaluation. SemEval 2014 included several tasks, including aspect based sentiment analysis (Task 4) which had four subtasks: (1) aspect term extraction, (2) aspect term polarity classification, (3) aspect category detection, and (4) aspect category polarity classification. The top performing system for this task used a semi-Markov tagger for aspect term extraction and SVMs trained on lexical, syntactic, and semantic features for the other subtasks.
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Classifying Non-Referential It for Question Answer PairsJinho Choi
This paper introduces a new corpus, QA-It, for the classification of non-referential it. Our dataset is unique in a sense that it is annotated on question answer pairs collected from multiple genres, useful for developing advanced QA systems. Our annotation scheme makes clear distinctions between 4 types of it, providing guidelines for many erroneous cases. Several statistical models are built for the classification of it, showing encouraging results. To the best of our knowledge, this is the first time that such a corpus is created for question answering.
Mlintro 120730222641-phpapp01-210624192524Scott Domes
- Prof. Lior Rokach introduces machine learning and gives an overview of his background and research interests.
- Machine learning aims to develop systems that can learn from experience to improve their performance on some task. A key aspect is that the system improves its performance based on experience.
- Examples of machine learning applications include spam filtering, data mining, handwriting recognition, and more. Popular algorithms discussed include decision trees, neural networks, nearest neighbors, and ensemble methods.
- Challenges in machine learning like overfitting, dimensionality, and finding meaningful patterns are also covered.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam
Design an application for a domain specific question answering system. Built a solution for finding answers of factoid questions by using N-gram Mining Approach. Calculated percentage about the related answers for the specific question. Built this application in Java platform.
This document summarizes a presentation about a sentiment analysis system developed for a large Korean telecommunications company. The system was designed to analyze customer feedback from call centers. It classified feedback into categories, identified trends over time, and detected complaints. The system used Korean linguistic analysis and sentiment classification. It showed the benefits of combining machine learning and rules-based approaches. However, challenges remained around data quality, lexicon development, and meeting customer expectations. Future work focused on improving the sentiment dictionary and developing a platform for ongoing natural language processing services.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Lexicon-based approaches to Twitter sentiment analysis are gaining much popularity due to their simplicity, domain independence, and relatively good performance. These approaches rely on sentiment lexicons, where a collection of words are marked with fixed sentiment polarities. However, words' sentiment orientation (positive, neural, negative) and/or sentiment strengths could change depending on context and targeted entities. In this paper we present SentiCircle; a novel lexicon-based approach that takes into account the contextual and conceptual semantics of words when calculating their sentiment orientation and strength in Twitter. We evaluate our approach on three Twitter datasets using three different sentiment lexicons. Results show that our approach significantly outperforms two lexicon baselines. Results are competitive but inconclusive when comparing to state-of-art SentiStrength, and vary from one dataset to another. SentiCircle outperforms SentiStrength in accuracy on average, but falls marginally behind in F-measure.
This document presents an approach for detecting service-oriented architecture (SOA) antipatterns. It discusses problems in existing literature, such as a lack of specifications for SOA antipatterns and no approaches for detecting them. The proposed solution involves specifying SOA antipatterns using rule cards, generating detection algorithms, detecting suspicious services, and validating the results. A domain-specific language and framework called Service Oriented Detection for Antipatterns (SODA) are developed to implement this approach. The goal is to provide a solution for detecting SOA antipatterns in service-based systems and validating the impacts.
This document provides an overview of a course on trends and research applications in natural language processing (NLP). It begins with introducing the goals of the course, which are to understand interesting NLP tasks and novel projects through a research-oriented webinar. The document then covers various NLP topics like question answering, machine translation, sentiment analysis, natural language generation applications, and challenges in NLP like grounded language and embodied language. It also provides tips for aspiring NLP researchers.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
The document provides information about an upcoming bootcamp on natural language processing (NLP) being conducted by Anuj Gupta. It discusses Anuj Gupta's background and experience in machine learning and NLP. The objective of the bootcamp is to provide a deep dive into state-of-the-art text representation techniques in NLP and help participants apply these techniques to solve their own NLP problems. The bootcamp will be very hands-on and cover topics like word vectors, sentence/paragraph vectors, and character vectors over two days through interactive Jupyter notebooks.
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
The document provides an outline for a workshop on representation learning of text for natural language processing (NLP). The workshop will be divided into 4 modules covering both foundational techniques like one-hot encoding and bag-of-words as well as state-of-the-art methods like word, sentence, and character vectors. The objective is for participants to gain a deeper understanding of the key ideas, math, and code behind text representation techniques in order to apply them to solve NLP problems and achieve higher accuracies and understanding.
This document provides an overview of machine learning applications in natural language processing and text classification. It discusses common machine learning tasks like part-of-speech tagging, named entity extraction, and text classification. Popular machine learning algorithms for classification are described, including k-nearest neighbors, Rocchio classification, support vector machines, bagging, and boosting. The document argues that machine learning can be used to solve complex real-world problems and that text processing is one area with many potential applications of these techniques.
Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann
IIT Patna, India
TU Darmstadt, Germany
Presented by: Alexander Panchenko, TU Darmstadt, Germany
This document summarizes a research paper that proposes an unsupervised approach to adapt existing sentiment lexicons to the context and language used on Twitter. It captures the contextual semantics of words based on their surrounding context in tweets. This is used to update the prior sentiment orientation and strength of words in an existing Twitter sentiment lexicon called Thelwall-Lexicon. Experiments show the adapted lexicons improve sentiment classification performance on two Twitter datasets compared to the original lexicon.
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
The document proposes incorporating Chinese radicals into neural machine translation models. It discusses related work incorporating word and character level information into neural MT. The proposed model combines radical-level MT with an attention-based neural model, representing input text with word, character, and radical combinations. Experiments show the character+radical and word+radical models outperform baselines on standard MT evaluation metrics using a Chinese-English dataset. Future work includes improving model optimization and testing on additional data.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
The document discusses collaborative filtering approaches for recommender systems. It covers user-based and item-based nearest neighbor collaborative filtering methods. It describes how similarity between users or items is measured using approaches like Pearson correlation and cosine similarity. It also discusses challenges like data sparsity and different algorithmic improvements and model-based approaches like matrix factorization using singular value decomposition.
This document outlines the syllabus and objectives for a course on probability and random processes for electrical engineering. The syllabus covers topics like probability models, random variables, multiple random variables, sums of random variables, random processes, analysis of random signals, Markov chains, and related mathematical concepts. The objectives are to describe and analyze various probabilistic concepts and random signals, and to design filters and estimators for random systems and processes.
This document summarizes key concepts from Dr. Sobia Baig's lecture on probability and random variables. It discusses conditional probability, Bayes' theorem, and independent events. Examples are provided to illustrate how to calculate conditional probabilities, apply Bayes' rule, and determine if events are independent. The document also examines sequential experiments and how to determine probabilities when subexperiments are independent.
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Classifying Non-Referential It for Question Answer PairsJinho Choi
This paper introduces a new corpus, QA-It, for the classification of non-referential it. Our dataset is unique in a sense that it is annotated on question answer pairs collected from multiple genres, useful for developing advanced QA systems. Our annotation scheme makes clear distinctions between 4 types of it, providing guidelines for many erroneous cases. Several statistical models are built for the classification of it, showing encouraging results. To the best of our knowledge, this is the first time that such a corpus is created for question answering.
Mlintro 120730222641-phpapp01-210624192524Scott Domes
- Prof. Lior Rokach introduces machine learning and gives an overview of his background and research interests.
- Machine learning aims to develop systems that can learn from experience to improve their performance on some task. A key aspect is that the system improves its performance based on experience.
- Examples of machine learning applications include spam filtering, data mining, handwriting recognition, and more. Popular algorithms discussed include decision trees, neural networks, nearest neighbors, and ensemble methods.
- Challenges in machine learning like overfitting, dimensionality, and finding meaningful patterns are also covered.
Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam
Design an application for a domain specific question answering system. Built a solution for finding answers of factoid questions by using N-gram Mining Approach. Calculated percentage about the related answers for the specific question. Built this application in Java platform.
This document summarizes a presentation about a sentiment analysis system developed for a large Korean telecommunications company. The system was designed to analyze customer feedback from call centers. It classified feedback into categories, identified trends over time, and detected complaints. The system used Korean linguistic analysis and sentiment classification. It showed the benefits of combining machine learning and rules-based approaches. However, challenges remained around data quality, lexicon development, and meeting customer expectations. Future work focused on improving the sentiment dictionary and developing a platform for ongoing natural language processing services.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Lexicon-based approaches to Twitter sentiment analysis are gaining much popularity due to their simplicity, domain independence, and relatively good performance. These approaches rely on sentiment lexicons, where a collection of words are marked with fixed sentiment polarities. However, words' sentiment orientation (positive, neural, negative) and/or sentiment strengths could change depending on context and targeted entities. In this paper we present SentiCircle; a novel lexicon-based approach that takes into account the contextual and conceptual semantics of words when calculating their sentiment orientation and strength in Twitter. We evaluate our approach on three Twitter datasets using three different sentiment lexicons. Results show that our approach significantly outperforms two lexicon baselines. Results are competitive but inconclusive when comparing to state-of-art SentiStrength, and vary from one dataset to another. SentiCircle outperforms SentiStrength in accuracy on average, but falls marginally behind in F-measure.
This document presents an approach for detecting service-oriented architecture (SOA) antipatterns. It discusses problems in existing literature, such as a lack of specifications for SOA antipatterns and no approaches for detecting them. The proposed solution involves specifying SOA antipatterns using rule cards, generating detection algorithms, detecting suspicious services, and validating the results. A domain-specific language and framework called Service Oriented Detection for Antipatterns (SODA) are developed to implement this approach. The goal is to provide a solution for detecting SOA antipatterns in service-based systems and validating the impacts.
This document provides an overview of a course on trends and research applications in natural language processing (NLP). It begins with introducing the goals of the course, which are to understand interesting NLP tasks and novel projects through a research-oriented webinar. The document then covers various NLP topics like question answering, machine translation, sentiment analysis, natural language generation applications, and challenges in NLP like grounded language and embodied language. It also provides tips for aspiring NLP researchers.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
The document provides information about an upcoming bootcamp on natural language processing (NLP) being conducted by Anuj Gupta. It discusses Anuj Gupta's background and experience in machine learning and NLP. The objective of the bootcamp is to provide a deep dive into state-of-the-art text representation techniques in NLP and help participants apply these techniques to solve their own NLP problems. The bootcamp will be very hands-on and cover topics like word vectors, sentence/paragraph vectors, and character vectors over two days through interactive Jupyter notebooks.
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
The document provides an outline for a workshop on representation learning of text for natural language processing (NLP). The workshop will be divided into 4 modules covering both foundational techniques like one-hot encoding and bag-of-words as well as state-of-the-art methods like word, sentence, and character vectors. The objective is for participants to gain a deeper understanding of the key ideas, math, and code behind text representation techniques in order to apply them to solve NLP problems and achieve higher accuracies and understanding.
This document provides an overview of machine learning applications in natural language processing and text classification. It discusses common machine learning tasks like part-of-speech tagging, named entity extraction, and text classification. Popular machine learning algorithms for classification are described, including k-nearest neighbors, Rocchio classification, support vector machines, bagging, and boosting. The document argues that machine learning can be used to solve complex real-world problems and that text processing is one area with many potential applications of these techniques.
Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal, Chris Biemann
IIT Patna, India
TU Darmstadt, Germany
Presented by: Alexander Panchenko, TU Darmstadt, Germany
This document summarizes a research paper that proposes an unsupervised approach to adapt existing sentiment lexicons to the context and language used on Twitter. It captures the contextual semantics of words based on their surrounding context in tweets. This is used to update the prior sentiment orientation and strength of words in an existing Twitter sentiment lexicon called Thelwall-Lexicon. Experiments show the adapted lexicons improve sentiment classification performance on two Twitter datasets compared to the original lexicon.
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
The document proposes incorporating Chinese radicals into neural machine translation models. It discusses related work incorporating word and character level information into neural MT. The proposed model combines radical-level MT with an attention-based neural model, representing input text with word, character, and radical combinations. Experiments show the character+radical and word+radical models outperform baselines on standard MT evaluation metrics using a Chinese-English dataset. Future work includes improving model optimization and testing on additional data.
This document provides an overview of representation learning techniques for natural language processing (NLP). It begins with introductions to the speakers and objectives of the workshop, which is to provide a deep dive into state-of-the-art text representation techniques. The workshop is divided into four modules: word vectors, sentence/paragraph/document vectors, and character vectors. The document provides background on why text representation is important for NLP, and discusses older techniques like one-hot encoding, bag-of-words, n-grams, and TF-IDF. It also introduces newer distributed representation techniques like word2vec's skip-gram and CBOW models, GloVe, and the use of neural networks for language modeling.
The document discusses collaborative filtering approaches for recommender systems. It covers user-based and item-based nearest neighbor collaborative filtering methods. It describes how similarity between users or items is measured using approaches like Pearson correlation and cosine similarity. It also discusses challenges like data sparsity and different algorithmic improvements and model-based approaches like matrix factorization using singular value decomposition.
This document outlines the syllabus and objectives for a course on probability and random processes for electrical engineering. The syllabus covers topics like probability models, random variables, multiple random variables, sums of random variables, random processes, analysis of random signals, Markov chains, and related mathematical concepts. The objectives are to describe and analyze various probabilistic concepts and random signals, and to design filters and estimators for random systems and processes.
This document summarizes key concepts from Dr. Sobia Baig's lecture on probability and random variables. It discusses conditional probability, Bayes' theorem, and independent events. Examples are provided to illustrate how to calculate conditional probabilities, apply Bayes' rule, and determine if events are independent. The document also examines sequential experiments and how to determine probabilities when subexperiments are independent.
This document summarizes key concepts from a lecture on probability and random variables, including defining random experiments and sample spaces, specifying discrete and continuous sample spaces, describing events, and outlining the axioms of probability for assigning probabilities to events.
This document summarizes key concepts from a lecture on probability and random variables, including defining random experiments and sample spaces, specifying discrete and continuous sample spaces, describing events, and outlining the axioms of probability for assigning probabilities to events.
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.
This document summarizes key points from Dr. Sobia Baig's lecture on probability and random variables. The lecture covers the geometric probability law, sequences of dependent experiments, random number generators, and simulating random experiments. It discusses concepts like Bernoulli trials, trellis diagrams, generating uniformly distributed random numbers on [0,1] for computer simulations, and approaches to pseudo-random number generation.
This document summarizes key points from Dr. Sobia Baig's lecture on probability and random variables. The lecture covers the geometric probability law, sequences of dependent experiments, random number generators, and simulating random experiments. It discusses concepts like Bernoulli trials, finite precision in computers, and generating random numbers for simulations. The lecture also provides an example of finding the probability of a sequence from a two-urn experiment using a trellis diagram.
Strong Heredity Models in High Dimensional Datasahirbhatnagar
The document presents a model called ECLUST for identifying predictor variables associated with a phenotype that depend on an environmental factor using high-dimensional data. ECLUST uses a 3 phase approach: 1) calculating gene similarity matrices separately for different environments, 2) clustering genes to reduce dimensionality, 3) performing penalized regression on cluster representations to identify important predictors and environment-specific interactions. Simulation results show ECLUST can accurately select important variables and outperforms other methods in variable selection and predictive performance. The method is implemented in an open source R package.
(I Can't Get No) Saturation: A Simulation and Guidelines for Minimum Sample S...Gemma Derrick
This document presents the results of a simulation study and guidelines for determining minimum sample sizes in qualitative research. The study explores the number of sampling steps required to reach theoretical saturation across different scenarios by simulating populations and varying the number of codes and probability of observing codes. The main findings are that the probability of observing a code is more important than the number of codes, and that purposive sampling typically requires less than 50 sampling steps with around 20 steps being common. Guidelines are provided for researchers to identify the applicable scenario and choose an appropriate sampling strategy based on estimating key factors. The guidelines aim to provide a theoretical basis for sample sizes while accounting for the assumptions and iterative nature of qualitative research.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
This document discusses a process calculus for modeling spatially-explicit ecological models. It begins with an introduction to spatially-explicit ecological models and motivations. Related work discusses existing approaches for modeling chemical reactions and ecological systems, including stochastic simulation, P systems, and process calculi. The document then describes the existing PALPS process calculus for population systems and its operational semantics and modeling capabilities. Finally, it outlines several research questions around extending the PALPS calculus with continuous time and dynamic parameters, and comparing it to other modeling approaches through translation and identifying other advantages beyond model checking.
- Completely randomized designs are used to study the effect of one primary factor without accounting for other nuisance variables. Randomized block designs account for nuisance variables by blocking on them.
- Key aspects of completely randomized designs include random assignment of treatments, defining the design with three numbers (number of factors, levels, and replications), and estimating treatment effects using ANOVA.
- Randomized block designs control for nuisance factors by assigning treatments randomly within blocks that are homogeneous for the nuisance factor. This reduces experimental error compared to completely randomization.
Dr. Roitman discusses the use of Artificial Intelligence to solve complex and insoluble problems. Artificial intelligence approach is in the root of I Know First predictive algorithm.
NETW601_Lecture01_2016, transmission and switching courseAbduljawad Taher
This document appears to be the first lecture of a course on transmission and switching. It provides information about the course instructor, teaching assistants, textbook, and overall course structure. The course aims to build understanding of transmission and switching concepts, relate them to real networks, and enhance modeling skills. Key topics covered include digital transmission fundamentals, multiplexing, telephone networks, and packet switching networks. Student assessment includes quizzes, assignments, a midterm, and a final exam.
This document provides a summary of three topics:
1) An introduction to Locally Linear Embedding (LLE), an unsupervised nonlinear dimensionality reduction technique. It describes the objective, idea, and algorithm of LLE.
2) Explaining variational approximations, describing the idea, algorithm, and examples of both density transform and tangent transform approaches. Variational approximations provide fast deterministic alternatives to Monte Carlo methods.
3) A question and answer section.
communication engineering II
for electronics
Frequency analysis of discrete-time signals is usually and most conveniently performed on a digital signal processor, which may be a general-purpose digital computer or specially designed digital hardware. To perform frequency analysis on a discrete-time signal x[n], we convert the time-domain sequence to an equivalent frequency
The document describes an ontology called Exposé that was created for machine learning experimentation. The ontology aims to formally represent key aspects of machine learning experiments such as algorithm specifications, implementations, applications, experimental contexts, evaluation functions, and structured data. Exposé builds on and extends existing ontologies for data mining and machine learning experimentation by incorporating classes and relationships to represent additional important concepts.
QoMEX2014 - Analysing the Quality of Experience of Multisensory Media from Me...Jacob Donley
This presentation was given at QoMEX 2014, the 6th International Workshop on Quality of Multimedia Experience.
Abstract:
This paper investigates the Quality of Experience (QoE) of multisensory media by analysing biosignals collected by electroencephalography (EEG) and eye gaze sensors and comparing with subjective ratings. Also investigated is the impact on QoE of various levels of synchronicity between the sensory effect and target video scene. Results confirm findings from previous research that show sensory effects added to videos increases the QoE rating. While there was no statistical difference observed for the QoE ratings for different levels of sensory effect synchronicity, an analysis of raw EEG data showed 25% more activity in the temporal lobe during asynchronous effects and 20-25% more activity in the occipital lobe during synchronous effects. The eye gaze data showed more deviation for a video with synchronous effects and the EEG showed correlating occipital lobe activity for this instance. These differences in physiological responses indicate sensory effect synchronicity may affect QoE despite subjective ratings appearing similar.
Presented at the Global Pharma R&D Informatics Congress. To find out more, visit:
www.global-engage.com
Text mining extracts complex information from text (entities, events and epistemic knowledge). It can be used to support pathway construction and the design of experiments by extracting evidence from literature. In this presentation, Sophia Ananiadou, Director of the National Centre for Text Mining, discusses bridging the gap between knowledge and text in cancer biology.
2. Contents
• Probability and Random Variables‐brief
Probability and Random Variables brief
introduction/motivation
• Mathematical Models as tools in Analysis and
Mathematical Models as tools in Analysis and
Design
–D t
Deterministic Models
i i ti M d l
– Probability Models
• Examples
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 2
3. A Typical Communication System
A Typical Communication System
• Probabilistic methods in making decisions
/ g
about the transmitted/received message
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 3
4. Digital Communication with
Probability of Error
b bl f
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 4
5. Probability/ Uncertainty Element in
Systems
• Wireless communication networks provide voice
e ess co u cat o et o s p o de o ce
and data communications to mobile users in
severe interference environments.
• The vast majority of media signals, voice, audio,
images, and video are processed digitally.
• The systems the designers build are
unprecedented in scale and the chaotic
environments in which they must operate are
environments in which they must operate are
untrodden terrritory.
• So there is “Uncertainty”
So there is Uncertainty
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 5
6. Probability Models
Probability Models
• Probability models are one of the tools that
Probability models are one of the tools that
enable the designer to make sense out of the
chaos and to successfully build systems that
chaos and to successfully build systems that
are efficient, reliable, and cost effective
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 6
7. MATHEMATICAL MODELS AS TOOLS IN ANALYSIS
AND DESIGN
AND DESIGN
• Experiments: Costly way of testing a design or solve a
problem.
• Model: Approximate representation of a physical situation.
• Useful Model: Able to explain all relevant aspects of a given
Useful Model: Able to explain all relevant aspects of a given
phenomenon.
• Mathematical Models: If observational phenomenon has
measurable properties then a mathematical model
measurable properties then a mathematical model
consisting of a set of assumptions about the system is
employed
• Conditions under which an experiment is performed and a
Conditions under which an experiment is performed and a
model is assumed are very critical. Change the assumptions
then a “good” model can be a great failure.
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 7
9. Computer Simulations & Deterministic
Models
d l
• Computer Simulation Models: They mimic or
p y
simulate the dynamics of a system
• Deterministic Models: Lab and textbook cases,
conditions determine outcome
conditions determine outcome
• 1. Circuit Theory
• 2 Ohm’s Law
2. Ohm s Law
• 3. Kirchoffs’ Laws
• 4 Transforms: FFT; Laplace Transforms
4. Transforms: FFT; Laplace Transforms
• 5. Convolution :Input/output behavior of systems
with well‐defined coefficients
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 9
10. Probability Models
Probability Models
• Probabilistic (Stochastic Random) Models:
Probabilistic (Stochastic, Random) Models:
involve phenomena that exhibit unpredictable
variation and randomness.
variation and randomness
Ex: Urn with three balls; (0,1,2
h h b ll (
marked)
‐‐ Outcome: A number from the set
{ , , }
{0,1,2}
‐‐ Sample Space: All possible
outcomes of an experiment: S =
{0,1,2}
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 10
11. Statistical Regularity
Statistical Regularity
• Relative Frequency:
Relative Frequency:
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 11
14. Properties of Relative Frequency
Properties of Relative Frequency
• Suppose a random experiment has K possible
Suppose a random experiment has K possible
outcomes: S = {1,2,…,K}. Then in “n” trials we
have
have
Ex: Consider the 3‐ball urn experiment
A: even = {0,2} then,
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 14
15. • Disjoint (mutually exclusive) events: If A or B
Disjoint (mutually exclusive) events: If A or B
can occur but not both, then
• Relative frequency of two disjoint events is
the sum of their individual relative frequency.
h f h i i di id l l i f
•
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 15
16. Axioms of Probability
Axioms of Probability
• Kolmogorov’s axioms to form a Theory of Probability:
Assumptions:
1. Random experiment has been defined and the sample
space S has been identified.
2. A class of subsets of S has been specified.
3. Each event A has been assigned a number P(A) such that,
– 1 0 ≤ P(A) ≤ 1
1. 0 ≤ P(A) ≤ 1
– 2. P(S) = 1
– 3. If A and B are mutually exclusive events then
• P(A or B) = P(A) + P(B)
P(A or B) = P(A) + P(B)
• Kolmogorov’s axioms are sufficient to build a consistent
Theory of Probability.
Theory of Probability
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 16
17. Example
• Example: Packet Voice Communication
Example: Packet Voice Communication
system Efficiency
• Due to silences voice communication is very
Due to silences voice communication is very
inefficient on dedicated lines. It is observed
that only “1/3” of the time actual speech goes
through. How to increase this rate by using
prob. approaches???
• Solution: Error vs rate trade off in digital
information (BCS) transmission/storage
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 17
19. Packet Voice Communication system
Efficiency (3)
ff ( )
• A is the outcome of a random experiment, determining which
p , g
packets contain active speech
• M<48 packets are transmitted every 10 ms
• If A≤ M, then all packets are active
• If A> M, then A‐M active packets are discarded randomly
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 19
20. Packet Voice Communication system
Efficiency (4)
ff ( )
• E[A]=48*1/3
E[A]=48 1/3
• E[A]=16
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 20
22. Example: Signal Enhancement Using
Filters
l
• Given a signal x(t) corrupted with noise and
Given a signal x(t) corrupted with noise and
has a Signal‐to‐Noise Ratio value SNR. If you
filter this noisy signal with a properly designed
filter this noisy signal with a properly designed
adaptive filter to suppress noise, we obtain an
enhanced signal, (smoothed by the filter.)
enhanced signal (smoothed by the filter )
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 22
24. Resource Sharing
Resource Sharing
• Example: Multi User Systems with Queues:
Example: Multi User Systems with Queues:
Resource sharing
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 24
26. System Reliability: Cascade vs. Parallel
Systems
• Issues: Need of a clock vs the system delay or
Issues: Need of a clock vs. the system delay or
throughput rate.
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 26
28. Reading Assignment
Reading Assignment
• Text Book Chapter 1 pages 17 ‐34
Text Book, Chapter 1, pages 17 34
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 28
29. Summary
• Mathematical Models
– Relate system parameters and variables
– Allow system designers to predict system performance by using
equations
– Experimentation may be too costly
Experimentation may be too costly
– Experimentation may not be feasible
• Computer simulation models predict system performance
• Deterministic models give output of an experiment with an exact
Deterministic models give output of an experiment with an exact
outcome
• Probability models determine probabilities of the possible
outcomes
• Probabilities and averages for a random experiment can be found
experimentally by computing relative frequencies and sample
averages in a large number of trials of experiment
Probability and Random Variables, Lecture 1, by Dr. Sobia Baig 29