In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
Yi-Shin Chen gives a presentation on natural language processing (NLP) and text mining. The presentation covers basic concepts in NLP including part-of-speech tagging, parsing, word segmentation, stemming, and vector space models. It also discusses word embedding techniques like word2vec, specifically the continuous bag-of-words and skip-gram models used to generate word vectors that capture semantic relationships. The goal is to obtain word representations that can better model linguistic knowledge compared to bag-of-words models.
Introduction to natural language processingMinh Pham
This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
This document provides an overview of natural language processing (NLP) including popular NLP problems, levels of NLP, the role of linguistics, sources of NLP data, tools and algorithms used in NLP, types of models including language models, and considerations for building practical NLP systems. It also describes a practical example of building a language detection system using word language models trained on Wiktionary data and evaluated using Wikipedia test data.
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
The document discusses BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT uses bidirectional Transformers to pre-train deep contextual representations of language. It was trained on two unsupervised prediction tasks using large text corpora. Experimental results showed that BERT achieved state-of-the-art results on eleven natural language understanding tasks, including question answering and textual inference. The document outlines the model architecture of BERT and the pre-training and fine-tuning methods used.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Text mining - from Bayes rule to dependency parsingFlorian Leitner
The updated slides for my introductory course on text mining. The slides take you from basic Bayesian statistics over Markov chains and language models (incl. neural embeddings) to text similarity measures (Locality Sensitive Hashing, Latent Semantic Analysis, and Latent Dirichlet Allocation) and classification techniques (Naïve Bayes, MaxEnt [logistic regression]). Finally, they introduce you to shallow parsing with dynamic graphical models (HMMs, MEMMs, and CRFs) and provide you with a glimpse at dependency parsing (shift-reduce, arc-standard).
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
The document discusses Rudolf Eremyan's work as a machine learning software engineer, including several natural language processing (NLP) projects. It provides details on a chatbot Eremyan created for the TBC Bank in Georgia that had over 35,000 likes and facilitated over 100,000 conversations. It also mentions sentiment analysis on Facebook comments and introduces NLP, discussing its history and applications such as text classification, machine translation, and question answering. The document outlines Eremyan's theoretical NLP project involving creating a machine learning pipeline for text classification using a labeled dataset.
This document provides an introduction to natural language processing (NLP) through a presentation given by Rutu Mulkar-Mehta. The presentation covers understanding natural language, common NLP tasks like text categorization and sentiment analysis, and challenges like ambiguity. It also discusses part-of-speech tagging and linguistic resources. The overall goal is to introduce attendees to the field of NLP and some of its applications.
This document discusses text mining and information extraction. It covers the goals of information extraction including extracting structured data from unstructured text. It also discusses named entity recognition, challenges in NER, maximum entropy methods for NER, template filling using statistical and finite-state approaches, and applications of information extraction.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
Words and sentences are the basic units of text. In this lecture we discuss basics of operations on words and sentences such as tokenization, text normalization, tf-idf, cosine similarity measures, vector space models and word representation
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
Yi-Shin Chen gives a presentation on natural language processing (NLP) and text mining. The presentation covers basic concepts in NLP including part-of-speech tagging, parsing, word segmentation, stemming, and vector space models. It also discusses word embedding techniques like word2vec, specifically the continuous bag-of-words and skip-gram models used to generate word vectors that capture semantic relationships. The goal is to obtain word representations that can better model linguistic knowledge compared to bag-of-words models.
Introduction to natural language processingMinh Pham
This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
This document provides an overview of natural language processing (NLP) including popular NLP problems, levels of NLP, the role of linguistics, sources of NLP data, tools and algorithms used in NLP, types of models including language models, and considerations for building practical NLP systems. It also describes a practical example of building a language detection system using word language models trained on Wiktionary data and evaluated using Wikipedia test data.
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
The document discusses BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT uses bidirectional Transformers to pre-train deep contextual representations of language. It was trained on two unsupervised prediction tasks using large text corpora. Experimental results showed that BERT achieved state-of-the-art results on eleven natural language understanding tasks, including question answering and textual inference. The document outlines the model architecture of BERT and the pre-training and fine-tuning methods used.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Text mining - from Bayes rule to dependency parsingFlorian Leitner
The updated slides for my introductory course on text mining. The slides take you from basic Bayesian statistics over Markov chains and language models (incl. neural embeddings) to text similarity measures (Locality Sensitive Hashing, Latent Semantic Analysis, and Latent Dirichlet Allocation) and classification techniques (Naïve Bayes, MaxEnt [logistic regression]). Finally, they introduce you to shallow parsing with dynamic graphical models (HMMs, MEMMs, and CRFs) and provide you with a glimpse at dependency parsing (shift-reduce, arc-standard).
This document discusses various techniques for question answering and relation extraction in natural language processing. It provides an overview of question answering systems and approaches, including examples like START, Ask Jeeves and Siri. It also discusses using search engines for question answering, relation extraction from questions, and common evaluation metrics for question answering systems like accuracy and mean reciprocal rank.
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
The document discusses Rudolf Eremyan's work as a machine learning software engineer, including several natural language processing (NLP) projects. It provides details on a chatbot Eremyan created for the TBC Bank in Georgia that had over 35,000 likes and facilitated over 100,000 conversations. It also mentions sentiment analysis on Facebook comments and introduces NLP, discussing its history and applications such as text classification, machine translation, and question answering. The document outlines Eremyan's theoretical NLP project involving creating a machine learning pipeline for text classification using a labeled dataset.
This document provides an introduction to natural language processing (NLP) through a presentation given by Rutu Mulkar-Mehta. The presentation covers understanding natural language, common NLP tasks like text categorization and sentiment analysis, and challenges like ambiguity. It also discusses part-of-speech tagging and linguistic resources. The overall goal is to introduce attendees to the field of NLP and some of its applications.
This document discusses text mining and information extraction. It covers the goals of information extraction including extracting structured data from unstructured text. It also discusses named entity recognition, challenges in NER, maximum entropy methods for NER, template filling using statistical and finite-state approaches, and applications of information extraction.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of text. Challenges include dealing with noisy, unstructured data and complex relationships between concepts. The text mining process involves preprocessing text through steps like tokenization, feature selection, and parsing to extract meaningful features before analysis can be done through classification, clustering, or other techniques. Potential applications are wide-ranging across domains like customer profiling, trend analysis, and web search.
Information Extraction --- An one hour summaryYunyao Li
This is the deck that I made when taking CS767 at Univ. of Michigan in 2006. While it is a few years' old, it is still a useful deck for people who are new to information extraction.
The presentation discusses information extraction techniques for distilling structured data from unstructured text. It provides an example of building a website to find continuing education opportunities by extracting structured data from unstructured web pages. The presentation covers machine learning approaches to information extraction, such as using a wrapper to query unstructured sources as databases. It also discusses challenges such as verifying extracted data and automatically repairing wrappers when extractions change or fail.
This document discusses using Python in the animation industry. It describes how one person learned Python while working in animation. It also discusses using Python in Maya for animation work. The document then talks about building an animation pipeline using Python and ideas like version control, asset management databases, and behavior logging to analyze user interactions. It considers options like CouchDB and MongoDB for managing asset data and user behavior logs. Finally, it emphasizes that version control, databases, and cloud techniques can benefit many creative roles beyond just programmers.
Combining Distributional Semantics and Entity Linking for Context-aware Conte...Cataldo Musto
This document describes a context-aware content-based recommendation framework called contextual eVSM. It combines distributional semantics and entity linking to address limitations of traditional content-based recommender systems related to poor semantic representation and lack of contextual modeling. The framework includes three main components: a semantic content analyzer, a context-aware profiler, and a recommender. The semantic content analyzer generates semantic representations of items using both entity linking and distributional semantics learned from text. The context-aware profiler builds contextual user profiles based on a strategy that combines standard user ratings with contextual information. The recommender then uses these representations to provide context-aware recommendations.
Entity linking meets Word Sense Disambiguation: a unified approach(TACL 2014)の紹介Koji Matsuda
My presentation of the paper that "Entity Linking meets Word Sense Disambiguation: a Unified Approach" (TACL 2014), Andrea Moro, Alessandro Raganato, Roberto Navigli (University of Roma)
Slides about "Usecases for Information Extraction with UIMA" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Martin Voigt | Streaming-based Text Mining using Deep Learning and Semanticssemanticsconference
This document discusses Ontos' approach to streaming-based text mining using deep learning and semantics. It describes use cases for content augmentation and monitoring, and requirements around entity detection, multiple languages/sources, and domain adaptation. An overview of the text analytics market shows growth areas. Ontos' WildhornMiner uses deep learning models trained on large corpora to classify entities in supervised models. Lessons learned include the benefits of neural networks and Kafka for streaming. Next steps involve relation extraction, search interfaces, and benchmarking.
Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
This document discusses text mining and provides an outline of the topic. It defines text mining as the analysis of natural language text data and explains why it is useful given the large amount of unstructured data. The document then describes the basic text mining process, which includes steps like filtering, segmentation, stemming, eliminating excessive words, and clustering. Several applications of text mining are mentioned like call centers, anti-spam, and market intelligence. Challenges of text mining like dealing with unstructured data and large collections of documents are also outlined.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
This document provides an introduction to topic modelling. It discusses how topic modelling can be used to summarize large collections of documents by clustering them into topics. It describes latent Dirichlet allocation (LDA) as a commonly used topic modelling technique that represents documents as mixtures of topics and topics as mixtures of words. The document outlines how LDA works using a generative process and Gibbs sampling. It also discusses other related methods like latent semantic analysis, word2vec, and Ida2vec. Evaluation techniques for topic models like word and topic intrusion are presented.
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
This document summarizes Liwei Ren's presentation on binary similarity algorithms. It discusses three existing algorithms - ssdeep, sdhash, and TLSH - and proposes a new algorithm called TSFP. TSFP represents files as bags of blocks and measures similarity based on the overlap of blocks between two files. It is suggested as a way to solve similarity search and clustering problems by creating an index of files represented as TSFPs. The presentation concludes with inviting questions from the audience.
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through standards like the OBO Foundry. It introduces key semantic web technologies like RDF, IRI, TTL, SPARQL and OWL that are used to represent ontologies and linked data. It provides examples of representing ontologies and data in the Turtle syntax and using semantic web standards like RDFS for defining classes, properties and basic inference.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents. The vector space model is commonly used to represent documents as vectors of term weights.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.
Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents into topics. The vector space model is commonly used to represent documents as vectors of term weights to enable similarity comparisons between documents.
The document discusses using RDFS and OWL reasoning to integrate heterogeneous linked data by addressing issues like terminology and naming heterogeneity. It presents an approach using a subset of OWL 2 RL rules to reason over a billion triple corpus in a scalable way, handling the TBox separately from the ABox to avoid quadratic inferences. It also describes augmenting the reasoning with annotations to track trustworthiness and using this to filter inferences, detect inconsistencies and perform a light repair of the data. Consolidation is discussed as rewriting URIs to canonical identifiers based on owl:sameAs relations. Performance results show the different techniques taking between 1-20 hours to run over the corpus distributed across 9 machines.
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
This document introduces latent semantic indexing (LSI), a technique for information retrieval that overcomes some limitations of the vector space model. LSI represents documents and queries in a semantic space of concepts derived from word co-occurrence patterns in the original text. It uses singular value decomposition to project documents and queries into a concept space of lower dimensionality than the original word space. This addresses problems with synonymy and polysemy. An example shows how LSI can retrieve a document based on conceptual similarity rather than direct word matching. Advantages of LSI include capturing synonymy and polysemy, while disadvantages include increased storage and computational requirements.
The document discusses formal language theory and its applications in natural language processing (NLP). It covers two main goals in computational linguistics - theoretical interest in formally characterizing natural language and practical interest in using well-understood frameworks like finite state models to solve NLP problems. Finite state devices are widely used in NLP tasks due to their efficiency and ability to model linguistic phenomena like words through dictionaries and rules. While finite state models provide a useful approximation of language, natural languages pose challenges like ambiguity, long distance dependencies and non-regular features that require extensions to basic finite state models.
This document discusses various techniques for feature engineering on text data, including both structured and unstructured data. It covers preprocessing techniques like tokenization, stopword removal, and stemming. It then discusses methods for feature extraction, such as bag-of-words, n-grams, TF-IDF, word embeddings, topic models like LDA. It also discusses document similarity metrics and applications of feature engineered text data to text classification. The goal is to transform unstructured text into structured feature vectors that can be used for machine learning applications.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
The document discusses various techniques for semantic processing in natural language processing, including:
1) Semantic processing aims to understand the meaning of sentences by associating words with semantic markers to resolve lexical ambiguity and preferences for certain subject-verb combinations.
2) Semantic grammars encode semantic information in syntactic rules and are designed around key semantic concepts.
3) Case grammars analyze meaning by identifying noun phrases that fill semantic roles or "cases" like agent, object, location for verbs.
4) Compositional semantics provides a step-by-step mapping from syntactic structures to semantic interpretations based on a knowledge base.
This document provides an overview of natural language processing and planning topics including:
- NLP tasks like parsing, machine translation, and information extraction.
- The components of a planning system including the planning agent, state and goal representations, and planning techniques like forward and backward chaining.
- Methods for natural language processing including pattern matching, syntactic analysis, and the stages of NLP like phonological, morphological, syntactic, semantic, and pragmatic analysis.
Bytewise approximate matching, searching and clusteringLiwei Ren任力偉
This document discusses bytewise approximate matching, searching, and clustering. It defines six matching problems - identicalness, containment, cross-sharing, similarity, approximate containment, and approximate cross-sharing. A framework is proposed that uses bytewise relevance to define matching, searching, and clustering problems and solutions. Current and future work includes algorithms, tools, and applications for approximate matching in domains like malware analysis, plagiarism detection, and digital forensics.
Similar to OUTDATED Text Mining 5/5: Information Extraction (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Intelligence supported media monitoring in veterinary medicine
OUTDATED Text Mining 5/5: Information Extraction
1. Text Mining 5
Information Extraction
!
Madrid Summer School 2014
Advanced Statistics and Data Mining
!
Florian Leitner
florian.leitner@upm.es
for real, now!
License:
2. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Incentive and Applications
➡ Statistical analyses of text sequences
!
• Text Segmentation
‣ word & sentence boundaries (see Text Mining #3)
• Part-of-Speech (PoS) Tagging & Chunking
‣ noun, verb, adjective, … & noun/verb/preposition/… phrases
• Named Entity Recognition (NER)
‣ organizations, persons, places, genes, chemicals, …
• Information Extraction
‣ locations/times, constants, formulas, entity linking, entity-relationships, …
144
3. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Probabilistic Graphical
Models (PGM)
• Graphs of hidden (blank) and observed (shaded) variables
(vertices/nodes).
• The edges depict dependencies, and if directed, show the
causal relationship between nodes.
145
H O
S3S2S1
directed ➡ Bayesian Network (BN)
undirected ➡ Markov Random Field (MRF)
mixed ➡ Mixture Models
Koller & Friedman. Probabilistic Graphical Models. 2009
4. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Hidden vs. Observed State
and Statistical Parameters
Rao. A Kalman Filter Model of the Visual Cortex. Neural Computation 1997
146
6. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Markov Random Field
148
normalizing constant (partition function Z)
factor (clique potential)
A
B
C
D
θ(A, B) θ(B, C) θ(C, D) θ(D, A)
a b 30 b c 100 c d 1 d a 100
a b 5 b c 1 c d 100 d a 1
a b 1 b c 1 c d 100 d a 1
a b 10 b c 100 c d 1 d a 100
P(a1, b1, c0, d1) =
10·1·100·100 ÷
7’201’840 =
0.014
factor table
factor graph
P(X = ~x) =
Q
cl2~x cl(cl)
P
~x2X
Q
cl2~x cl(cl)
cl … [maximal] clique; a subset of nodes in the
graph where every pair of nodes is connected
History class: Ising developed a linear
field to model (binary) atomic spin states
(Ising, 1924); the 2-dim. model problem
was solved by Onsager in 1944.
7. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
A First Look At Probabilistic
Graphical Models
• Latent Dirichlet Allocation: LDA
‣ Blei, Ng, and Jordan. Journal of Machine Learning Research 2003
‣ For assigning “topics” to “documents”
149
i.e., Text Classification; last lesson (4)
8. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Latent Dirichlet Allocation
(LDA 1/3)
• Intuition for LDA
• From: Edwin Chen. Introduction to LDA. 2011
‣ “Document Collection”
• I like to eat broccoli and bananas.
• I ate a banana and spinach smoothie for breakfast.
• Chinchillas and kittens are cute.
• My sister adopted a kitten yesterday.
• Look at this cute hamster munching on a piece of broccoli.
!
!
150
➡ Topic A
➡ Topic B
➡ Topic 0.6A + 0.4B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
!
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
9. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Dirichlet Process
• A Dirichlet is a [possibly continuos] distribution over
[discrete/multinomial] distributions (probability masses).
!
!
!
• The Dirichlet Process samples multiple independent,
discrete distributions θi with repetition from θ (“statistical
clustering”).
151
D(✓, ↵) =
(
P
↵i)
Q
(↵i)
Y
✓↵i 1
i
∑ i = 1; a PMF
Xnα
N
a Dirichlet Process is like drawing from an
(infinite) ”bag of dice“ (with finite faces)
Gamma function -> a
“continuous” factorial [!]
1. Draw a new distribution X from D(θ, α)
2. With probability α ÷ (α + n - 1) draw a new X
With probability n ÷ (α + n - 1), (re-)sample an Xi from X
Dirichlet prior: ∀ i ∈ : i > 0
10. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Dirichlet Prior α
152
bluered
green
Frigyik et al. Introduction to the Dirichlet
Distribution and Related Processes. 2010
⤳ equal, =1 ➡ uniform distribution
⤳ equal, <1 ➡ marginal distrib. (”choose few“)
⤳ equal, >1 ➡ symmetric, mono-modal distrib.
⤳ not equal, >1 ➡ non-symmetric distribution
Documents and Topic distributions (N=3)
11. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Latent Dirichlet Allocation
(LDA 2/3)
• α - per-document Dirichlet
prior
• θd - topic distribution of
document d
• zd,n - word-topic assignments
• wd,n - observed words
• βk - word distrib. of topic k
• η - per-topic Dirichlet prior
153
βkwd,nzd,nθdα η
ND K
Documents Words Topics
P(B, ⇥, Z, W) =
KY
k
P( k|⌘)
! DY
d
P(✓d|↵)
NY
n
P(zd,n|✓d)P(wd,n| 1:K, zd,n)
!
P(Topics)
P(Document-T.)
P(Word-T. | Document-T.)
P(Word | Topics, Word-T.)
Joint Probability
Posterior: P(Topic | Word) P(B, , Z | W) = P(B, , Z, W) ÷ P(W)
termscorek,n = ˆk,n log
ˆk,n
⇣QK
j
ˆj,n
⌘1/K
dampens the topic-specific score of terms assigned to many topics
What Topics is a Word assigned to?
12. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Latent Dirichlet Allocation
(LDA 3/3)
• LDA inference in a nutshell
‣ Calculate the probability that Topic t generated Word w.
‣ Initialization: Choose K and randomly assign one out of the K Topics to each of
the N Words in each of the D Documents.
• The same word can have different Topics at different positions in the Document.
‣ Then, for each Topic:
And for each Word in each Document:
1. Compute P(Word-Topic | Document): the proportion of [Words assigned to] Topic t in Document d
2. Compute P(Word | Topics, Word-Topic): the probability a Word w is assigned a Topic t (using the general
distribution of Topics and the Document-specific distribution of [Word-] Topics)
• Note that a Word can be assigned a different Topic each time it appears in a Document.
3. Given the prior probabilities of a Document’s Topics and that of Topics in general, reassign
P(Topic | Word) = P(Word-Topic | Document) * P(Word | Topics, Word-Topic)
‣ Repeat until P(Topic | Word) stabilizes (e.g., MCMC Gibbs sampling, Course 04)
154
MCMC in Python: PyMC or PySTAN
13. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Retrospective
We have seen how to…
• Design generative models of natural language.
• Segment, tokenize, and compare text and/or sentences.
• [Multi-] labeling whole documents or chunks of text.
!
Some open questions…
• How to assign labels to individual tokens in a stream?
‣ without using a dictionary/gazetteer
• How to detect semantic relationships between tokens?
‣ dependency & phrase structure parsing not in this class !
155
14. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Probabilistic Models for
Sequential Data
• a.k.a. Temporal or Dynamic Bayesian Networks
‣ static process w/ constant model ➡ temporal process w/ dynamic model
‣ model structure and parameters are [still] constant
‣ the topology within a [constant] “time slice” is depicted
• Markov Chain (MC; Markov. 1906)
• Hidden Markov Model (HMM; Baum et al. 1970)
• MaxEnt Markov Model (MEMM; McCallum et al. 2000)
• [Markov] Conditional Random Field (CRF; Lafferty et al. 2001)
‣ Naming: all four models make the Markov assumption (Text Mining 2)
156
generativediscriminative
15. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
From A Markov Chain to a
Hidden Markov Model
157
WW’ SS’
W
hidden states
observed state
S1S0
W1
S2
W2
S3
W3
“unrolled”
(for 3 words)
“time-slice”
W depends on S
and S in turn
only depends on
S’
P(W | W’) P(W | S)
P(S | S’)
16. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
A Language-Based Intuition
for Hidden Markov Models
• A Markov Chain: P(W) = ∏ P(w | w’)
‣ assumes the observed words are in and of themselves the cause of the
observed sequence.
• A Hidden Markov Model P(S, W) = ∏ P(s | s’) P(w | s)
‣ assumes the observed words are emitted by a hidden (not observable)
sequence, for example the chain of part-of-speech-states.
158
S DT NN VB NN .
The dog ran home !
again, this is the “unrolled” model that does not depict the conditional dependencies
17. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The two Matrices of a
Hidden Markov Model
159
P(s | s’) DT NN VB …
DT 0.03 0.7 0
NN 0 0 0.5
VB 0 0.5 0.2
…
P(w | s) word word word …
DT 0.3 0 0
NN 0.0001 0.002 0
VB 0 0 0.001
…
SS’
W
very sparse (W is large)
➡ Smoothing!
Transition Matrix
Observation Matrix
underflow danger ➡ use “log Ps”!
a.k.a. “CPDs”: Conditional Probability Distributions
(as measured from annotated PoS corpora)
18. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Three Problems Solved by
HMMs
Evaluation: Given a HMM, infer the P of the observed
sequence (because a HMM is a generative model).
Solution: Forward Algorithm
Decoding: Given a HMM and an observed sequence, predict
the hidden states that lead to this observation.
Solution: Viterbi Algorithm
Training: Given only the graphical model and an observation
sequence, learn the best [smoothed] parameters.
Solution: Baum-Welch Algorithm
160
all three algorithms are are implemented using dynamic programming
in Bioinformatics:
Likelihood of a particular DNA
element
in Statistical NLP:
PoS annotation
19. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Three Limitations of HMMs
Markov assumption: The next state only depends on the
current state.
Example issue: trigrams
Output assumption: The output (observed value) is
independent of all previous outputs (given the current state).
Example issue: word morphology
Stationary assumption: Transition probabilities are
independent of the actual time when they take place.
Example issue: position in sentence
161
(label bias* problem!)
(long-range dependencies!)
(inflection, declension!)
MEMMCRF“unsolved”
(CRF)
20. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
From Generative to
Discriminative Markov Models
162
SS’
W S’ S
W
S’ S
W
Hidden Markov Model
Maximum Entropy Markov Model
Conditional Random
Field
this “clique” makes
CRFs expensive to
compute
generative model; lower
bias is beneficial for
small training sets
(linear chain version)(first oder version)
P(S’, S, W)
P(S | S’, W)
NB boldface W:
all words!
(first oder version)
21. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Maximum Entropy
(MaxEnt 3/3)
‣ In summary, MaxEnt is about selecting the “maximal” model p*:
!
!
!
‣ That obeys the following conditional equality constraint:
!
!
!
‣ Using, e.g., Langrange multipliers, one can establish the optimal λ weights
of the model that maximize the entropy of this probability:
163
X
x2X
P(x)
X
y2Y
P(y|x) f(x, y) =
X
x2X,y2Y
P(x, y) f(x, y)
p⇤
= argmax
p2P
X
x2X
p(x)
X
y2Y
p(y|x) log2 p(y|x)
p⇤
(y|x) =
exp(
P
ifi(x, y))
P
y2Y exp(
P
ifi(x, y))
…using a conditional model that matches the (observed) joint probabilities
select some model that maximizes the conditional entropy…
REMINDER
“Exponential Model”
22. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Maximum Entropy Markov
Models (MEMM)
Simplify training of P(s | s’, w)
by splitting the model into |S| separate
transition functions Ps’(s | w) for each s’
164
P(S | S’, W)
S’ S
W
Ps0 (s|w) =
exp (
P
ifi(w, s))
P
s⇤2S exp (
P
ifi(w, s⇤))
P(s | s’, w) = Ps’(s | w)
MaxEnt used {x, y} which in the sequence model now are {w, s}
23. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Label Bias Problem of
Directional Markov Models
165
The robot wheels are round.
The robot wheels Fred around.
The robot wheels were broken.
DT NN VB NN RB
DT NN NN VB JJ
DT NN ?? — —
yet unseen!
Wallach. Efficient Training of CRFs. MSc 2002
because of their
directionality constraints,
MEMMs & HMMs suffer
from the label bias problem
S’ S
W
MEMM
HMM
SS’
W
24. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Markov Random Field
166
normalizing constant (partition function - Z)
factor (clique potential)
A
B
C
D
θ(A, B) θ(B, C) θ(C, D) θ(D, A)
a b 30 b c 100 c d 1 d a 100
a b 5 b c 1 c d 100 d a 1
a b 1 b c 1 c d 100 d a 1
a b 10 b c 100 c d 1 d a 100
P(a1, b1, c0, d1) =
10·1·100·100 ÷
7’201’840 =
0.014
factor table
factor graph
P(X = ~x) =
Q
cl2~x cl(cl)
P
~x2X
Q
cl2~x cl(cl)
cl … [maximal] clique; a subset of nodes in the
graph where every pair of nodes is connected
REMINDER
25. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Conditional Random Field
167
SS’
W
w1, …, wn
Models a per-state exponential function of joint
probability over the entire observed sequence W.
(max.)
clique
MRF:
Label bias “solved” by
conditioning the MRF Y-Y’
on the (possibly) entire
observed sequence (feature
function dependent).
P(s|W) =
exp (
P
ifi(W, s0
, s))
P
s⇤2S exp (
P
ifi(W, s0⇤, s⇤))
CRF:
note W (upper-case/bold),
not w (lower-case)
P(X = ~x) =
Q
cl2~x cl(cl)
P
~x2X
Q
cl2~x cl(cl)
P(Y = ~y |X = ~x) =
Q
y2~y cl(y0
, y, ~x)
P
~y2Y
Q
y2~y cl(y0, y, ~x)
” P(s, s’, W) “
P(W)
s’ WENT MISSING
26. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Parameter Estimation and
L2-Regularization of CRFs
• For training {Y(n), X(n)}N
n=1 sequence pairs with K features
• Parameter estimation using conditional log-likelihood
!
!
• Substitute log P(Y(n) | X(n)) with log exponential model
!
!
• Add a penalty for parameters with a to high L2-norm
168
free parameter
`(⇤) = argmax
2⇤
NX
n
log P(Y (n)
|X(n)
; )
`( ) =
NX
n=1
X
y2Y (n)
KX
i=1
ifi(y0
, y, X(n)
)
NX
n=1
log Z(X(n)
)
`( ) =
NX
n=1
X
y2Y (n)
KX
i=1
ifi(y0
, y, X(n)
)
NX
n=1
log Z(X(n)
)
KX
i=1
2
i
2 2
(regularization reduces the effects of overfitting)
27. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Model Summary:
HMM, MEMM, CRF
• A HMM
‣ generative model
‣ efficient to learn and deploy
‣ trains with little data
‣ generalizes well (low bias)
• A MEMM
‣ better labeling performance
‣ modeling of features
‣ label bias problem
!
• CRF
‣ conditioned on entire observation
‣ complex features over full input
‣ training time scales exponentially
• O(NTM2
G)
• N: # of sequence pairs;
T: E[sequence length];
M: # of (hidden) states;
G: # of gradient computations for parameter
estimation
169
a PoS model w/45 states and 1 M
words can take a week to train…
Sutton & McCallum. An Introduction to CRFs for Relational Learning. 2006
29. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• The Parts of Speech:
‣ noun: NN, verb: VB, adjective: JJ, adverb: RB, preposition: IN,
personal pronoun: PRP, …
‣ e.g. the full Penn Treebank PoS tagset contains 48 tags:
‣ 34 grammatical tags (i.e., “real” parts-of-speech) for words
‣ one for cardinal numbers (“CD”; i.e., a series of digits)
‣ 13 for [mathematical] “SYM” and currency “$” symbols, various types of
punctuation, as well as for opening/closing parenthesis and quotes
171
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
“PoS tags”
REMINDER
30. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• Corpora for the [supervised] training of PoS taggers
‣ Brown Corpus (AE from ~1961)
‣ British National Corpus: BNC (20th
century British)
‣ American National Corpus: ANC (AE from the 90s)
‣ Lancaster Corpus of Mandarin Chinese: LCMC (Books in Mandarin)
‣ The GENIA corpus (Biomedical abstracts from PubMed)
‣ NEGR@ (German Newswire from the 90s)
‣ Spanish and Arabian corpora should be (commercially…) available… ???
172
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
31. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Noun/Verb Phrase Chunking
and BIO-Labels
173
The brown fox quickly jumps over the lazy dog .
a pangram (hint: check the letters)
DT JJ NN RB VBZ IN DT JJ NN .
B-N I-N I-N B-V I-V O B-N I-N I-N O
Performance (2nd
order CRF) ~ 94%
Main Problem: embedded & chained NPs (N of N and N)
!
Chunking is “more robust to the highly diverse corpus of text on the Web”
and [exponentially] faster than parsing.
Banko et al. Open Information Extraction from the Web. IJCAI 2007 a paper with over 700 citations
Wermter et al. Recognizing noun phrases in biomedical text. SMBM 2005 error sources
32. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Word Sense Disambiguation
(WSD)
• Basic Example: hard [JJ]
‣ physically hard (a hard stone)
‣ difficult [task] (a hard task)
‣ strong/severe (a hard wind)
‣ dispassionate [personality] (a hard
bargainer)
• Entity Disambig.: bank [NN]
‣ finance (“bank account”)
‣ terrain (“river bank”)
‣ aeronautics (“went into bank”)
‣ grouping (“a bank of …”)
!
• SensEval
‣ http://www.senseval.org/
‣ SensEval/SemEval Challenges
‣ provides corpora where every word is
tagged with its sense
• WordNet
‣ http://wordnet.princeton.edu/
‣ a labeled graph of word senses
• Applications
‣ Named Entity Recognition
‣ Machine Translation
‣ Language Understanding
174
Note the PoS-tag “dependency”:
otherwise, the two examples
would have even more senses!
33. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Word Representations:
Unsupervised WSD
• Idea 1: Words with similar meaning have similar environments.
• Use a word vector to count a word’s surrounding words.
• Similar words now will have similar word vectors.
‣ see Text Mining 4, Cosine Similarity
‣ visualization: Principal Component Analysis
!
• Idea 2: Words with similar meaning have similar environments.
• Use the surrounding of unseen words to “smoothen” language models
(i.e., the correlation between word wi and its context cj).
‣ see Text Mining 4: TF-IDF weighting, Cosine similarity and point-wise MI
‣ Levy & Goldberg. Linguistic Regularities in Sparse and Explicit Word Representations.
CoNLL 2014
175
Levy & Goldberg took
two words on each side
to “beat” a neural
network model with a
four word window!
Turian et al. Word representations. ACL 2010
34. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Named Entity Recognition
(NER)
176
Date
Location
Money
Organization
Percentage
Person
➡ Conditional Random Field
➡ Ensemble Methods; +SVM, HMM, MEMM, … ➡ pyensemble
NB these are corpus-based approaches (supervised)
CoNLL03: http://www.cnts.ua.ac.be/conll2003/ner/
!
How much training
data do I need?
“corpora list”
Image Source: v@s3k [a GATE session; http://vas3k.ru/blog/354/]
35. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Next Step: Relationship
Extraction
Given this piece of text:
The fourth Wells account moving to another agency is the
packaged paper-products division of Georgia-Pacific Corp., which
arrived at Wells only last fall. Like Hertz and the History Channel,
it is also leaving for an Omnicom-owned agency, the BBDO South
unit of BBDO Worldwide. BBDO South in Atlanta, which handles
corporate advertising for Georgia-Pacific, will assume additional
duties for brands like Angel Soft toilet tissue and Sparkle paper
towels, said Ken Haldin, a spokesman for Georgia-Pacific in
Atlanta.
Which organizations operate in Atlanta? (BBDO S., G-P)
177
36. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Four Approaches to
Relationship Extraction
• Co-mention window
‣ e.g.: if ORG and LOC entity within
same sentence and no more than x
tokens in between, treat the pair as a
hit.
‣ Low precision, high recall; trivial,
many false positives.
• Dependency Parsing
‣ if the shortest a path containing
certain nodes (e.g. the “in/IN”)
connecting the two entities, extract
the pair.
‣ Balanced precision and recall,
computationally EXPENSIVE.
• Pattern Extraction
‣ e.g.: <ORG>+ <IN> <LOC>+
‣ High precision, low recall;
cumbersome, but very common
‣ pattern learning can help
• Machine Learning
‣ features for sentences with entities
and some classifier (e.g., SVM,
MaxEnt, ANN, …)
‣ very variable milages
but much more fun than anything
else in the speaker’s opinion…
178
preposition (PoS)
token-distance,
#tokens between the
entities, tokens
before/after them,
etc.)
37. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Advanced Evaluation Metrics:
AUC ROC and PR .
179
Image Source: WikiMedia Commons, kku (“kakau”, eddie)
TPR / Recall"
TP ÷ (TP + FN)
!
FPR"
FP ÷ (FP + TN)
!
Precision"
TP ÷ (TP + FP)
• Davis & Goadrich. The Relationship Between PR and ROC
Curves. ICML 2006
• Landgrebe et al. Precision-recall operating characteristic
(P-ROC) curves in imprecise environments. Pattern
Recognition 2006
• Hanczar et al. Small-Sample Precision of ROC-Related
Estimates. Bioinformatics 2010
“an algorithm which optimizes the area under
the ROC curve is not guaranteed to optimize
the area under the PR curve”
Davis & Goadrich, 2006
Area Under the Curve
Receiver-Operator Characteristic
Precision-Recall
i.e., ROC is
affected by the
same class-
imbalance issues
as accuracy
(Lesson #4)!
38. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Practical: Implement a
Shallow Parser and NER
Implement a Python class and command-line tool that will allow
you to do shallow parsing and NER on “standard” English text.
180
39. Richard Feynman, 1974
“The first principle is that you must not fool yourself
- and you are the easiest person to fool.”