There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
Directed versus undirected network analysis of student essaysRoy Clariana
IWALS 2018
6th International Workshop on Advanced Learning Sciences
Perspectives on the Learner: Cognition, Brain, and Education
University of Pittsburgh, USA JUNE 6-8, 2018
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
Directed versus undirected network analysis of student essaysRoy Clariana
IWALS 2018
6th International Workshop on Advanced Learning Sciences
Perspectives on the Learner: Cognition, Brain, and Education
University of Pittsburgh, USA JUNE 6-8, 2018
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
The slides present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
The document discusses text mining tools, techniques, and applications. It provides examples of using text mining for medical research to discover relationships between migraines and biochemical levels. Another example shows using call center records to analyze customer sentiment and identify problem areas. The document also discusses challenges of text mining like ambiguity and context sensitivity in language. It outlines text processing techniques including statistical analysis, language analysis, and information extraction. Finally, it discusses interfaces and visualization challenges for presenting text mining results.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
This document discusses various techniques used in web search engines for indexing and ranking documents. It covers topics like inverted indices, stopword removal, stemming, relevance feedback, vector space models, and Bayesian inference networks. Web search engines prepare an index of keywords for documents and return ranked lists in response to queries by measuring similarities between query and document vectors based on term frequencies and inverse document frequencies.
The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.
The document discusses automatic summarization and related disciplines. It defines summarization as the condensation of a source text into a shorter version by selecting key information. Automatic summarization involves producing summaries computationally. Related fields include automatic classification, keyword extraction, information retrieval, information extraction, and question answering, which all aim to organize and understand information from text.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
This document provides an overview of research methodology and resources for conducting research. It discusses developing research questions, search strategies, using databases like Education Source and SAGE Research Methods, evaluating sources, and citing sources. Specific search techniques covered include developing keywords, Boolean operators, phrase searching, truncation, subject headings, and refining results. Interlibrary loan, citation tools, and guides on dissertations and copyright are also mentioned as useful research resources.
This document provides an overview of latest trends in AI and information retrieval. It discusses how search engines work by crawling websites, indexing content, handling user queries, and ranking results. Open-source search solutions and real-world problems in information retrieval are also covered, such as extracting text from web pages and using machine learning for ranking. Emerging areas like learning to rank, query expansion, question answering, and neural information retrieval methods are also summarized. The document concludes by listing some common job roles in the software industry.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
The document discusses the role of text mining in search engines. It describes how search engines work by crawling websites and indexing key terms. Text mining can help search engines provide more relevant and contextualized search results through techniques like clustering, categorization, and entity extraction. The document also discusses future trends in search engines leveraging more advanced text mining techniques like summarization and answering intelligent questions.
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
The slides present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Presentation of the main IR models
Presentation of our submission to TREC KBA 2014 (Entity oriented information retrieval), in partnership with Kware company (V. Bouvier, M. Benoit)
The document discusses text mining tools, techniques, and applications. It provides examples of using text mining for medical research to discover relationships between migraines and biochemical levels. Another example shows using call center records to analyze customer sentiment and identify problem areas. The document also discusses challenges of text mining like ambiguity and context sensitivity in language. It outlines text processing techniques including statistical analysis, language analysis, and information extraction. Finally, it discusses interfaces and visualization challenges for presenting text mining results.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
This document discusses various techniques used in web search engines for indexing and ranking documents. It covers topics like inverted indices, stopword removal, stemming, relevance feedback, vector space models, and Bayesian inference networks. Web search engines prepare an index of keywords for documents and return ranked lists in response to queries by measuring similarities between query and document vectors based on term frequencies and inverse document frequencies.
The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.
The document discusses automatic summarization and related disciplines. It defines summarization as the condensation of a source text into a shorter version by selecting key information. Automatic summarization involves producing summaries computationally. Related fields include automatic classification, keyword extraction, information retrieval, information extraction, and question answering, which all aim to organize and understand information from text.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
Information retrieval systems use indexes and inverted indexes to quickly search large document collections by mapping terms to their locations. Boolean retrieval uses an inverted index to process Boolean queries by intersecting postings lists to find documents that contain sets of terms. Key aspects of information retrieval systems include precision, recall, and ranking search results by relevance.
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
This document describes a study that uses natural language processing and text mining techniques to identify future work statements in scientific papers and extract keywords from those statements. The researchers developed a multi-step pipeline to first identify the future work section, then select future work sentences within that section. They used rules and algorithms to identify sentences discussing future work. Keywords were then extracted from the selected sentences using the RAKE algorithm. An analysis found that 31.4% of papers contained future work statements, with medical science papers having the highest overlap between future work and title-abstract keywords. The researchers hope this work is a first step toward predicting future research topics.
This document provides an overview of research methodology and resources for conducting research. It discusses developing research questions, search strategies, using databases like Education Source and SAGE Research Methods, evaluating sources, and citing sources. Specific search techniques covered include developing keywords, Boolean operators, phrase searching, truncation, subject headings, and refining results. Interlibrary loan, citation tools, and guides on dissertations and copyright are also mentioned as useful research resources.
This document provides an overview of latest trends in AI and information retrieval. It discusses how search engines work by crawling websites, indexing content, handling user queries, and ranking results. Open-source search solutions and real-world problems in information retrieval are also covered, such as extracting text from web pages and using machine learning for ranking. Emerging areas like learning to rank, query expansion, question answering, and neural information retrieval methods are also summarized. The document concludes by listing some common job roles in the software industry.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
The document discusses the role of text mining in search engines. It describes how search engines work by crawling websites and indexing key terms. Text mining can help search engines provide more relevant and contextualized search results through techniques like clustering, categorization, and entity extraction. The document also discusses future trends in search engines leveraging more advanced text mining techniques like summarization and answering intelligent questions.
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute: https://www.elsevier.com/connect/join-us-as-elsevier-data-scientists-present-at-columbia-university
This document provides an introduction to text mining and information retrieval. It discusses how text mining is used to extract knowledge and patterns from unstructured text sources. The key steps of text mining include preprocessing text, applying techniques like summarization and classification, and analyzing the results. Text databases and information retrieval systems are described. Various models and techniques for text retrieval are outlined, including Boolean, vector space, and probabilistic models. Evaluation measures like precision and recall are also introduced.
This document provides an introduction and overview of an IS220 Database Systems course. It outlines that the course will cover topics like database design, file organization, indexing and hashing, query processing and optimization, transactions, object-oriented and XML databases. It notes that the class will be 70% theory and 30% hands-on assignments completed in pairs. Assessment will include group work, tests, and a final exam. Class rules require punctuality, use of English, dressing professionally, and minimum 80% attendance.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
This document discusses implementing Linked Data in low resource conditions. It begins by outlining goals of providing a high-level view of Linked Data, identifying possible bottlenecks due to limited resources, and offering suggestions to overcome bottlenecks based on experience. It then defines what is meant by "low-resource conditions", including limited IT competencies, software, hardware, electricity, internet access. The document outlines the Linked Data workflow and discusses each step in more detail, including data generation, conversion to RDF, data storage, maintenance, linking, and exposure. It highlights the example of AGRIS, a collaborative Linked Data application, and emphasizes starting small, being strategic, reusing existing resources, and collaborating to maximize resources in low
The document discusses metadata standards and practices. It begins by asking questions about how digital information is organized and found. It then discusses challenges like having to do new tasks without full knowledge and learning from others. The document provides overviews of various metadata standards like MODS, MIX, PREMIS, METS, and TEI. It also discusses topics such as metadata schemas, subject metadata, indexing metadata, and search relevance. Throughout, it offers advice on evaluating and implementing metadata standards.
The document discusses data dictionaries and system description techniques. It defines a data dictionary as a place that records information about data flows, data stores, and processes. It also describes three levels of data dictionaries - data elements, data structures, and data flows and data stores. The document then discusses normalization, flowcharts, data flow diagrams, decision tables, and decision trees as techniques for graphically representing systems and processes.
This document provides guidance on qualitative data analysis methods, including:
- The process of immersion in qualitative data through repeated reading/listening to become familiar with the content.
- Coding qualitative data by applying abstract representations or labels to segments of data that are relevant to the research question.
- Developing codes that are data-derived (based on the explicit content) or researcher-derived (conceptual interpretations).
- Using analytical memos and diaries to document the analysis process, including emerging codes, themes, and interpretations.
- Identifying themes by examining codes for patterns and relationships that answer the research question. Themes capture broader meanings than codes.
This document provides an overview of the organization and content of a course on data modeling and databases. It discusses the following key points:
- The course is split into instruction groups led by professors and student assistants. Assessments include weekly homework assignments and a final exam.
- The course covers topics like the relational model, functional dependencies, data modeling, and database design. It examines how to represent real-world data and relationships in a database using different modeling approaches.
- Database management systems help address issues like data redundancy, inconsistency, isolation, and integrity that can arise when directly building applications on file systems. The course focuses on data manipulation and retrieval as well as database design.
Elasticsearch is an open-source, distributed, real-time document indexer with support for online analytics. It has features like a powerful REST API, schema-less data model, full distribution and high availability, and advanced search capabilities. Documents are indexed into indexes which contain mappings and types. Queries retrieve matching documents from indexes. Analysis converts text into searchable terms using tokenizers, filters, and analyzers. Documents are distributed across shards and replicas for scalability and fault tolerance. The REST APIs can be used to index, search, and inspect the cluster.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
This document provides an overview of key concepts in data science and big data, including:
- Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data.
- The data value chain describes the process of acquiring data, analyzing it, curating it for storage, and using it.
- Big data is characterized by its volume, velocity, variety, and veracity. Hadoop is an open-source framework that allows distributed processing of large datasets across computer clusters.
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
In the talk I describe two approaches for improve the recall and precision of an enterprise search engine using machine learning techniques. The main focus is improving relevancy with ML while using your existing search stack, be that Luce, Solr, Elastic Search, Endeca or something else.
The document discusses different types of databases including relational databases, analytical databases, operational databases, and object-oriented databases. It describes key characteristics of each type of database such as how they model and store data. Relational databases use tables to store data and link tables using relationships while analytical databases store archived data for analysis and operational databases manage dynamic data. Object-oriented databases integrate object-oriented programming with databases.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
This document discusses transparent concrete, which is produced by adding optical fibers to concrete mixes. There are two main materials used: concrete and optical fibers. The mixing process involves combining cement, water, and other components like epoxy resins, polycarbonate, fiberglass, and silica in specific proportions. Transparent concrete has similar strength properties to regular concrete but allows light to pass through up to 20 meters. It can be used to reduce lighting needs and has applications in partitions, floors, and decorative elements where improved visibility is desired. While advantageous for its aesthetics and energy savings, transparent concrete is more expensive than traditional concrete due to the optical fibers.
Stationary waves are produced by the superposition of two progressive waves of equal amplitude and frequency traveling in opposite directions. They have nodes where there is no displacement and antinodes where displacement is at maximum. A stationary wave's waveform does not move through the medium. Particles within a stationary wave vibrate in phase but with varying amplitudes, reaching maximum at antinodes and resting at nodes. The fundamental frequency and overtones of vibrating strings and air columns can be determined from measurements of stationary waves produced in them. The timbre of a sound depends on its harmonic content and amplitudes.
The document discusses the electrical activity of the heart as measured by an electrocardiogram (ECG). It begins by introducing the concept of polarization and the resting membrane potential of heart cells. It then describes the phases of the cardiac action potential: depolarization, repolarization, and the refractory periods. It compares the action potentials of pacemaker and non-pacemaker cells. It also discusses the conduction of the cardiac impulse and how this is reflected in the different waves of the ECG. Finally, it covers topics like ECG paper format, lead systems, heart rate calculation, and interpreting the ECG.
The document summarizes electrical activity in the heart. It discusses:
1) How electrical signals originate in the sinoatrial node and propagate through the heart, causing atrial and ventricular contraction.
2) The action potentials that occur in the sinoatrial node, atrioventricular node, Purkinje fibers, and ventricles.
3) How the electrocardiogram (ECG) records these electrical signals to examine cardiac excitation and contraction.
The document discusses various software process life cycle models, including:
- Waterfall model which progresses in linear stages from requirements to maintenance. It values predictability but is inflexible to changes.
- Prototyping model which adds prototyping stages to explore risks before full development.
- V model which mirrors each development phase with a testing phase. It emphasizes verification and validation.
- Iterative and incremental models like RUP which develop software iteratively in phases and increments, releasing early and often. This is more flexible and reduces risks compared to waterfall.
- Agile methods are also iterative and incremental but emphasize lightweight processes, adaptation, and flexibility over heavy documentation.
The
This document discusses digital logic circuits and binary logic. It begins with an overview of binary logic, logic gates like NAND, NOR and XOR, and Boolean algebra. It then covers analog vs digital signals, quantization, and converting between analog and digital formats. Various representations of digital designs are presented, including truth tables, Boolean algebra, and schematics. Common logic gates and their representations are described. The document discusses design methodologies and analyses, as well as simulation of logic circuits. It also covers elementary binary logic functions, basic identities of Boolean algebra, and converting between Boolean expressions and logic circuits.
This document discusses real-time scheduling algorithms. It begins by defining real-time systems and their key properties of timeliness and predictability. It then discusses two common real-time scheduling algorithms: fixed-priority Rate Monotonic scheduling and dynamic-priority Earliest Deadline First scheduling. It covers how each algorithm prioritizes and orders tasks, and analyzes their schedulability and utilization bounds. It concludes by comparing the two approaches.
Real-Time Signal Processing: Implementation and Applicationsathish sak
This document discusses real-time signal processing, including what it means, why it is used, and platforms for implementation. Real-time signal processing allows signals to be collected, analyzed, and modified in real-time as they occur. It is used to avoid time and money lost when collecting and processing data separately. Common platforms include software/PC, hardware like FPGAs, and firmware/hardware like DSPs, each with their own benefits and drawbacks relating to flexibility, speed, cost, and practicality. The document focuses on DSPs as a popular "middle ground" option and discusses code generation applications and the Embedded Target for TI's C6711 DSP.
This document provides an overview of digital signal processors (DSPs). It discusses how DSPs are specialized processors that are optimized for real-time signal processing applications like filtering. DSPs offer advantages over general purpose processors and analog signal processing techniques, including programmability, reduced noise susceptibility, and lower power consumption. The document compares different DSP families from Texas Instruments and discusses their applications and key parameters.
POWER GENERATION OF THERMAL POWER PLANTsathish sak
. The kinetic energy of the molecules in a solid, liquid or gas
2. The more kinetic energy, the more thermal energy the object possesses
3. Physicists also call this the internal energy of an object
mathematics application fiels of engineeringsathish sak
MATHS IS HARD
MATHS IS BORING
MATHS HAS NOTHING TO DO WITH REAL LIFE
ALL MATHEMATICIANS ARE MAD!
BUT I CAN SHOW YOU THAT MATHS IS IMPORTANT IN
CRIME DETECTION MEDICINE FINDING LANDMINES
Plastic is a material consists of wide range of synthetic or organics that can be moulded into solid object with diverse shapes.
The word PLASTIC is derived from the Greek Word “PLASTIKOS” meaning capable of being shaped or moulded.
Plastics are organic polymers of higher molecular mass.
The document discusses various engineering fields including biomedical engineering, civil engineering, mechanical engineering, electrical and electronics engineering, aeronautical engineering, and chemical engineering. It provides more details on biomedical engineering, stating that it applies engineering principles and problem-solving techniques to biology and medicine, and that biomedical engineers use an intimate knowledge of biological principles in their design process. The document also thanks the reader.
Pollution is a major problem affecting living organisms. There are different types of pollution including air, noise, and water pollution. Air pollution is caused by vehicles and industries releasing gases and carbon monoxide into the atmosphere. Noise pollution stems from human activities like vehicles and cell phone towers negatively impacting small birds. Water pollution occurs when industrial waste is dumped into bodies of water, harming organisms and potentially causing toxic algae blooms. Prevention efforts include using biofuels instead of petrol and diesel, stopping wastewater dumping, and planting trees to reduce noise pollution.
Radio frequency identification(RFID) technology using at various application by using radio frequency ranges.
It is especially used at tollgates. For automation of gate control.
It can also used at library systems.
Green Chemistry is the utilisation of a set of principles that reduces or eliminates the use or generation of hazardous substances in the design, manufacture and application of chemical products .
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...kalichargn70th171
In today's fiercely competitive mobile app market, the role of the QA team is pivotal for continuous improvement and sustained success. Effective testing strategies are essential to navigate the challenges confidently and precisely. Ensuring the perfection of mobile apps before they reach end-users requires thoughtful decisions in the testing plan.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
React.js, a JavaScript library developed by Facebook, has gained immense popularity for building user interfaces, especially for single-page applications. Over the years, React has evolved and expanded its capabilities, becoming a preferred choice for mobile app development. This article will explore why React.js is an excellent choice for the Best Mobile App development company in Noida.
Visit Us For Information: https://www.linkedin.com/pulse/what-makes-reactjs-stand-out-mobile-app-development-rajesh-rai-pihvf/
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Paul Brebner
Closing talk for the Performance Engineering track at Community Over Code EU (Bratislava, Slovakia, June 5 2024) https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/ Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
The Power of Visual Regression Testing_ Why It Is Critical for Enterprise App...kalichargn70th171
Visual testing plays a vital role in ensuring that software products meet the aesthetic requirements specified by clients in functional and non-functional specifications. In today's highly competitive digital landscape, users expect a seamless and visually appealing online experience. Visual testing, also known as automated UI testing or visual regression testing, verifies the accuracy of the visual elements that users interact with.
The Comprehensive Guide to Validating Audio-Visual Performances.pdfkalichargn70th171
Ensuring the optimal performance of your audio-visual (AV) equipment is crucial for delivering exceptional experiences. AV performance validation is a critical process that verifies the quality and functionality of your AV setup. Whether you're a content creator, a business conducting webinars, or a homeowner creating a home theater, validating your AV performance is essential.
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid
IBM watsonx Code Assistant for Z, our latest Generative AI-assisted mainframe application modernization solution. Mainframe (IBM Z) application modernization is a topic that every mainframe client is addressing to various degrees today, driven largely from digital transformation. With generative AI comes the opportunity to reimagine the mainframe application modernization experience. Infusing generative AI will enable speed and trust, help de-risk, and lower total costs associated with heavy-lifting application modernization initiatives. This document provides an overview of the IBM watsonx Code Assistant for Z which uses the power of generative AI to make it easier for developers to selectively modernize COBOL business services while maintaining mainframe qualities of service.
Manyata Tech Park Bangalore_ Infrastructure, Facilities and Morenarinav14
Located in the bustling city of Bangalore, Manyata Tech Park stands as one of India’s largest and most prominent tech parks, playing a pivotal role in shaping the city’s reputation as the Silicon Valley of India. Established to cater to the burgeoning IT and technology sectors
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
2. What is Text Mining?
• There are many examples of text-based documents (all in
‘electronic’ format…)
– e-mails, corporate Web pages, customer surveys, résumés, medical records,
DNA sequences, technical papers, incident reports, news stories and
more…
• Not enough time or patience to read
– Can we extract the most vital kernels of information…
• So, we wish to find a way to gain knowledge (in summarised form)
from all that text, without reading or examining them fully first…!
– Some others (e.g. DNA seq.) are hard to comprehend!
3. What is Text Mining?
• Traditional data mining uses ‘structured data’ (n x p
matrix)
• The analysis of ‘free-form text’ is also referred to as
‘unstructured data’,
– successful categorisation of such data can be a difficult and
time-consuming task…
• Often, can combine free-form text and structured data to
derive valuable, actionable information… (e.g. as in
typical surveys) – semi-structured
4. Text Mining: Examples
• Text mining is an exercise to gain knowledge from stores
of language text.
• Text:
– Web pages
– Medical records
– Customer surveys
– Email filtering (spam)
– DNA sequences
– Incident reports
– Drug interaction reports
– News stories (e.g. predict stock movement)
5. What is Text Mining
• Data examples
– Web pages
– Customer surveys
Customer Age Sex Tenure Comments Outcome
123 24 M 12 years Incorrect
charges on bill
customer angry
Y
243 26 F 1 month Inquiry about
charges to India
N
346 54 M 3 years Question about
charges on bill
N
7. Of Mice and Men: Concordance
Concordance is an alphabetized list of the most frequently occurring words in a book,
excluding common words such as "of" and "it." The font size of a word is proportional to the
number of times it occurs in the book.
11. Text Mining
• Typically falls into one of two categories
– Analysis of text: I have a bunch of text I am
interested in, tell me something about it
• E.g. sentiment analysis, “buzz” searches
– Retrieval: There is a large corpus of text documents,
and I want the one closest to a specified query
• E.g. web search, library catalogs, legal and medical
precedent studies
12. Text Mining: Analysis
• Which words are most present
• Which words are most surprising
• Which words help define the document
• What are the interesting text phrases?
13. Text Mining: Retrieval
• Find k objects in the corpus of documents which
are most similar to my query.
• Can be viewed as “interactive” data mining -
query not specified a priori.
• Main problems of text retrieval:
– What does “similar” mean?
– How do I know if I have the right documents?
– How can I incorporate user feedback?
14. Text Retrieval: Challenges
• Calculating similarity is not obvious - what is the distance between
two sentences or queries?
• Evaluating retrieval is hard: what is the “right” answer ? (no
ground truth)
• User can query things you have not seen before e.g. misspelled,
foreign, new terms.
• Goal (score function) is different than in classification/regression:
not looking to model all of the data, just get best results for a
given user.
• Words can hide semantic content
– Synonymy: A keyword T does not appear anywhere in the document, even
though the document is closely related to T, e.g., data mining
– Polysemy: The same keyword may mean different things in different
contexts, e.g., mining
15. Basic Measures for Text Retrieval
• Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
• Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
recall=
|{Relevant}∩{Retrieved}|
|{Relevant}|
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision
∩
=
16. Precision vs. Recall
• We’ve been here before!
– Precision = TP/(TP+FP)
– Recall = TP/(TP+FN)
– Trade off:
• If algorithm is ‘picky’: precision high, recall low
• If algorithm is ‘relaxed’: precision low, recall high
– BUT: recall often hard if not impossible to calculate
Truth:Relvant Truth:Not Relevant
Algorithm:Relevant TP FP
Algorithm: Not Relevant FN TN
16
1 0
1 a b
0 c d
predicted
outcome
actual
outcome
17. Precision Recall Curves
• If we have a labelled training set, we can calculate recall.
• For any given number of returned documents, we can plot a point
for precision vs. recall. (similar to thresholds in ROC curves)
• Different retrieval algorithms might have very different curves -
hard to tell which is “best”
18. Term / document matrix
• Most common form of representation in text
mining is the term - document matrix
– Term: typically a single word, but could be a word
phrase like “data mining”
– Document: a generic term meaning a collection of
text to be retrieved
– Can be large - terms are often 50k or larger,
documents can be in the billions (www).
– Can be binary, or use counts
19. Term document matrix
• Each document now is just a vector of terms, sometimes
boolean
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Example: 10 documents: 6 terms
20. Term document matrix
• We have lost all semantic content
• Be careful constructing your term list!
– Not all words are created equal!
– Words that are the same should be treated the same!
• Stop Words
• Stemming
21. Stop words
• Many of the most frequently used words in English are worthless in
retrieval and text mining – these words are called stop words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
– For an application, an additional domain specific stop words list may be
constructed
• Why do we need to remove stop words?
– Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits
22. Stemming
• Techniques used to find out the root/stem of a word:
– E.g.,
– user engineering
– users engineered
– used engineer
– using
• stem: use engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as
much as 40-50%.
23. Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only of
one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
24. Feature Selection
• Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
– Even after stemming and stopword removal.
• Greedy search
– Start from full set and delete one at a time
– Find the least important variable
• Can use Gini index for this if a classification problem
• Often performance does not degrade even with orders of
magnitude reductions
– Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,
electricity and electronics.
– Only 140 out of 20,000 terms needed for classification!
25. Distances in TD matrices
• Given a term doc matrix represetnation, now we can define
distances between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies (sometimes
normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well
• If docs are the same, dc =1, if nothing in common dc=0
26. • We can calculate cosine and Euclidean distance
for this matrix
• What would you want the distances to look like?
Database SQL Index Regression Likelihood linear
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
27. Document distance
• Pairwise distances between documents
• Image plots of cosine distance, Euclidean, and
scaled Euclidean
R function: ‘image’
28. Weighting in TD space
• Not all phrases are of equal importance
– E.g. David less important than Beckham
– If a term occurs frequently in many documents it has less discriminatory
power
– One way to correct for this is inverse-document frequency (IDF).
– Term importance = Term Frequency (TF) x IDF
– Nj= # of docs containing the term
– N = total # of docs
– A term is “important” if it has a high TF and/or a high IDF.
– TF x IDF is a common measure of term importance
30. Queries
• A query is a representation of the user’s information needs
– Normally a list of words.
• Once we have a TD matrix, queries can be represented as a vector in
the same space
– “Database Index” = (1,0,1,0,0,0)
• Query can be a simple question in natural language
• Calculate cosine distance between query and the TF x IDF version of
the TD matrix
• Returns a ranked vector of documents
31. Latent Semantic Indexing
• Criticism: queries can be posed in many ways, but still
mean the same
– Data mining and knowledge discovery
– Car and automobile
– Beet and beetroot
• Semantically, these are the same, and documents with
either term are relevant.
• Using synonym lists or thesauri are solutions, but messy
and difficult.
• Latent Semantic Indexing (LSI): tries to extract hidden
semantic structure in the documents
• Search what I meant, not what I said!
32. LSI
• Approximate the T-dimensional term space using
principle components calculated from the TD matrix
• The first k PC directions provide the best set of k
orthogonal basis vectors - these explain the most
variance in the data.
– Data is reduced to an N x k matrix, without much loss of
information
• Each “direction” is a linear combination of the input
terms, and define a clustering of “topics” in the data.
• What does this mean for our toy example?
34. LSI
• Typically done using Singular Value Decomposition
(SVD) to find principal components
TD matrix
Term weighting by
document - 10 x 6
New orthogonal basis for the
data (PC directions) -
Diagonal matrix of
eigenvalues
For our example: S = (77.4,69.5,22.9,13.5,12.1,4.8)
Fraction of the variance explained (PC1&2) = = 92.5%
35. LSI
Top 2 PC make new
pseudo-terms to define
documents…
Also, can look at first two Principal components:
(0.74,0.49, 0.27,0.28,0.18,0.19) -> emphasizes first two terms
(-0.28,-0.24,-0.12,0.74,0.37,0.31) -> separates the two clusters
Note how distance from the origin shows number of terms,
And angle (from the origin) shows similarity as well
36. LSI
• Here we show the same
plot, but with two new
documents, one with the
term “SQL” 50 times,
another with the term
“Databases” 50 times.
• Even though they have no
phrases in common, they
are close in LSI space
37. Textual analysis
• Once we have the data into a nice matrix
representation (TD, TDxIDF, or LSI), we can
throw the data mining toolbox at it:
– Classification of documents
• If we have training data for classes
– Clustering of documents
• unsupervised
38. Automatic document classification
• Motivation
– Automatic classification for the tremendous number of on-line text documents (Web
pages, e-mails, etc.)
– Customer comments: Requests for info, complaints, inquiries
• A classification problem
– Training set: Human experts generate a training data set
– Classification: The computer system discovers the classification rules
– Application: The discovered rules can be applied to classify new/unknown documents
• Techniques
– Linear/logistic regression, naïve Bayes
– Trees not so good here due to massive dimension, few interactions
39. Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Also called “multivariate Bernoulli”
– Assumes conditional independence assumption given the class:
p( x | ck ) = Π p( xj | ck )
– Note that we model each term xj as a discrete random variable
In other words, the probability that a bunch of words comes from a given class
equals the product of the individual probabilities of those words.
40. jNx
j
kjkxk
n
cxpcNpcp ∏=
∝
1
)|()|()|(x
Multinomial Classifier for Text
• Multinomial Classification model
– Assumes that the data are generated by a p-sided die (multinomial model)
– where Nx = number of terms (total count) in document x
– xj = number of times term j occurs in the document
– ck = class = k
– Based on training data, each class has its own multinomial probability across all
words.
.
41. Naïve Bayes vs. Multinomial
• Many extensions and adaptations of both
• Text mining classification models usually a version of one of these
• Example: Web pages
– Classify webpages from CS departments into:
• student, faculty, course,project
– Train on ~5,000 hand-labeled web pages from Cornell, Washington,
U.Texas, Wisconsin
– Crawl and classify a new site (CMU)
Student Faculty Person Project Course Departmt
Extracted 180 66 246 99 28 1
Correct 130 28 194 72 25 1
Accuracy: 72% 42% 79% 73% 89% 100%
44. Document Clustering
• Can also do clustering, or unsupervised learning of docs.
• Automatically group related documents based on their
content.
• Require no training sets or predetermined taxonomies.
• Major steps
– Preprocessing
• Remove stop words, stem, feature extraction, lexical analysis, …
– Hierarchical clustering
• Compute similarities applying clustering algorithms, …
– Slicing
• Fan out controls, flatten the tree to desired number of levels.
• Like all clustering examples, success is relative
45. Document Clustering
• To Cluster:
– Can use LSI
– Another model: Latent Dirichlet Allocation (LDA)
– LDA is a generative probabilistic model of a corpus. Documents are
represented as random mixtures over latent topics, where a topic is
characterized by a distribution over words.
• LDA:
– Three concepts: words, topics, and documents
– Documents are a collection of words and have a probability
distribution over topics
– Topics have a probability distribution over words
– Fully Bayesian Model
46. LDA
• Assume data was generated by a generative process:
∀ θ is a document - made up from topics from a probability distribution
• z is a topic made up from words from a probability distribution
• w is a word, the only real observables (N=number of words in all documents)
• Then, the LDA equations are specified in a fully Bayesian model:
α=per-document topic distributions
47. Which can be solved via advance
computational techniques
see Blei, et al 2003
48. LDA output
• The result can be an often-useful classification of documents into topics, and a
distribution of each topic across words:
49. Another Look at LDA
• Model: Topics made up of words used to generate documents
50. Another Look at LDA
• Reality: Documents observed, infer topics
51. Case Study: TV Listings
• Use text to make recommendations for TV shows
52. Data Issues
• 10013|In Harm's Way|In Harm's Way|A tough Naval officer faces the enemy
while fighting in the South Pacific during World War II.|A tough Naval
officer faces the enemy while fighting in the South Pacific during World
War II.|en-US| Movie,NR Rating|Movies:Drama|||165|1965|USA||||||STARS-3||
NR|John Wayne, Kirk Douglas, Patricia Neal, Tom Tryon, Paula Prentis s,
Burgess Meredith|Otto Preminger||||Otto Preminger|
Parsed Program Guide entries – 2 weeks, ~66,000 programs, 19,000 words
• Collapse on series (syndicated shows are still a problem)
• Stopwords/stemming, duplication, paid programming, length
normalization
54. Results
• We fit LDA
– Results in a full distribution of words, topics and documents
– Topics are unveiled which are a collection of words
55. Results
• For user modelling, consider the collection of shows a single user
watches as a ‘document’ – then look to see what topics (and hence,
words) make up that document
59. Text Mining - Other Topics
• Part of Speech Tagging
– Assign grammatical tags to words (verb, noun, etc)
– Helps in understanding documents : uses Hidden Markov Models
• Named Entity Classification
– Classification task: can we automatically detect proper nouns and tag them
– “Mr. Jones” is a person; “Madison” is a town.
– Helps with dis-ambiguation: spears
60. Text Mining - Other Topics
• Sentiment Analysis
– Automatically determine tone in text: positive, negative or neutral
– Typically uses collections of good and bad words
– “While the traditional media is slowly starting to take John McCain’s straight talking
image with increasingly large grains of salt, his base isn’t quite ready to give up on their
favorite son. Jonathan Alter’s bizarre defense of McCain after he was caught telling an
outright lie, perfectly captures that reluctance[.]”
– Often fit using Naïve Bayes
• There are sentiment word lists out there:
– See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
61. Text Mining - Other Topics
• Summarizing text: Word Clouds
– Takes text as input, finds the most
interesting ones, and displays them
graphically
– Blogs do this
– Wordle.net