This document summarizes a research paper that classified CNN news articles using term frequency-inverse document frequency (TF-IDF) metrics. It discusses the TF-IDF family of metrics used to calculate word frequencies, describes preprocessing steps and algorithms for classification. The dataset contained 3,000 news articles from 12 CNN categories split into training and test sets. Keywords were extracted and weighted using various TF-IDF variants to classify articles by comparing words to the training set categories.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.
The document provides an introduction to text mining in R using the tm package. It discusses how to import text data from various sources into a corpus, transform and preprocess text within a corpus using mappings, and manage metadata for documents and corpora. Specific transformations demonstrated include converting documents to plain text, removing whitespace, converting to lowercase, removing stopwords, and stemming. The document also discusses filtering documents based on metadata values or text content.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.
The document discusses various data sources for linguistic analysis, including corpora, dictionaries, social media, and linked open data. It provides details on accessing data from Facebook and Twitter using APIs and R packages. It also covers preprocessing text data through tokenization, lemmatization, stemming and creating term-document matrices. Sentiment analysis on data from sources like Experience Project is demonstrated through exploring word-category correlations.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.
The document provides an introduction to text mining in R using the tm package. It discusses how to import text data from various sources into a corpus, transform and preprocess text within a corpus using mappings, and manage metadata for documents and corpora. Specific transformations demonstrated include converting documents to plain text, removing whitespace, converting to lowercase, removing stopwords, and stemming. The document also discusses filtering documents based on metadata values or text content.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.
The document discusses various data sources for linguistic analysis, including corpora, dictionaries, social media, and linked open data. It provides details on accessing data from Facebook and Twitter using APIs and R packages. It also covers preprocessing text data through tokenization, lemmatization, stemming and creating term-document matrices. Sentiment analysis on data from sources like Experience Project is demonstrated through exploring word-category correlations.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
Information access over linked data requires to determine
subgraph(s), in linked data's underlying graph, that correspond to the required information need. Usually, an information access framework is able to retrieve richer information by checking of a large number of possible subgraphs. However, on the ecking of a large number of possible subgraphs increases information access complexity. This makes information access frameworks less eective. A large number of contemporary linked data information access frameworks reduce the complexity by introducing dierent heuristics but they suer on retrieving richer information. Or, some frameworks do not care about the complexity. However, a practically usable framework should retrieve richer information with lower complexity. In linked data information access, we hypothesize that pre-processed data statistics of linked data can be used to eciently check a large number of possible subgraphs. This will help to retrieve comparatively richer information with lower data access complexity. Preliminary evaluation of our proposed hypothesis shows promising performance.
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
In this study we propose a new approach based on Pattern Structures, an extension of Formal Concept Analysis, to provide exploration over Linked Data through concept lattices. It takes RDF triples and RDF Schema based on user requirements and provides one navigation space resulting from several RDF resources. This navigation space provides interactive exploration over RDF data and allows user to visualize only the part of data that is interesting for her.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
Text mining and social network analysis of twitter data part 1Johan Blomme
Twitter is one of the most popular social networks through which millions of users share information and express views and opinions. The rapid growth of internet data is a driver for mining the huge amount of unstructured data that is generated to uncover insights from it.
In the first part of this paper we explore different text mining tools. We collect tweets containing the “#MachineLearning” hashtag, prepare the data and run a series of diagnostics to mine the text that is contained in tweets. We also examine the issue of topic modeling that allows to estimate the similarity between documents in a larger corpus.
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
The document summarizes a PhD dissertation defense talk on learning multilingual semantic parsers for question answering over linked data. It discusses comparing neural and probabilistic graphical model architectures for semantic parsing to map natural language to formal meaning representations. The talk outlines introducing dependency parse tree-based approaches, evaluating different model architectures, and addressing challenges in building multilingual question answering systems over structured knowledge bases.
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.
- Dynamic programming is used to find the optimal alignment between two protein sequences by recursively computing sub-alignments and storing them in a lookup table.
- The example shows calculating the alignment score between a zinc-finger core sequence and a viral sequence fragment by filling a table and tracking the cumulative scores.
- Filling the table from left to right and top to bottom allows reconstructing the highest scoring alignment between the two sequences.
Text as Data: processing the Hebrew BibleDirk Roorda
The merits of stand-off markup (LAF) versus inline markup (TEI) for processing text as data. Ideas applied to work with the Hebrew Bible, resulting in tools for researchers and end-users.
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
This document describes a cluster model for combining latent topics with document attributes in text analysis. It introduces topic models and describes how metadata can be incorporated. The model restricts each document to one topic to allow collapsing observations. An algorithm is provided and applied to congressional speech and restaurant review data. Results show the model can recover topics similarly to topic models, while also capturing variation explained by metadata like political affiliation or review rating.
This document discusses how to calculate the mechanical advantage of a gear system. It provides an example gear system with an input gear of 18 teeth, an idler gear of 6 teeth, and an output gear of 24 teeth. It explains that the mechanical advantage is determined by the ratio of the number of teeth on the output gear to the input gear. For systems with more than two gears, the mechanical advantages of each two-gear combination are multiplied to find the overall mechanical advantage. In this example, the mechanical advantage of the overall system is calculated to be 4/3.
La Unión Europea ha propuesto un nuevo paquete de sanciones contra Rusia que incluye un embargo al petróleo. El embargo prohibiría la importación de petróleo ruso a la UE y también prohibiría a los buques europeos transportar petróleo ruso a otros países. Sin embargo, Hungría se opone firmemente al embargo al petróleo, lo que podría retrasar la aprobación del paquete de sanciones de la UE.
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
In this study we propose a new approach based on Pattern Structures, an extension of Formal Concept Analysis, to provide exploration over Linked Data through concept lattices. It takes RDF triples and RDF Schema based on user requirements and provides one navigation space resulting from several RDF resources. This navigation space provides interactive exploration over RDF data and allows user to visualize only the part of data that is interesting for her.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Applications of Word Vectors in Text Retrieval and Classificationshakimov
Applications of word vectors (word2vec, BERT, etc.) on problems such as text retrieval, classification of textual documents for tasks such as sentiment analysis, spam detection.
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
Text mining and social network analysis of twitter data part 1Johan Blomme
Twitter is one of the most popular social networks through which millions of users share information and express views and opinions. The rapid growth of internet data is a driver for mining the huge amount of unstructured data that is generated to uncover insights from it.
In the first part of this paper we explore different text mining tools. We collect tweets containing the “#MachineLearning” hashtag, prepare the data and run a series of diagnostics to mine the text that is contained in tweets. We also examine the issue of topic modeling that allows to estimate the similarity between documents in a larger corpus.
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
The document summarizes a PhD dissertation defense talk on learning multilingual semantic parsers for question answering over linked data. It discusses comparing neural and probabilistic graphical model architectures for semantic parsing to map natural language to formal meaning representations. The talk outlines introducing dependency parse tree-based approaches, evaluating different model architectures, and addressing challenges in building multilingual question answering systems over structured knowledge bases.
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.
- Dynamic programming is used to find the optimal alignment between two protein sequences by recursively computing sub-alignments and storing them in a lookup table.
- The example shows calculating the alignment score between a zinc-finger core sequence and a viral sequence fragment by filling a table and tracking the cumulative scores.
- Filling the table from left to right and top to bottom allows reconstructing the highest scoring alignment between the two sequences.
Text as Data: processing the Hebrew BibleDirk Roorda
The merits of stand-off markup (LAF) versus inline markup (TEI) for processing text as data. Ideas applied to work with the Hebrew Bible, resulting in tools for researchers and end-users.
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
This document describes a cluster model for combining latent topics with document attributes in text analysis. It introduces topic models and describes how metadata can be incorporated. The model restricts each document to one topic to allow collapsing observations. An algorithm is provided and applied to congressional speech and restaurant review data. Results show the model can recover topics similarly to topic models, while also capturing variation explained by metadata like political affiliation or review rating.
This document discusses how to calculate the mechanical advantage of a gear system. It provides an example gear system with an input gear of 18 teeth, an idler gear of 6 teeth, and an output gear of 24 teeth. It explains that the mechanical advantage is determined by the ratio of the number of teeth on the output gear to the input gear. For systems with more than two gears, the mechanical advantages of each two-gear combination are multiplied to find the overall mechanical advantage. In this example, the mechanical advantage of the overall system is calculated to be 4/3.
La Unión Europea ha propuesto un nuevo paquete de sanciones contra Rusia que incluye un embargo al petróleo. El embargo prohibiría la importación de petróleo ruso a la UE y también prohibiría a los buques europeos transportar petróleo ruso a otros países. Sin embargo, Hungría se opone firmemente al embargo al petróleo, lo que podría retrasar la aprobación del paquete de sanciones de la UE.
TTG Int. LTD is a leading global provider of OSS software solutions and services to telecommunications companies established in 2001. It has experienced outstanding growth due to its leadership in developing tools for 2G, 3G, 4G and fixed networks. TTG's OSS tools are designed by telecommunications engineers and provide functionality such as performance management, fault management, and inventory management.
El documento habla sobre los lenguajes de definición de datos (DDL) y manipulación de datos (DML). El DDL permite a los usuarios definir estructuras de datos y procedimientos, mientras que el DML permite consultar y modificar datos. Algunos ejemplos de comandos DDL son CREATE TABLE, SHOW TABLES y DROP TABLE. El DML más popular es SQL y permite operaciones como SELECT, INSERT, DELETE y UPDATE. Los DML pueden ser procedimentales o no procedimentales.
La Unión Europea ha acordado un embargo petrolero contra Rusia en respuesta a la invasión de Ucrania. El embargo forma parte de un sexto paquete de sanciones y prohibirá la mayoría de las importaciones de petróleo ruso en la UE a finales de este año. Algunos estados miembros aún dependen en gran medida del petróleo ruso y se les ha concedido una exención, pero se espera que todo el petróleo ruso quede prohibido para fines de 2023.
Impact
We will gain a better understanding of the critical soil conditions and microbial factors that uncouple or couple nitrification from the other NH4+ consuming sinks. This will enable us to refine nitrogen models and field based management strategies that prevent excessive and/or untimely losses of soil and fertilizer N. This will reduce economic losses to farmers and reduce the potential for off-site damage to environmental quality.
Partite Methacrylate and Methacrylic Adhesives by Parson Adhesive, specially formulated for same and cross material surface adhesion, that includes metal to metal, plastic to metal and metal to plastic.
Kann Social Media Listening tatsächlich auch Leben retten? Kann es zumindest die Sicherheit von Festivalteilnehmern erhöhen? Diesen Fragen haben wir uns im Forschungsprojekt "Basisbausteine zur Sicherheit bei Großveranstaltungen (BaSiGo)" gemeinsam mit der Universität Siegen gestellt. Ein Veranstaltungslagezentrum mit der Fähigkeit, kritische Situationen zu erkennen und zu prognostizieren - das ist die Vision, die wir versuchen, zu verwirklichen. Wir zeigen die Forschungsergebnisse vom Wacken Open Air, Chiemsee Summer und der Annakirmes. Neben dem georeferenzierten Listening in der Kombination mit klassischem keyword-basierten Listening und Channel-Monitoring, kommen hoch entwickelte Techniken zum Einsatz, die eine Vorhersage von Krisenverläufen ermöglichen. In Kombination mit semantischen Text-, Bild- und Inhaltsanalysen werden Cluster für die Predictive-Analyse aufbereitet. Diese Kombination aus Technik und einem hochorganisierten Team bildet das Lagezentrum der Zukunft, mit der Fähigkeit, kritische Situationen zu erkennen und zu verhindern.
There are examples for running user experience service design processes. Cases are from 2015 UXSD workshop hosted by drhhtang. Here you can find how to present results of interviews and WAAD.
BlindNavi, a mobile navigation app specially designed for the visually impair...Anne Chen
http://blindnavi.webflow.io
BlindNavi is a mobile navigation service that provides the best daily traveling experience to the visually impaired users. With BlindNavi, you can search for phone numbers and address, navigate directions while traveling, as well as share everyday experience with friends.
In the past, the visually impaired users have to go through tedious process to get the transportation methods. Besides, when they are actually walking on the streets, the navigation messages provided by the current service, such as google or apple maps, are not detailed enough for the blind people.
The overall design of BlindNavi, incorporates a blind user friendly interface that operates through VoiceOver, the iOS screen reader that verbally describe every object and movement on the screen to its user.
Applying the Orientation and Mobility (O&M) training concept to the navigation process, BlindNavi provides meaningful micro-location info through beacon technology to create a better walking experience for the visually impaired people, and to enhance their cognition of the living environments.
This slide explains the use of morphological tree and some applications for the students of design method course lectured by drhhtang at NTUST in 2016.
The document discusses rehabilitation game platforms and technology aids for elderly people living with dementia. Some key points discussed include:
- Over 130,000 elderly people in Taiwan live with dementia as the population ages. More than 150 daycare centers have been established.
- Technology aids are becoming more common, including touchscreen games and virtual reality interactions. Several examples of technology aids are provided, such as interactive table systems, online brain training games, and wearable tracking devices.
- Integrated care services aim to coordinate health and social services based on individual needs, particularly for those with complex long-term conditions. Neighborhood care models are also discussed.
„Mehr als nur Reporting machen“ ist die Maßgabe, mit der viele BI-Verantwortliche in das Jahr 2016 starten. Gerade in fortgeschrittenen Datenanalysen und Vorhersagen liegt großes Potential, aus den bisherigen Investitionen in BI-Systeme und Datenbestände mehr Nutzen zu ziehen. Wie der aktuelle Stand aussieht, hat jetzt die Studie „Advanced & Predictive Analytics - Schlüssel zur zukünftigen Wettbewerbsfähigkeit“ des Business Application Research Center (BARC) ausführlich untersucht.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
Panelists: Yoshiyasu Yamakawa (Intel), JP Barraza (Systran), Konstantin Dranch (Memsource), David Koot (TAUS)
The focus of this session will be on predictions and risk management. What kind of things can you predict and how can you manage risks by by analyzing your translation data or monitoring your productivity and quality. Tracking translation data in different cycles of the translation process (translation, post-editing, review, proof-reading) offers tremendous value when it comes to predicting future trends or making informed choices. What type of data can be valuable and what kind of predictions can we make using this data? How can we make more efficient use of already available data? How can we use this type of data to improve machine translation, automatic QA, error-recognition, sampling or quality estimation? How can academia and industry work together towards a common goal?
Post-editese: an Exacerbated Translationese (presentation at MT Summit 2019)Antonio Toral
The document summarizes an analysis of post-edited translations compared to human translations in terms of translationese principles. The analysis uses several datasets containing human translations, machine translations, and post-edits. Experiments measure lexical variety, lexical density, length ratio, and perplexity on parts-of-speech sequences to analyze differences between human translations, machine translations, and post-edits. The results generally show that post-edits exhibit characteristics that are between those of human translations and machine translations, indicating that post-editing does not fully remove the "footprint" of machine translation.
Search engines (e.g. Google.com, Yahoo.com, and Bi
ng.com) have become the dominant model of online search. Large and small e-commerce provide built-in search capability to their visitors to examine the products they have. While most large business are able to hire the
necessary skills to build advanced search engines,
small online business still lack the ability to evaluate the results of their search engines, which means losing the opportunity to compete with larger business. The purpose of this paper is to build an open-source model that can measure the relevance of search results for online businesses
as well as the accuracy of their underlined algorithms. We used data from a Kaggle.com competition in order to show our model running on real data.
This document provides an overview and introduction to the concepts taught in a data structures and algorithms course. It discusses the goals of reinforcing that every data structure has costs and benefits, learning commonly used data structures, and understanding how to analyze the efficiency of algorithms. Key topics covered include abstract data types, common data structures, algorithm analysis techniques like best/worst/average cases and asymptotic notation, and examples of analyzing the time complexity of various algorithms. The document emphasizes that problems can have multiple potential algorithms and that problems should be carefully defined in terms of inputs, outputs, and resource constraints.
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
Two presentation from the Michigan Information Retrieval Enthusiasts Group Meetup on August 19 by Cengage Learning search platform development team.
Scaling Performance Tuning With Lucene by John Nader discusses primary performance hot spots related to scaling to a multi-million document collection. This includes the team's experiences with memory consumption, GC tuning, query expansion, and filter performance. Discusses both the tools used to identify issues and the techniques used to address them.
Relevance Tuning Using TREC Dataset by Rohit Laungani and Ivan Provalov describes the TREC dataset used by the team to improve the relevance of the Lucene-based search platform. Goes over IBM paper and describe the approaches tried: Lexical Affinities, Stemming, Pivot Length Normalization, Sweet Spot Similarity, Term Frequency Average Normalization. Talks about Pseudo Relevance Feedback.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
Lecture 7- Text Statistics and Document ParsingSean Golliher
This document discusses various techniques for text processing and indexing documents for information retrieval systems. It covers topics like tokenization, stemming, stopwords, n-grams to identify phrases, and weighting important document elements like headers, anchor text, and metadata. The document also discusses using links between documents for link analysis and utilizing anchor text for retrieval.
The document discusses best practices for translating DITA (Darwin Information Typing Architecture) content and using the xml:tm (XML-based text memory) standard. It recommends using a CMS, avoiding translatable attributes, keeping topics granular, and using xml:tm to maintain author and translation memories for reuse and alignment between source and translated content. xml:tm allows embedding of contextual information to link source and translated text segments for more accurate matching.
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
Media analysts have to deal with with analyzing high volumes of real-time news feeds and social media streams which is often a tedious process because they need to write search profiles for entities. Python tools like NLTK do not scale to large production data sets and cannot be plugged into a distributed scalable frameworks like Apache Flink. Apache Flink being a streaming first engine is ideally suited for ingesting multiple streams of news feeds, social media, blogs etc.. and for being able to do streaming analytics on the various feeds. Natural Language Processing tools like Apache OpenNLP can be plugged into Flink streaming pipelines so as to be able to perform common NLP tasks like Named Entity Recognition (NER), Chunking, and text classification. In this talk, we’ll be building a real-time media analyzer which does Named Entity Recognition (NER) on the individual incoming streams, calculates the co-occurrences of the named entities and aggregates them across multiple streams; index the results into a search engine and being able to query the results for actionable insights. We’ll also be showing as to how to handle multilingual documents for calculating co-occurrences. NLP practitioners will come away from this talk with a better understanding of how the various Apache OpenNLP components can help in processing large streams of data feeds and can easily be plugged into a highly scalable and distributed framework like Apache Flink.
Duke is an open source tool for deduplicating and linking records across different data sources without common identifiers. It indexes data using Lucene and performs searches to find potential matches. Duke was used in a real-world project linking data from Mondial and DBpedia, where it correctly linked 94.9% of records while avoiding wrong links. Duke is flexible, scalable, and incremental, making it suitable for ongoing use at Hafslund to integrate customer records from multiple systems and remove duplicates. Future work may include improving comparators, adding a web service interface, and exploring parallelism.
R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
The document summarizes a presentation on setting up a domain-specific language model for natural language processing (NLP) tasks at DATEV eG. It discusses adapting a pretrained BERT model to DATEV's domain by fine-tuning it on DATEV corpus data. A proof-of-concept evaluation on two classification tasks showed improved results over baselines, demonstrating that incorporating domain data enhances NLP models for a company's specific needs. Key insights included the importance of domain knowledge and corpus composition for different subdomains.
The document discusses building a machine learning model for resume classification using natural language processing techniques. It explores the dataset of resumes and profiles, performs text preprocessing, feature engineering, and builds various classification models to accurately classify resumes. The best performing model is random forest classification, which achieves 100% accuracy on the test data with no errors, overfitting, or misclassifications.
The document discusses a generic programming toolkit called PADS/ML that can be used to parse, analyze, and transform semi-structured or "ad hoc" data from various domains. It describes how PADS/ML uses generated type representations and typecase analysis to write functions that can operate on any data format described by a PADS/ML type. Case studies of PADX and Harmony are presented, which use PADS/ML to build tools for querying and synchronizing different data formats.
Cassandra is a structured storage system designed for large amounts of data across commodity servers. It provides high availability with eventual consistency and scales incrementally without centralized administration. Data is partitioned across nodes and replicated for fault tolerance. Writes are applied locally and propagated asynchronously, prioritizing availability over consistency. It uses a gossip protocol for membership and failure detection.
The document discusses OAXAL (Open Architecture for XML Authoring and Localization) and xml:tm (XML-based Text Memory) as standards for authoring and localizing XML documents. It provides an overview of related standards like DITA, ITS, TMX, SRX, GMX, and XLIFF. It then describes how xml:tm uses these standards and creates an XML namespace to embed contextual information, link source and target texts, and automate workflows for translation and localization of XML documents.
The document discusses different techniques for weighting terms in the vector space model for information retrieval, including:
- Sublinear tf scaling using the logarithm of term frequency
- Tf-idf weighting
- Maximum tf normalization to mitigate higher weights for longer documents
It also discusses evaluating information retrieval systems using test collections with queries, relevant documents, and metrics like precision and recall. Standard test collections include Cranfield, TREC, and CLEF.
Similar to Classification of CNN.com Articles using a TF*IDF Metric (20)
Preserving virtual worlds educational events using social media v2Marie Vans
This document discusses preserving educational experiences from virtual world courses. It describes courses run at San Jose State University in virtual worlds representing historical time periods. Large amounts of student data are generated across social networks and virtual worlds, but are at risk without preservation. Options discussed for archiving this data include using YouTube, blogs, Pinterest and cloud storage services like Amazon Glacier, Google Cloud Storage, and Preservica. These services provide affordable, redundant storage of the video, images and other materials generated in virtual world courses to preserve them for future education.
Creating an Award-Winning Educational MachinimaMarie Vans
This document describes Marie Vans' award-winning educational machinima about Catherine de' Medici and the St. Bartholomew's Day Massacre. It provides details on the research, characters, setting, story, and challenges in creating the 23-minute machinima. Tools used included Captivate for recording, Premier Pro for editing, Audacity for voices, Jamendo and CCMixer for music, and Second Life for the virtual world environment. Planning and overcoming technical difficulties were important lessons learned.
Preserving virtual worlds educational events using social media v2Marie Vans
This document discusses preserving educational experiences from virtual world courses. It describes courses run at San Jose State University in virtual worlds like Second Life reflecting historical time periods. Large amounts of student data are generated across social networks but are at risk without preservation. Options discussed include using the school blog and YouTube to archive videos. Amazon Glacier and Google Cloud storage are affordable archiving solutions that could help systematically preserve the scattered student works and documentation.
Archive enabling tagging using progressive barcodesMarie Vans
This document discusses using progressive barcodes for archiving applications. Progressive barcodes encode data in both black-and-white and color channels, allowing two channels of information. They are suitable for archiving where one channel, like a document ID, remains static, while the color channel can dynamically track document changes. The document proposes an inference model where tags on individual archive items can be inferred from tags on containing folders and computers, reducing the number of required tags. Progressive barcodes combined with inference could track documents and versions without opening each container.
Creating displays of virtual objects and eventsMarie Vans
This document discusses ways to document educational experiences in virtual worlds through machinima and object displays. It notes that vast amounts of data are generated during educational activities in virtual worlds, including student builds and revisions, project data, educational events, and avatar interactions. While it may not be feasible to preserve all this data, it is possible to document essential elements through machinima, photography, and displays of virtual objects to allow for recreation of experiences or development of similar courses. The document outlines challenges to documentation and provides examples of how builds, events, and displays can be captured and presented.
This document proposes the use of progressive barcodes, which encode changing information over time through the addition of colors to barcode tiles. It describes how progressive barcodes can be used in workflows to track items as they move through different stages. The document also discusses how progressive barcodes allow for inference applications by linking items to their containers. Finally, it suggests progressive barcodes could incorporate customer rewards programs and other applications at point-of-sale by encoding data in the color tiles beyond the static information read by retailers.
The document proposes and experiments with progressive barcodes, which can change over time to represent information at different stages of a workflow. A progressive barcode starts with white tiles and then colors are incrementally added to tiles to encode additional data as a document moves through a workflow. Experiments show that progressive barcodes can still be read with saturation up to 75% and certain color combinations, allowing the barcodes to change over time while still being scanned. The barcodes have potential applications for tracking documents and routing physical items through different stages of a process.
This document summarizes a study on the impact of scrambling techniques on the entropy of barcodes. The study tested barcodes with and without error correcting codes (ECC) using four scrambling methods and three entropy measures. Results showed that scrambling increased the entropy and randomness of barcodes that originally contained ECC, making it harder to detect the presence of ECC. However, the difference in entropy between scrambled ECC and non-ECC barcodes was small and not statistically significant. The study concluded that while entropy analysis can detect the presence of structure in barcodes, the methods tested were not effective at distinguishing scrambled ECC barcodes from purely random barcodes.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Classification of CNN.com Articles using a TF*IDF Metric
1. Classification of CNN.com Articles
using a TF*IDF Metric
Marie Vans and Steven Simske HP Labs;
Fort Collins, Colorado
April 20, 2016
1
2. Agenda
• TF*IDF Family of Metrics
• Word Frequencies
• Data Set & Preprocessing
• Algorithms for word frequencies and classification
• An example
• Results
• Future Directions
• Conclusions
2
6. TF*IDF – Family – Inverse Document
Frequency
6
IDF Name IDF Equation
13 NormPowerOfSums 𝑗=1
𝑁−1
𝑘𝑗
𝑤𝑖,𝑛
𝑃𝑜𝑤𝑒𝑟
14 NormPowersOfSums
𝑗=1
𝑁−1
𝑘𝑗
𝐷𝑜𝑐𝑃𝑜𝑤𝑒𝑟
𝑤𝑖,𝑛
𝑊𝑜𝑟𝑑𝑃𝑜𝑤𝑒𝑟
i =current word
j = current document
k = total words in document j
n = total words in other than current document
N = total number of documents in the corpus
wi,j = number of occurrences of word i in document j.
wi,n = word occurrences of word i in other documents.
ni = number of documents in which i occurs.
LogRatio = ratio of log for individual word to log for document length
MinLogRatio = user settable minimum for LogRatio
WordPower & DocPower = adjustable value
9. CNN Data Set
9
Class Name TTL Number
of Files
Number of Files
Training Set
Number of Files
Test Set
Business 161 81 80
Health 290 145 145
Justice 224 112 112
Living 98 49 49
Opinion 192 96 96
Politics 195 98 97
Showbiz 241 121 120
Sport 148 74 74
Tech 132 66 66
Travel 171 86 85
US 160 80 80
World 988 494 494
• 12 Classes
• 3,000 Total
Files
• Each Class
split into 2
sets:
• Training Set
• Test Set
• File Classes
Ground-trouth
by CNN
Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro.
A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012,
pages 1–8, July 2012
10. CNN Data Set
10
Class Name TTL Number of
Train Set Unique
Words
TTL Number of
Test Set Unique
Words
Total Number of
Words Processed
Business 8278 7851 16129
Health 12246 12036 24282
Justice 9133 9032 18165
Living 7936 7030 14966
Opinion 11382 10886 22268
Politics 9268 9039 18307
Showbiz 8997 9949 18946
Sport 7445 7191 14636
Tech 7971 7548 15519
Travel 14931 12612 27543
US 8488 8707 17195
World 22936 23441 46377
• 12 Classes
• Total words
254,333
• Training Set
129,011
• Test Set
125,322
11. Preprocessing
• Remove “stop words”
• Remove punctuation (hyphenation excepted)
• No lemmatization
• SharpNLP – Open Source Natural Language Processing
(https://sharpnlp.codeplex.com/)
• sentence splitter
• tokenizer
• part-of-speech tagger
• chunker
• parser
• name finder
• coreference tool
• interface to the WordNet lexical database
• File parsed with each word tagged with part of speech
11
12. Program Classes (Not CNN
Classes)• Word Class
• m_Spelling
• m_Count (frequency of word in file)
• m_Weight (assigned by different TF*IDF
measures)
• m_HasHyphen (Hyphenated words counts as single
word)
• m_PennTags (Parts of speech tag)
• m_Tags (Number of tags associated with
word)
• TermFrequencies Class
• m_TermName;
• int m_TermFreq;
12
• Classify Class
• m_businessWords;
• m_healthWords;
• m_justiceWords;
• m_livingWords;
• m_opinionWords;
• m_politicsWords;
• m_showbizWords;
• m_sportWords;
• m_techWords;
• m_travelWords;
• m_usWords;
• m_worldWords;
• m_confusionMatrx
13. Algorithm
A. Using Training Set files in each class: (i.e. do this 12
times)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique
word for the entire set of documents
3.0 Calculate the weight of each word:
total occurrences of wordi in all files / total occurrences of
all words in all files
𝑇 𝑤𝑜𝑟𝑑 𝑖
=𝑓(𝑤 𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)13
14. Algorithm
B. Using the Testing Set files in a specific class: (i.e. business)
1.0 For each file in the set:
Create a word object for every unique word in the file
2.0 Count the total number of occurrences of each unique word for the entire set
of documents
3.0 Calculate the weight of each word:
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
= total occurrences of wordi in file / total occurrences of all words in
file
𝑇 𝑤𝑜𝑟𝑑 𝑖
= total occurrences of wordi in all files / total occurrences of all words
in all
files
14
𝑇 𝑤𝑜𝑟𝑑 𝑖,𝑗
=
𝑓 𝑤𝑖,𝑗
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
𝑇 𝑤𝑜𝑟𝑑 𝑖
= 𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑓(𝑤𝑖,𝑗)
𝑗=1
𝑛𝑓𝑖𝑙𝑒𝑠
𝑖=1
𝑛𝑤𝑜𝑟𝑑𝑠
𝑓(𝑤𝑖,𝑗)
15. Algorithm
D. Classify each wordi in one test file by comparing to the same
word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
15
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
16. Algorithm
C. Classify each wordi in the entire test class by comparing to
the same word in all training classes:
𝑒. 𝑔. 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
= 𝑤𝑜𝑟𝑑𝑖 in Business test class
𝐶 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
16
𝐶ℎ𝑒𝑎𝑙𝑡ℎ = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐻𝑒𝑎𝑙𝑡ℎ_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑗𝑢𝑠𝑡𝑖𝑐𝑒 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
.
.
.
𝐶 𝑤𝑜𝑟𝑙𝑑 = 𝑇𝑒𝑠𝑡 𝑤𝑜𝑟𝑑 𝑖
× 𝑊𝑜𝑟𝑙𝑑_𝑇𝑟𝑎𝑖𝑛 𝑤𝑜𝑟𝑑 𝑖
𝐶𝑙𝑎𝑠𝑠 = 𝑀𝑎𝑥
17. CNN Data Set – Example Article – Business
Class
17
After : Could Germany's nuclear gamble backfire?
As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over
the cost to the country's prosperity and its already squeezed consumers.
Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity
consumption by 2020 and 80% by 2050 as part of its clean energy drive.
The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move
away from nuclear power and fossil fuels to renewable energy sources, following Japan's disaster
in 2011.
Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the
government's energy targets are 'completely unfeasible.'
'Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural
space,' he said.
'And still it would not be able to deliver electricity when it is needed.'
The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to
reduce 40% of greenhouse gas emissions by 2020.
Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to
construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according
to the DIW economic institute in Berlin.
As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors,
which fuel 18% of the country's power needs.
Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.
18. CNN Data Set – Example Frequencies –
Training Set
18
m_TermFreq 6 int
m_TermName "fukushima"
string
m_TermFreq 1 int
m_TermName "germany"
string
m_TermFreq 12 int
m_TermName "nuclear"
string
Single File
m_TermFreq 9 int
m_TermName "fukushima"
string
m_TermFreq 26 int
m_TermName "germany"
string
m_TermFreq 33 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000307188203972967
germany 0.000887432589255239
nuclear 0.00112635674790088
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0102739726027397
germany 0.00156739811912226
nuclear 0.0205479452054795
19. CNN Data Set – Example Frequencies – Test
Set
19
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 9 int
m_TermName "germany"
string
m_TermFreq 5 int
m_TermName "nuclear"
string
Single File
m_TermFreq 2 int
m_TermName "fukushima"
string
m_TermFreq 21 int
m_TermName "germany"
string
m_TermFreq 6 int
m_TermName "nuclear"
string
All Files in Class
fukushima 0.000073773515308004
germany 0.000774621910734047
nuclear 0.000221320545924013
% Occurrence in
Class
% Occurrence in
Filefukushima 0.0056022408963585
germany 0.0252100840336134
nuclear 0.0140056022408964
20. Classify All Words in Single Business Test File
20
Business 0.00069364
Health 0.00030063
Justice 0.00025000
Living 0.00026707
Opinion 0.00033446
Politics 0.00034694
Showbiz 0.00025372
Sport 0.00029984
Tech 0.00033337
Travel 0.00023201
US 0.00031539
World 0.00040208
MAX Class Value
21. Classify All Words in All Business Test Files
21
Business 0.00059513
Health 0.00035854
Justice 0.00027830
Living 0.00038269
Opinion 0.00039295
Politics 0.00036828
Showbiz 0.00029162
Sport 0.00036698
Tech 0.00040147
Travel 0.00032406
US 0.00032592
World 0.00037747
MAX Class Value
22. Confusion Matrix
22
• Each column contains samples of classifier output
• Each row contains samples in true class
• Each row sums to 1.0
• Diagonal show percent classified correctly
• Mean of diagonal = 89%
• Off-diagonal shows types of errors that occur
• A is misclassified as B – 3%
• A is misclassified as C – 3%
Normalized Confusion Matrix Classifier Output (Computed Classification) Prediction
A B C
True Class of the
Samples (Input)
A 0.94 0.03 0.03
B 0.08 0.85 0.07
C 0.08 0.04 0.88
23. Results - Classification
23
business health justice living opinion politics
showbi
z sport tech travel us world
business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0
health 0
0.772
4 0.0207 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0
justice 0 0 0.9018 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0
living 0.0204
0.040
8 0 0.8163 0 0.0204 0 0.0612 0.0408 0 0 0
opinion 0.2083
0.072
9 0.0208 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354
politics 0.0103
0.010
3 0.0515 0.0412 0 0.8557 0 0 0 0 0 0.0309
showbiz 0.0083
0.008
3 0.1583 0.1417 0 0.0083 0.6417 0.025 0 0 0.0083 0
sport 0.027
0.013
5 0.0405 0.0541 0 0.027 0 0.8108 0 0.027 0 0
tech 0.0303
0.030
3 0.0152 0.2121 0 0 0.0152 0 0.6818 0.0152 0 0
travel 0.14120.0118 0.0235 0.1412 0 0.0824 0 0.0353 0.0588 0.4353 0.0471 0.0235
us 0.025 0.05 0.3125 0.175 0 0.1125 0.025 0.0625 0.0375 0.0125 0.1875 0
world 0.0769
0.014
2 0.1316 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.0142 0.5142
Note that the diagonals (in bold) are the correct classifications
The rows sum to 1.0 since the left column represents the actual class from which the document is taken
The columns have a mean of 1.0 with some variance depending on whether the class in the column is an
attractor class (> 1.0) or a repulsor class (<1.0)
24. Example of Incorrectly Classified File
24
Business 0.00033924
Health 0.00025056
Justice
0.00027728
Living 0.00027807
Opinion 0.00041936
Politics
0.00046704
Showbiz 0.00023136
Sport 0.00028422
Tech 0.00025991
Travel 0.00021793
US 0.00032973
World 0.00043251
2nd MAX Class Value
Results of File from Opinion Test Class:
MAX Class Value
3rd MAX Class Value
It takes 3 tries to get it right
25. Classification Attempts to Success
25
Measures the average number of attempts until correct class is chose
1 × 𝑃1 + 2 × 𝑃2 + 3 × 𝑃3 … + 12 × 𝑃12
𝑛𝑓𝑖𝑙𝑒𝑠
where
P1 = number correctly classified on first try
p2 = number correctly classified after two tries
.
.
.
P12 = number correctly classified on the last try
nfiles = number of files in testing class
Example: Worst Class – Opinion
P1 = 3
P2 = 35 x 2
P3 = 31 x 3
P4 = 14 x 4
P5 = 7 x5
P6 = 3 x6
P7 = 2 x 7
P8 = 1 x8
=297/96 = 3.09Ʃ
26. Results - Classification Attempts to
Success
26
• Measures the average number of attempts until correct class is cho
• Ideal is 1.0 – We get it right on the first try
• Best Class – Justice
• Correctly classified: 0.9018
• Mean classification attempts: 1.19
• Delta from ideal = 0.19
• Worst Class – Opinion
• Correctly classified: 0.0313
• Mean classification attempts: 3.09
• Delta from ideal = 2.09
• Best classification attempts class 11 times better than worst class
• All other classes between best and worst
27. Results - General
• Confusion matrix shows good classification results
• Average classification rate for all classes = 0.61655883
• Classification Errors:
• Attractor classes:
• Repulsor classes:
• Normalized by total occurrences of all words in file
• For classification of single file
• Normalized by total occurrences of all words in the class
• For classification of multiple files
27
busines
s
health justice living opinion politics showbi
z
sport tech travel us world
1.297
8
1.024
5
1.676
5
2.216 0.033
3
1.456
9
0.703
7
1.075
7
1.000
3
0.517 0.294
3
0.704
Busines
s
Health Justice Living Politics Sport Tech
Opinion Showbiz Travel U.S. World
29. Future Directions
• Automatic summarization based on word frequencies in sentences
• Data from Brazil also contained Gold Standard sentences for summarization
• Each file contains sentences pulled out of the full article by at least 3 students
• Gold Standard sentences for each file act as ground truth for automatic
summarization
• New York Times Annotated Corpus: (https://catalog.ldc.upenn.edu/LDC2008T19)
• Written and published by the New York Times between January 1, 1987 and June 19,
2007
• Metadata provided by the New York Times Newsroom, the New York Times Indexing
Service and the online production staff at nytimes.com:
• Over 1.8 million articles
• Over 650,000 article summaries written by library scientists
• Over 1,500,000 articles manually tagged by library scientists with tags drawn from
a normalized indexing vocabulary of people, organizations, locations and topic
descriptors
• Over 275,000 algorithmically-tagged articles that have been hand verified by the29
30. Conclusions
• A family of TF*IDF metrics for summarization and classification
• A simple TF*IDF metric
• Classification scheme that works well on a set of 3,000 CNN articles
separated into 12 classes
• Classification attempts to success is a measure that tells us how
hard it is to classify
• Attractor and repulsor classes may help for identifying imbalances in
the data
• Simple TF*IDF metric can be used for benchmarking the rest of the
112 TF*IDF
30