R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.
This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Text analytics in Python and R with examples from Tobacco ControlBen Healey
Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.
This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
NLTK: Natural Language Processing made easyoutsider2
Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python is introduced. It is useful for getting started with NLP and also for research/teaching.
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.
HackYale - Natural Language Processing (Week 0)Nick Hathaway
Slides for a course I taught on Natural Language Processing covering corpus manipulation, word tokenization and text classification tasks using Python's popular Natural Language Toolkit.
Programming with Millions of Examples (HRL)Eran Yahav
In a world where programming is largely based on using APIs, semantic code search emerges as a way to effectively learn how such APIs should be used. Towards this end, we present a formal framework for static specification mining that is able to handle code snippets and incomplete programs. Our framework analyzes code snippets and extracts partial temporal specifications. Technically, partial temporal specifications are represented as symbolic automata – automata where transitions may be labeled by variables, and a variable can be substituted by a letter, a word, or a regular language. With the help of symbolic automata, the use of the API is extracted from each snippet of code, and the many separate examples are consolidated to create a full(er) usage scenario database that can be queried. We have implemented our approach in a tool called PRIME and applied it to analyze and consolidate thousands of snippets per tested API.
This talk is based on work with Alon Mishne, Sharon Shoham, Eran Yahav, and Hongseok Yang.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
Shankar Ambady of Session M will give a tutorial on the Python NLTK (Natural Language Tool Kit). Shankar had previously presented a comprehensive overview of the NLTK last December at the Python meetup. The Python NLTK is a very powerful collection of libraries that can be applied to a variety of NLP applications such as sentiment analysis. His presentation from last December may be found here (click on Boston Python Meetup Materials) : http://www.shankarambady.com/
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
NLTK: Natural Language Processing made easyoutsider2
Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python is introduced. It is useful for getting started with NLP and also for research/teaching.
DCU Search Runs at MediaEval 2014 Search and Hyperlinkingmultimediaeval
We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.
HackYale - Natural Language Processing (Week 0)Nick Hathaway
Slides for a course I taught on Natural Language Processing covering corpus manipulation, word tokenization and text classification tasks using Python's popular Natural Language Toolkit.
Programming with Millions of Examples (HRL)Eran Yahav
In a world where programming is largely based on using APIs, semantic code search emerges as a way to effectively learn how such APIs should be used. Towards this end, we present a formal framework for static specification mining that is able to handle code snippets and incomplete programs. Our framework analyzes code snippets and extracts partial temporal specifications. Technically, partial temporal specifications are represented as symbolic automata – automata where transitions may be labeled by variables, and a variable can be substituted by a letter, a word, or a regular language. With the help of symbolic automata, the use of the API is extracted from each snippet of code, and the many separate examples are consolidated to create a full(er) usage scenario database that can be queried. We have implemented our approach in a tool called PRIME and applied it to analyze and consolidate thousands of snippets per tested API.
This talk is based on work with Alon Mishne, Sharon Shoham, Eran Yahav, and Hongseok Yang.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
Slides from the workshop on Benchmarking RDF Systems co-located with the Extended Semantic Web Conference 2013. The presentation is about an on-going work on building the benchmark for electronic publishing applications. The benchmark provides real-world data sets, the Dutch parliamentary proceedings and a set of analytical SPARQL queries that were built on top of these data sets. The queries were grouped into micro-benchmarks according to their analytical aims. This allows one to perform better analysis of RDF stores behaviors with respect to a certain SPARQL feature used in a micro-benchmark/query.
Preliminary results of running the benchmark on the Virtuoso native RDF store are presented, as well as references to the on-line material including the data sets, queries and the scripts that were used to obtain the results.
Shankar Ambady of Session M will give a tutorial on the Python NLTK (Natural Language Tool Kit). Shankar had previously presented a comprehensive overview of the NLTK last December at the Python meetup. The Python NLTK is a very powerful collection of libraries that can be applied to a variety of NLP applications such as sentiment analysis. His presentation from last December may be found here (click on Boston Python Meetup Materials) : http://www.shankarambady.com/
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
Presentation to the Boulder/Denver BigData meetup 2013-09-25 http://www.meetup.com/Boulder-Denver-Big-Data/events/131047972/
Overview of Enterprise Data Workflows with Cascading; code samples in Cascading, Cascalog, Scalding; Lingual and Pattern Examples; An Evolution of Cluster Computing based on Apache Mesos, with use cases
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
18 css101j pps unit 1
Evolution of Programming & Languages - Problem Solving through Programming - Creating Algorithms - Drawing Flowcharts - Writing Pseudocode - Evolution of C language, its usage history - Input and output functions: Printf and scanf - Variables and identifiers – Expressions - Single line and multiline comments - Constants, Keywords - Values, Names, Scope, Binding, Storage Classes - Numeric Data types: integer - floating point - Non-Numeric Data types: char and string - Increment and decrement operator - Comma, Arrow and Assignment operator - Bitwise and Sizeof operator
A short tutorial on R, basically for a starter who wants to do data mining especially text data mining.
Related codes and data will be found at the following lnik: http://textanalytics.in/wm/R%20tutorial%20(DATA2014).zip
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
1. Text Mining Infrastructure in R
Presented By
Ashraf Uddin
(http://ashrafsau.blogspot.in/)
South Asian University, New Delhi, India.
29 January 2014
2. What is R?
A free software environment for statistical computing and graphics.
open source package based developed by Bell Labs
Many statistical functions are already built in
Contributed packages expand the functionality to cutting edge research
Implementation languages C, Fortran
3. What is R?
R is the result of a collaborative effort with contributions from all over the
world
R was initially written by Robert Gentleman and Ross Ihaka—also known as
"R & R" of the Statistics Department of the University of Auckland
R was inspired by the S environment
R can be extended (easily) via packages.
More about R
4. What R does and does not
ois not a database, but
connects to DBMSs
olanguage interpreter can be
very slow, but allows to call
own C/C++ code
ono professional / commercial
support
5. Data Types in R
numeric (integer, double, complex)
character
logical
Data frame
factor
8. Text Mining: Basics
Text is Unstructured collections of words
Documents are basic units consisting of a sequence of tokens or terms
Terms are words or roots of words, semantic units or phrases which are the
atoms of indexing
Repositories (databases) and corpora are collections of documents.
Corpus conceptual entity similar to a database for holding and managing text
documents
Text mining involves computations to gain interesting information
9. Text Mining: Practical Applications
Spam filtering
Business Intelligence, Marketing applications : predictive analytics
Sentiment analysis
Text IR, indexing
Creating suggestion and recommendations (like amazon)
Monitoring public opinions (for example in blogs or review sites)
Customer service, email support
Automatic labeling of documents in business libraries
Fraud detection by investigating notification of claims
Fighting cyberbullying or cybercrime in IM and IRC chat
And many more
11. Text Mining Packages in R
Corpora gsubfn kernlab KoNLP
koRpus `lda lsa maxent
movMF openNLP qdap RcmdrPlugin.temis
RKEA RTextTools Rweka skmeans
Snowball SnowballC tau textcat
Textir tm tm.plugin.dc tm.plugin.factiva
tm.plugin.mail topicmodels wordcloud
Wordnet zipfR
12. Text Mining Packages in R
plyr: Tools for splitting, applying and combining data
class: Various functions for classification
tm: A framework for text mining applications
corpora: Statistics and data sets for corpus frequency data
snowball: stemmers
Rweka: interface to Weka, a collection of ML algorithms for data mining tasks
wordnet: interface to WordNet using the Jawbone Java API to WordNet
wordcloud: to make cloud of word
textir: A suite of tools for text and sentiment mining
tau: Text Analysis Utilities
topicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA)
models and Correlated Topics Models (CTM)
zipfR: Statistical models for word frequency distributions
13. Conceptual process in Text Mining
organize and structure the texts (into repository)
convenient representation (preprocessing)
Transform texts into structured formats (e.g. TDM)
14. The framework
different file formats and in different locations
standardized interfaces to access the document (sources)
Metadata valuable insights into the document structure
must be able to alleviate metadata usage
to efficiently work with the documents
must provide tools and algorithm to perform common task (transformation)
To extract patterns of interest (filtering)
15. Text document collections: Corpus
Constructor:
Corpus(object = ..., readerControl = list(reader = object@DefaultReader,
language = "en_US", load = FALSE))
Example:
>txt <- system.file("texts", "txt", package = "tm")
>(ovid <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 5 text documents
16. Corpus: Meta Data
>meta(ovid[[1]])
Available meta data pairs are:
Author :
DateTimeStamp: 2013-11-19 18:54:04
Description :
Heading :
ID : ovid_1.txt
Language : la
Origin :
>ID(ovid[[1]])
[1] "ovid_1.txt“
17. Corpus: Document’s text
>ovid[[1]]
Si quis in hoc artem populo non novit amandi, hoc
legat et lecto carmine doctus amet. arte citae
veloque rates remoque moventur, arte leves currus:
arte regendus amor. curribus Automedon lentisque erat
aptus habenis, Tiphys in Haemonia puppe magister
erat: me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego. ille quidem
ferus est et qui mihi saepe repugnet: sed puer est,
aetas mollis et apta regi. Phillyrides puerum cithara
perfecit Achillem, atque animos placida contudit arte
feros. qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
18. Corpus: Meta Data
>c(ovid[1:2], ovid[3:4])
A corpus with 4 text documents
>length(ovid)
5
>summary(ovid)
A corpus with 5 text documents
The metadata consists of 2 tag-value pairs and a data
frame
Available tags are: create_date creator
Available variables in the data frame are: MetaID
20. Corpus: Transformations and Filters
>getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
"removePunctuation" "removeWords"
[5] "stemDocument" "stripWhitespace“
>tm_map(ovid, FUN = tolower)
A corpus with 5 text documents
>getFilters()
[1] "searchFullText" "sFilter" "tm_intersect"
>tm_filter(ovid, FUN = searchFullText, "Venus",
doclevel = TRUE)
A corpus with 1 text document
21. Text Preprocessing: import
>txt <- system.file("texts", "crude", package = "tm")
>(acq <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 50 text documents
>txt <- system.file("texts", "crude", package = "tm")
>(crude <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 20 text documents
resulting in 50 articles of topic acq and 20 articles of topic crude
22. Preprocessing: stemming
Morphological variants of a word (morphemes). Similar terms derived from
a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Grouping words with a common stem
together.
For example, a search on reads, also finds read, reading, and readable
Stemming consists of removing suffixes and conflating the resulting
morphemes. Occasionally, prefixes are also removed.
23. Preprocessing: stemming
Reduce terms to their “roots”
automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
24. Preprocessing: stemming
Typical rules in Stemming:
sses ss
ies i
ational ate
tional tion
Weight of word sensitive rules
(m>1) EMENT →
replacement → replac
cement → cement
25. Preprocessing: stemming
help recall for some queries but harm precision on others
Fine distinctions may be lost through stemming.
26. Preprocessing: stemming
>acq[[10]]
Gulf Applied Technologies Inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. The company said the
sale is subject to certain post closing adjustments,
which it did not explain. Reuter
>stemDocument(acq[[10]])
Gulf Appli Technolog Inc said it sold it subsidiari
engag in pipelin and terminal oper for 12.2 mln dlrs.
The compani said the sale is subject to certain post
clos adjustments, which it did not explain. Reuter
>tm_map(acq, stemDocument)
A corpus with 50 text documents
27. Preprocessing: Whitespace elimination & lower
case conversion
>stripWhitespace(acq[[10]])
Gulf Applied Technologies Inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. The company said the
sale is subject to certain post closing adjustments,
which it did not explain. Reuter
>tolower(acq[[10]])
gulf applied technologies inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. the company said the
sale is subject to certain post closing adjustments,
which it did not explain. reuter
28. Preprocessing: Stopword removal
Very common words, such as of, and, the, are rarely of use in information
retrieval.
A long stop list saves space in indexes, speeds processing, and eliminates many
false hits.
However, common words are sometimes significant in information retrieval,
which is an argument for a short stop list.
(Consider the query, "To be or not to be?")
29. Preprocessing: Stopword removal
Include the most common words in the English language (perhaps 50 to 250
words).
Do not include words that might be important for retrieval (Among the 200
most frequently occurring words in general literature in English are time, war,
home, life, water, and world).
In addition, include words that are very common in context (e.g., computer,
information, system in a set of computing documents).
30. Preprocessing: Stopword removal
about above accordingacross actually adj after
afterwards again against all almost alone
along already also although always among
amongst an another any anyhow anyone
anything anywhere are aren't around
at be became because become becomes becoming
been before beforehand begin beginning behind
being below beside besides between beyond
billion both but by can can't
cannot caption co could couldn't
did didn't do does doesn't don't down
during each eg eight eighty
either else elsewhere end ending enough
etc even ever every everyone everything
31. Preprocessing: Stopword removal
How many words should be in the stop list?
• Long list lowers recall
Which words should be in list?
• Some common words may have retrieval importance:
-- war, home, life, water, world
• In certain domains, some words are very common:
-- computer, program, source, machine, language
32. Preprocessing: Stopword removal
>mystopwords <- c("and", "for", "in", "is", "it",
"not", "the", "to")
>removeWords(acq[[10]], mystopwords)
Gulf Applied Technologies Inc said sold its
subsidiaries engaged pipeline terminal operations
12.2 mln dlrs. The company said sale subject certain
post closing adjustments, which did explain. Reuter
>tm_map(acq, removeWords, mystopwords)
A corpus with 50 text documents
37. Classification using KNN
K-Nearest Neighbor algorithm:
Most basic instance-based method
Data are represented in a vector space
Supervised learning
, V is the finite set {v1,......,vn}
the k-NN returns the most common value among the k training examples
nearest to xq.
39. KNN Training algorithm
For each training example <x,f(x)> add the example to the list
Classification algorithm
Given a query instance xq to be classified
Let x1,..,xk k instances which are nearest to xq
Where 𝛿(a,b)=1 if a=b, else 𝛿(a,b)= 0 (Kronecker function)
40. Classification using KNN : Example
Two classes: Red and Blue
Green is Unknown
With K=3, classification is Red
With k=4, classification is Blue
41. How to determine the good value for k?
Determined experimentally
Start with k=1 and use a test set to validate the error rate of the classifier
Repeat with k=k+2
Choose the value of k for which the error rate is minimum
Note: k should be odd number to avoid ties
42. KNN for speech classification
Datasets:
Size: 40 instances
Barak Obama 20 speeches
Mitt Romney 20 speeches
Training datasets: 70% (28)
Test datasets: 30% (12)
Accuracy: on average more than 90%
43. Speech Classification Implementation in R
#initialize the R environment
libs<-c("tm","plyr","class")
lapply(libs,require,character.only=TRUE)
#Set parameters / source directory
dir.names<-c("obama","romney")
path<-"E:/Ashraf/speeches"
#clean text / preprocessing
cleanCorpus<-function(corpus){
corpus.tmp<-tm_map(corpus,removePunctuation)
corpus.tmp<-tm_map(corpus.tmp,stripWhitespace)
corpus.tmp<-tm_map(corpus.tmp,tolower)
corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english"))
return (corpus.tmp)
}
44. Speech Classification Implementation in R
#build term document matrix
generateTDM<-function(dir.name,dir.path){
s.dir<-sprintf("%s/%s",dir.path,dir.name)
s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI"))
s.cor.cl<-cleanCorpus(s.cor)
s.tdm<-TermDocumentMatrix(s.cor.cl)
s.tdm<-removeSparseTerms(s.tdm,0.7)
result<-list(name=dir.name,tdm=s.tdm)
}
tdm<-lapply(dir.names,generateTDM,dir.path=path)
45. Speech Classification Implementation in R
#attach candidate name to each row of TDM
bindCandidateToTDM<-function(tdm){
s.mat<-t(data.matrix(tdm[["tdm"]]))
s.df<-as.data.frame(s.mat,StringAsFactors=FALSE)
s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df)))
colnames(s.df)[ncol(s.df)]<-"targetcandidate"
return (s.df)
}
candTDM<-lapply(tdm,bindCandidateToTDM)
46. Speech Classification Implementation in R
#stack the TDMs together (for both Obama and Romnie)
tdm.stack<-do.call(rbind.fill,candTDM)
tdm.stack[is.na(tdm.stack)]<-0
#hold-out / splitting training and test data sets
train.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7))
test.idx<-(1:nrow(tdm.stack))[-train.idx])
47. Speech Classification Implementation in R
#model KNN
tdm.cand<-tdm.stack[,"targetcandidate"]
tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"]
knn.pred<-
knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx])
#accuracy of the prediction
conf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx])
(accuracy<-(sum(diag(conf.mat))/length(test.idx))*100)
#show result
show(conf.mat)
show(accuracy)
49. References
1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol.
25, Issue 5, Mar 2008, Journal of Statistical Software.
2. http://mittromneycentral.com/speeches/
3. http://obamaspeeches.com/
4. http://cran.r-project.org/