R is a widely used programming language and software environment for statistical analysis and graphics. It was first created in 1993 at the University of Auckland as an implementation of the S language. R compiles and runs on Windows, Mac, and UNIX systems and includes packages for text mining, such as tm and SnowballC, which are used to preprocess text through steps like lowering case, removing stopwords and numbers, and stemming words. It also supports creating word clouds and sentiment analysis visualizations.
2. What is R?
R is world’s most widely used statistics programming language .
R is a programming language and software environment for
Statistical analysis.
Graphics representation and reporting .
R provides a suite of operators for calculations on arrays, lists,
vectors and matrices.
3. History
R is a programming language it was an
implementation over S language. R was first
designed by Ross Ihaka and Robert Gentleman
at the University of Auckland in 1993
It was stable released on October 31st 2014 the
four months ago, by R Development Core
Team Under GNU General Public License
4. Introduction
R is a programming language and software environment for statistical computing
and graphics
The R language is widely used among statisticians software and data analysis
It compiles and runs on a wide variety of UNIX platforms, Windows and Mac OS.
R can be downloaded and installed from CRAN website, CRAN stands for
Comprehensive R Archive Network
5. R - Data Types
Primitive (or atomic) data types in R are:
• Numeric (integer, double, complex)
• Character
• Logical
• Function
6. Text Mining with R
R is an open source language and environment for statistical computing and
graphics. It includes packages like tm, SnowballC, ggplot2 and wordcloud, which
are used to carry out the earlier-mentioned steps in text processing. The first
prerequisite is that Rand R Studio need to be installed on your machine. R is an
open source language and environment for statistical computing and graphics. It
includes packages like tm, SnowballC, ggplot2 and wordcloud, which are used to
carry out the earlier-mentioned steps in text processing. The first prerequisite is
that Rand R Studio need to be installed on your machine.
7. Packages Used in Text Mining
RSQLite, ‘SQLite’ Interface for R
tm, framework for text mining applications
SnowballC, text stemming library
Wordloud, for making wordCloud visualizations
Syuzhet, text sentiment analysis
8.
9. Reading SQLite data in R
Docs <- Corpus(docs,VectorSource(docs$comments))
# Get all the emails sent by Hillary
Comm <- read.csv(“comments.csv”, header = TRUE)
emailRaw <- paste(emailHillary$EmailBody, collapse=" // ")
10. Cleaning Text in R
Install.packages(“tm”)
Install.packages(“NLP”)
Load text mining package - library(“tm”)
docs <- Corpus(VerctorSum(emailRaw)) – Corpus it is a collection of text
documents
11. Processing text in R
docs <- tm_map(docs, content_transformer(tolower)) – It makes all the words to
lower cases.
docs <- tm_map(docs, removeNumbers) - It removes numbers
docs <- tm_map(docs, removeWords, stopWords(“english”)) – It removes stop
words like the, is, of
docs <- tm_map(docs, removePunctuation) – It removes Punctuation
docs <- tm_map(docs, stripWhiteSpace) – It removes extra White Spaces
12. SnowballC to Stem Text
#Text stemming (reduces words to their root form)
library("SnowballC")
docs <- tm_map(docs, stemDocument)
# Remove additional stopwords
docs <- tm_map(docs, removeWords, c("clintonemailcom", "stategov", "hrod"))
13. SnowballC to Stem Text
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
Old programming
No multithreading
Data loaded directly into memory limits fuctionlaity for larger datasets
Sandbox…subsample data
Microsoft working on multicore r h2o