SlideShare a Scribd company logo
1 of 29
Statistical text mining using R
Tom Liptrot
The Christie Hospital
Motivation
Example 1: Dickens to
matrix
Example 2: Electronic
patient records
Dickens to Matrix: a bag of words
IT WAS the best of times, it was the worst of times,
it was the age of wisdom, it was the age of
foolishness, it was the epoch of belief, it was the
epoch of incredulity, it was the season of Light, it
was the season of Darkness, it was the spring of
hope, it was the winter of despair, we had
everything before us, we had nothing before us, we
were all going direct to Heaven, we were all going
direct the other way- in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good or
for evil, in the superlative degree of comparison
only.
Dickens to Matrix: a matrix
Documents
Words
#Example matrix syntax
A = matrix(c(1, rep(0,6), 2), nrow = 4)
library(slam)
S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2))
library(Matrix)
M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))
𝑎11 𝑎12 ⋯ 𝑎1𝑛
𝑎21 𝑎22 ⋯ 𝑎2𝑛
⋮ ⋮ ⋱ ⋮
𝑎 𝑚1 𝑎 𝑚2 ⋯ 𝑎 𝑚𝑛
Dickens to Matrix: tm package
library(tm) #load the tm package
corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity, it was the season of light, it was the season of
darkness, it was the spring of hope, it was the winter of despair, we
had everything before us, we had nothing before us, we were all going
direct to heaven, we were all going direct the other way- in short, the
period was so far like the present period, that some of its noisiest
authorities insisted on its being received, for good or for evil, in the
superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of light, it was the
season of darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we
were all going direct to heaven, we were all going direct the other
way- in short, the period was so far like the present period, that
some of its noisiest authorities insisted on its being received, for
good or for evil, in the superlative degree of comparison only.
Dickens to Matrix: stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times, worst times, age wisdom, age foolishness, epoch
belief, epoch incredulity, season light, season darkness,
spring hope, winter despair, everything us, nothing us, going
direct heaven, going direct way- short, period far like present
period, noisiest authorities insisted received, good evil,
superlative degree comparison .
Dickens to Matrix: punctuation
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best times worst times age wisdom age foolishness epoch
belief epoch incredulity season light season darkness spring
hope winter despair everything us nothing us going direct
heaven going direct way short period far like present period
noisiest authorities insisted received good evil superlative degree
comparison
Dickens to Matrix: stemming
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch
belief epoch incredul season light season dark spring hope
winter despair everyth us noth us go direct heaven go direct
way short period far like present period noisiest author insist
receiv good evil superl degre comparison
Dickens to Matrix: cleanup
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
corpus_1 <- tm_map(corpus_1, content_transformer(tolower))
corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english"))
corpus_1 <- tm_map(corpus_1, removePunctuation)
corpus_1 <- tm_map(corpus_1, stemDocument)
corpus_1 <- tm_map(corpus_1, stripWhitespace)
best time worst time age wisdom age foolish epoch belief epoch
incredul season light season dark spring hope winter despair everyth
us noth us go direct heaven go direct way short period far like present
period noisiest author insist receiv good evil superl degre comparison
Dickens to Matrix: Term Document Matrix
tdm <- TermDocumentMatrix(corpus_1)
<<TermDocumentMatrix (terms: 35, documents: 1)>>
Non-/sparse entries: 35/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
class(tdm)
[1] "TermDocumentMatrix" "simple_triplet_matrix“
dim (tdm)
[1] 35 1
age 2 epoch 2 insist 1 short 1
author 1 everyth 1 light 1 spring 1
belief 1 evil 1 like 1 superl 1
best 1 far 1 noisiest 1 time 2
comparison 1 foolish 1 noth 1 way 1
dark 1 good 1 period 2 winter 1
degre 1 heaven 1 present 1 wisdom 1
despair 1 hope 1 receiv 1 worst 1
direct 2 incredul 1 season 2
Dickens to Matrix: Ngrams
Dickens to Matrix: Ngrams
Library(Rweka)
four_gram_tokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4))
}
tdm_4gram <- TermDocumentMatrix(corpus_1,
control = list(tokenize = four_gram_tokeniser)))
dim(tdm_4gram)
[1] 163 1
age 2 author insist receiv good 1 dark 1
age foolish 1 belief 1 dark spring 1
age foolish epoch 1 belief epoch 1 dark spring hope 1
age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1
age wisdom 1 belief epoch incredul season 1 degre 1
age wisdom age 1 best 1 degre comparison 1
age wisdom age foolish 1 best time 1 despair 1
author 1 best time worst 1 despair everyth 1
author insist 1 best time worst time 1 despair everyth us 1
author insist receiv 1 comparison 1 despair everyth us noth 1
Electronic patient records: Gathering
structured medical data
Doctor enters structured data directly
Electronic patient records: Gathering
structured medical data
Trained staff
extract
structured
data from
typed notes
Doctor enters structured data directly
Electronic patient records: example text
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since
X and was known at X Hospital. She underwent an endoscopy which
found a tumour which was biopsied and is a squamous cell carcinoma.
A staging CT scan picked up a left upper lobe nodule. She then went on
to have an EUS at X this was performed by Dr X and showed an early
T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes,
between 35-40cm. There was a further 7.8mm node in the AP window at
27cm, the carina was measured at 28cm and aortic arch at 24cm, the
conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see
below. She can manage a soft diet such as Weetabix, soft toast, mashed
potato and gets occasional food stuck. Has lost half a stone in weight
and is supplementing with 3 Fresubin supplements per day.
Performance score is 1.
Electronic patient records: targets
Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0
History: X year old lady who presented with progressive dysphagia since X and was known at X
Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a
squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went
on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of
4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in
the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion
T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such
as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in
weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
Electronic patient records: steps
1. Identify patients where we have both structured data and notes (c.20k)
2. Extract notes and structured data from SQL database
3. Make term document matrix (as shown previously) (60m x 20k)
4. Split data into training and development set
5. Train classification model using training set
6. Assess performance and tune model using development set
7. Evaluate system performance on independent dataset
8. Use system to extract structured data where we have none
Electronic patient records: predicting
disease site using the elastic net
#fits a elastic net model, classifying into oesophagus or not
selecting lambda through cross validation
library(glmnet)
dim(tdm) #22,843 documents, 677,017 Ngrams
#note tdm must either be a matrix or a SparseMatrix NOT a
simple_triplet_matrix
mod_oeso <- cv.glmnet( x = tdm,
y = disease_site == 'Oesophagus',
family = "binomial")
𝛽 = argmin
𝛽
𝑦 − 𝑋𝛽 2
+ 𝜆2 𝛽 2
+ 𝜆1 𝛽 1
OLS + RIDGE + LASSO
Electronic patient records: The Elastic Net
#plots non-zero coefficients from elastic net model
coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1]
coefs <- coefs[coefs != 0]
coefs <- coefs[order(abs(coefs), decreasing = TRUE)]
barplot(coefs[-1], horiz = TRUE, col = 2)
P(site = ‘Oesophagus’) = 0.03
Electronic patient records: classification
performance: primary disease site
Training set = 20,000
Test set = 4,000 patients
80% of patients can be classified
with 95% accuracy (remaining 20%
can be done by human abstractors)
Next step is full formal evaluation on
independent dataset
Working in combination with rules
based approach from Manchester
University
AUC = 90%
Electronic patient records: Possible
extensions
• Classification (hierarchical)
• Cluster analysis (KNN)
• Time
• Survival
• Drug toxicity
• Quality of life
Thanks
Tom.liptrot@christie.nhs.uk
Books example
get_links <- function(address, link_prefix = '', link_suffix = ''){
page <- getURL(address)
# Convert to R
tree <- htmlParse(page)
## Get All link elements
links <- xpathSApply(tree, path = "//*/a",
fun = xmlGetAttr, name = "href")
## Convert to vector
links <- unlist(links)
## add prefix and suffix
paste0(link_prefix, links, link_suffix)
}
links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/',
link_prefix ='http://textfiles.com/etext/AUTHORS/')
links_text <- alply(links_authors, 1,function(.x){
get_links(.x, link_prefix =.x , link_suffix = '')
})
books <- llply(links_text, function(.x){
aaply(.x, 1, getURL)
})
Principle components analysis
## Code to get the first n principal components
## from a large sparse matrix term document matrix of class
dgCMatrix
library(irlba)
n <- 5 # number of components to calculate
m <- nrow(tdm) # 110703 terms in tdm matrix
xt.x <- crossprod(tdm)
x.Means <- colMeans(tdm)
xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1)
svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
0.8 1.0 1.2 1.4
0.8
0.9
1.0
1.1
1.2
1.3
1.4
PC2
PC3
ARISTOTLE
BURROUGHS
DICKENS
KANT
PLATO
SHAKESPEARE
plot(svd$v[i,c(2,3)] + 1,
col = books_df$author,
log = 'xy',
xlab = 'PC2',
ylab = 'PC3')
PCA plot

More Related Content

Viewers also liked

Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studioAshley Lindley
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)Vincent Handara
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression modelsHamideh Iraj
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on rAshraf Uddin
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Krishna Petrochemicals
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisike kurniati
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 

Viewers also liked (20)

Text mining with R-studio
Text mining with R-studioText mining with R-studio
Text mining with R-studio
 
My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)My Data Analysis Portfolio (Text Mining)
My Data Analysis Portfolio (Text Mining)
 
Data mining with R- regression models
Data mining with R- regression modelsData mining with R- regression models
Data mining with R- regression models
 
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
Twitter Text Mining with Web scraping, R, Shiny and Hadoop - Richard Sheng
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 

Similar to R user group presentation

Quantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceQuantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceErick Matsen
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Variational quantum fidelity estimation
Variational quantum fidelity estimationVariational quantum fidelity estimation
Variational quantum fidelity estimationMarco Cerezo
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksIlias Flaounas
 
Markov Blanket Causal Discovery Using Minimum Message Length
Markov Blanket Causal  Discovery Using Minimum  Message LengthMarkov Blanket Causal  Discovery Using Minimum  Message Length
Markov Blanket Causal Discovery Using Minimum Message LengthBayesian Intelligence
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMCPierre Jacob
 
Prediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsPrediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...EmadfHABIB2
 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxabhilasha7888
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...NeuroMat
 
Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Esteban Moro
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
Looking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisLooking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisSascha Zöllner
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionbutest
 

Similar to R user group presentation (20)

Quantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree spaceQuantifying MCMC exploration of phylogenetic tree space
Quantifying MCMC exploration of phylogenetic tree space
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Variational quantum fidelity estimation
Variational quantum fidelity estimationVariational quantum fidelity estimation
Variational quantum fidelity estimation
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Markov Blanket Causal Discovery Using Minimum Message Length
Markov Blanket Causal  Discovery Using Minimum  Message LengthMarkov Blanket Causal  Discovery Using Minimum  Message Length
Markov Blanket Causal Discovery Using Minimum Message Length
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Prediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data StreamsPrediction and Explanation over DL-Lite Data Streams
Prediction and Explanation over DL-Lite Data Streams
 
Poster_PingPong
Poster_PingPongPoster_PingPong
Poster_PingPong
 
MD_course.ppt
MD_course.pptMD_course.ppt
MD_course.ppt
 
Kalman filtering
Kalman filteringKalman filtering
Kalman filtering
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
 
A baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptxA baisc ideas of statistical physics.pptx
A baisc ideas of statistical physics.pptx
 
08 entropie
08 entropie08 entropie
08 entropie
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...Spike sorting: What is it? Why do we need it? Where does it come from? How is...
Spike sorting: What is it? Why do we need it? Where does it come from? How is...
 
Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)Temporal dynamics of human behavior in social networks (ii)
Temporal dynamics of human behavior in social networks (ii)
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Looking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of CarcinogenesisLooking Inside Mechanistic Models of Carcinogenesis
Looking Inside Mechanistic Models of Carcinogenesis
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

R user group presentation

  • 1. Statistical text mining using R Tom Liptrot The Christie Hospital
  • 3.
  • 4. Example 1: Dickens to matrix Example 2: Electronic patient records
  • 5. Dickens to Matrix: a bag of words IT WAS the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 6. Dickens to Matrix: a matrix Documents Words #Example matrix syntax A = matrix(c(1, rep(0,6), 2), nrow = 4) library(slam) S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2)) library(Matrix) M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2)) 𝑎11 𝑎12 ⋯ 𝑎1𝑛 𝑎21 𝑎22 ⋯ 𝑎2𝑛 ⋮ ⋮ ⋱ ⋮ 𝑎 𝑚1 𝑎 𝑚2 ⋯ 𝑎 𝑚𝑛
  • 7. Dickens to Matrix: tm package library(tm) #load the tm package corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 8. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) it was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.
  • 9. Dickens to Matrix: stopwords library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times, worst times, age wisdom, age foolishness, epoch belief, epoch incredulity, season light, season darkness, spring hope, winter despair, everything us, nothing us, going direct heaven, going direct way- short, period far like present period, noisiest authorities insisted received, good evil, superlative degree comparison .
  • 10. Dickens to Matrix: punctuation library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best times worst times age wisdom age foolishness epoch belief epoch incredulity season light season darkness spring hope winter despair everything us nothing us going direct heaven going direct way short period far like present period noisiest authorities insisted received good evil superlative degree comparison
  • 11. Dickens to Matrix: stemming library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 12. Dickens to Matrix: cleanup library(tm) corpus_1 <- Corpus(VectorSource(txt)) corpus_1 <- tm_map(corpus_1, content_transformer(tolower)) corpus_1 <- tm_map(corpus_1, removeWords, stopwords("english")) corpus_1 <- tm_map(corpus_1, removePunctuation) corpus_1 <- tm_map(corpus_1, stemDocument) corpus_1 <- tm_map(corpus_1, stripWhitespace) best time worst time age wisdom age foolish epoch belief epoch incredul season light season dark spring hope winter despair everyth us noth us go direct heaven go direct way short period far like present period noisiest author insist receiv good evil superl degre comparison
  • 13. Dickens to Matrix: Term Document Matrix tdm <- TermDocumentMatrix(corpus_1) <<TermDocumentMatrix (terms: 35, documents: 1)>> Non-/sparse entries: 35/0 Sparsity : 0% Maximal term length: 10 Weighting : term frequency (tf) class(tdm) [1] "TermDocumentMatrix" "simple_triplet_matrix“ dim (tdm) [1] 35 1 age 2 epoch 2 insist 1 short 1 author 1 everyth 1 light 1 spring 1 belief 1 evil 1 like 1 superl 1 best 1 far 1 noisiest 1 time 2 comparison 1 foolish 1 noth 1 way 1 dark 1 good 1 period 2 winter 1 degre 1 heaven 1 present 1 wisdom 1 despair 1 hope 1 receiv 1 worst 1 direct 2 incredul 1 season 2
  • 15. Dickens to Matrix: Ngrams Library(Rweka) four_gram_tokeniser <- function(x, n) { RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4)) } tdm_4gram <- TermDocumentMatrix(corpus_1, control = list(tokenize = four_gram_tokeniser))) dim(tdm_4gram) [1] 163 1 age 2 author insist receiv good 1 dark 1 age foolish 1 belief 1 dark spring 1 age foolish epoch 1 belief epoch 1 dark spring hope 1 age foolish epoch belief 1 belief epoch incredul 1 dark spring hope winter 1 age wisdom 1 belief epoch incredul season 1 degre 1 age wisdom age 1 best 1 degre comparison 1 age wisdom age foolish 1 best time 1 despair 1 author 1 best time worst 1 despair everyth 1 author insist 1 best time worst time 1 despair everyth us 1 author insist receiv 1 comparison 1 despair everyth us noth 1
  • 16. Electronic patient records: Gathering structured medical data Doctor enters structured data directly
  • 17. Electronic patient records: Gathering structured medical data Trained staff extract structured data from typed notes Doctor enters structured data directly
  • 18. Electronic patient records: example text Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 19. Electronic patient records: targets Diagnosis: Oesophagus lower third squamous cell carcinoma, T3 N2 M0 History: X year old lady who presented with progressive dysphagia since X and was known at X Hospital. She underwent an endoscopy which found a tumour which was biopsied and is a squamous cell carcinoma. A staging CT scan picked up a left upper lobe nodule. She then went on to have an EUS at X this was performed by Dr X and showed an early T3 tumour at 35-40cm of 4 small 4-6mm para-oesophageal nodes, between 35-40cm. There was a further 7.8mm node in the AP window at 27cm, the carina was measured at 28cm and aortic arch at 24cm, the conclusion T3 N2 M0. A subsequent PET CT scan was arranged-see below. She can manage a soft diet such as Weetabix, soft toast, mashed potato and gets occasional food stuck. Has lost half a stone in weight and is supplementing with 3 Fresubin supplements per day. Performance score is 1.
  • 20. Electronic patient records: steps 1. Identify patients where we have both structured data and notes (c.20k) 2. Extract notes and structured data from SQL database 3. Make term document matrix (as shown previously) (60m x 20k) 4. Split data into training and development set 5. Train classification model using training set 6. Assess performance and tune model using development set 7. Evaluate system performance on independent dataset 8. Use system to extract structured data where we have none
  • 21. Electronic patient records: predicting disease site using the elastic net #fits a elastic net model, classifying into oesophagus or not selecting lambda through cross validation library(glmnet) dim(tdm) #22,843 documents, 677,017 Ngrams #note tdm must either be a matrix or a SparseMatrix NOT a simple_triplet_matrix mod_oeso <- cv.glmnet( x = tdm, y = disease_site == 'Oesophagus', family = "binomial") 𝛽 = argmin 𝛽 𝑦 − 𝑋𝛽 2 + 𝜆2 𝛽 2 + 𝜆1 𝛽 1 OLS + RIDGE + LASSO
  • 22. Electronic patient records: The Elastic Net #plots non-zero coefficients from elastic net model coefs <- coef(mod_oeso, s = mod_oeso$lambda.1se)[,1] coefs <- coefs[coefs != 0] coefs <- coefs[order(abs(coefs), decreasing = TRUE)] barplot(coefs[-1], horiz = TRUE, col = 2) P(site = ‘Oesophagus’) = 0.03
  • 23. Electronic patient records: classification performance: primary disease site Training set = 20,000 Test set = 4,000 patients 80% of patients can be classified with 95% accuracy (remaining 20% can be done by human abstractors) Next step is full formal evaluation on independent dataset Working in combination with rules based approach from Manchester University AUC = 90%
  • 24. Electronic patient records: Possible extensions • Classification (hierarchical) • Cluster analysis (KNN) • Time • Survival • Drug toxicity • Quality of life
  • 26. Books example get_links <- function(address, link_prefix = '', link_suffix = ''){ page <- getURL(address) # Convert to R tree <- htmlParse(page) ## Get All link elements links <- xpathSApply(tree, path = "//*/a", fun = xmlGetAttr, name = "href") ## Convert to vector links <- unlist(links) ## add prefix and suffix paste0(link_prefix, links, link_suffix) } links_authors <- get_links("http://textfiles.com/etext/AUTHORS/", '/', link_prefix ='http://textfiles.com/etext/AUTHORS/') links_text <- alply(links_authors, 1,function(.x){ get_links(.x, link_prefix =.x , link_suffix = '') }) books <- llply(links_text, function(.x){ aaply(.x, 1, getURL) })
  • 27.
  • 28. Principle components analysis ## Code to get the first n principal components ## from a large sparse matrix term document matrix of class dgCMatrix library(irlba) n <- 5 # number of components to calculate m <- nrow(tdm) # 110703 terms in tdm matrix xt.x <- crossprod(tdm) x.Means <- colMeans(tdm) xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1) svd <- irlba(xt.x, nu=0, nv=n, tol=1e-10)
  • 29. 0.8 1.0 1.2 1.4 0.8 0.9 1.0 1.1 1.2 1.3 1.4 PC2 PC3 ARISTOTLE BURROUGHS DICKENS KANT PLATO SHAKESPEARE plot(svd$v[i,c(2,3)] + 1, col = books_df$author, log = 'xy', xlab = 'PC2', ylab = 'PC3') PCA plot

Editor's Notes

  1. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already
  2. Aims Make people want to try this themselves Get feedback or suggestions from those who have done this already