I wrote this brief lecture with the aim of enlightening the reader on the simplicity of using R and its packages, (such as 'twitteR') in performing powerful datamining exercises and analyses, as in this text mining example.
A few people like to say that passwords are dead, but the reality is far from it. First of all, we can't get rid of passwords entirely, because the alternatives all suck: physical tokens are easy to lose and retina scans are pretty creepy. What we should focus on is eliminating site-specific passwords.
Mozilla Persona was introduced at OSDC last year, but a number of new things have been added to it since. But more importantly, it's still the best shot we have at a decentralized web-wide identity system that works for average users and doesn't violate their privacy.
So I'm back to show you what's new and to talk about what organizations can gain from adding native support on their domain. It's time to solve the password problem on the web.
With the advent of the Room (the official ORM for Android). Android Database persistence has become easier than ever. Simply annotate your model objects and just like magic you have a persistent data model. However, with such simplicity, we often times forget the basics. Concepts like referential integrity, proper indexing, foreign keys and relationships can often feel, well... foreign these days. Yet, these are essential concepts in good data design and will help safe guard your app against data loss and corruption.
A few people like to say that passwords are dead, but the reality is far from it. First of all, we can't get rid of passwords entirely, because the alternatives all suck: physical tokens are easy to lose and retina scans are pretty creepy. What we should focus on is eliminating site-specific passwords.
Mozilla Persona was introduced at OSDC last year, but a number of new things have been added to it since. But more importantly, it's still the best shot we have at a decentralized web-wide identity system that works for average users and doesn't violate their privacy.
So I'm back to show you what's new and to talk about what organizations can gain from adding native support on their domain. It's time to solve the password problem on the web.
With the advent of the Room (the official ORM for Android). Android Database persistence has become easier than ever. Simply annotate your model objects and just like magic you have a persistent data model. However, with such simplicity, we often times forget the basics. Concepts like referential integrity, proper indexing, foreign keys and relationships can often feel, well... foreign these days. Yet, these are essential concepts in good data design and will help safe guard your app against data loss and corruption.
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGISbramantiyo marjuki
How to Clip Rasters Using Polygon, Summary of mini course at Thematic Mapping Technical Unit of Ministry of Public Works center data processing, March, 10th, 2015
Presentasi Pemetaan Digital untuk Materi Ajar Diklat Pengukuran, Pemetaan, dan GIS, Balai Diklat PU Wilayah 3 Yogyakarta 7-11 Oktober 2014
Slide Credits
1. Komang Sri Hartini, Pusat Pengolahan Data Kementerian PU
2. Arif Aditya, Badan Informasi Geospasial
3. SOKKIA Technical Team Indonesia
4. Soma Tranggana, Badan Informasi Geospasial
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
A quick tutorial for the Boston Predictive Analytics MeetUp to demonstrate the use of R in the context of text mining Twitter. We implement a very crude algorithm for sentiment analysis but still get a plausible result.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. See http://strataconf.com/strata2011/public/schedule/detail/17714 for an overview of the talk.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond PHP - It's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Tiga Cara Memotong file Raster Sesuai Batas Polygon Menggunakan ArcGISbramantiyo marjuki
How to Clip Rasters Using Polygon, Summary of mini course at Thematic Mapping Technical Unit of Ministry of Public Works center data processing, March, 10th, 2015
Presentasi Pemetaan Digital untuk Materi Ajar Diklat Pengukuran, Pemetaan, dan GIS, Balai Diklat PU Wilayah 3 Yogyakarta 7-11 Oktober 2014
Slide Credits
1. Komang Sri Hartini, Pusat Pengolahan Data Kementerian PU
2. Arif Aditya, Badan Informasi Geospasial
3. SOKKIA Technical Team Indonesia
4. Soma Tranggana, Badan Informasi Geospasial
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
A quick tutorial for the Boston Predictive Analytics MeetUp to demonstrate the use of R in the context of text mining Twitter. We implement a very crude algorithm for sentiment analysis but still get a plausible result.
Beyond PHP - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just writing PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. Matthew Russell's "Unleashing Twitter Data for Fun and Insight" presentation from Strata 2011. See http://strataconf.com/strata2011/public/schedule/detail/17714 for an overview of the talk.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond php - it's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
Beyond PHP - It's not (just) about the codeWim Godden
Most PHP developers focus on writing code. But creating Web applications is about much more than just wrting PHP. Take a step outside the PHP cocoon and into the big PHP ecosphere to find out how small code changes can make a world of difference on servers and network. This talk is an eye-opener for developers who spend over 80% of their time coding, debugging and testing.
In this study, we attempted to study the network of Twitter users and the mentions between them. Starting with a very large and incorrectly structured dataset, we used the Unix terminal (sed) and regular expressions to efficiently perform filtering and various transformations to end up with a lighter dataset. Then, using Python, we completely transformed the dataset from a linear (line by line) to a tabular format (columns), in order to load the data in iGraph. Using iGraph, we created a weighted directed graph and performed various tasks to explore the network:
- Identifying basic properties of the network, such as the Number of vertices, Number of edges, Diameter of the graph, Average in-degree and Average out-degree.
- Visualising the 5-day evolution of these metrics and commenting on observed fluctuations.
- Identifying the important nodes of the graph, based on In-degree, Out-degree and PageRank
- Performing community detections on the mention graphs, by applying fast greedy clustering, infomap clustering, and louvain clustering on the undirected versions of the 5 mention graphs.
- Visualising the different communities in the mention graph.
Derrière ce titre putaclic se cache une réalité pour une partie de notre industrie.
Les boucles for/while sont des structures itératives proposant le plus bas niveau d'abstraction. Les langages modernes proposent encore de nos jours ces structures car elles ont leur utilité dans quelques cas exceptionnels.
Ces 10 dernières années, de nouvelles structures d'itérations sont apparues, proposant un plus haut niveau d'abstraction : donc une meilleure productivité, moins de ligne de code, donc moins de bug potentiels (que nous décrirons).
Nous partirons d'exemples de code simple et montrerons leur équivalent via ces nouvelles structures puis observerons les avantages (et inconvénients ?). Les exemples seront en JavaScript mais bien entendu applicable dans d'autres langages (Java, C#, Python, Ruby, C++, Scala, Go, Rust, ...).
Similar to Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Adeleye
1. Text Mining, Social Network Analysis
Deolu Adeleye
Text Mining
Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data.
Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soon
demonstrate.
As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’ll
be examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice when
running the code).
Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate),
and will be using the following R packages:
• twitter
• tm
• wordcloud
• SnowballC
• RWeka
• igraph
The first step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. After
filling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to
click on the save button after doing this. In the “Details” tab, take note of the following:
• your consumer key
• your consumer secret
• your access token
• your access secret
Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the format
setup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the according
values inserted:
#load the twitteR package
library(twitteR)
#authenticate
setup_twitter_oauth(our_key,
our_secret,
our_token,
our_access_secret)
## [1] "Using direct authentication"
1
2. You only need to authenticate once per R session.
So, we’ve authenticated. Next, let’s just randomly mine a particular word, say ‘water’, from everywhere it was used
recently on Twitter.
#retrieve last 50 tweets where hashtag '#water' is used, for example
watertag<-searchTwitter('#water', n=50)
head(watertag,3)
## [[1]]
## [1] "FrozenMOVlE: #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http:/
##
## [[2]]
## [1] "FrozenMOVlE: Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD
##
## [[3]]
## [1] "vikashprasad21: RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47q
Next, let’s get info from the particular user ‘@55wordsorless’:
#retrieve the last 100 tweets from the specified timeline
tweets <- userTimeline('55wordsorless', n=100)
head(tweets,3)
## [[1]]
## [1] "55WordsOrLess: @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next
##
## [[2]]
## [1] "55WordsOrLess: Join the conversation!! http://t.co/xqDAtWbzVK :D :D"
##
## [[3]]
## [1] "55WordsOrLess: @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D"
For our purposes, we’ll convert these into a data.frame object:
watertag_df <- twListToDF(watertag)
tweets_df <- twListToDF(tweets)
head(watertag_df,3)
##
## 1 #vsco #afterlight #winter #wisconsin #water #lake #michiga
## 2 Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+
## 3 RT @WaterNetwork1: #DSRSD #Certified For #Wat
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 <NA> 2015-01-02 21:06:40 FALSE
## 2 FALSE 0 <NA> 2015-01-02 21:06:21 FALSE
## 3 FALSE 0 <NA> 2015-01-02 21:05:22 FALSE
## replyToSID id replyToUID
## 1 <NA> 551122427166875648 <NA>
## 2 <NA> 551122347814825984 <NA>
## 3 <NA> 551122099348054016 <NA>
## statusSource
2
3. ## 1 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 2 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>
## 3 <a href="http://spinabell.com" rel="nofollow">spinabell</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 2 FrozenMOVlE 0 FALSE FALSE <NA> <NA>
## 3 vikashprasad21 1 TRUE FALSE <NA> <NA>
head(tweets_df,3)
## text
## 1 @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :|
## 2 Join the conversation!! http://t.co/xqDAtWbzVK :D :D
## 3 @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 _MissJem_ 2014-11-28 18:03:43 FALSE
## 2 FALSE 0 <NA> 2014-10-28 14:19:27 FALSE
## 3 FALSE 0 BeautifulFeet_ 2014-10-09 19:48:46 FALSE
## replyToSID id replyToUID
## 1 538279840193839108 538392811075559424 434366153
## 2 <NA> 527102349727531008 <NA>
## 3 <NA> 520299852824334337 92370873
## statusSource
## 1 <a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M2)</a>
## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 55WordsOrLess 0 FALSE FALSE NA NA
## 2 55WordsOrLess 0 FALSE FALSE NA NA
## 3 55WordsOrLess 0 FALSE FALSE NA NA
After that, we’ll convert to a corpus (which is just a collection of text documents) using the tm package:
library(tm)
#build a corpus, and specify the source to be character vectors
watertag_corpus <- Corpus(VectorSource(watertag_df$text))
tweets_corpus <- Corpus(VectorSource(tweets_df$text))
The corpus allows us to perform certain manipulations with functions in the tm package. You should run ?Corpus
to see other possible sources of textual data you can harness.
Let’s proceed by first ‘cleaning’ our data:
#make a copy, just in case we might need the original later
watertag_1 <- watertag_corpus
tweets_1 <- tweets_corpus
# remove punctuation
watertag_corpus <- tm_map(watertag_corpus, removePunctuation)
tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
# remove numbers
watertag_corpus <- tm_map(watertag_corpus, removeNumbers)
tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
# convert to lower case
watertag_corpus <- tm_map(watertag_corpus, content_transformer(tolower))
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
3
4. # remove whitespace
watertag_corpus <- tm_map(watertag_corpus, stripWhitespace)
tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
# remove stopwords such as 'you', 'me', etc.
watertag_corpus <- tm_map(watertag_corpus, removeWords, stopwords("english"))
tweets_corpus <- tm_map(tweets_corpus, removeWords, stopwords("english"))
# remove URLs
# We'll create a function to look for 'http' in our text, and then delete the links
removeURL <- content_transformer(function(x) gsub("http[[:alnum:]]*", "", x))
watertag_corpus <- tm_map(watertag_corpus, removeURL)
tweets_corpus <- tm_map(tweets_corpus, removeURL)
#inspect our results
inspect(head(watertag_corpus,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## vsco afterlight winter wisconsin water lake michigan frozen milwaukee city
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## taking brothers place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## rt waternetwork dsrsd certified water quality testing
inspect(head(tweets_corpus,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## missjem someone already hasthough borrow dr whos tardis next difficulty
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## join conversation d d
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## beautifulfeet read mischievous thoughts well d
Other transformations possible with tm_map can obtained by running getTransformations()
In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stem
would be taken as the same when counting word frequency. Stemming uses an algorithm that removes common word
endings for English words, such as “es”, “ed” and “’s”. For instance, words “update”, “updated” and “updating”
would all be stemmed to “updat”. It’s not mandatory (and sometimes it may be counter-productive), but it does
pay to understand what it does, so we’ll demonstrate:
4
5. # create a copy we'll stem
watertag_stemmed <- watertag_corpus
tweets_stemmed <- tweets_corpus
# stem words
library(SnowballC)
watertag_stemmed <- tm_map(watertag_stemmed, stemDocument)
tweets_stemmed <- tm_map(tweets_stemmed, stemDocument)
# inspect our stemmed results
inspect(head(watertag_stemmed,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## vsco afterlight winter wisconsin water lake michigan frozen milwauke citi
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## take brother place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+38
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## rt waternetwork dsrsd certifi water qualiti test
inspect(head(tweets_stemmed,3))
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
##
## [[1]]
## <<PlainTextDocument (metadata: 7)>>
## missjem someon alreadi hasthough borrow dr whos tardi next difficulti
##
## [[2]]
## <<PlainTextDocument (metadata: 7)>>
## join convers d d
##
## [[3]]
## <<PlainTextDocument (metadata: 7)>>
## beautifulfeet read mischiev thought well d
A term-document matrix represents the relationship between terms and documents, where each row stands for a
term and each column for a document, and an entry is the number of occurrences of the term in the document.
In contrast, a document-term matrix is simply the transpose of the term-document matrix, with documents as
rows, and columns as terms.
So which should you use? Either of your choice!
#creating term-document matrices
watertag_tdm<-TermDocumentMatrix(watertag_corpus)
tweets_tdm<-TermDocumentMatrix(tweets_corpus)
#creating document-term matrices
watertag_dtm<-DocumentTermMatrix(watertag_corpus)
5
6. tweets_dtm<-DocumentTermMatrix(tweets_corpus)
# just to compare the two:
watertag_tdm
## <<TermDocumentMatrix (terms: 311, documents: 50)>>
## Non-/sparse entries: 492/15058
## Sparsity : 97%
## Maximal term length: 61
## Weighting : term frequency (tf)
watertag_dtm
## <<DocumentTermMatrix (documents: 50, terms: 311)>>
## Non-/sparse entries: 492/15058
## Sparsity : 97%
## Maximal term length: 61
## Weighting : term frequency (tf)
As seen above, except for their transpose, their practically the same.
With our matrix, we can perform quite a number of functions. Like, if we wanted to know the frequency of occurence
for some words:
#find terms which occur 5 times or more
findFreqTerms(watertag_dtm, 5)
## [1] "amp" "ice" "sun" "water"
#how about 10 times or more?
findFreqTerms(tweets_tdm, lowfreq=10)
## [1] "man" "now"
It is important to note the results are ordered alphabetically, not according to frequency of occurence.
If we want it according to frequency, we’ll obtain it as a vector by converting into a matrix and using the rowSums
function if we’re using a tdm, and colSums if dtm:
#remember you can use either dtm or tdm - we're using both interchangeably just to demonstrate
watertag_freq <- colSums(as.matrix(watertag_dtm))
tweets_freq <- rowSums(as.matrix(tweets_tdm))
…and then we sort it in descending order, so it shows the terms with maximum occurence first:
#display head of most frequent terms
head(sort(watertag_freq,decreasing=TRUE))
## water ice amp sun frozen fun
## 57 7 5 5 4 4
6
7. head(sort(tweets_freq,decreasing=TRUE))
## man now just said home never
## 10 10 7 7 6 6
We could even see the frequency of frequencies, to know how many times some terms appear:
head(table(watertag_freq),15)
## watertag_freq
## 1 2 3 4 5 7 57
## 200 85 16 6 2 1 1
head(table(tweets_freq),15)
## tweets_freq
## 1 2 3 4 5 6 7 10
## 569 85 24 21 4 4 2 2
This is tells us that from our search, 200 terms occur just once; and from our tweets, 569 terms occur just once, and
so forth…
We could also retrieve associations between words: if two words appeared together, then their correlation would be
1.0; if never: 0.0. Those are the boundaries.
So, let’s say we wanted to see words that have at least a 0.5 correlation with the word ‘time’ in our search results:
findAssocs(watertag_dtm, "time", corlimit=0.5)
## $time
## numeric(0)
Note that a result of type(0) indicates no correlating words were found, meaning the word you searced didn’t occur
(to the level of correlation you specified).
How about the words ‘trend’ and ‘food’ from our timeline, this time with a 0.4 correlation?
findAssocs(tweets_tdm, c("trend","food"), corlimit=0.4)
## $trend
## numeric(0)
##
## $food
## diner garbage” protested rat siryou tastes cook
## 1.00 1.00 1.00 1.00 1.00 1.00 0.70
## money paid yet stunned good like
## 0.70 0.70 0.70 0.57 0.49 0.49
What if we wanted to graphically represent our results? We could, and it only require a few lines of code.
For example: let’s make a barplot of all the terms that occur at least 5 times from text source(s). (5 is considerably
small, but serves this particular example well)
7
8. #using ggplot2 package
library(ggplot2)
#from our search on Twitter
qplot(names(watertag_freq[watertag_freq>=5]), watertag_freq[watertag_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="Search Results") + coord_flip()
Figure 1: Words Occuring At Least 5 Times
#from our timeline
qplot(names(tweets_freq[tweets_freq>=5]), tweets_freq[tweets_freq>=5], geom="bar",
stat="identity", xlab="Frequency", ylab="Terms", main="@55wordsorless Timeline") + coord_flip()
Figure 2: Words Occuring At Least 5 Times
8
9. Wordclouds are also a very cool graphical representation of textual information. Here, the more frequently a word
occurs, the bolder and larger it is displayed, with the reverse being true.
By default the most frequent words have a font scale of 4 and the least have a scale of 0.5, but even that can be
changed, as we’ll demonstrate!
tweets_freq<-sort(tweets_freq,decreasing=TRUE)
watertag_freq<-sort(watertag_freq,decreasing=TRUE)
#wordcloud package allows us to produce wordclouds
library(wordcloud)
#each time wordcloud is run, it randomly produces a layout.
#Though it doesn't really matter, you can set the seed to keep the layout the same
set.seed(77)
#'min.freq' specifies the minimum frequency of the words to be plotted
wordcloud(names(tweets_freq), tweets_freq, min.freq=3)
Figure 3: Wordcloud Using min.freq
#max.words specifies the maximum number of words it should plot
#scale changes font scale
wordcloud(names(tweets_freq), scale=c(5, .1), tweets_freq, max.words=100)
## Warning in wordcloud(names(tweets_freq), scale = c(5, 0.1), tweets_freq, :
## now could not be fit on page. It will not be plotted.
9
11. #just adding some colour!
set.seed(79)
wordcloud(names(watertag_freq), watertag_freq, min.freq=2,
random.color=TRUE,colors=rainbow(7))
Figure 5: Wordcloud With Colour!
Run ?wordcloud for even more options you can specify.
Word clusters can also be generated.
Hierarchical
Let’s first remove some sparse words that occur minimally and are not so important with removeSparseTerms. The
value of sparse is a numeric serving as a factor - terms occuring less than the specified percentage are retained.
#we're using 0.95 because our text source has only a few terms, and not many re-ocurring words
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
distMatrix <- dist(scale(tweets_sparsed))
fit <- hclust(distMatrix, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(fit)
# cut tree into 10 clusters
rect.hclust(fit, k=10)
(groups <- cutree(fit, k=10))
## ’ll bed check died discovered good
## 1 2 3 1 3 4
## home just kill know last like
11
13. ## 5 6 7 8 7 4
## love made man marry mischievous never
## 1 8 1 1 3 2
## new now one said smiled steel
## 4 9 4 1 4 3
## still swore take things took went
## 6 2 1 3 10 5
## wife
## 2
K-means
We can use the k-means clustering in our analysis. However, for this you MUST use a document-term matrix.
#using DOCUMENT-TERM matrix
(tweets_sparsed <- removeSparseTerms(tweets_dtm, sparse=0.95))
## <<DocumentTermMatrix (documents: 79, terms: 31)>>
## Non-/sparse entries: 153/2296
## Sparsity : 94%
## Maximal term length: 11
## Weighting : term frequency (tf)
#setting our value of k
k <- 4
kmeansResult <- kmeans(tweets_sparsed, k)
# cluster centers
round(kmeansResult$centers, digits=3)
## ’ll bed check died discovered good home just kill know last
## 1 0.034 0 0 0.051 0 0.051 0.051 0.102 0.068 0.068 0.051
## 2 0.000 0 1 0.000 1 0.000 0.000 0.000 0.000 0.000 0.000
## 3 0.167 0 0 0.083 0 0.083 0.000 0.083 0.000 0.083 0.083
## 4 0.000 1 0 0.000 0 0.000 0.750 0.000 0.000 0.000 0.000
## like love made man marry mischievous never new now one said
## 1 0.051 0.051 0.068 0.0 0.000 0.017 0.051 0.068 0.119 0.085 0.000
## 2 0.000 0.000 0.000 1.0 0.000 1.000 0.000 0.000 0.000 0.000 0.000
## 3 0.083 0.083 0.083 0.5 0.333 0.000 0.000 0.000 0.250 0.000 0.583
## 4 0.000 0.000 0.000 0.0 0.000 0.000 0.750 0.000 0.000 0.000 0.000
## smiled steel still swore take things took went wife
## 1 0.102 0 0.068 0.000 0.017 0 0.085 0.051 0.017
## 2 0.000 1 0.000 0.000 0.000 1 0.000 0.000 0.000
## 3 0.000 0 0.000 0.083 0.250 0 0.083 0.000 0.083
## 4 0.000 0 0.000 0.750 0.000 0 0.000 0.250 0.500
To make things easier, let’s just print the top three words in every cluster, as well as the wordcloud cluster:
for (i in 1:k)
{
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "n")
# if you want to print the tweets of every cluster, run the next line
# print(tweets[which(kmeansResult$cluster==i)])
}
13
14. ## cluster 1: now just smiled
## cluster 2: check discovered man
## cluster 3: said man marry
## cluster 4: bed home never
Social Network Analysis
First, we want to produce a term-term matrix, which is basically just a network of terms based on their co-occurrence
in tweets. It is the matrix product of the term-document and a document-term matrices. (We produce the matrix
product by using the operator **%*%**).
#matrix product;
#using sparsed tweets because original tdm in our example had too many sparse terms
#transposing with 't' operator
tweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)
termTerm <- as.matrix(tweets_sparsed) %*% as.matrix(t(tweets_sparsed))
#inspect few rows and columns
termTerm[1:10,1:10]
## Terms
## Terms ’ll bed check died discovered good home just kill know
## ’ll 4 0 0 0 0 0 0 0 1 0
## bed 0 4 0 0 0 0 3 0 0 0
## check 0 0 4 0 4 0 0 0 0 0
## died 0 0 0 4 0 0 0 1 0 1
## discovered 0 0 4 0 4 0 0 0 0 0
## good 0 0 0 0 0 4 0 1 0 0
## home 0 3 0 0 0 0 8 0 0 0
## just 0 0 0 1 0 1 0 7 0 0
## kill 1 0 0 0 0 0 0 0 4 0
## know 0 0 0 1 0 0 0 0 0 5
After this, we can use package igraph to graphically represent these network of terms in a visually-appealing way:
library(igraph)
# build a graph from the above matrix
g <- graph.adjacency(termTerm, weighted=T, mode="undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
# setting seed to make the layout reproducible
set.seed(1001)
#call to plot network
layout1 <- layout.fruchterman.reingold(g)
plot(g, layout=layout1)
What if we wanted a different layout?
plot(g, layout=layout.kamada.kawai)
14
17. What if we wanted an interactive network plot? Easy!
tkplot(g, layout=layout1)
In fact, in our interactive graphs, we can just change the layouts immediately by selecting different options in the
Layout tab.
But the above just produce a graph with a lot of connections. What if we wanted to see straightaway which were
more important? Which connections were stronger? We can do that by specifying options with the following code:
#make stronger connections more bold on vertices 'V'
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
#color
V(g)$label.color <- rgb(0, 0, .2, .8)
#no frame
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
# access edges 'E'
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)
…and straightaway we can see which words are more ‘weighted’, and even point out one or two clusters…
How about making this new graph interactive too? As before, just use tkplot:
tkplot(g, layout=layout1)
As usual, there are a plethora of options and settings at your disposal! Just run ?igraph::layout to see them! (we’re
specifying the package because you might have another layout funtion from another package)
17