• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Natural Language Processing in R (rNLP)
 

Natural Language Processing in R (rNLP)

on

  • 4,584 views

The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on ...

The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials

Statistics

Views

Total Views
4,584
Views on SlideShare
4,489
Embed Views
95

Actions

Likes
0
Downloads
77
Comments
0

2 Embeds 95

http://kmi.open.ac.uk 92
http://people.kmi.open.ac.uk 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Natural Language Processing in R (rNLP) Natural Language Processing in R (rNLP) Presentation Transcript

    • Natural Language Processingin R (rNLP)Fridolin Wild, The Open University, UKTutorial to the Doctoral Schoolat the Institute of Business Informaticsof the Goethe University Frankfurt
    • Structure of this tutorial• An introduction to R and cRunch• Language basics in R• Basic I/O in R• Social Network Analysis• Latent Semantic Analysis• Twitter• Sentiment• (Advanced I/O in R: MySQL, SparQL)
    • Introduction
    • cRunch• is an infrastructure• for computationally-intense learninganalytics• supporting researchers• in investigating big data• generated in the co-construction ofknowledge… and beyond…
    • Architecture(Thiele & Lehner, 2011)
    • Architecture(Thiele & Lehner, 2011)Living Reportsdata shopcron jobsR webservices
    • Reports
    • Living reports• reports with embeddedscripts and data• knitr and Sweave• render to html, PDF, …• visualisations:– ggplot2, trellis, graphix– jpg, png, eps, pdfpng(file=”n.png”, plot(network(m)))• Fill-in-the-blanks:Drop out quote went down to<<echo=FALSE>>=doquote[“OU”,”2011”]@documentclass[a4paper]{article}title{Sweave Example 1}author{Friedrich Leisch}begin{document}maketitleIn this example we embed parts of the examples from thetexttt{kruskal.test} help page into a LaTeX{} document:<<>>=data(airquality)library(ctest)kruskal.test(Ozone ~ Month, data = airquality)@which shows that the location parameter of the Ozonedistribution varies significantly from month to month. Finally weinclude a boxplot of the data:begin{center}<<fig=TRUE,echo=FALSE>>=boxplot(Ozone ~ Month, data = airquality)@end{center}end{document}
    • Example PDF report
    • Example html5 reportExample Report=============This is an example of embedded scripts anddata.```{r}a = "hello world”print(a)```And here is an example of how to embed a chart.```{r fig.width=7, fig.height=6}plot( 5:20 )```
    • Shiny Widgets (1)• Widgets: use-casesized encapsulationsof mini apps• HTML5• Two files:ui.R, server.R• Still missing:manifest files(info.plist, config.xml)
    • Shiny Widgets (2)From http://www.rstudio.com/shiny/
    • Web Servicesharmonization &data warehousing
    • Example R web serviceprint “hello world”
    • More complex R web servicesetContentType("image/png")a = c(1,3,5,12,13,15)image_file = tempfile()png(file=image_file)plot(a,main = "The magic image",ylab = "", xlab = "",col = c("darkred", "darkblue", "darkgreen"))dev.off()sendBin(readBin(image_file,raw,n=file.info(image_file)$size))unlink(image_file)
    • R web services• Uses the apachemod_R.so• See http://Rapache.net• Common server functions:– GET and POST variables– setContentType– sendBin– …
    • A word on memory mgmt.• Advanced memory management(see p.70 of Dietl diploma thesis):– Use package big memory(for shared memory acrossthreads)– Use package Rserve (for sharedread-only access across threads)– Swap out memory objects withsave() and load()– The latter is typically sufficient(hard disks are fast!)• data management abstractionlayer for mod_R.so:configure handler in http.conf:specify directory match and load specificdata management routines at start up:REvalOnStartup"source(‟/dbal.R);"
    • Harvestingdata acquisition
    • Job scheduling• crontab entries for R webservices• e.g. harvest feeds• e.g. store in local DB
    • data shopsharing
    • Data shop and the community• You have a „public/‟ folder :)– „public/data‟: save() any .rda file andit will be indexed within the hour– „public/services‟: use this to executeyour scripts; indexed within the hour– „public/gallery‟: use this to storeyour public visualisations– code sharing: Any .R script in your„public/‟ folder is source readable bythe web
    • Not coveredThe useful pointer
    • More NLP packagesinstall.packages("NaturalLanguageProcessing”)library("NaturalLanguageProcessing")
    • studioexploratoryprogramming
    • studio
    • Social Network AnalysisFridolin Wild, The Open University, UK
    • The Idea
    • The basic concept• Precursors date back to 1920s, math toEuler‟s „Seven Bridges of Koenigsberg‟
    • The basic concept• Precursors date back to 1920s, math toEuler‟s „Seven Bridges of Koenigsberg‟
    • The basic concept• Precursors date back to 1920s, math toEuler‟s „Seven Bridges of Koenigsberg‟• Social Networks are:• Actors (people, groups, media, tags, …)• Ties (interactions, relationships, …)• Actors and ties form graph• Graph has measurable structuralproperties• Betweenness,• Degree of Centrality,• Density,• Cohesion• Structural Patterns
    • Forum Messagesmessage_id forum_id parent_id author130 2853483 2853445 N 2043131 1440740 785876 N 1669132 2515257 2515256 N 5814133 4704949 4699874 N 5810134 2597170 2558273 N 2054135 2316951 2230821 N 5095136 3407573 3407568 N 36137 2277393 2277387 N 359138 3394136 3382201 N 1050139 4603931 4167338 N 453140 6234819 6189254 6231352 5400141 806699 785877 804668 2177142 4430290 3371246 3380313 48143 3395686 3391024 3391129 35144 6270213 6024351 6265378 5780145 2496015 2491522 2491536 2774146 4707562 4699873 4707502 5810147 2574199 2440094 2443801 5801148 4501993 4424215 4491650 5232message_id forum_id parent_id author60 734569 31117 N 2491221 762702 31117 1317 762717 31117 762702 19271528 819660 31117 793408 11971950 840406 31117 839998 13481047 841810 31117 767386 18792239 862709 31117 N 19822420 869839 31117 862709 20382694 884824 31117 N 54392503 896399 31117 862709 19822846 901691 31117 895022 9923321 951376 31117 N 51743384 952895 31117 951376 15971186 955595 31117 767386 57243604 958065 31117 N 7162551 960734 31117 862709 19394072 975816 31117 N 5842574 986038 31117 862709 20432590 987842 31117 862709 1982
    • Incidence Matrix• msg_id = incident, authors appear in incidents
    • Derive Adjacency Matrix= t(im) %*% im
    • Visualization: Sociogramme
    • Degree
    • Betweenness
    • Network Density• Total edges = 29• Possible edges =18 * (18-1)/2 = 153• Density = 0.19
    • kmeans Cluster (k=3)
    • Analysis• Mix• Match• Optimise
    • Tutorials• Starter: sna-simple.Rmd• Real: sna-blog.Rmd• Advanced: sna-forum.Rmd
    • Latent Semantic AnalysisFridolin Wild, The Open University, UK
    • Latent Semantic Analysis• “Humans learn word meanings and how to combinethem into passage meaning through experiencewith ~paragraph unitized verbal environments.”• “They don‟t remember all the separate words of apassage; they remember its overall gist ormeaning.”• “LSA learns by „reading‟ ~paragraph unitizedtexts that represent the environment.”• “It doesn‟t remember all the separate words of atext it; it remembers its overall gist or meaning.”(Landauer, 2007)
    • Word choice is over-rated• Educated adult understands ~100,000 word forms• An average sentence contains 20 tokens.• Thus 100,00020 possible combinations of words in asentence• maximum of log2 100,00020= 332 bits in word choice alone.• 20! = 2.4 x 1018 possible orders of 20 words= maximum of 61 bits from order of the words.• 332/(61+ 332) = 84% word choice(Landauer, 2007)
    • LSA (2)• Assumption: texts have a semantic structure• However, this structure is obscured by wordusage (noise, synonymy, polysemy, …)• Proposed LSA Solution:– map doc-term matrix– using conceptual indices– derived statistically (truncated SVD)– and make similarity comparisons usingangles
    • Input (e.g., documents){ M } =Deerwester, Dumais, Furnas, Landauer, and Harshman (1990):Indexing by Latent Semantic Analysis, In: Journal of the AmericanSociety for Information Science, 41(6):391-407Only the red terms appear in morethan one document, so strip the rest.term = featurevocabulary = ordered set of featuresTEXTMATRIX
    • Singular Value Decomposition=
    • Truncated SVDlatent-semantic space
    • Reconstructed, Reduced Matrixm4: Graph minors: A survey
    • Similarity in a Latent-Semantic SpaceQueryTarget 1Target 2Angle 2Angle 1YdimensionX dimension
    • doc2doc - similaritiesUnreduced = pure vectorspace model- Based on M = TSD’- Pearson Correlationover document vectorsreduced- based on M2 = TS2D’- Pearson Correlationover document vectors
    • Ex Post Updating: Folding-In• SVD factor stability– SVD calculates factors over a given text base– Different texts – different factors– Challenge: avoid unwanted factor changes(e.g., bad essays)– Solution: folding-in of essays instead of recalculating• SVD is computationally expensive
    • Folding-In in Detail1kkTi STvd1Tikki dSTm2vTTk Sk DkMk(Berry et al., 1995)(1) convertOriginalVector to„Dk“-format(2) convert„Dk“-formatvector to„Mk“-format
    • LSA Process & Driving Parameters4 x 12 x 7 x 2 x 3= 2016 Combinations
    • Pre-Processing• Stemming– Porter Stemmer (snowball.tartarus.org)– ‚move„, ‚moving„, ‚moves„ => ‚move„– in German even more important (more flections)• Stop Word Elimination– 373 Stop Words in German• Stemming plus Stop Word Elimination• Unprocessed („raw‟) Terms
    • Term Weighting Schemes• Global Weights (GW)– None (‚raw‘ tf)– Normalisation– Inverse DocumentFrequency (IDF)– 1 + Entropy.121jijitfnorm1)(log2idocfreqnumdocsidfi1loglog1jijijinumdocsppentplusone 1jijijijtftfp, whereweightij = lw(tfij) ∙ gw(tfij) Local Weights (LW) None (‘raw’ tf) Binary Term Frequency Logarithmized Term Frequency(log)
    • SVD-Dimensionality• Many different proposals (see package)• 80% variance is a good estimator
    • Proximity Measures• Pearson Correlation• Cosine Correlation• Spearman„s Rhopics: http://davidmlane.com/hyperstat/A62891.html
    • Pair-wise dis/similarityConvergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’
    • The Package• Available via CRAN, e.g.:http://cran.r-project.org/web/packages/lsa/index.html• Higher-level Abstraction to Ease Use– Core methods:textmatrix() / query()lsa()fold_in()as.textmatrix()– Support methods for term weighting, dimensionalitycalculation, correlation measurement, …
    • Core Workflow• tm = textmatrix(„dir/„)• tm = lw_logtf(tm) *gw_idf(tm)• space = lsa(tm,dims=dimcalc_share())• tm3 = fold_in(tm, space)• as.textmatrix(tm)
    • Pre-Processing Chain
    • Tutorials• Starter: lsa-indexing.Rmd• Real: lsa-essayscoring.Rmd• Advanced: lsa-sparse.Rmd
    • Additional tutorialsFridolin Wild, The Open University, UK
    • Tutorials• Advanced I/O: twitter.Rmd• Advanced I/O: sparql.Rmd• Advanced NLP: twitter-sentiment.Rmd• Evaluation: interrater-agreement.Rmd