1© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist
2© Cloudera, Inc. All rights reserved.
Me
• Data scientist at Cloudera
• Recently lead Cloudera’s Apache Spark
development
• Author of Advanced Analytics with Spark
3© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist
4© Cloudera, Inc. All rights reserved.
Latent Semantic Analysis
• Fancy name for applying a matrix decomposition (SVD) to text data
5© Cloudera, Inc. All rights reserved.
6© Cloudera, Inc. All rights reserved.
7© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
8© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
9© Cloudera, Inc. All rights reserved.
Wikipedia Content Data Set
• http://dumps.wikimedia.org/enwiki/latest/
• XML-formatted
• 46 GB uncompressed
10© Cloudera, Inc. All rights reserved.
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>584215651</id>
<parentid>584213644</parentid>
<timestamp>2013-12-02T15:14:01Z</timestamp>
<contributor>
<username>AnomieBOT</username>
<id>7611264</id>
</contributor>
<comment>Rescuing orphaned refs (&quot;autogenerated1&quot; from rev
584155010; &quot;bbc&quot; from rev 584155010)</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
Anarchist (comics)}}
{{Redirect|Anarchists}}
{{pp-move-indef}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] that advocates [[stateless society|
stateless societies]] often defined as [[self-governance|self-governed]] voluntary
institutions,&lt;ref&gt;&quot;ANARCHISM, a social philosophy that rejects
authoritarian government and maintains that voluntary institutions are best suited
to express man's natural social tendencies.&quot; George Woodcock.
&quot;Anarchism&quot; at The Encyclopedia of Philosophy&lt;/ref&gt;&lt;ref&gt;
&quot;In a society developed on these lines, the voluntary associations which
already now begin to cover all the fields of human activity would take a still
greater extension so as to substitute
...
11© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
12© Cloudera, Inc. All rights reserved.
import org.apache.mahout.text.wikipedia.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "hdfs:///user/ds/wikidump.xml"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text],
conf)
val rawXmls = kvs.map(p => p._2.toString)
13© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
14© Cloudera, Inc. All rights reserved.
Lemmatization
“the boy’s cars are different colors”
“the boy car be different color”
15© Cloudera, Inc. All rights reserved.
CoreNLP
def createNLPPipeline(): StanfordCoreNLP = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
new StanfordCoreNLP(props)
}
16© Cloudera, Inc. All rights reserved.
Stop Words
“the boy car be different color”
“boy car different color”
17© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
18© Cloudera, Inc. All rights reserved.
Tail Monkey Algorithm Scala
Document 1 1.5 1.8
Document 2 2.0 4.3
Document 3 1.4 6.7
Document 4 1.6
Document 5 1.2
Term-Document Matrix
19© Cloudera, Inc. All rights reserved.
tf-idf
• (Term Frequency) * (Inverse Document Frequency)
• tf(document, word) = # times word appears in document
• idf(word) = 1 / (# documents that contain word)
20© Cloudera, Inc. All rights reserved.
val rowVectors: RDD[Vector] = ...
21© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
22© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• U is m x n
• S is n x n
• V is n x n
23© Cloudera, Inc. All rights reserved.
Low Rank Approximation
• Account for synonymy by condensing related terms.
• Account for polysemy by placing less weight on terms that have multiple meanings.
• Throw out noise.
SVD can find the rank-k approximation that has the lowest Frobenius distance from the
original matrix.
24© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• k = # concepts
• U is m x n
• S is k x k
• V is k x n
25© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
26© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
27© Cloudera, Inc. All rights reserved.
rowVectors.cache()
val mat = new RowMatrix(rowVectors)
val k = 1000
val svd = mat.computeSVD(k, computeU=true)
28© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results
29© Cloudera, Inc. All rights reserved.
What are the top “concepts”?
I.e. what dimensions in term-space and document-space explain most of the variance of
the data?
30© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V
31© Cloudera, Inc. All rights reserved.
U S V
32© Cloudera, Inc. All rights reserved.
U S V
33© Cloudera, Inc. All rights reserved.
def topTermsInConcept(concept: Int, numTerms: Int)
: Seq[(String, Double)] = {
val v = svd.V.toBreezeMatrix
val termWeights = v(::, k).toArray.zipWithIndex
val sorted = termWeights.sortBy(-_._1)
sorted.take(numTerms)
}
34© Cloudera, Inc. All rights reserved.
def topDocsInConcept(concept: Int, numDocs: Int)
: Seq[Seq[(String, Double)]] = {
val u = svd.U
val docWeights =
u.rows.map(_.toArray(concept)).zipWithUniqueId()
docWeights.top(numDocs)
}
35© Cloudera, Inc. All rights reserved.
Concept 1
Terms: department, commune, communes, insee, france, see, also,
southwestern, oise, marne, moselle, manche, eure, aisne, isère
Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle,
Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle,
Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin,
Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne,
Saint-Astier, Lot-et-Garonne
36© Cloudera, Inc. All rights reserved.
Concept 2
Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum,
snail, database, natural, find, geometridae, reference, museum, noctuidae
Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini,
Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus),
Alpina (moth), Arycanda (moth)
37© Cloudera, Inc. All rights reserved.
Querying
• Given a set of terms, find the closest documents in the latent space
38© Cloudera, Inc. All rights reserved.
Reconstructed Matrix
(U * S * V)
Doc
Term
39© Cloudera, Inc. All rights reserved.
def topTermsForTerm(
normalizedVS : BDenseMatrix[Double],
termId: Int): Seq[(Double, Int)] = {
val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray)
val termScores = (normalizedVS * rowVec).toArray.zipWithIndex
termScores.sortBy(- _._1).take(10)
}
val VS = multiplyByDiagonalMatrix(svd.V, svd.s)
val normalizedVS = rowsNormalized(VS)
topTermsForTerm(normalizedVS, id, termIds)
40© Cloudera, Inc. All rights reserved.
printRelevantTerms("radiohead")
radiohead 0.9999999999999993
lyrically 0.8837403315233519
catchy 0.8780717902060333
riff 0.861326571452104
lyricsthe 0.8460798060853993
lyric 0.8434937575368959
upbeat 0.8410212279939793
Term Similarity
41© Cloudera, Inc. All rights reserved.
printRelevantTerms("algorithm")
algorithm 1.000000000000002
heuristic 0.8773199836391916
compute 0.8561015487853708
constraint 0.8370707630657652
optimization 0.8331940333186296
complexity 0.823738607119692
algorithmic 0.8227315888559854
Term Similarity
42© Cloudera, Inc. All rights reserved.
(algorithm,1.000000000000002), (heuristic,0.8773199836391916),
(compute,0.8561015487853708), (constraint,0.8370707630657652),
(optimization,0.8331940333186296), (complexity,0.823738607119692),
(algorithmic,0.8227315888559854), (iterative,0.822364922633442),
(recursive,0.8176921180556759), (minimization,0.8160188481409465)
43© Cloudera, Inc. All rights reserved.
def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int)
: Seq[(Double, Long)] = {
val rowArr = row(V, termId).toArray
val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr)
val docScores = US.multiply(termRowVec)
val allDocWeights = docScores.rows.map( _.toArray(0)).
zipWithUniqueId()
allDocWeights.top( 10)
}
44© Cloudera, Inc. All rights reserved.
printRelevantDocs("fir")
Silver tree 0.006292909647173194
See the forest for the trees 0.004785047583508223
Eucalyptus tree 0.004592837783089319
Sequoia tree 0.004497446632469554
Willow tree 0.004429936059594164
Coniferous tree 0.004381572286629475
Tulip Tree 0.004374705020233878
Document Similarity
45© Cloudera, Inc. All rights reserved.
• https://github.com/sryza/aas/tree/master/ch06-lsa
• https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
•
More detail?
46© Cloudera, Inc. All rights reserved.
Thank you
@sandysifting

Latent Semantic Analysis of Wikipedia with Spark

  • 1.
    1© Cloudera, Inc.All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
  • 2.
    2© Cloudera, Inc.All rights reserved. Me • Data scientist at Cloudera • Recently lead Cloudera’s Apache Spark development • Author of Advanced Analytics with Spark
  • 3.
    3© Cloudera, Inc.All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
  • 4.
    4© Cloudera, Inc.All rights reserved. Latent Semantic Analysis • Fancy name for applying a matrix decomposition (SVD) to text data
  • 5.
    5© Cloudera, Inc.All rights reserved.
  • 6.
    6© Cloudera, Inc.All rights reserved.
  • 7.
    7© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 8.
    8© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 9.
    9© Cloudera, Inc.All rights reserved. Wikipedia Content Data Set • http://dumps.wikimedia.org/enwiki/latest/ • XML-formatted • 46 GB uncompressed
  • 10.
    10© Cloudera, Inc.All rights reserved. <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>584215651</id> <parentid>584213644</parentid> <timestamp>2013-12-02T15:14:01Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <comment>Rescuing orphaned refs (&quot;autogenerated1&quot; from rev 584155010; &quot;bbc&quot; from rev 584155010)</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character| Anarchist (comics)}} {{Redirect|Anarchists}} {{pp-move-indef}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] that advocates [[stateless society| stateless societies]] often defined as [[self-governance|self-governed]] voluntary institutions,&lt;ref&gt;&quot;ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies.&quot; George Woodcock. &quot;Anarchism&quot; at The Encyclopedia of Philosophy&lt;/ref&gt;&lt;ref&gt; &quot;In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute ...
  • 11.
    11© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 12.
    12© Cloudera, Inc.All rights reserved. import org.apache.mahout.text.wikipedia.XmlInputFormat import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io._ val path = "hdfs:///user/ds/wikidump.xml" val conf = new Configuration() conf.set(XmlInputFormat.START_TAG_KEY, "<page>") conf.set(XmlInputFormat.END_TAG_KEY, "</page>") val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf) val rawXmls = kvs.map(p => p._2.toString)
  • 13.
    13© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 14.
    14© Cloudera, Inc.All rights reserved. Lemmatization “the boy’s cars are different colors” “the boy car be different color”
  • 15.
    15© Cloudera, Inc.All rights reserved. CoreNLP def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) }
  • 16.
    16© Cloudera, Inc.All rights reserved. Stop Words “the boy car be different color” “boy car different color”
  • 17.
    17© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 18.
    18© Cloudera, Inc.All rights reserved. Tail Monkey Algorithm Scala Document 1 1.5 1.8 Document 2 2.0 4.3 Document 3 1.4 6.7 Document 4 1.6 Document 5 1.2 Term-Document Matrix
  • 19.
    19© Cloudera, Inc.All rights reserved. tf-idf • (Term Frequency) * (Inverse Document Frequency) • tf(document, word) = # times word appears in document • idf(word) = 1 / (# documents that contain word)
  • 20.
    20© Cloudera, Inc.All rights reserved. val rowVectors: RDD[Vector] = ...
  • 21.
    21© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 22.
    22© Cloudera, Inc.All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • U is m x n • S is n x n • V is n x n
  • 23.
    23© Cloudera, Inc.All rights reserved. Low Rank Approximation • Account for synonymy by condensing related terms. • Account for polysemy by placing less weight on terms that have multiple meanings. • Throw out noise. SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.
  • 24.
    24© Cloudera, Inc.All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • k = # concepts • U is m x n • S is k x k • V is k x n
  • 25.
    25© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
  • 26.
    26© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
  • 27.
    27© Cloudera, Inc.All rights reserved. rowVectors.cache() val mat = new RowMatrix(rowVectors) val k = 1000 val svd = mat.computeSVD(k, computeU=true)
  • 28.
    28© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
  • 29.
    29© Cloudera, Inc.All rights reserved. What are the top “concepts”? I.e. what dimensions in term-space and document-space explain most of the variance of the data?
  • 30.
    30© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
  • 31.
    31© Cloudera, Inc.All rights reserved. U S V
  • 32.
    32© Cloudera, Inc.All rights reserved. U S V
  • 33.
    33© Cloudera, Inc.All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }
  • 34.
    34© Cloudera, Inc.All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }
  • 35.
    35© Cloudera, Inc.All rights reserved. Concept 1 Terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, moselle, manche, eure, aisne, isère Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle, Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin, Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne, Saint-Astier, Lot-et-Garonne
  • 36.
    36© Cloudera, Inc.All rights reserved. Concept 2 Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, find, geometridae, reference, museum, noctuidae Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus), Alpina (moth), Arycanda (moth)
  • 37.
    37© Cloudera, Inc.All rights reserved. Querying • Given a set of terms, find the closest documents in the latent space
  • 38.
    38© Cloudera, Inc.All rights reserved. Reconstructed Matrix (U * S * V) Doc Term
  • 39.
    39© Cloudera, Inc.All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)
  • 40.
    40© Cloudera, Inc.All rights reserved. printRelevantTerms("radiohead") radiohead 0.9999999999999993 lyrically 0.8837403315233519 catchy 0.8780717902060333 riff 0.861326571452104 lyricsthe 0.8460798060853993 lyric 0.8434937575368959 upbeat 0.8410212279939793 Term Similarity
  • 41.
    41© Cloudera, Inc.All rights reserved. printRelevantTerms("algorithm") algorithm 1.000000000000002 heuristic 0.8773199836391916 compute 0.8561015487853708 constraint 0.8370707630657652 optimization 0.8331940333186296 complexity 0.823738607119692 algorithmic 0.8227315888559854 Term Similarity
  • 42.
    42© Cloudera, Inc.All rights reserved. (algorithm,1.000000000000002), (heuristic,0.8773199836391916), (compute,0.8561015487853708), (constraint,0.8370707630657652), (optimization,0.8331940333186296), (complexity,0.823738607119692), (algorithmic,0.8227315888559854), (iterative,0.822364922633442), (recursive,0.8176921180556759), (minimization,0.8160188481409465)
  • 43.
    43© Cloudera, Inc.All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }
  • 44.
    44© Cloudera, Inc.All rights reserved. printRelevantDocs("fir") Silver tree 0.006292909647173194 See the forest for the trees 0.004785047583508223 Eucalyptus tree 0.004592837783089319 Sequoia tree 0.004497446632469554 Willow tree 0.004429936059594164 Coniferous tree 0.004381572286629475 Tulip Tree 0.004374705020233878 Document Similarity
  • 45.
    45© Cloudera, Inc.All rights reserved. • https://github.com/sryza/aas/tree/master/ch06-lsa • https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html • More detail?
  • 46.
    46© Cloudera, Inc.All rights reserved. Thank you @sandysifting