Uploaded bySandy Ryza

PDF, PPTX9,490 views

Latent Semantic Analysis of Wikipedia with Spark

This document describes how to perform latent semantic analysis (LSA) on Wikipedia data using Apache Spark. It discusses parsing Wikipedia XML data, creating a term-document matrix, applying singular value decomposition to reduce the matrix's rank, and interpreting the results to find concepts and related documents. Key steps include parsing Wikipedia pages into terms and documents, cleaning the data through lemmatization and removing stop words, creating a tf-idf weighted term-document matrix, applying SVD to get U, S, and V matrices, and using these to find terms and documents most strongly related to given queries.

1© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

2© Cloudera, Inc. All rights reserved.
Me
• Data scientist at Cloudera
• Recently lead Cloudera’s Apache Spark
development
• Author of Advanced Analytics with Spark

3© Cloudera, Inc. All rights reserved.
LSA-ing Wikipedia with Spark
Sandy Ryza | Senior Data Scientist

4© Cloudera, Inc. All rights reserved.
Latent Semantic Analysis
• Fancy name for applying a matrix decomposition (SVD) to text data

5© Cloudera, Inc. All rights reserved.

6© Cloudera, Inc. All rights reserved.

7© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

8© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

9© Cloudera, Inc. All rights reserved.
Wikipedia Content Data Set
• http://dumps.wikimedia.org/enwiki/latest/
• XML-formatted
• 46 GB uncompressed

10© Cloudera, Inc. All rights reserved.
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>584215651</id>
<parentid>584213644</parentid>
<timestamp>2013-12-02T15:14:01Z</timestamp>
<contributor>
<username>AnomieBOT</username>
<id>7611264</id>
</contributor>
<comment>Rescuing orphaned refs ("autogenerated1" from rev
584155010; "bbc" from rev 584155010)</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|
Anarchist (comics)}}
{{Redirect|Anarchists}}
{{pp-move-indef}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] that advocates [[stateless society|
stateless societies]] often defined as [[self-governance|self-governed]] voluntary
institutions,<ref>"ANARCHISM, a social philosophy that rejects
authoritarian government and maintains that voluntary institutions are best suited
to express man's natural social tendencies." George Woodcock.
"Anarchism" at The Encyclopedia of Philosophy</ref><ref>
"In a society developed on these lines, the voluntary associations which
already now begin to cover all the fields of human activity would take a still
greater extension so as to substitute
...

11© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

12© Cloudera, Inc. All rights reserved.
import org.apache.mahout.text.wikipedia.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "hdfs:///user/ds/wikidump.xml"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text],
conf)
val rawXmls = kvs.map(p => p._2.toString)

13© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

14© Cloudera, Inc. All rights reserved.
Lemmatization
“the boy’s cars are different colors”
“the boy car be different color”

15© Cloudera, Inc. All rights reserved.
CoreNLP
def createNLPPipeline(): StanfordCoreNLP = {
val props = new Properties()
props.put("annotators", "tokenize, ssplit, pos, lemma")
new StanfordCoreNLP(props)
}

16© Cloudera, Inc. All rights reserved.
Stop Words
“the boy car be different color”
“boy car different color”

17© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

18© Cloudera, Inc. All rights reserved.
Tail Monkey Algorithm Scala
Document 1 1.5 1.8
Document 2 2.0 4.3
Document 3 1.4 6.7
Document 4 1.6
Document 5 1.2
Term-Document Matrix

19© Cloudera, Inc. All rights reserved.
tf-idf
• (Term Frequency) * (Inverse Document Frequency)
• tf(document, word) = # times word appears in document
• idf(word) = 1 / (# documents that contain word)

20© Cloudera, Inc. All rights reserved.
val rowVectors: RDD[Vector] = ...

21© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

22© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• U is m x n
• S is n x n
• V is n x n

23© Cloudera, Inc. All rights reserved.
Low Rank Approximation
• Account for synonymy by condensing related terms.
• Account for polysemy by placing less weight on terms that have multiple meanings.
• Throw out noise.
SVD can find the rank-k approximation that has the lowest Frobenius distance from the
original matrix.

24© Cloudera, Inc. All rights reserved.
Singular Value Decomposition
• Factors matrix into the product of three matrices: U, S, and V
• m = # documents
• n = # terms
• k = # concepts
• U is m x n
• S is k x k
• V is k x n

25© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

26© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

27© Cloudera, Inc. All rights reserved.
rowVectors.cache()
val mat = new RowMatrix(rowVectors)
val k = 1000
val svd = mat.computeSVD(k, computeU=true)

28© Cloudera, Inc. All rights reserved.
Parse
Raw Data
Clean
Term-
Document
Matrix
SVD
Interpret
Results

29© Cloudera, Inc. All rights reserved.
What are the top “concepts”?
I.e. what dimensions in term-space and document-space explain most of the variance of
the data?

30© Cloudera, Inc. All rights reserved.
Docs:
Terms:U S V

31© Cloudera, Inc. All rights reserved.
U S V

32© Cloudera, Inc. All rights reserved.
U S V

$33© Cloudera, Inc. All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }$

$34© Cloudera, Inc. All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }$

35© Cloudera, Inc. All rights reserved.
Concept 1
Terms: department, commune, communes, insee, france, see, also,
southwestern, oise, marne, moselle, manche, eure, aisne, isère
Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle,
Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle,
Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin,
Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne,
Saint-Astier, Lot-et-Garonne

36© Cloudera, Inc. All rights reserved.
Concept 2
Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum,
snail, database, natural, find, geometridae, reference, museum, noctuidae
Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini,
Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus),
Alpina (moth), Arycanda (moth)

37© Cloudera, Inc. All rights reserved.
Querying
• Given a set of terms, find the closest documents in the latent space

38© Cloudera, Inc. All rights reserved.
Reconstructed Matrix
(U * S * V)
Doc
Term

$39© Cloudera, Inc. All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)$

40© Cloudera, Inc. All rights reserved.
printRelevantTerms("radiohead")
radiohead 0.9999999999999993
lyrically 0.8837403315233519
catchy 0.8780717902060333
riff 0.861326571452104
lyricsthe 0.8460798060853993
lyric 0.8434937575368959
upbeat 0.8410212279939793
Term Similarity

41© Cloudera, Inc. All rights reserved.
printRelevantTerms("algorithm")
algorithm 1.000000000000002
heuristic 0.8773199836391916
compute 0.8561015487853708
constraint 0.8370707630657652
optimization 0.8331940333186296
complexity 0.823738607119692
algorithmic 0.8227315888559854
Term Similarity

42© Cloudera, Inc. All rights reserved.
(algorithm,1.000000000000002), (heuristic,0.8773199836391916),
(compute,0.8561015487853708), (constraint,0.8370707630657652),
(optimization,0.8331940333186296), (complexity,0.823738607119692),
(algorithmic,0.8227315888559854), (iterative,0.822364922633442),
(recursive,0.8176921180556759), (minimization,0.8160188481409465)

$43© Cloudera, Inc. All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }$

44© Cloudera, Inc. All rights reserved.
printRelevantDocs("fir")
Silver tree 0.006292909647173194
See the forest for the trees 0.004785047583508223
Eucalyptus tree 0.004592837783089319
Sequoia tree 0.004497446632469554
Willow tree 0.004429936059594164
Coniferous tree 0.004381572286629475
Tulip Tree 0.004374705020233878
Document Similarity

45© Cloudera, Inc. All rights reserved.
• https://github.com/sryza/aas/tree/master/ch06-lsa
• https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
•
More detail?

46© Cloudera, Inc. All rights reserved.
Thank you
@sandysifting

Recommended

PPTX

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...

PPTX

Gartner Talk on AI Transformation & Innovation

byPhilipBasford

PPTX

AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices

byDemetris Trihinas

PDF

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

byHostedbyConfluent

PDF

Introduction to Streaming Analytics

byGuido Schmutz

PDF

Apache Flink internals

byKostas Tzoumas

PDF

Big data analytics and innovation

PDF

Unsupervised Learning with Apache Spark

PPTX

Latent Semanctic Analysis Auro Tripathy

byAuro Tripathy

PPT

Latent Semantic Indexing and Analysis

byMercy Livingstone

PPTX

Big data app meetup 2016-06-15

byIllia Polosukhin

PDF

Driving Healthcare Operations with Data Science

PDF

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

bySocial Media Camp

PDF

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

byDavide Chicco

PDF

Introduction to Probabilistic Latent Semantic Analysis

byNYC Predictive Analytics

PDF

Algorithm ethics

byJoerg Blumtritt

PDF

Empreendedorismo Agil

PPTX

Ili structuredauthoring

PDF

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

PDF

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

PDF

LOD2 Webinar Series: DBpedia Spotlight

byLOD2 Creating Knowledge out of Interlinked Data

PPTX

DBpedia Spotlight at I-SEMANTICS 2011

PDF

Google Cloud Machine Learning

byIndia Quotient

PDF

Lec13: Clustering Based Medical Image Segmentation Methods

byUlaş Bağcı

PPTX

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

byFabio Benedetti

PDF

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

PDF

GoogLeNet Insights

byAuro Tripathy

PDF

Why Spark Is the Next Top (Compute) Model

PPTX

R user-group-2011-09

PPTX

Intro to Vectorization Concepts - GaTech cse6242

byJosh Patterson

More Related Content

PPTX

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...

PPTX

Gartner Talk on AI Transformation & Innovation

byPhilipBasford

PPTX

AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices

byDemetris Trihinas

PDF

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

byHostedbyConfluent

PDF

Introduction to Streaming Analytics

byGuido Schmutz

PDF

Apache Flink internals

byKostas Tzoumas

PDF

Big data analytics and innovation

PDF

Unsupervised Learning with Apache Spark

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...

Gartner Talk on AI Transformation & Innovation

byPhilipBasford

AdaM: an Adaptive Monitoring Framework for Sampling and Filtering on IoT Devices

byDemetris Trihinas

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

byHostedbyConfluent

Introduction to Streaming Analytics

byGuido Schmutz

Apache Flink internals

byKostas Tzoumas

Big data analytics and innovation

Unsupervised Learning with Apache Spark

Viewers also liked

PPTX

Latent Semanctic Analysis Auro Tripathy

byAuro Tripathy

PPT

Latent Semantic Indexing and Analysis

byMercy Livingstone

PPTX

Big data app meetup 2016-06-15

byIllia Polosukhin

PDF

Driving Healthcare Operations with Data Science

PDF

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

bySocial Media Camp

PDF

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

byDavide Chicco

PDF

Introduction to Probabilistic Latent Semantic Analysis

byNYC Predictive Analytics

PDF

Algorithm ethics

byJoerg Blumtritt

PDF

Empreendedorismo Agil

PPTX

Ili structuredauthoring

PDF

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

PDF

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

PDF

LOD2 Webinar Series: DBpedia Spotlight

byLOD2 Creating Knowledge out of Interlinked Data

PPTX

DBpedia Spotlight at I-SEMANTICS 2011

PDF

Google Cloud Machine Learning

byIndia Quotient

PDF

Lec13: Clustering Based Medical Image Segmentation Methods

byUlaş Bağcı

PPTX

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

byFabio Benedetti

PDF

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

PDF

GoogLeNet Insights

byAuro Tripathy

PDF

Why Spark Is the Next Top (Compute) Model

Latent Semanctic Analysis Auro Tripathy

byAuro Tripathy

Latent Semantic Indexing and Analysis

byMercy Livingstone

Big data app meetup 2016-06-15

byIllia Polosukhin

Driving Healthcare Operations with Data Science

How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi

bySocial Media Camp

"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annot...

byDavide Chicco

Introduction to Probabilistic Latent Semantic Analysis

byNYC Predictive Analytics

Algorithm ethics

byJoerg Blumtritt

Empreendedorismo Agil

Ili structuredauthoring

Diving into Machine Learning with Rob Craft, Group Product Manager at Google!

A Virtuous Cycle of Semantic Enhancement with DBpedia Spotlight - SemTech Ber...

LOD2 Webinar Series: DBpedia Spotlight

byLOD2 Creating Knowledge out of Interlinked Data

DBpedia Spotlight at I-SEMANTICS 2011

Google Cloud Machine Learning

byIndia Quotient

Lec13: Clustering Based Medical Image Segmentation Methods

byUlaş Bağcı

Context Semantic Analysis: a knowledge-based technique for computing inter-do...

byFabio Benedetti

Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine

GoogLeNet Insights

byAuro Tripathy

Why Spark Is the Next Top (Compute) Model

Similar to Latent Semantic Analysis of Wikipedia with Spark

PPTX

R user-group-2011-09

PPTX

Intro to Vectorization Concepts - GaTech cse6242

byJosh Patterson

PPSX

Semantic Analysis using Wikipedia Taxonomy

byPatrick Nicolas

PDF

Tag And Tag Based Recommender

PPTX

Overview of Machine Learning and Feature Engineering

PDF

Ml3

bypoovarasu maniandan

PPT

Orchestrating the Intelligent Web with Apache Mahout

byaneeshabakharia

PPT

Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based ...

byCarlos Lorenzetti

PPT

Effective Extraction of Thematically Grouped Key Terms From Text

bymaria.grineva

PDF

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

PPT

Wikipedia as an Ontology for Describing Documents

PDF

ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION

PDF

40120140501013

byIAEME Publication

PPT

Hands on Mahout!

PPTX

Learning the Semantics of Structured Data Sources

byMohsen Taheriyan

PDF

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

byZalando Technology

PDF

Searching Images: Recent research at Southampton

byJonathon Hare

PDF

03. revised paper edit iq

PDF

Searching Images: Recent research at Southampton

byJonathon Hare

PDF

Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities

byFranck Michel

R user-group-2011-09

Intro to Vectorization Concepts - GaTech cse6242

byJosh Patterson

Semantic Analysis using Wikipedia Taxonomy

byPatrick Nicolas

Tag And Tag Based Recommender

Overview of Machine Learning and Feature Engineering

Ml3

bypoovarasu maniandan

Orchestrating the Intelligent Web with Apache Mahout

byaneeshabakharia

Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based ...

byCarlos Lorenzetti

Effective Extraction of Thematically Grouped Key Terms From Text

bymaria.grineva

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Wikipedia as an Ontology for Describing Documents

ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION

40120140501013

byIAEME Publication

Hands on Mahout!

Learning the Semantics of Structured Data Sources

byMohsen Taheriyan

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

byZalando Technology

Searching Images: Recent research at Southampton

byJonathon Hare

03. revised paper edit iq

Searching Images: Recent research at Southampton

byJonathon Hare

Modelling Biodiversity Linked Data: Pragmatism May Narrow Future Opportunities

byFranck Michel

Recently uploaded

PDF

EU regulations for the North American book supply chain - Tech Forum 2026

byBookNet Canada

PDF

Real-Time AI at Scale Masterclass: Feature Store at Scale

PPTX

Introduction to Artificial Intelligence (AI).pptx

PPTX

TechSprint WinterHack — Top 10 Teams Pitching Session we are excited for pit...

bybajpaitusharoffon678

PDF

Agentic-Document-Extraction-2026-Presentation.pdf

PDF

"Painless Major Upgrade: A Strategy for Updating Large Codebases", Andrii Yat...

PDF

Build Next-Gen Spatial & Sensor Workflows: Secure, Scalable Processing in Sno...

bySafe Software

PDF

TrustArc Webinar - From Zero to Privacy Hero: Launching Your Program Right an...

PPTX

Software Engineering in the Age of AI Agents

byHusseinMalikMammadli

PDF

When Drones Decide for Themselves_ Inside the Rise of Machine-Led Flight.pdf

byLyra Anderson

PDF

Real-Time AI Masterclass: Vector Search at Scale

PDF

UiPath Automation Developer Associate Training Series 2026 - Session 1

PDF

January 2026 OpenMetadata Community Spotlight - OpenMetadata @ Wix.pdf

PDF

Transcript: EU regulations for the North American book supply chain - Tech Fo...

byBookNet Canada

PDF

The Enterprise Web3 Landscape - a view from Kaleido based on the past 10 year...

byPeter Broadhurst

PPTX

TechSprint (SJBIT) 2025-26 Hackathon Winners & Awards Ceremony

PPTX

The Lex Wire Precedent: A Technical Standard for Machine-Mediated Authority ...

PDF

What Thema can do: Leveraging metadata to support the discoverability of Firs...

byBookNet Canada

PPTX

Introduction to Industrial-Arts Grade 8 ppt Lesson 1

byFSBTLEDNathanVince

PPTX

Cyber security chapter 1 for learning and educating to the students who want ...

EU regulations for the North American book supply chain - Tech Forum 2026

byBookNet Canada

Real-Time AI at Scale Masterclass: Feature Store at Scale

Introduction to Artificial Intelligence (AI).pptx

TechSprint WinterHack — Top 10 Teams Pitching Session we are excited for pit...

bybajpaitusharoffon678

Agentic-Document-Extraction-2026-Presentation.pdf

"Painless Major Upgrade: A Strategy for Updating Large Codebases", Andrii Yat...

Build Next-Gen Spatial & Sensor Workflows: Secure, Scalable Processing in Sno...

bySafe Software

TrustArc Webinar - From Zero to Privacy Hero: Launching Your Program Right an...

Software Engineering in the Age of AI Agents

byHusseinMalikMammadli

When Drones Decide for Themselves_ Inside the Rise of Machine-Led Flight.pdf

byLyra Anderson

Real-Time AI Masterclass: Vector Search at Scale

UiPath Automation Developer Associate Training Series 2026 - Session 1

January 2026 OpenMetadata Community Spotlight - OpenMetadata @ Wix.pdf

Transcript: EU regulations for the North American book supply chain - Tech Fo...

byBookNet Canada

The Enterprise Web3 Landscape - a view from Kaleido based on the past 10 year...

byPeter Broadhurst

TechSprint (SJBIT) 2025-26 Hackathon Winners & Awards Ceremony

The Lex Wire Precedent: A Technical Standard for Machine-Mediated Authority ...

What Thema can do: Leveraging metadata to support the discoverability of Firs...

byBookNet Canada

Introduction to Industrial-Arts Grade 8 ppt Lesson 1

byFSBTLEDNathanVince

Cyber security chapter 1 for learning and educating to the students who want ...

Latent Semantic Analysis of Wikipedia with Spark

1.
1© Cloudera, Inc.All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
2.
2© Cloudera, Inc.All rights reserved. Me • Data scientist at Cloudera • Recently lead Cloudera’s Apache Spark development • Author of Advanced Analytics with Spark
3.
3© Cloudera, Inc.All rights reserved. LSA-ing Wikipedia with Spark Sandy Ryza | Senior Data Scientist
4.
4© Cloudera, Inc.All rights reserved. Latent Semantic Analysis • Fancy name for applying a matrix decomposition (SVD) to text data
5.
5© Cloudera, Inc.All rights reserved.
6.
6© Cloudera, Inc.All rights reserved.
7.
7© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
8.
8© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
9.
9© Cloudera, Inc.All rights reserved. Wikipedia Content Data Set • http://dumps.wikimedia.org/enwiki/latest/ • XML-formatted • 46 GB uncompressed
10.
10© Cloudera, Inc.All rights reserved. <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> <revision> <id>584215651</id> <parentid>584213644</parentid> <timestamp>2013-12-02T15:14:01Z</timestamp> <contributor> <username>AnomieBOT</username> <id>7611264</id> </contributor> <comment>Rescuing orphaned refs ("autogenerated1" from rev 584155010; "bbc" from rev 584155010)</comment> <text xml:space="preserve">{{Redirect|Anarchist|the fictional character| Anarchist (comics)}} {{Redirect|Anarchists}} {{pp-move-indef}} {{Anarchism sidebar}} '''Anarchism''' is a [[political philosophy]] that advocates [[stateless society| stateless societies]] often defined as [[self-governance|self-governed]] voluntary institutions,<ref>"ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref> "In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute ...
11.
11© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
12.
12© Cloudera, Inc.All rights reserved. import org.apache.mahout.text.wikipedia.XmlInputFormat import org.apache.hadoop.conf.Configuration import org.apache.hadoop.io._ val path = "hdfs:///user/ds/wikidump.xml" val conf = new Configuration() conf.set(XmlInputFormat.START_TAG_KEY, "<page>") conf.set(XmlInputFormat.END_TAG_KEY, "</page>") val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text], conf) val rawXmls = kvs.map(p => p._2.toString)
13.
13© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
14.
14© Cloudera, Inc.All rights reserved. Lemmatization “the boy’s cars are different colors” “the boy car be different color”
15.
15© Cloudera, Inc.All rights reserved. CoreNLP def createNLPPipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) }
16.
16© Cloudera, Inc.All rights reserved. Stop Words “the boy car be different color” “boy car different color”
17.
17© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
18.
18© Cloudera, Inc.All rights reserved. Tail Monkey Algorithm Scala Document 1 1.5 1.8 Document 2 2.0 4.3 Document 3 1.4 6.7 Document 4 1.6 Document 5 1.2 Term-Document Matrix
19.
19© Cloudera, Inc.All rights reserved. tf-idf • (Term Frequency) * (Inverse Document Frequency) • tf(document, word) = # times word appears in document • idf(word) = 1 / (# documents that contain word)
20.
20© Cloudera, Inc.All rights reserved. val rowVectors: RDD[Vector] = ...
21.
21© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
22.
22© Cloudera, Inc.All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • U is m x n • S is n x n • V is n x n
23.
23© Cloudera, Inc.All rights reserved. Low Rank Approximation • Account for synonymy by condensing related terms. • Account for polysemy by placing less weight on terms that have multiple meanings. • Throw out noise. SVD can find the rank-k approximation that has the lowest Frobenius distance from the original matrix.
24.
24© Cloudera, Inc.All rights reserved. Singular Value Decomposition • Factors matrix into the product of three matrices: U, S, and V • m = # documents • n = # terms • k = # concepts • U is m x n • S is k x k • V is k x n
25.
25© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
26.
26© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
27.
27© Cloudera, Inc.All rights reserved. rowVectors.cache() val mat = new RowMatrix(rowVectors) val k = 1000 val svd = mat.computeSVD(k, computeU=true)
28.
28© Cloudera, Inc.All rights reserved. Parse Raw Data Clean Term- Document Matrix SVD Interpret Results
29.
29© Cloudera, Inc.All rights reserved. What are the top “concepts”? I.e. what dimensions in term-space and document-space explain most of the variance of the data?
30.
30© Cloudera, Inc.All rights reserved. Docs: Terms:U S V
31.
31© Cloudera, Inc.All rights reserved. U S V
32.
32© Cloudera, Inc.All rights reserved. U S V
33.
33© Cloudera, Inc.All rights reserved. def topTermsInConcept(concept: Int, numTerms: Int) : Seq[(String, Double)] = { val v = svd.V.toBreezeMatrix val termWeights = v(::, k).toArray.zipWithIndex val sorted = termWeights.sortBy(-_._1) sorted.take(numTerms) }
34.
34© Cloudera, Inc.All rights reserved. def topDocsInConcept(concept: Int, numDocs: Int) : Seq[Seq[(String, Double)]] = { val u = svd.U val docWeights = u.rows.map(_.toArray(concept)).zipWithUniqueId() docWeights.top(numDocs) }
35.
35© Cloudera, Inc.All rights reserved. Concept 1 Terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, moselle, manche, eure, aisne, isère Docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Saint-Clément, Meurthe-et-Moselle, Saint-Sardos, Lot-et-Garonne, Saint-Urcisse, Lot-et-Garonne, Saint-Sernin, Lot-et-Garonne, Saint-Robert, Lot-et-Garonne, Saint-Léon, Lot-et-Garonne, Saint-Astier, Lot-et-Garonne
36.
36© Cloudera, Inc.All rights reserved. Concept 2 Terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, find, geometridae, reference, museum, noctuidae Docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Gigartinales, Parapodia (genus), Alpina (moth), Arycanda (moth)
37.
37© Cloudera, Inc.All rights reserved. Querying • Given a set of terms, find the closest documents in the latent space
38.
38© Cloudera, Inc.All rights reserved. Reconstructed Matrix (U * S * V) Doc Term
39.
39© Cloudera, Inc.All rights reserved. def topTermsForTerm( normalizedVS : BDenseMatrix[Double], termId: Int): Seq[(Double, Int)] = { val rowVec = new BDenseVector[Double](row(normalizedVS, termId).toArray) val termScores = (normalizedVS * rowVec).toArray.zipWithIndex termScores.sortBy(- _._1).take(10) } val VS = multiplyByDiagonalMatrix(svd.V, svd.s) val normalizedVS = rowsNormalized(VS) topTermsForTerm(normalizedVS, id, termIds)
40.
40© Cloudera, Inc.All rights reserved. printRelevantTerms("radiohead") radiohead 0.9999999999999993 lyrically 0.8837403315233519 catchy 0.8780717902060333 riff 0.861326571452104 lyricsthe 0.8460798060853993 lyric 0.8434937575368959 upbeat 0.8410212279939793 Term Similarity
41.
41© Cloudera, Inc.All rights reserved. printRelevantTerms("algorithm") algorithm 1.000000000000002 heuristic 0.8773199836391916 compute 0.8561015487853708 constraint 0.8370707630657652 optimization 0.8331940333186296 complexity 0.823738607119692 algorithmic 0.8227315888559854 Term Similarity
42.
42© Cloudera, Inc.All rights reserved. (algorithm,1.000000000000002), (heuristic,0.8773199836391916), (compute,0.8561015487853708), (constraint,0.8370707630657652), (optimization,0.8331940333186296), (complexity,0.823738607119692), (algorithmic,0.8227315888559854), (iterative,0.822364922633442), (recursive,0.8176921180556759), (minimization,0.8160188481409465)
43.
43© Cloudera, Inc.All rights reserved. def topDocsForTerm( US: RowMatrix, V: Matrix, termId: Int) : Seq[(Double, Long)] = { val rowArr = row(V, termId).toArray val rowVec = Matrices.dense(termRowArr.length, 1, termRowArr) val docScores = US.multiply(termRowVec) val allDocWeights = docScores.rows.map( _.toArray(0)). zipWithUniqueId() allDocWeights.top( 10) }
44.
44© Cloudera, Inc.All rights reserved. printRelevantDocs("fir") Silver tree 0.006292909647173194 See the forest for the trees 0.004785047583508223 Eucalyptus tree 0.004592837783089319 Sequoia tree 0.004497446632469554 Willow tree 0.004429936059594164 Coniferous tree 0.004381572286629475 Tulip Tree 0.004374705020233878 Document Similarity
45.
45© Cloudera, Inc.All rights reserved. • https://github.com/sryza/aas/tree/master/ch06-lsa • https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html • More detail?
46.
46© Cloudera, Inc.All rights reserved. Thank you @sandysifting