SlideShare a Scribd company logo
Text Mining, Word Embeddings,
& Wikipedia
Muhammad Atif Qureshi
12/01/17 2
Contents
● Introduction
● Text Mining
– Similar words
– Word ambiguity
● Word Embedding
– Related Research
– Toy Example
● Wikipedia
– Structure
– Phrase Chunking
– Case studies
12/01/17 3
Problem
● Motivation
– Human beings have found a great comfort in expressing their viewpoint in writing
because of its ability to preserve thoughts for a longer period of time than oral
communication.
– Textual data is a very popular means of communication over the World Wide Web
in the form of data on online news websites, social networks, emails, governmental
websites, etc.
● Observation
Text may contain the following complexities
– Lack of contextual and background information
– Ambiguity due to more than one possible interpretation of the meaning of text
– Focus and assertions on multiple topics
12/01/17 4
Text Mining
● Motivation
With so much textual data around us especially on
the World Wide Web, there is a motivation to
understand the meaning of the data
● Definition
It is the process by which textual data is analyzed in
order to derive high quality information on the basis
of patterns
12/01/17 5
Similar Words
● Can similar words be group together as one?
– Simple techniques
● Lemmatization (mapping plural to singulars, accurate
but low coverage)
● Stemming (map word to a root word, inaccurate but
high coverage)
– Complex technique
● A word is known by the company it keeps → Word
Embeddings
12/01/17 6
Word Ambiguity
● Is Apple a company or a fruit?
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
● Context is important
– Tastes → Fruit
– Phones → Apple Inc.
12/01/17 7
Word Embedding
● Definition
– It is a technique in NLP that quantifies a concept
(word or phrase) as a vector of real numbers.
● Simple application scenario
– How similar are two words?
– Similarity(vector(good), vector(best))
12/01/17 8
Related Research
● Word embeddings
– Word2Vec
● It is a predictive model which uses two layer neural networks
– FastText
● It is an extension to word2vec by Facebook
– GloVe
● It is a count based model which performs dimensionality reduction on the co-
occurrence matrix
● Wikipedia based Relatedness
– Semantic Relatedness Framework
● It uses Wikipedia sub-category hierarchy to measure relatedness
12/01/17 9
Toy Example → Word
Embeddings
● Train co-occurence matrix
● Apply cosine similarity
● Find vectors
● Further concepts
– Dimestionality Reduction
– Window size
– Filter words
12/01/17 10
Word Analogies
● Man is to Woman, King is to ____ ?
● London is to England, Islamabad is to
____ ?
● Using vectors, we can say
– King – Man + Woman → Queen
– Islamabad – London + England → Pakistan
12/01/17 11
Why Wikipedia for Text
Mining?
● One of the largest encyclopedia
● Free to use
● Collaboratively and actively updated
12/01/17 12
Wikipedia
● Each article has a title that identifies a concept.
● Each article contains content that defines a particular concept textually.
● Each article is mentioned inside different categories
– E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’,
etc.
●
Each Wikipedia category generally contains parent and children categories.
– E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by
nationality’, etc
– E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc
12/01/17 13
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
Wikipedia Category Graph Structure along with Wikipedia Articles
Wikipedia Graph
Structure
12/01/17 14
Example of Wikipedia
Category Structure
academic_disciplines
science
interdisciplinary_fields
scientific_disciplines
behavioural_sciences
society
social_sciences
science_studies
information_technology
information
sociology
information_science
Truncated Wikipedia Category Graph
12/01/17 15
Phrase Chunking using
Wikipedia
i prefer samsung s5 over htc, apple, nokia because it is economical and good.
i prefer samsung s5 over htc apple nokia because it is economical and good
Phrase chunking using phrase
boundaries
Longest phrase that matches with
Wikipedia Article Title or Redirect
(which is not a stopword)
samsung s5prefer htc apple
nokia economical
overi because it
and goodis
Removed stopwords Extracted phrases
I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.
Conversion into lowercase
12/01/17 16
Word Embedding using
Wikipedia
● We can find more complex relationships
due to
– Article-Category Graph structure
– Multi-lingual relations
– Infobox, birth, age, etc
12/01/17 17
Wikipedia Documents
Phrase
Chunking
Relatedness
Calculator
Wikipedia Article
Title or Redirect
Stream of
Text
Candidate
Phrases
Wikipedia Category-
Article Structure
Online Reputation
Management Tasks
Perspective Aware
Search Engine
Relatedness
Scores
Wikipedia Based Semantic
Relatedness Framework
12/01/17 18
Perspective Aware Approach to
Search
● Problem: The result set from a search engine
(Google, Bing, Yahoo) for any user's query may have
an inherent perspective given issues with the search
engine or issues with the underlying collection.
● PAS is system that allows users to specify at query
time a perspective together with their query.
● The system allows the users to quickly surmise the
presence of the perspective in the returned set.
12/01/17 19
Perspective Aware Approach to
Search
● Perspective is modelled by making use of
Wikipedia articles-categories graph
structure
– Perspective: activism
– Wikipedia fetches articles defining activism by
looking into category graph structure
12/01/17 20
Perspective Aware Approach to
Search
12/01/17 21
Keyword Extraction via
Identification of Domain-Specific
Keywords
Title of Web
Pages
Wikipedia Articles
& Redirects
Intersected
Phrases
Community Detection
Algorithm
Wikipedia
Category
Graph
Domain-Specific
Phrases
Identifies readable
phrases
Domain-Specific
Single Terms
Merging both
Domain-Specific
Keywords
By exploiting Wikipedia
Article-Category Structure
● Problem: Given a
collection of document
titles from different school
websites, we extract
domain specific keywords
for the entire website that
represent the domain.
● Example: “Information
Retrieval”, “Science”
12/01/17 22
Innovation in Automotive
Red → Probability 1.0
Green → Probability 0.5
White → Probability 0.0
Size represents how much a category is mentioned inside the dataset`
12/01/17 23
Python Snippet for the Usage of
the WikiMadeEasy API
● wiki_client = Wiki_client_service()
● print(wiki_client.process([`isTitle', `business', 0]))
● print(wiki_client.process([`isPerson', `albert einstein', 0]))
● print(wiki_client.process([`mentionInCategories', `data mining', 0]))
● print(wiki_client.process([`containsArticles', `business', 0]))
● print(wiki_client.process([`matchesCategories', `pakistan', 0]))
● print(wiki_client.process([`matchesArticles', `computer science', 0]))
● print(wiki_client.process([`getWikiOutlinks', `pagerank', 0]))
● print(wiki_client.process([`getWikiInlinks', `google', 0]))
● print(wiki_client.process([`getExtendedAbstract', `pakistan', 0]))
● print(wiki_client.process([`getSubCategory', `science', 0]))
● print(wiki_client.process([`getSuperCategory', `science', 0]))
● graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [`information_science',
`sociology'], 2])
12/01/17 24
Question

More Related Content

Similar to Text mining, word embeddings, & wikipedia

STC India 2013 don day-being relevant in 2028
STC India 2013 don day-being relevant in 2028STC India 2013 don day-being relevant in 2028
STC India 2013 don day-being relevant in 2028
Don Day
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
Shenghui Wang
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
Carlos Toxtli
 
duepuntozero
duepuntozeroduepuntozero
duepuntozero
italo losero
 
Web 2.0 - Join the Journey
Web 2.0 - Join the JourneyWeb 2.0 - Join the Journey
Web 2.0 - Join the JourneyLori Franklin
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
KDZ - Zentrum für Verwaltungsforschung
 
The common core
The common coreThe common core
The common core
tmcclu
 
Reflective Teaching Essay
Reflective Teaching EssayReflective Teaching Essay
Reflective Teaching Essay
Lisa Williams
 
Putting Linked Data to Use in a Large Higher-Education Organisation
Putting Linked Data to Use in a Large Higher-Education OrganisationPutting Linked Data to Use in a Large Higher-Education Organisation
Putting Linked Data to Use in a Large Higher-Education Organisation
Mathieu d'Aquin
 
20111120 warsaw learning curve by b hyland notes
20111120 warsaw   learning curve by b hyland notes20111120 warsaw   learning curve by b hyland notes
20111120 warsaw learning curve by b hyland notesBernadette Hyland-Wood
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
Frank Kienle
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011
Don Day
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2
coop3674
 
Transforming knowledge management for climate action
Transforming knowledge management for climate action  Transforming knowledge management for climate action
Transforming knowledge management for climate action
weADAPT
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
DataDryad
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
wiser pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
guest8ed46d
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
Marianne Sweeny
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
Vasu Jain
 

Similar to Text mining, word embeddings, & wikipedia (20)

STC India 2013 don day-being relevant in 2028
STC India 2013 don day-being relevant in 2028STC India 2013 don day-being relevant in 2028
STC India 2013 don day-being relevant in 2028
 
Chat 1
Chat 1Chat 1
Chat 1
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
duepuntozero
duepuntozeroduepuntozero
duepuntozero
 
Web 2.0 - Join the Journey
Web 2.0 - Join the JourneyWeb 2.0 - Join the Journey
Web 2.0 - Join the Journey
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
 
The common core
The common coreThe common core
The common core
 
Reflective Teaching Essay
Reflective Teaching EssayReflective Teaching Essay
Reflective Teaching Essay
 
Putting Linked Data to Use in a Large Higher-Education Organisation
Putting Linked Data to Use in a Large Higher-Education OrganisationPutting Linked Data to Use in a Large Higher-Education Organisation
Putting Linked Data to Use in a Large Higher-Education Organisation
 
20111120 warsaw learning curve by b hyland notes
20111120 warsaw   learning curve by b hyland notes20111120 warsaw   learning curve by b hyland notes
20111120 warsaw learning curve by b hyland notes
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011
 
Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2Engl313 ada project4_slidedoc2
Engl313 ada project4_slidedoc2
 
Transforming knowledge management for climate action
Transforming knowledge management for climate action  Transforming knowledge management for climate action
Transforming knowledge management for climate action
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
 

More from M. Atif Qureshi

Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendations
M. Atif Qureshi
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
M. Atif Qureshi
 
Fundamentals of IR models
Fundamentals of IR modelsFundamentals of IR models
Fundamentals of IR models
M. Atif Qureshi
 
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
M. Atif Qureshi
 
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
M. Atif Qureshi
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
M. Atif Qureshi
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
M. Atif Qureshi
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphere
M. Atif Qureshi
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureM. Atif Qureshi
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
M. Atif Qureshi
 

More from M. Atif Qureshi (10)

Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendations
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Fundamentals of IR models
Fundamentals of IR modelsFundamentals of IR models
Fundamentals of IR models
 
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
 
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
 
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphere
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel Future
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 

Text mining, word embeddings, & wikipedia

  • 1. Text Mining, Word Embeddings, & Wikipedia Muhammad Atif Qureshi
  • 2. 12/01/17 2 Contents ● Introduction ● Text Mining – Similar words – Word ambiguity ● Word Embedding – Related Research – Toy Example ● Wikipedia – Structure – Phrase Chunking – Case studies
  • 3. 12/01/17 3 Problem ● Motivation – Human beings have found a great comfort in expressing their viewpoint in writing because of its ability to preserve thoughts for a longer period of time than oral communication. – Textual data is a very popular means of communication over the World Wide Web in the form of data on online news websites, social networks, emails, governmental websites, etc. ● Observation Text may contain the following complexities – Lack of contextual and background information – Ambiguity due to more than one possible interpretation of the meaning of text – Focus and assertions on multiple topics
  • 4. 12/01/17 4 Text Mining ● Motivation With so much textual data around us especially on the World Wide Web, there is a motivation to understand the meaning of the data ● Definition It is the process by which textual data is analyzed in order to derive high quality information on the basis of patterns
  • 5. 12/01/17 5 Similar Words ● Can similar words be group together as one? – Simple techniques ● Lemmatization (mapping plural to singulars, accurate but low coverage) ● Stemming (map word to a root word, inaccurate but high coverage) – Complex technique ● A word is known by the company it keeps → Word Embeddings
  • 6. 12/01/17 6 Word Ambiguity ● Is Apple a company or a fruit? – “Apple tastes better than blackberry” – “Apple phones are better than blackberry” ● Context is important – Tastes → Fruit – Phones → Apple Inc.
  • 7. 12/01/17 7 Word Embedding ● Definition – It is a technique in NLP that quantifies a concept (word or phrase) as a vector of real numbers. ● Simple application scenario – How similar are two words? – Similarity(vector(good), vector(best))
  • 8. 12/01/17 8 Related Research ● Word embeddings – Word2Vec ● It is a predictive model which uses two layer neural networks – FastText ● It is an extension to word2vec by Facebook – GloVe ● It is a count based model which performs dimensionality reduction on the co- occurrence matrix ● Wikipedia based Relatedness – Semantic Relatedness Framework ● It uses Wikipedia sub-category hierarchy to measure relatedness
  • 9. 12/01/17 9 Toy Example → Word Embeddings ● Train co-occurence matrix ● Apply cosine similarity ● Find vectors ● Further concepts – Dimestionality Reduction – Window size – Filter words
  • 10. 12/01/17 10 Word Analogies ● Man is to Woman, King is to ____ ? ● London is to England, Islamabad is to ____ ? ● Using vectors, we can say – King – Man + Woman → Queen – Islamabad – London + England → Pakistan
  • 11. 12/01/17 11 Why Wikipedia for Text Mining? ● One of the largest encyclopedia ● Free to use ● Collaboratively and actively updated
  • 12. 12/01/17 12 Wikipedia ● Each article has a title that identifies a concept. ● Each article contains content that defines a particular concept textually. ● Each article is mentioned inside different categories – E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’, etc. ● Each Wikipedia category generally contains parent and children categories. – E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by nationality’, etc – E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc
  • 13. 12/01/17 13 C1 A1 A3 A4 C3C2 C4 C5 C6 C7 C10 C9 Category Article Category Edge Article Belonging to Category A2 Article Link Wikipedia Category Graph Structure along with Wikipedia Articles Wikipedia Graph Structure
  • 14. 12/01/17 14 Example of Wikipedia Category Structure academic_disciplines science interdisciplinary_fields scientific_disciplines behavioural_sciences society social_sciences science_studies information_technology information sociology information_science Truncated Wikipedia Category Graph
  • 15. 12/01/17 15 Phrase Chunking using Wikipedia i prefer samsung s5 over htc, apple, nokia because it is economical and good. i prefer samsung s5 over htc apple nokia because it is economical and good Phrase chunking using phrase boundaries Longest phrase that matches with Wikipedia Article Title or Redirect (which is not a stopword) samsung s5prefer htc apple nokia economical overi because it and goodis Removed stopwords Extracted phrases I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good. Conversion into lowercase
  • 16. 12/01/17 16 Word Embedding using Wikipedia ● We can find more complex relationships due to – Article-Category Graph structure – Multi-lingual relations – Infobox, birth, age, etc
  • 17. 12/01/17 17 Wikipedia Documents Phrase Chunking Relatedness Calculator Wikipedia Article Title or Redirect Stream of Text Candidate Phrases Wikipedia Category- Article Structure Online Reputation Management Tasks Perspective Aware Search Engine Relatedness Scores Wikipedia Based Semantic Relatedness Framework
  • 18. 12/01/17 18 Perspective Aware Approach to Search ● Problem: The result set from a search engine (Google, Bing, Yahoo) for any user's query may have an inherent perspective given issues with the search engine or issues with the underlying collection. ● PAS is system that allows users to specify at query time a perspective together with their query. ● The system allows the users to quickly surmise the presence of the perspective in the returned set.
  • 19. 12/01/17 19 Perspective Aware Approach to Search ● Perspective is modelled by making use of Wikipedia articles-categories graph structure – Perspective: activism – Wikipedia fetches articles defining activism by looking into category graph structure
  • 20. 12/01/17 20 Perspective Aware Approach to Search
  • 21. 12/01/17 21 Keyword Extraction via Identification of Domain-Specific Keywords Title of Web Pages Wikipedia Articles & Redirects Intersected Phrases Community Detection Algorithm Wikipedia Category Graph Domain-Specific Phrases Identifies readable phrases Domain-Specific Single Terms Merging both Domain-Specific Keywords By exploiting Wikipedia Article-Category Structure ● Problem: Given a collection of document titles from different school websites, we extract domain specific keywords for the entire website that represent the domain. ● Example: “Information Retrieval”, “Science”
  • 22. 12/01/17 22 Innovation in Automotive Red → Probability 1.0 Green → Probability 0.5 White → Probability 0.0 Size represents how much a category is mentioned inside the dataset`
  • 23. 12/01/17 23 Python Snippet for the Usage of the WikiMadeEasy API ● wiki_client = Wiki_client_service() ● print(wiki_client.process([`isTitle', `business', 0])) ● print(wiki_client.process([`isPerson', `albert einstein', 0])) ● print(wiki_client.process([`mentionInCategories', `data mining', 0])) ● print(wiki_client.process([`containsArticles', `business', 0])) ● print(wiki_client.process([`matchesCategories', `pakistan', 0])) ● print(wiki_client.process([`matchesArticles', `computer science', 0])) ● print(wiki_client.process([`getWikiOutlinks', `pagerank', 0])) ● print(wiki_client.process([`getWikiInlinks', `google', 0])) ● print(wiki_client.process([`getExtendedAbstract', `pakistan', 0])) ● print(wiki_client.process([`getSubCategory', `science', 0])) ● print(wiki_client.process([`getSuperCategory', `science', 0])) ● graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [`information_science', `sociology'], 2])