SlideShare a Scribd company logo
1 of 20
Web Intelligence
Text Mining, and web-related
Applications
’
WEB-SOM
A self-organizing-map
(SOM) algorithm
applied to over 1M
newsgroup posts.
See
http://websom.hut.fi/we
bsom/milliondemo/html
/root.html and play
around with it.
Finding similar literature
Two different www documents X and Y might be closely related.
If they are, then:
• a user interested in X will also probably be interested in Y
• If X is highly ranked in a search, Y should also be made prominently
available to the searcher
• If a user is specifically trying to find documents similar to X, then
Y is one of them.
But, the problem is:
• X might turn up in a search, but not Y. There are no links between
X and Y, they may be in very separated components of the www
graph.
Another way of looking at it
Suppose you do a search on the keyword pasta
• Google may retrieve 1,000,000 documents
• How can you (or, hopefully, an automated system) usefully
organise these documents?
• If the documents were automatically clustered, so that similar
groups of documents were put together in the same cluster,
then we would be able to impose useful organisation.
• E.g. one cluster might be documents about the history of pasta,
another cluster may be mainly recipes, etc…
So, it will be very useful if we have some way of working out
similarity between documents – then we can cluster them.
Applications/Motivations for
document similarity
• Recommendations
– Many search engines and other sites try to help you manage your
bookmarks/favourites; as part of this they offer recommendations,
i.e. “if you like that, you might also like these …”
– On amazon, or any general product sales site, this can be based on
distances between (e.g.) 200 word summaries or ToC of a book, or
text that describes a product in a catalogue
• Research (scientific, scholarly, for lit review, for market
research)
• Mapping for Browsing purposes – a 2D visualisation of the
web, or a subset, where each page is a (clickable) point,
and distance between them is related to document
similarity
But a document is a “bag of words”
– to work out distances, we need
numbers
How did I get these vectors from these
two `documents’?
<h1> Compilers: lecture 1 </h1>
<p> This lecture will introduce the
concept of lexical analysis, in which
the source code is scanned to reveal
the basic tokens it contains. For this,
we will need the concept of
regular expressions (r.e.s).</p>
<h1> Compilers</h1>
<p> The Guardian uses several
compilers for its daily cryptic
crosswords. One of the most
frequently used is Araucaria,
and one of the most difficult
is Bunthorne.</p>
35, 2, 0 26, 2, 2
What about these two vectors?
<h1> Compilers: lecture 1 </h1>
<p> This lecture will introduce the
concept of lexical analysis, in which
the source code is scanned to reveal
the basic tokens it contains. For this,
we will need the concept of
regular expressions (r.e.s).</p>
<h1> Compilers</h1>
<p> The Guardian uses several
compilers for its daily cryptic
crosswords. One of the most
frequently used is Araucaria,
and one of the most difficult
is Bunthorne.</p>
0, 0, 0, 1, 1, 1 1, 1, 1, 0, 0, 0
An unfair question, but I got that by using the following word
vector:
(Crossword, Cryptic, Difficult, Expression, Lexical, Token)
If a document contains the word `crossword’, it gets a 1 in
position 1 of the vector, otherwise 0. If it contains `lexical’, it gets
a 1 in position 5, otherwise 0, and so on.
How similar would be the vectors for two pages about crossword
compilers?
The key to measuring document similarity is turning documents
into vectors based on specific words and their frequencies.
Turning a document into a vector
We start with a template for the vector, which needs a master list of
terms . A term can be a word, or a number, or anything that appears
frequently in documents.
There are almost 200,000 words in English – it would take much too
long to process documents vectors of that length.
Commonly, vectors are made from a small number (50—1000) of
most frequently-occurring words.
However, the master list usually does not include words from a stoplist,
Which contains words such as the, and, there, which, etc … why?
The TFIDF Encoding
(Term Frequency x Inverse Document Frequency)
A term is a word, or some other frequently occuring item
Given some term i, and a document j, the term count
is the number of times that term i occurs in document j
Given a collection of k terms and a set D of documents, the
term frequency, is:
… considering only the terms of interest, this is the
proportion of document j that is made up from term i.
ij
n
ij
tf


 T
k
kj
ij
ij
n
n
tf
1
Term frequency is a measure of the importance of term i in
document j
Inverse document frequency (which we see next) is a measure of
the general importance of the term.
I.e. High term frequency for “apple” means that apple is an
important word in a specific document.
But high document frequency (low inverse document frequency)
for “apple”, given a particular set of documents, means that
apple is not all that important overall, since it is in all of the
documents.
ij
tf
Inverse document frequency of term i is:
}
:
{
|
|
log
D
d
d
D
idf
j
j
i


Log of: … the number of documents in the master collection,
divided by the number of those documents that contain the term.
TFIDF encoding of a document
So, given:
- a background collection of documents
(e.g. 100,000 random web pages,
all the articles we can find about cancer
100 student essays submitted as coursework
…)
- a specific ordered list (possibly large) of terms
We can encode any document as a vector of TFIDF numbers,
where the ith entry in the vector for document j is:
i
ij idf
tf 
Turning a document into a vector
Suppose our Master List is:
(banana, cat, dog, fish, read)
Suppose document 1 contains only:
“Bananas are grown in hot countries, and cats like bananas.”
And suppose the background frequencies of these words in a large
random collection of documents is (0.2, 0.1, 0.05, 0.05, 0.2)
The document 1 vector entry for word w is:
))
(
freq_in_bg
/
1
(
log
)
(
freqindoc 2 w
w
This is just a rephrasing of TFIDF, where:
freqindoc(w) is the frequency of w in document 1,
and freq_in_bg(w) is the `background’ frequency in our
reference set of documents
Turning a document into a vector
Master list: (banana, cat, dog, fish, read)
Background frequencies: (0.2, 0.1, 0.05, 0.05, 0.2)
Document 1:
“Bananas are grown in hot countries, and cats like bananas.”
Frequencies are proportions. The background frequency of banana is
0.2, meaning that 20% of documents in general contain `banana’, or
bananas, etc. (note that read includes reads, reading, reader, etc…)
The frequency of banana in document 1 is also 0.2 – why?
The TFIDF encoding of this document is:
0.464, 0.332, 0, 0, 0
Suppose another document has
exactly the same vector – will it
be the same document?
Vector representation of
documents underpins:
Many areas of automated document analysis
Such as: automated classification of documents
Clustering and organising document collections
Building maps of the web, and of different web
communities
Understanding the interactions between different
scientific communities, which in turn will lead to
helping with automated WWW-based scientific
discovery.
What can you say about the TFIDF value for the
word “and”?
What about the word “cancer”?
What is the TFIDF value of cancer, where the
background collection of document is a collection
of abstracts from a cancer journal?
Stoplists and Stemming
• Stoplists – we mentioned these already; this is a
list of words that we should ignore when
processing documents, since they give no useful
information about content. Examples of such
words?
• Stemming – this is the process of treating a set of
words like “fights, fighting, fighter, …” as all
instances of the same term – in this case the stem
is “fight”. Why is this useful?
Examinable Reading
The Sinka/Corne paper on my teaching site;
I want you to be able to talk clearly about the
findings (e.g. how the quality of clustering
was affected by whether or not stemming
was used)

More Related Content

Similar to text

20090921 Art Databanken Agosti Final
20090921 Art Databanken Agosti Final20090921 Art Databanken Agosti Final
20090921 Art Databanken Agosti Finalagosti
 
Books and Webs: Pulling the Down Rows
Books and Webs: Pulling the Down RowsBooks and Webs: Pulling the Down Rows
Books and Webs: Pulling the Down RowsPeter Brantley
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 
THGenius, rdf and open linked data for thesaurus management
THGenius, rdf and open linked data for thesaurus managementTHGenius, rdf and open linked data for thesaurus management
THGenius, rdf and open linked data for thesaurus management@CULT Srl
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptSamuelKetema1
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1Sumit Sony
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning   sstose
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 

Similar to text (20)

20090921 Art Databanken Agosti Final
20090921 Art Databanken Agosti Final20090921 Art Databanken Agosti Final
20090921 Art Databanken Agosti Final
 
Books and Webs: Pulling the Down Rows
Books and Webs: Pulling the Down RowsBooks and Webs: Pulling the Down Rows
Books and Webs: Pulling the Down Rows
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 
Ir 02
Ir   02Ir   02
Ir 02
 
Document similarity
Document similarityDocument similarity
Document similarity
 
THGenius, rdf and open linked data for thesaurus management
THGenius, rdf and open linked data for thesaurus managementTHGenius, rdf and open linked data for thesaurus management
THGenius, rdf and open linked data for thesaurus management
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
IR.pptx
IR.pptxIR.pptx
IR.pptx
 
Web of Science
Web of ScienceWeb of Science
Web of Science
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 

More from nyomans1

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptnyomans1
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxnyomans1
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxnyomans1
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.pptnyomans1
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptnyomans1
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptxnyomans1
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxnyomans1
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.pptnyomans1
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...nyomans1
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptxnyomans1
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptxnyomans1
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptxnyomans1
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptxnyomans1
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptxnyomans1
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptxnyomans1
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxnyomans1
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxnyomans1
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxnyomans1
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptxnyomans1
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptxnyomans1
 

More from nyomans1 (20)

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptx
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptx
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.ppt
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptx
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptx
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.ppt
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptx
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

text

  • 1. Web Intelligence Text Mining, and web-related Applications ’
  • 2. WEB-SOM A self-organizing-map (SOM) algorithm applied to over 1M newsgroup posts. See http://websom.hut.fi/we bsom/milliondemo/html /root.html and play around with it.
  • 3. Finding similar literature Two different www documents X and Y might be closely related. If they are, then: • a user interested in X will also probably be interested in Y • If X is highly ranked in a search, Y should also be made prominently available to the searcher • If a user is specifically trying to find documents similar to X, then Y is one of them. But, the problem is: • X might turn up in a search, but not Y. There are no links between X and Y, they may be in very separated components of the www graph.
  • 4. Another way of looking at it Suppose you do a search on the keyword pasta • Google may retrieve 1,000,000 documents • How can you (or, hopefully, an automated system) usefully organise these documents? • If the documents were automatically clustered, so that similar groups of documents were put together in the same cluster, then we would be able to impose useful organisation. • E.g. one cluster might be documents about the history of pasta, another cluster may be mainly recipes, etc… So, it will be very useful if we have some way of working out similarity between documents – then we can cluster them.
  • 5. Applications/Motivations for document similarity • Recommendations – Many search engines and other sites try to help you manage your bookmarks/favourites; as part of this they offer recommendations, i.e. “if you like that, you might also like these …” – On amazon, or any general product sales site, this can be based on distances between (e.g.) 200 word summaries or ToC of a book, or text that describes a product in a catalogue • Research (scientific, scholarly, for lit review, for market research) • Mapping for Browsing purposes – a 2D visualisation of the web, or a subset, where each page is a (clickable) point, and distance between them is related to document similarity
  • 6. But a document is a “bag of words” – to work out distances, we need numbers
  • 7. How did I get these vectors from these two `documents’? <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> 35, 2, 0 26, 2, 2
  • 8. What about these two vectors? <h1> Compilers: lecture 1 </h1> <p> This lecture will introduce the concept of lexical analysis, in which the source code is scanned to reveal the basic tokens it contains. For this, we will need the concept of regular expressions (r.e.s).</p> <h1> Compilers</h1> <p> The Guardian uses several compilers for its daily cryptic crosswords. One of the most frequently used is Araucaria, and one of the most difficult is Bunthorne.</p> 0, 0, 0, 1, 1, 1 1, 1, 1, 0, 0, 0
  • 9. An unfair question, but I got that by using the following word vector: (Crossword, Cryptic, Difficult, Expression, Lexical, Token) If a document contains the word `crossword’, it gets a 1 in position 1 of the vector, otherwise 0. If it contains `lexical’, it gets a 1 in position 5, otherwise 0, and so on. How similar would be the vectors for two pages about crossword compilers? The key to measuring document similarity is turning documents into vectors based on specific words and their frequencies.
  • 10. Turning a document into a vector We start with a template for the vector, which needs a master list of terms . A term can be a word, or a number, or anything that appears frequently in documents. There are almost 200,000 words in English – it would take much too long to process documents vectors of that length. Commonly, vectors are made from a small number (50—1000) of most frequently-occurring words. However, the master list usually does not include words from a stoplist, Which contains words such as the, and, there, which, etc … why?
  • 11. The TFIDF Encoding (Term Frequency x Inverse Document Frequency) A term is a word, or some other frequently occuring item Given some term i, and a document j, the term count is the number of times that term i occurs in document j Given a collection of k terms and a set D of documents, the term frequency, is: … considering only the terms of interest, this is the proportion of document j that is made up from term i. ij n ij tf    T k kj ij ij n n tf 1
  • 12. Term frequency is a measure of the importance of term i in document j Inverse document frequency (which we see next) is a measure of the general importance of the term. I.e. High term frequency for “apple” means that apple is an important word in a specific document. But high document frequency (low inverse document frequency) for “apple”, given a particular set of documents, means that apple is not all that important overall, since it is in all of the documents. ij tf
  • 13. Inverse document frequency of term i is: } : { | | log D d d D idf j j i   Log of: … the number of documents in the master collection, divided by the number of those documents that contain the term.
  • 14. TFIDF encoding of a document So, given: - a background collection of documents (e.g. 100,000 random web pages, all the articles we can find about cancer 100 student essays submitted as coursework …) - a specific ordered list (possibly large) of terms We can encode any document as a vector of TFIDF numbers, where the ith entry in the vector for document j is: i ij idf tf 
  • 15. Turning a document into a vector Suppose our Master List is: (banana, cat, dog, fish, read) Suppose document 1 contains only: “Bananas are grown in hot countries, and cats like bananas.” And suppose the background frequencies of these words in a large random collection of documents is (0.2, 0.1, 0.05, 0.05, 0.2) The document 1 vector entry for word w is: )) ( freq_in_bg / 1 ( log ) ( freqindoc 2 w w This is just a rephrasing of TFIDF, where: freqindoc(w) is the frequency of w in document 1, and freq_in_bg(w) is the `background’ frequency in our reference set of documents
  • 16. Turning a document into a vector Master list: (banana, cat, dog, fish, read) Background frequencies: (0.2, 0.1, 0.05, 0.05, 0.2) Document 1: “Bananas are grown in hot countries, and cats like bananas.” Frequencies are proportions. The background frequency of banana is 0.2, meaning that 20% of documents in general contain `banana’, or bananas, etc. (note that read includes reads, reading, reader, etc…) The frequency of banana in document 1 is also 0.2 – why? The TFIDF encoding of this document is: 0.464, 0.332, 0, 0, 0 Suppose another document has exactly the same vector – will it be the same document?
  • 17. Vector representation of documents underpins: Many areas of automated document analysis Such as: automated classification of documents Clustering and organising document collections Building maps of the web, and of different web communities Understanding the interactions between different scientific communities, which in turn will lead to helping with automated WWW-based scientific discovery.
  • 18. What can you say about the TFIDF value for the word “and”? What about the word “cancer”? What is the TFIDF value of cancer, where the background collection of document is a collection of abstracts from a cancer journal?
  • 19. Stoplists and Stemming • Stoplists – we mentioned these already; this is a list of words that we should ignore when processing documents, since they give no useful information about content. Examples of such words? • Stemming – this is the process of treating a set of words like “fights, fighting, fighter, …” as all instances of the same term – in this case the stem is “fight”. Why is this useful?
  • 20. Examinable Reading The Sinka/Corne paper on my teaching site; I want you to be able to talk clearly about the findings (e.g. how the quality of clustering was affected by whether or not stemming was used)