SlideShare a Scribd company logo
1 of 7
Download to read offline
MIT OpenCourseWare
http://ocw.mit.edu




6.006 Introduction to Algorithms
Spring 2008



For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Lecture 1	                Introduction and Document Distance           6.006 Spring 2008



      Lecture 1: Introduction and the Document

                   Distance Problem


Course Overview
   •	 Efficient procedures for solving problems on large inputs (Ex: entire works of Shake­
      speare, human genome, U.S. Highway map)

   •	 Scalability

   •	 Classic data structures and elementary algorithms (CLRS text)


   •	 Real implementations in Python ⇔ Fun problem sets!


   • β version of the class - feedback is welcome!

Pre-requisites
   •	 Familiarity with Python and Discrete Mathematics

Contents
The course is divided into 7 modules - each of which has a motivating problem and problem
set (except for the last module). Modules and motivating problems are as described below:

  1. Linked Data Structures: Document Distance (DD)

  2. Hashing: DD, Genome Comparison

  3. Sorting: Gas Simulation

  4. Search: Rubik’s Cube 2 × 2 × 2

  5. Shortest Paths: Caltech → MIT

  6. Dynamic Programming: Stock Market
              √
  7. Numerics: 2

Document Distance Problem
Motivation

Given two documents, how similar are they?

   •	 Identical - easy?

   •	 Modified or related (Ex: DNA, Plagiarism, Authorship)

                                             1
Lecture 1	                Introduction and Document Distance                 6.006 Spring 2008



   •	 Did Francis Bacon write Shakespeare’s plays?

To answer the above, we need to define practical metrics. Metrics are defined in terms of
word frequencies.

Definitions

  1.	 Word : Sequence of alphanumeric characters. For example, the phrase “6.006 is fun”
      has 4 words.

  2.	 Word Frequencies: Word frequency D(w) of a given word w is the number of times
      it occurs in a document D.
     For example, the words and word frequencies for the above phrase are as below:

                            Count : 1 0   1   1    0    1
                            W ord : 6 the is 006 easy f un

     In practice, while counting, it is easy to choose some canonical ordering of words.

  3.	 Distance Metric: The document distance metric is the inner product of the vectors D1
      and D2 containing the word frequencies for all words in the 2 documents. Equivalently,
      this is the projection of vectors D1 onto D2 or vice versa. Mathematically this is
      expressed as:

                                             �
                                 D1 · D2 =         D1 (w) · D2 (w)	                        (1)
                                             w

  4.	 Angle Metric: The angle between the vectors D1 and D2 gives an indication of overlap
      between the 2 documents. Mathematically this angle is expressed as:
                                                   �                     �
                                                          D1 · D2
                            θ(D1 , D2 ) = arccos
                                                       � D1 � ∗ � D2 �


                                          0 ≤ θ ≤ π/2

     An angle metric of 0 means the two documents are identical whereas an angle metric
     of π/2 implies that there are no common words.

  5.	 Number of Words in Document: The magnitude of the vector D which contains word
      frequencies of all words in the document. Mathematically this is expressed as:

                                                         √
                                   N (D) =� D �=             D · D	                        (2)

So let’s apply the ideas to a few Python programs and try to flesh out more.

                                             2
Lecture 1                    Introduction and Document Distance         6.006 Spring 2008



Document Distance in Practice
Computing Document Distance: docdist1.py

The python code and results relevant to this section are available here. This program com­
putes the distance between 2 documents by performing the following steps:

   • Read file

   • Make word list [“the”,“year”,. . . ]

   • Count frequencies [[“the”,4012],[“year”,55],. . . ]

   • Sort into order [[“a”,3120],[“after”,17],. . . ]

   • Compute θ

Ideally, we would like to run this program to compute document distances between writings
of the following authors:

   • Jules Verne - document size 25k

   • Bobsey Twins - document size 268k

   • Lewis and Clark - document size 1M

   • Shakespeare - document size 5.5M

   • Churchill - document size 10M

Experiment: Comparing the Bobsey and Lewis documents with docdist1.py gives θ = 0.574.

However, it takes approximately 3 minutes to compute this document distance, and probably

gets slower as the inputs get large.

What is wrong with the efficiency of this program?

Is it a Python vs. C issue? Is it a choice of algorithm issue - θ(n2 ) versus θ(n)?


Profiling: docdist2.py

In order to figure out why our initial program is so slow, we now “instrument” the program
so that Python will tell us where the running time is going. This can be done simply using
the profile module in Python. The profile module indicates how much time is spent in each
routine.
(See this link for details on profile).
The profile module is imported into docdist1.py and the end of the docdist1.py file is
modified. The modified docdist1.py file is renamed as docdist2.py
Detailed results of document comparisons are available here .



                                                 3
Lecture 1	                 Introduction and Document Distance              6.006 Spring 2008



More on the different columns in the output displayed on that webpage:

   •	 tottime per call(column3) is tottime(column2)/ncalls(column1)

   •	 cumtime(column4)includes subroutine calls

   •	 cumtime per call(column5) is cumtime(column4)/ncalls(column1)

The profiling of the Bobsey vs. Lewis document comparison is as follows:

   • Total: 195 secs

   • Get words from line list: 107 secs

   •	 Count-frequency: 44 secs

   •	 Get words from string: 13 secs


   • Insertion sort: 12 secs


So the get words from line list operation is the culprit. The code for this particular section
is:

    word_list = [ ]
    for line in L:

        words_in_line = get_words_from_string(line)

        word_list = word_list + words_in_line

    return word_list

The bulk of the computation time is to implement

word_list = word_list + words_in_line

There isn’t anything else that takes up much computation time.

List Concatenation: docdist3.py

The problem in docdist1.py as illustrated by docdist2.py is that concatenating two lists
takes time proportional to the sum of the lengths of the two lists, since each list is copied
into the output list!
L = L1 + L2 takes time proportional to | L1 | + | L2 |	 If we had n lines (each with one
                                                       .
word), computation time would be proportional to 1 + 2 + 3 + . . . + n = n(n+1) = θ(n2 )
                                                                             2

Solution:

    word_list.extend (words_in_line)

    [word_list.append(word)] for each word in words_in_line


Ensures L1 .extend(L2 ) time proportional to | L2 |

                                              4
Lecture 1                  Introduction and Document Distance              6.006 Spring 2008



Take Home Lesson: Python has powerful primitives (like concatenation of lists) built in.
To write efficient algorithms, we need to understand their costs. See Python Cost Model
for details. PS1 also has an exercise on figuring out cost of a set of operations.

Incorporate this solution into docdist1.py - rename as docdist3.py. Implementation details
and results are available here. This modification helps run the Bobsey vs. Lewis example
in 85 secs (as opposed to the original 195 secs).

We can improve further by looking for other quadratic running times hidden in our routines.
The next offender (in terms of overall computation time) is the count frequency routine,
which computes the frequency of each word, given the word list.

Analysing Count Frequency

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    L = []
    for new_word in word_list:
        for entry in L:
            if new_word == entry[0]:
                entry[1] = entry[1] + 1
                break
        else:
            L.append([new_word,1])
    return L

If document has n words and d distinct words, θ(nd). If all words distinct, θ(n2 ). This shows
that the count frequency routine searches linearly down the list of word/frequency pairs to
find the given word. Thus it has quadratic running time! Turns out the count frequency
routine takes more than 1/2 of the running time in docdist3.py. Can we improve?

Dictionaries: docdist4.py

The solution to improve the Count Frequency routine lies in hashing, which gives constant
running time routines to store and retrieve key/value pairs from a table. In Python, a hash
table is called a dictionary. Documentation on dictionaries can be found here.

Hash table is defined a mapping from a domain(finite collection of immutable things) to a
range(anything). For example, D[‘ab’] = 2, D[‘the’] = 3.




                                              5
Lecture 1                  Introduction and Document Distance            6.006 Spring 2008



Modify docdist3.py to docdist4.py using dictionaries to give constant time lookup. Modified
count frequency routine is as follows:

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    D = {}
    for new_word in word_list:
        if D.has_key(new_word):

            D[new_word] = D[new_word]+1

        else:

            D[new_word] = 1

    return D.items()


Details of implementation and results are here. Running time is now θ(n). We have suc­
cessfully replaced one of our quadratic time routines with a linear-time one, so the running
time will scale better for larger inputs. For the Bobsey vs. Lewis example, running time
improves from 85 secs to 42 secs.

What’s left? The two largest contributors to running time are now:

   • Get words from string routine (13 secs) — version 5 of docdist fixes this with translate

   • Insertion sort routine (11 secs) — version 6 of docdist fixes this with merge-sort

More on that next time . . .




                                             6

More Related Content

What's hot

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4Glenn De Backer
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.pptbutest
 
Research Summary: Hidden Topic Markov Models, Gruber
Research Summary: Hidden Topic Markov Models, GruberResearch Summary: Hidden Topic Markov Models, Gruber
Research Summary: Hidden Topic Markov Models, GruberAlex Klibisz
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORTDhrumil Shah
 

What's hot (20)

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.ppt
 
Research Summary: Hidden Topic Markov Models, Gruber
Research Summary: Hidden Topic Markov Models, GruberResearch Summary: Hidden Topic Markov Models, Gruber
Research Summary: Hidden Topic Markov Models, Gruber
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
 

Viewers also liked

Get it right the first time through cheap and easy DIY usability testing
Get it right the first time through cheap and easy DIY usability testingGet it right the first time through cheap and easy DIY usability testing
Get it right the first time through cheap and easy DIY usability testingDavid Minton
 
Menu renard
Menu renardMenu renard
Menu renardurenarda
 
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...Website Redesign in Drupal: are you planning to succeed or succeeding to fail...
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...David Minton
 
Creative tourism into
Creative tourism intoCreative tourism into
Creative tourism intoGreg Richards
 

Viewers also liked (11)

Get it right the first time through cheap and easy DIY usability testing
Get it right the first time through cheap and easy DIY usability testingGet it right the first time through cheap and easy DIY usability testing
Get it right the first time through cheap and easy DIY usability testing
 
AWOD Orientation
AWOD OrientationAWOD Orientation
AWOD Orientation
 
Menu renard
Menu renardMenu renard
Menu renard
 
Kehoomaa
KehoomaaKehoomaa
Kehoomaa
 
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...Website Redesign in Drupal: are you planning to succeed or succeeding to fail...
Website Redesign in Drupal: are you planning to succeed or succeeding to fail...
 
Solang
SolangSolang
Solang
 
Following the light
Following the lightFollowing the light
Following the light
 
Solang
SolangSolang
Solang
 
Test Presentation
Test PresentationTest Presentation
Test Presentation
 
Solang
SolangSolang
Solang
 
Creative tourism into
Creative tourism intoCreative tourism into
Creative tourism into
 

Similar to Lec1

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataVrije Universiteit Amsterdam
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingTomonari Masada
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...Chuancong Gao
 

Similar to Lec1 (20)

A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Ir models
Ir modelsIr models
Ir models
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Document similarity
Document similarityDocument similarity
Document similarity
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Canini09a
Canini09aCanini09a
Canini09a
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic Modeling
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 

More from Prafulla Kiran

More from Prafulla Kiran (6)

Festival of Lights
Festival of LightsFestival of Lights
Festival of Lights
 
About Stacks
About StacksAbout Stacks
About Stacks
 
Jammu delhi-redbus
Jammu delhi-redbusJammu delhi-redbus
Jammu delhi-redbus
 
Jammu delhi-redbus
Jammu delhi-redbusJammu delhi-redbus
Jammu delhi-redbus
 
Jammu delhi-irctc
Jammu delhi-irctcJammu delhi-irctc
Jammu delhi-irctc
 
Ppapp
PpappPpapp
Ppapp
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Lec1

  • 1. MIT OpenCourseWare http://ocw.mit.edu 6.006 Introduction to Algorithms Spring 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
  • 2. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 Lecture 1: Introduction and the Document Distance Problem Course Overview • Efficient procedures for solving problems on large inputs (Ex: entire works of Shake­ speare, human genome, U.S. Highway map) • Scalability • Classic data structures and elementary algorithms (CLRS text) • Real implementations in Python ⇔ Fun problem sets! • β version of the class - feedback is welcome! Pre-requisites • Familiarity with Python and Discrete Mathematics Contents The course is divided into 7 modules - each of which has a motivating problem and problem set (except for the last module). Modules and motivating problems are as described below: 1. Linked Data Structures: Document Distance (DD) 2. Hashing: DD, Genome Comparison 3. Sorting: Gas Simulation 4. Search: Rubik’s Cube 2 × 2 × 2 5. Shortest Paths: Caltech → MIT 6. Dynamic Programming: Stock Market √ 7. Numerics: 2 Document Distance Problem Motivation Given two documents, how similar are they? • Identical - easy? • Modified or related (Ex: DNA, Plagiarism, Authorship) 1
  • 3. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 • Did Francis Bacon write Shakespeare’s plays? To answer the above, we need to define practical metrics. Metrics are defined in terms of word frequencies. Definitions 1. Word : Sequence of alphanumeric characters. For example, the phrase “6.006 is fun” has 4 words. 2. Word Frequencies: Word frequency D(w) of a given word w is the number of times it occurs in a document D. For example, the words and word frequencies for the above phrase are as below: Count : 1 0 1 1 0 1 W ord : 6 the is 006 easy f un In practice, while counting, it is easy to choose some canonical ordering of words. 3. Distance Metric: The document distance metric is the inner product of the vectors D1 and D2 containing the word frequencies for all words in the 2 documents. Equivalently, this is the projection of vectors D1 onto D2 or vice versa. Mathematically this is expressed as: � D1 · D2 = D1 (w) · D2 (w) (1) w 4. Angle Metric: The angle between the vectors D1 and D2 gives an indication of overlap between the 2 documents. Mathematically this angle is expressed as: � � D1 · D2 θ(D1 , D2 ) = arccos � D1 � ∗ � D2 � 0 ≤ θ ≤ π/2 An angle metric of 0 means the two documents are identical whereas an angle metric of π/2 implies that there are no common words. 5. Number of Words in Document: The magnitude of the vector D which contains word frequencies of all words in the document. Mathematically this is expressed as: √ N (D) =� D �= D · D (2) So let’s apply the ideas to a few Python programs and try to flesh out more. 2
  • 4. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 Document Distance in Practice Computing Document Distance: docdist1.py The python code and results relevant to this section are available here. This program com­ putes the distance between 2 documents by performing the following steps: • Read file • Make word list [“the”,“year”,. . . ] • Count frequencies [[“the”,4012],[“year”,55],. . . ] • Sort into order [[“a”,3120],[“after”,17],. . . ] • Compute θ Ideally, we would like to run this program to compute document distances between writings of the following authors: • Jules Verne - document size 25k • Bobsey Twins - document size 268k • Lewis and Clark - document size 1M • Shakespeare - document size 5.5M • Churchill - document size 10M Experiment: Comparing the Bobsey and Lewis documents with docdist1.py gives θ = 0.574. However, it takes approximately 3 minutes to compute this document distance, and probably gets slower as the inputs get large. What is wrong with the efficiency of this program? Is it a Python vs. C issue? Is it a choice of algorithm issue - θ(n2 ) versus θ(n)? Profiling: docdist2.py In order to figure out why our initial program is so slow, we now “instrument” the program so that Python will tell us where the running time is going. This can be done simply using the profile module in Python. The profile module indicates how much time is spent in each routine. (See this link for details on profile). The profile module is imported into docdist1.py and the end of the docdist1.py file is modified. The modified docdist1.py file is renamed as docdist2.py Detailed results of document comparisons are available here . 3
  • 5. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 More on the different columns in the output displayed on that webpage: • tottime per call(column3) is tottime(column2)/ncalls(column1) • cumtime(column4)includes subroutine calls • cumtime per call(column5) is cumtime(column4)/ncalls(column1) The profiling of the Bobsey vs. Lewis document comparison is as follows: • Total: 195 secs • Get words from line list: 107 secs • Count-frequency: 44 secs • Get words from string: 13 secs • Insertion sort: 12 secs So the get words from line list operation is the culprit. The code for this particular section is: word_list = [ ] for line in L: words_in_line = get_words_from_string(line) word_list = word_list + words_in_line return word_list The bulk of the computation time is to implement word_list = word_list + words_in_line There isn’t anything else that takes up much computation time. List Concatenation: docdist3.py The problem in docdist1.py as illustrated by docdist2.py is that concatenating two lists takes time proportional to the sum of the lengths of the two lists, since each list is copied into the output list! L = L1 + L2 takes time proportional to | L1 | + | L2 | If we had n lines (each with one . word), computation time would be proportional to 1 + 2 + 3 + . . . + n = n(n+1) = θ(n2 ) 2 Solution: word_list.extend (words_in_line) [word_list.append(word)] for each word in words_in_line Ensures L1 .extend(L2 ) time proportional to | L2 | 4
  • 6. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 Take Home Lesson: Python has powerful primitives (like concatenation of lists) built in. To write efficient algorithms, we need to understand their costs. See Python Cost Model for details. PS1 also has an exercise on figuring out cost of a set of operations. Incorporate this solution into docdist1.py - rename as docdist3.py. Implementation details and results are available here. This modification helps run the Bobsey vs. Lewis example in 85 secs (as opposed to the original 195 secs). We can improve further by looking for other quadratic running times hidden in our routines. The next offender (in terms of overall computation time) is the count frequency routine, which computes the frequency of each word, given the word list. Analysing Count Frequency def count_frequency(word_list): """ Return a list giving pairs of form: (word,frequency) """ L = [] for new_word in word_list: for entry in L: if new_word == entry[0]: entry[1] = entry[1] + 1 break else: L.append([new_word,1]) return L If document has n words and d distinct words, θ(nd). If all words distinct, θ(n2 ). This shows that the count frequency routine searches linearly down the list of word/frequency pairs to find the given word. Thus it has quadratic running time! Turns out the count frequency routine takes more than 1/2 of the running time in docdist3.py. Can we improve? Dictionaries: docdist4.py The solution to improve the Count Frequency routine lies in hashing, which gives constant running time routines to store and retrieve key/value pairs from a table. In Python, a hash table is called a dictionary. Documentation on dictionaries can be found here. Hash table is defined a mapping from a domain(finite collection of immutable things) to a range(anything). For example, D[‘ab’] = 2, D[‘the’] = 3. 5
  • 7. Lecture 1 Introduction and Document Distance 6.006 Spring 2008 Modify docdist3.py to docdist4.py using dictionaries to give constant time lookup. Modified count frequency routine is as follows: def count_frequency(word_list): """ Return a list giving pairs of form: (word,frequency) """ D = {} for new_word in word_list: if D.has_key(new_word): D[new_word] = D[new_word]+1 else: D[new_word] = 1 return D.items() Details of implementation and results are here. Running time is now θ(n). We have suc­ cessfully replaced one of our quadratic time routines with a linear-time one, so the running time will scale better for larger inputs. For the Bobsey vs. Lewis example, running time improves from 85 secs to 42 secs. What’s left? The two largest contributors to running time are now: • Get words from string routine (13 secs) — version 5 of docdist fixes this with translate • Insertion sort routine (11 secs) — version 6 of docdist fixes this with merge-sort More on that next time . . . 6