• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

IR3.ppt

on

  • 503 views

 

Statistics

Views

Total Views
503
Views on SlideShare
503
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IR3.ppt IR3.ppt Presentation Transcript

    • Search Engine Technology (3) Prof. Dragomir R. Radev [email_address]
    • SET Fall 2009 … 5. Evaluation of IR systems Reference collections TREC …
    • Relevance
      • Difficult to change: fuzzy, inconsistent
      • Methods: exhaustive, sampling, pooling, search-based
    • Contingency table w=tp x=fn y=fp z=tn n 2 = w + y n 1 = w + x N relevant not relevant retrieved not retrieved
    • Precision and Recall Recall: Precision: w w+y w+x w
    • Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien , “JRR Tolkien” , +”JRR Tolkien” +Lord of the Rings” , etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.
    • [From Salton’s book]
    •  
    • Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?
    • Issues
      • Why not use accuracy A=(w+z)/N?
      • Average precision
      • Average P at given “document cutoff values”
      • Report when P=R
      • F measure: F=(  2 +1)PR/(  2 P+R)
      • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R
    • Kappa
      • N: number of items (index i)
      • n: number of categories (index j)
      • k: number of annotators
    • Kappa example J1+ J1- TOTAL J2+ 300 10 310 J2- 20 70 90 TOTAL 320 80 400
    • Kappa (cont’d)
      • P(A) = 370/400 = 0.925
      • P (-) = (10+20+70+70)/800 = 0.2125
      • P (+) = (10+20+300+300)/800 = 0.7875
      • P (E) = 0.2125 * 0.2125 + 0.7875 * 0.7875 = 0.665
      • K = (0.925-0.665)/(1-0.665) = 0.776
      • Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good
    • Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208
    • <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the &quot;Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities,&quot; Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II &quot;appears to have a higher number of single-vehicle, first event roll-overs,&quot; a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC>
    • TREC (cont’d)
      • http://trec.nist.gov/tracks.html
      • http://trec.nist.gov/presentations/presentations.html
    • Most used reference collections
      • Generic retrieval: OHSUMED, CRANFIELD, CACM
      • Text classification: Reuters, 20newsgroups
      • Question answering: TREC-QA
      • Web: DOTGOV, wt100g
      • Blogs: Buzzmetrics datasets
      • TREC ad hoc collections, 2-6 GB
      • TREC Web collections, 2-100GB
    • Comparing two systems
      • Comparing A and B
      • One query?
      • Average performance?
      • Need: A to consistently outperform B
      [this slide: courtesy James Allan]
    • The sign test
      • Example 1:
        • A > B (12 times)
        • A = B (25 times)
        • A < B (3 times)
        • p < 0.035 (significant at the 5% level)
      • Example 2:
        • A > B (18 times)
        • A < B (9 times)
        • p < 0.122 (not significant at the 5% level)
        • http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html
      [this slide: courtesy James Allan]
    • Other tests
      • Student t-test: takes into account the actual performances, not just which system is better
        • http:// www.fon.hum.uva.nl/Service/Statistics/Student_t_Test.html
        • http://www.socialresearchmethods.net/kb/stat_t.php
      • Wilcoxon Matched-Pairs Signed-Ranks Test
        • http://www.fon.hum.uva.nl/Service/Statistics/Signed_Rank_Test.html
    • SET Fall 2009 … 6. Automated indexing/labeling Compression …
    • Indexing methods
      • Manual: e.g., Library of Congress subject headings, MeSH
      • Automatic: e.g., TF*IDF based
    • LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
    • Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine
    • Automatic methods
      • TF*IDF: pick terms with the highest TF*IDF scores
      • Centroid-based: pick terms that appear in the centroid with high scores
      • The maximal marginal relevance principle (MMR)
      • Related to summarization, snippet generation
    • Compression
      • Methods
        • Fixed length codes
        • Huffman coding
        • Ziv-Lempel codes
    • Fixed length codes
      • Binary representations
        • ASCII
        • Representational power (2 k symbols where k is the number of bits)
    • Variable length codes
      • Alphabet:
      • A .-   N -.   0 -----
      • B -...   O ---   1 .----
      • C -.-.   P .--.   2 ..---
      • D -..   Q --.-   3 ...—
      • E .   R .-. 4 ....-
      • F ..-. S ... 5 .....
      • G --. T -   6 -....
      • H .... U ..-   7 --...
      • I ..   V ...-   8 ---..
      • J .---   W .--   9 ----.
      • K -.-   X -..-
      • L .-..   Y -.—
      • M --   Z --..
      • Demo:
        • http://www.scphillips.com/morse/
    • Most frequent letters in English
      • Most frequent letters:
        • E T A O I N S H R D L U
      • Demo:
        • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
      • Also: bigrams:
        • TH HE IN ER AN RE ND AT ON NT
    • Huffman coding
      • Developed by David Huffman (1952)
      • Average of 5 bits per character (37.5% compression)
      • Based on frequency distributions of symbols
      • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
    •  
    • 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 c b d f g i j h e a
    •  
    • Exercise
      • Consider the bit string: 01101101111000100110001110100111000110101101011101
      • Use the Huffman code from the example to decode it.
      • Try inserting, deleting, and switching some bits at random locations and try decoding.
    • Extensions
      • Word-based
      • Domain/genre dependent models
    • Ziv-Lempel coding
      • Two types - one is known as LZ77 (used in GZIP)
      • Code: set of triples <a,b,c>
      • a: how far back in the decoded text to look for the upcoming text segment
      • b: how many characters to copy
      • c: new character to add to complete segment
      • <0,0,p> p
      • <0,0,e> pe
      • <0,0,t> pet
      • <2,1,r> peter
      • <0,0,_> peter_
      • <6,1,i> peter_pi
      • <8,2,r> peter_piper
      • <6,3,c> peter_piper_pic
      • <0,0,k> peter_piper_pick
      • <7,1,d> peter_piper_picked
      • <7,1,a> peter_piper_picked_a
      • <9,2,e> peter_piper_picked_a_pe
      • <9,2,_> peter_piper_picked_a_peck_
      • <0,0,o> peter_piper_picked_a_peck_o
      • <0,0,f> peter_piper_picked_a_peck_of
      • <17,5,l> peter_piper_picked_a_peck_of_pickl
      • <12,1,d> peter_piper_picked_a_peck_of_pickled
      • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep
      • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper
      • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
    • Links on text compression
      • Data compression:
        • http://www.data-compression.info/
      • Calgary corpus:
        • http://links.uwaterloo.ca/calgary.corpus.html
      • Huffman coding:
        • http://www.compressconsult.com/huffman/
        • http://en.wikipedia.org/wiki/Huffman_coding
      • LZ
        • http://en.wikipedia.org/wiki/LZ77
    • 100 alternative search engines
      • http://rss.slashdot.org/~r/Slashdot/slashdot/~3/83468703/article.pl
    • Readings
      • 2: MRS9
      • 3: MRS13, MRS14
      • 4: MRS15, MRS16