Cross-Language Information Retrieval

1,401 views

Published on

Cross-language information retrieval (CLIR) is a technique to locate documents written in one natural language by queries expressed in another language. This project investigates the feasibility of CLIR based on domain-specific bilingual corpus databases.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,401
On SlideShare
0
From Embeds
0
Number of Embeds
119
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cross-Language Information Retrieval

  1. 1. Cross-Language Information Retrieval University of Arizona Sumin Byeon 1
  2. 2. Overview 안드로이드 이메일 암호화& Matching& algorithm& Bilingual& corpus& database& Results&in& English& Android&email&encryp3on& Google& Search& 2
  3. 3. Background • Corpus - a collection of written text; a single word or multiple words, or even phrases and sentences • Comparable corpus - a collection of text from pairs of languages referring to the same domain[1]; (source text, target text) pair • N-gram - n-character or n-word slice of a longer string[2]. We refer n-character slices by the term n-gram. We use 4-gram (four-gram or quad-gram) • Source language - the language of the original phrases • Target language - the language into which CLIR translates the original phrases [1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264. [2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print. 3
  4. 4. Motivation • Desire to acquire information even if the information is not sufficiently available in their native language • Survey has shown people have a higher foreign language proficiency level in reading than in writing • CLIR may bridge the gap between their desire to obtain information and unavailability or under-availability of such information in their native language 4
  5. 5. Goals • Allow users to query for domain-specific (i.e., computer science and software engineering) information in their native language • Present relevant search results in the target language; the language in which the largest amount of information is available 5
  6. 6. Components • Domain-specific bilingual corpus extraction from multiple sources • Corpus indexing • Querying and string matching 6
  7. 7. Corpus Extraction 7
  8. 8. Corpus Indexing (S, T) -> (i1, h1), (i2, h2), …, (in, hn) • Java$ • Quad-grams (k=4) 0:$Java$(20451)$ • Fingerprint overlapping is okay, although it is not the most space-efficient way global$variable$ 자바$ Frequency 전역 변수$ 3:$bal_$(14870)$ 50000 8:$aria$(14269)$ 37500 25000 example$ 예제$ 12500 1:$xamp$(20451)$ 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 86 88 90 92 95 97 99 103 8
  9. 9. Querying & Matching Java$global$variable$example$$ Java$ 자바$ 0:$Java$(20451)$ 0:$Java$(20451)$ 1:$ava_$(24085)$ …$ global$variable$ 8:$bal_$(14870)$ 전역 변수$ 3:$bal_$(14870)$ …$ 8:$aria$(14269)$ 13:$aria$(14269)$ …$ 22:$xamp$(20451)$ example$ 예제$ 1:$xamp$(20451)$ 9
  10. 10. Multiple Candidates global&variable& • • Longest match first Confidence: how many times does this comparable corpus pair appear in a set of documents? 3:&bal_&(14870)& 8:&aria&(14269)& global& • Outcome of matching depends on the domain of the documents stored in the database 전역 변수& 세계적인& 0:&loba&(25848)& variable& 변수& 1:&aria&(14269)& variable& 가변적인& 1:&aria&(14269)& 10
  11. 11. Indexing and Querying Recap 자바 전역 변수 예제! 자바 :!Java! 전역 :!transfer! 전역 :!all!parts!(of)! 전역 변수 :!global!variable! 변수 :!variable! 예제 :!example! Java!global!variable! example!! 11
  12. 12. Relationship with Content Addressability 자바 전역 변수 예제& 자바& Java& 전역 변수& 예제& global&variable& example& Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.& Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor& ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis& suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae& consectetur&risus.&global&variable&Aenean&justo&risus,&mollis& sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,& commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin& pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non& elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum& lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,& interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,& ornare&quis&ligula.& 12
  13. 13. Evaluation • Matching • • • Did it translate all the search terms to the target language properly? Did it preserve domain-specific information? Searching • Hit ratio: # of relevant web pages / # of results on the first page • Total number of search results 13
  14. 14. Evaluation • 재귀 열거 집합 - recursively enumerable sets • • 배낭 문제 시간 복잡도 - 배낭 issue the time complexity • • (3/3, 1/1) (3/4, 1/2) 가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter energy efficiency maximization • (7/7, 4/4) 14
  15. 15. Evaluation • Query in source language “재귀 열거 집합” • • Query in target language “recursively enumerable sets” • • (6/10, 15,300) (10/10, 105,000) Google Translate result “Set of recursive enumeration” • (10/10, 1,990,000) 15
  16. 16. Evaluation • Query in source language “배낭 문제 시간 복잡도” • • Query in target language “배낭 issue time complexity” • • (10/10, 31,200) (2/6, 2,270) Google Translate result “Knapsack problem, the time complexity” • (10/10, 206,000) 16
  17. 17. Evaluation • Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화” • • Query in target language “through virtualization datacenter energy efficiency maximization” • • (5/10, 36,100) (8/10, 264,000) Google Translate result “Maximize energy efficiency through data center virtualization” • (10/10, 284,000) 17
  18. 18. Conclusion & Future Work • Preliminary results look satisfactory • Machine translation based CLIR appears to be more useful in many cases • Evaluation factors may not reflect the actual quality of the system • Labor-intensive evaluation process - need for an automated evaluation • Fuzzy matching based on lexical information (e.g., call, calls) • Fuzzy matching based on semantic information (e.g., maximize, maximizing, maximization, maximum) 18

×