Your SlideShare is downloading. ×
0
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Towards a Web Search Service for Minority Language Communities
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Towards a Web Search Service for Minority Language Communities

825

Published on

Talk at OpenRoad 2006 (17 January 2006, Melbourne)

Talk at OpenRoad 2006 (17 January 2006, Melbourne)

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
825
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Towards a Web Search Service for Minority Language Communities Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne badenh@csse.unimelb.edu.au 17 January 2006 Hughes @ OpenRoad 2006 1
  • 2. Diversity in Australia Well recognised cultural and linguistic diversity of Australia’s population SIL Ethnologue 311 languages (14th edition, 2000) 318 languages (15th edition, 2005) Australia in top 10 countries for linguistic diversity ( = languages in a country / languages globally ) ABS: 364 languages (2005) Considerable number of low density languages used within immigrant communities 17 January 2006 Hughes @ OpenRoad 2006 2
  • 3. Inefficiency of Web Search General web search is a low precision activity in the best case scenario Google: 8 billion web pages Web search for materials in lesser-used languages is even lower precision than the general case Web search for minority (“low density”) languages is even lower precision again Mining the ‘long tail’ of the web is a specialist domain of research 17 January 2006 Hughes @ OpenRoad 2006 3
  • 4. Harvesting vs Enabling Previous work in linguistically-oriented data mining of web content to create derivative works: corpora, dictionaries None of these address the low precision issues for generalized web search Our work is aimed at increasing the likelihood that end users searching for resources in minority languages on the web will find useful results from searching Developing use-case specific tools for web search and leveraging existing broad coverage web search tools 17 January 2006 Hughes @ OpenRoad 2006 4
  • 5. Open Language Archives Community (OLAC) OLAC is a consortium of linguistic data archives http://www.language-archives.org/ 34 archives, 28K+ objects in catalogue OLAC metadata is based on Dublin Core, with extensions for specifically linguistically-oriented properties eg language, data type, subject language, linguistic subject OLAC is an Open Archives Initiative (OAI) subcommunity Uses standard OAI Protocol for Metadata Harvesting to promote data access and integration 17 January 2006 Hughes @ OpenRoad 2006 5
  • 6. In vs About OLAC Metadata crucially distinguishes between The language a resource is in (‘language’) The language a resource is about (‘subject language’) Such differentiation allows for additional precision in classifying, indexing and searching for low density language resources ‘In-ness’ is more interesting than ‘About-ness’ 17 January 2006 Hughes @ OpenRoad 2006 6
  • 7. Service Architecture Building on previous work in developing robust strategies for identifying web resources for lesser used languages on the web, the LangGator service architecture provides Language-centric web resource identification and acquisition Language-centric resource description Language-aware end-user resource discovery 17 January 2006 Hughes @ OpenRoad 2006 7
  • 8. Crawler Internals Crawl seeded by language name variants (Ethnologue), place and country names and variants (Getty TGN), lexical items (Rosetta) Programmatic queries against Google, Yahoo, A9, DogPile Essentially guided metasearch Resulting URIs merged and sorted using rank aggregation techniques Highly ranked documents from metasearch used for focused crawling around URI TF/IDF for low frequency items in found documents 17 January 2006 Hughes @ OpenRoad 2006 8
  • 9. Crawler Status Running intermittently since July 2004 on high bandwidth research infrastructure >1.6 million web resources have been identified in over 3000 languages Some exposed via standard OLAC search Majority exposed to standard search engines via DP9 gateway Full circle exploitation of web search Evaluation of precision improvement is ongoing More details in the paper (or Hughes 2005 paper) 17 January 2006 Hughes @ OpenRoad 2006 9
  • 10. Metadata Descriptions Describing resources separately from their realization is required since the web based language-centric resources are not held locally Metadata creation is an effort intensive process Automatic description generation is well studied in the general digital libraries community (eg Paynter 2005) Some metadata elements are well supported by existing automatic metadata creation tools We focus particularly on language vs subject language metadata creation since it is of primary importance 17 January 2006 Hughes @ OpenRoad 2006 10
  • 11. Metadata Descriptions Status We use a combination of machine learning approaches to compare and classify a given resource against human curated gold standard data for known languages Primary data points: encoding, word n-grams, character n- grams Secondary data points: geographical referent colocation, lexical item occurrence, URI Currently described around 40% of the >1.6 million URIs found by crawler at probability of 0.8 or higher as threshold for acceptable language identification Computationally bound at present, but re-engineering 17 January 2006 Hughes @ OpenRoad 2006 11
  • 12. Search Facilities Currently search delivered via OLAC Search Engine (http://www.language-archives.org/tools/search/) Features Web search style interface, UTF-8 support, no restrictions on string, operators, inline syntax Fuzzy string matching for geographical entities and language names ‘Click minimization’ strategy for empty search: pre- composed derivative queries Exploits Ethnologue and Getty ontologies Exploits linguistic knowledge (eg language families) 17 January 2006 Hughes @ OpenRoad 2006 12
  • 13. Search Facilities Localization-oriented interface XML core with XSL Entirely user preference driven with a default Post-query encoding/language change Currently code auditing for upgrading interface strings to XLIFF Portable Objects Interest for localization into French, Spanish, Bahasa Indonesia, Vietnamese, Thai More search architecture detail in Kamat and Hughes (2005) 17 January 2006 Hughes @ OpenRoad 2006 13
  • 14. Language Search: Dinka 17 January 2006 Hughes @ OpenRoad 2006 14
  • 15. Country Search: Togo 17 January 2006 Hughes @ OpenRoad 2006 15
  • 16. Future Work Increased frequency of web crawling More efficient and reliable language identification End user documentation and accessibility API documentation for third party data consumers and documentation for service/interface customization Map based search GUI; better geographical context- aware search Linguistically or geographical proximity based language matching Basic Language Resource Kits (BLARK) Integration with MyLanguage 17 January 2006 Hughes @ OpenRoad 2006 16
  • 17. Conclusion Language-centric broad coverage web search is a strongly motivated user function Major search providers do not focus on precision improvement per se, but can be incrementally improved through covert means A multilingual web and multilingual web users can be supported effectively, even down to low densities Interested in leveraging our existing research and service development in other ways 17 January 2006 Hughes @ OpenRoad 2006 17
  • 18. Acknowledgements Research supported by the Australian Research Council under the funding program for Special Research Initiatives (E-Research) Grant SR0567353 “An Intelligent Search Infrastructure for Language Resources on the Web”. 17 January 2006 Hughes @ OpenRoad 2006 18

×