Towards a Web Search
     Service for Minority
 Language Communities
                            Baden Hughes
       Depar...
Diversity in Australia
     Well recognised cultural and linguistic diversity of
     Australia’s population
           SI...
Inefficiency of Web Search
     General web search is a low precision activity
     in the best case scenario
           G...
Harvesting vs Enabling
     Previous work in linguistically-oriented data mining
     of web content to create derivative ...
Open Language Archives
Community (OLAC)
     OLAC is a consortium of linguistic data archives
           http://www.langua...
In vs About
     OLAC Metadata crucially distinguishes
     between
           The language a resource is in (‘language’)
...
Service Architecture
     Building on previous work in developing
     robust strategies for identifying web
     resource...
Crawler Internals
     Crawl seeded by language name variants
     (Ethnologue), place and country names and variants
    ...
Crawler Status
     Running intermittently since July 2004 on high
     bandwidth research infrastructure
     >1.6 millio...
Metadata Descriptions
     Describing resources separately from their
     realization is required since the web based
   ...
Metadata Descriptions Status
     We use a combination of machine learning
     approaches to compare and classify a given...
Search Facilities
     Currently search delivered via OLAC Search Engine
     (http://www.language-archives.org/tools/sear...
Search Facilities
     Localization-oriented interface
           XML core with XSL
           Entirely user preference dr...
Language Search: Dinka




17 January 2006   Hughes @ OpenRoad 2006   14
Country Search: Togo




17 January 2006   Hughes @ OpenRoad 2006   15
Future Work
     Increased frequency of web crawling
     More efficient and reliable language identification
     End use...
Conclusion
     Language-centric broad coverage web search is a
     strongly motivated user function
     Major search pr...
Acknowledgements
     Research supported by the Australian
     Research Council under the funding program
     for Specia...
Upcoming SlideShare
Loading in …5
×

Towards a Web Search Service for Minority Language Communities

993 views

Published on

Talk at OpenRoad 2006 (17 January 2006, Melbourne)

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
993
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Towards a Web Search Service for Minority Language Communities

  1. 1. Towards a Web Search Service for Minority Language Communities Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne badenh@csse.unimelb.edu.au 17 January 2006 Hughes @ OpenRoad 2006 1
  2. 2. Diversity in Australia Well recognised cultural and linguistic diversity of Australia’s population SIL Ethnologue 311 languages (14th edition, 2000) 318 languages (15th edition, 2005) Australia in top 10 countries for linguistic diversity ( = languages in a country / languages globally ) ABS: 364 languages (2005) Considerable number of low density languages used within immigrant communities 17 January 2006 Hughes @ OpenRoad 2006 2
  3. 3. Inefficiency of Web Search General web search is a low precision activity in the best case scenario Google: 8 billion web pages Web search for materials in lesser-used languages is even lower precision than the general case Web search for minority (“low density”) languages is even lower precision again Mining the ‘long tail’ of the web is a specialist domain of research 17 January 2006 Hughes @ OpenRoad 2006 3
  4. 4. Harvesting vs Enabling Previous work in linguistically-oriented data mining of web content to create derivative works: corpora, dictionaries None of these address the low precision issues for generalized web search Our work is aimed at increasing the likelihood that end users searching for resources in minority languages on the web will find useful results from searching Developing use-case specific tools for web search and leveraging existing broad coverage web search tools 17 January 2006 Hughes @ OpenRoad 2006 4
  5. 5. Open Language Archives Community (OLAC) OLAC is a consortium of linguistic data archives http://www.language-archives.org/ 34 archives, 28K+ objects in catalogue OLAC metadata is based on Dublin Core, with extensions for specifically linguistically-oriented properties eg language, data type, subject language, linguistic subject OLAC is an Open Archives Initiative (OAI) subcommunity Uses standard OAI Protocol for Metadata Harvesting to promote data access and integration 17 January 2006 Hughes @ OpenRoad 2006 5
  6. 6. In vs About OLAC Metadata crucially distinguishes between The language a resource is in (‘language’) The language a resource is about (‘subject language’) Such differentiation allows for additional precision in classifying, indexing and searching for low density language resources ‘In-ness’ is more interesting than ‘About-ness’ 17 January 2006 Hughes @ OpenRoad 2006 6
  7. 7. Service Architecture Building on previous work in developing robust strategies for identifying web resources for lesser used languages on the web, the LangGator service architecture provides Language-centric web resource identification and acquisition Language-centric resource description Language-aware end-user resource discovery 17 January 2006 Hughes @ OpenRoad 2006 7
  8. 8. Crawler Internals Crawl seeded by language name variants (Ethnologue), place and country names and variants (Getty TGN), lexical items (Rosetta) Programmatic queries against Google, Yahoo, A9, DogPile Essentially guided metasearch Resulting URIs merged and sorted using rank aggregation techniques Highly ranked documents from metasearch used for focused crawling around URI TF/IDF for low frequency items in found documents 17 January 2006 Hughes @ OpenRoad 2006 8
  9. 9. Crawler Status Running intermittently since July 2004 on high bandwidth research infrastructure >1.6 million web resources have been identified in over 3000 languages Some exposed via standard OLAC search Majority exposed to standard search engines via DP9 gateway Full circle exploitation of web search Evaluation of precision improvement is ongoing More details in the paper (or Hughes 2005 paper) 17 January 2006 Hughes @ OpenRoad 2006 9
  10. 10. Metadata Descriptions Describing resources separately from their realization is required since the web based language-centric resources are not held locally Metadata creation is an effort intensive process Automatic description generation is well studied in the general digital libraries community (eg Paynter 2005) Some metadata elements are well supported by existing automatic metadata creation tools We focus particularly on language vs subject language metadata creation since it is of primary importance 17 January 2006 Hughes @ OpenRoad 2006 10
  11. 11. Metadata Descriptions Status We use a combination of machine learning approaches to compare and classify a given resource against human curated gold standard data for known languages Primary data points: encoding, word n-grams, character n- grams Secondary data points: geographical referent colocation, lexical item occurrence, URI Currently described around 40% of the >1.6 million URIs found by crawler at probability of 0.8 or higher as threshold for acceptable language identification Computationally bound at present, but re-engineering 17 January 2006 Hughes @ OpenRoad 2006 11
  12. 12. Search Facilities Currently search delivered via OLAC Search Engine (http://www.language-archives.org/tools/search/) Features Web search style interface, UTF-8 support, no restrictions on string, operators, inline syntax Fuzzy string matching for geographical entities and language names ‘Click minimization’ strategy for empty search: pre- composed derivative queries Exploits Ethnologue and Getty ontologies Exploits linguistic knowledge (eg language families) 17 January 2006 Hughes @ OpenRoad 2006 12
  13. 13. Search Facilities Localization-oriented interface XML core with XSL Entirely user preference driven with a default Post-query encoding/language change Currently code auditing for upgrading interface strings to XLIFF Portable Objects Interest for localization into French, Spanish, Bahasa Indonesia, Vietnamese, Thai More search architecture detail in Kamat and Hughes (2005) 17 January 2006 Hughes @ OpenRoad 2006 13
  14. 14. Language Search: Dinka 17 January 2006 Hughes @ OpenRoad 2006 14
  15. 15. Country Search: Togo 17 January 2006 Hughes @ OpenRoad 2006 15
  16. 16. Future Work Increased frequency of web crawling More efficient and reliable language identification End user documentation and accessibility API documentation for third party data consumers and documentation for service/interface customization Map based search GUI; better geographical context- aware search Linguistically or geographical proximity based language matching Basic Language Resource Kits (BLARK) Integration with MyLanguage 17 January 2006 Hughes @ OpenRoad 2006 16
  17. 17. Conclusion Language-centric broad coverage web search is a strongly motivated user function Major search providers do not focus on precision improvement per se, but can be incrementally improved through covert means A multilingual web and multilingual web users can be supported effectively, even down to low densities Interested in leveraging our existing research and service development in other ways 17 January 2006 Hughes @ OpenRoad 2006 17
  18. 18. Acknowledgements Research supported by the Australian Research Council under the funding program for Special Research Initiatives (E-Research) Grant SR0567353 “An Intelligent Search Infrastructure for Language Resources on the Web”. 17 January 2006 Hughes @ OpenRoad 2006 18

×