Introduction to wwwjdic project


Published on

An introductory talk for Hacker News Kansai meetup on the ruby rewrite of Jim Breen's wwwjdic

Published in: Technology, Self Improvement
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • My name is Mark BurnsI'm a ruby developer, I speak Japanese, and I'm on holiday from England.
  • I'm here to talk today about Jim Breen's Japanese Dictionary, wwwjdic,in particular, an open source rewrite of this online dictionary. As you may have guessed, it's originally written and maintained mostly byJim Breen, who is a retired professor (and current PhD student) at MonashUniversity in Melbourne Australia.It's freely available, actually I'm not 100% sure about the license, I'm no internet/international lawyer, but it's a flexible license that allows free and commercial use, but with a 'please-do-the-right-thing'and donate some money if it benefits you kind of deal
  • So the start of the rewrite is available here: [URL]I'll also show the slideshare URL at the end of the talk so youcan make a note to be able to see all the various linksIn the past I've spoke to Jim about making improvements to the webinterface of the dictionary. I feel it could be better presented andmore user-friendly/intuitive.
  • For example a typical lookup would be this kind of interaction:Visit wwwjdic.comredirected to this long URL with a particular query param for the word-search pagefill in a form and do a POST request toa URL with a specific query string parameter andspecifically encoded bodyAnd the results are currently available as HTML that looks likethis:
  • So it's great, if you like information, and know where to look.You have links to everything you might need to do, and more.And it's this 'and more', that I think is the issue with a lot of information presentation.To be honest, it's not great for beginners, without thought on hierarchy of importance of information(which I'll come back to) Now, there's nothing wrong with this at all, it's just that it suitsits specific audience in particular. And by that I mean, technicallyminded learners of Japanese. I can only guess, but I also imagine it is morecommonly known amongst English native speakers than native Japanese.
  • I thought it would be nicer to be able to make it in general more accessibleSo my aims of creating this project are thus:* Provide a JSON API* A Cleaner UI/UX* Autocomplete/other nice UI touches* Maintainability
  • 8.Propose APIwhere you can GET a simply defined (easy to remember) URLGET
  • And some nicer design for the HTML output. now I'm not a front-end designer by any means, but I can appreciate the philosophyof clean design
  • A first attempt was made using the Rails flavour of the ActiveRecord pattern against an SQL backend . (Easy to get up andrunning, but squeezes the concepts of domain model and persistence together). But a dictionary is much more read heavy than write heavy,and the model of languages doesn't fit as well in a relationaldatabase. The existing data is a few flat text files so I wanted toget a decent compromise for maintainability and it would be nice tonot completely throw away all the performance of the existingsystem's custom C code reading from flat text files.
  • Autocomplete was done with a trie index The whole code and concept was pretty much taken from Antirez's (theauthor of redis) blog post It scales quite nicely, as the entries are of the 150,000 magnitude Time O(log(N)) Space N*(Ma+1) Where Ma is average length of a word (5.6) =~51MB
  • OK some details, Not too specific, but detailed enough hopefully to keep everyone happy. This is a result of doing a lookup on an index generated for autocompletion. E.g. the user searched for ‘egg’, and the list shows all the following matches in the autocomplete list.
  • Here’s the lookup
  • After entering ‘ eg ’ this is the value of `matches` Where we iterate over each match, and if the match doesn’t match, we break out. otherwise we append the match to our list of matches
  • Here we have an example where the user has entered “walr” and the break clause is hit, as the value “walt” does not match “walr”
  • In my work for shutl, a UK startup aimed at solving the onlinedelivery problem we use graph databases to help us match upcarrier/vehicle availability and pricing with customer requirementsand retail store opening hours. I think it could be interesting tostart structuring the data in a graph format. Words can at least belinked to the entries listed in their definitions. There can be amore semantically rich level of relationships represented though
  • I think that mapping words to a graph is a more natural way to expressthe relationship between two languages. Firstly, you don't always haveisomorphic (one-to-one) relationships between any two words in eitherlanguage. すごい can mean in English either great or terrible. It can meansomething like wonderful or fantastic, as well as dreadful. I oftenstruggle with words that are their own antonyms, this was particularlyrelevant to me as on the day of the large Touhoku earthquake, I was ona shinkansen heading into Tokyo. After being on the train for sixhours, I needed to get a beer and find some people to chat to to findout what had happened. I'd understood that there was an earthquake,but it was my first experience of an earthquake and I hadn't yetgrasped the magnitude of it in both the literal and metaphoricalsenses of the term magnitude. So I found a guy who wanted to practicehis English, and he explained to me that "This is a great day forJapan". "Very great" I understanding something along the lines ofwonderful/fantastic had to ask him "Why? Is it a national holiday?Maybe the emperor's birthday?" Of course, it occured to me when Itranslated his sentence into Japanese in my head, choosing すごい forgreat that he must have meant the terrible/dreadful sense of the word.So clearly there is a need for a richer, more expressive data modelthat can capture these nuances and senses, and not just provide aone-to-one lookup service.
  • Due to Jim's relationship with Monash University, hehas access to google's data-set of Japanese n-grams. An n-gram 安心リフォームへの近道 安心 リフォーム へ の 近道 [TAB]29 (5-gramsample) 安心 + リフォーム + へ + の + 近道安心 [TAB]41322178 安 心 [TAB]3274So this sequence of words occurred 29 times during the datacollection.By utilising this data we can look at making search have morerelevance. One of the problems with the existing flat file structureis that there is no meta-data helping with understanding how recent orrelevant a particular result is. Some of the terms may be legal orscientific terms, or pre-1945,Can be useful for spotting common co-locations too.
  • Introduction to wwwjdic project

    1. 1. 1 About me マーク・バーンズ 日本語ができる Ruby developer On holiday from England I love ruby and startups
    2. 2. 2 Introduction Jim Breen’s (Monash University) Japanese-English online dictionary Data freely available accepts user-contributions
    3. 3. 3 wwwjdic (rewrite)
    4. 4. 4 Current interaction GET 301 -> POST BODY: dsrchkey=%CD%F1&dicsel=1
    5. 5. 5 Response 5
    6. 6. 6 Aims JSON API Cleaner UI Nice features: e.g. autocomplete Easily extensible open source codebase
    7. 7. 7 JSON API GET http://localhost:4000/ 卵 .json
    8. 8. 8 Simpler UI (Example) GET http://localhost:4000/ 卵 8
    9. 9. 9 Autocomplete
    10. 10. 10 Trie index Autocomplete
    11. 11. 11 Trie index Time: O(log(N)) N=~150,000. Space: N*(Ma+1) =~ 51MB
    12. 12. 12 TRIE 12
    13. 13. 13
    14. 14. 14 ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"]
    15. 15. 15 ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"]
    16. 16. 16 ["egg laid in wint", "egg laid in winte", "egg laid in winter", "egg laid in winter*", "egg m", "egg me", "egg mem", "egg memb", "egg membr", "egg membra", "egg membran", "egg membrane", "egg membrane*", "egg s", "egg sa"] ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"]
    17. 17. 17 "walr""walt" "walrus" ["walr", "walru", "walrus", "walrus*", "walruse", "walruses", "walruses*", "walt", "waltz", "waltz ", "waltz (", "waltz (c", "waltz (co", "waltz (com", "waltz (comp"]
    18. 18. 18 & graphs
    19. 19. 19 Isomorphism?
    20. 20. 20 N-grams 安心 リフォーム へ の 近道 [TAB]29 (Anshin reform he no chikamichi) 安心 + リフォーム + へ + の + 近道 安心 [TAB]41,322,178
    21. 21. 21 Present/State of Play Data import to redis Indexed word lookup Autocomplete Begun work on text glossing
    22. 22. 22 Noticably Missing Not yet released to production No test/staging server However, should be easy enough to run locally
    23. 23. 23 Future Wordnet plus graph db => mapping of languages Analysis of kanji User experience/Design/Polish N-grams Other ideas/collaboration?
    24. 24. 24 Questions? 24