Introduction to wwwjdic project
Upcoming SlideShare
Loading in...5
×
 

Introduction to wwwjdic project

on

  • 195 views

An introductory talk for Hacker News Kansai meetup on the ruby rewrite of Jim Breen's wwwjdic

An introductory talk for Hacker News Kansai meetup on the ruby rewrite of Jim Breen's wwwjdic

Statistics

Views

Total Views
195
Views on SlideShare
195
Embed Views
0

Actions

Likes
1
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • My name is Mark BurnsI'm a ruby developer, I speak Japanese, and I'm on holiday from England.
  • I'm here to talk today about Jim Breen's Japanese Dictionary, wwwjdic,in particular, an open source rewrite of this online dictionary. As you may have guessed, it's originally written and maintained mostly byJim Breen, who is a retired professor (and current PhD student) at MonashUniversity in Melbourne Australia.It's freely available, actually I'm not 100% sure about the license, I'm no internet/international lawyer, but it's a flexible license that allows free and commercial use, but with a 'please-do-the-right-thing'and donate some money if it benefits you kind of deal
  • So the start of the rewrite is available here: [URL]I'll also show the slideshare URL at the end of the talk so youcan make a note to be able to see all the various linksIn the past I've spoke to Jim about making improvements to the webinterface of the dictionary. I feel it could be better presented andmore user-friendly/intuitive.
  • For example a typical lookup would be this kind of interaction:Visit wwwjdic.comredirected to this long URL with a particular query param for the word-search pagefill in a form and do a POST request toa URL with a specific query string parameter andspecifically encoded bodyAnd the results are currently available as HTML that looks likethis:
  • So it's great, if you like information, and know where to look.You have links to everything you might need to do, and more.And it's this 'and more', that I think is the issue with a lot of information presentation.To be honest, it's not great for beginners, without thought on hierarchy of importance of information(which I'll come back to) Now, there's nothing wrong with this at all, it's just that it suitsits specific audience in particular. And by that I mean, technicallyminded learners of Japanese. I can only guess, but I also imagine it is morecommonly known amongst English native speakers than native Japanese.
  • I thought it would be nicer to be able to make it in general more accessibleSo my aims of creating this project are thus:* Provide a JSON API* A Cleaner UI/UX* Autocomplete/other nice UI touches* Maintainability
  • 8.Propose APIwhere you can GET a simply defined (easy to remember) URLGET http://wwwjdic.com/egg.json
  • And some nicer design for the HTML output. now I'm not a front-end designer by any means, but I can appreciate the philosophyof clean design
  • A first attempt was made using the Rails flavour of the ActiveRecord pattern against an SQL backend . (Easy to get up andrunning, but squeezes the concepts of domain model and persistence together). But a dictionary is much more read heavy than write heavy,and the model of languages doesn't fit as well in a relationaldatabase. The existing data is a few flat text files so I wanted toget a decent compromise for maintainability and it would be nice tonot completely throw away all the performance of the existingsystem's custom C code reading from flat text files.
  • Autocomplete was done with a trie index The whole code and concept was pretty much taken from Antirez's (theauthor of redis) blog post http://oldblog.antirez.com/post/autocomplete-with-redis.html It scales quite nicely, as the entries are of the 150,000 magnitude Time O(log(N)) Space N*(Ma+1) Where Ma is average length of a word (5.6) =~51MB
  • OK some details, Not too specific, but detailed enough hopefully to keep everyone happy. This is a result of doing a lookup on an index generated for autocompletion. E.g. the user searched for ‘egg’, and the list shows all the following matches in the autocomplete list.
  • Here’s the lookup
  • After entering ‘ eg ’ this is the value of `matches` Where we iterate over each match, and if the match doesn’t match, we break out. otherwise we append the match to our list of matches
  • Here we have an example where the user has entered “walr” and the break clause is hit, as the value “walt” does not match “walr”
  • In my work for shutl, a UK startup aimed at solving the onlinedelivery problem we use graph databases to help us match upcarrier/vehicle availability and pricing with customer requirementsand retail store opening hours. I think it could be interesting tostart structuring the data in a graph format. Words can at least belinked to the entries listed in their definitions. There can be amore semantically rich level of relationships represented though
  • I think that mapping words to a graph is a more natural way to expressthe relationship between two languages. Firstly, you don't always haveisomorphic (one-to-one) relationships between any two words in eitherlanguage. すごい can mean in English either great or terrible. It can meansomething like wonderful or fantastic, as well as dreadful. I oftenstruggle with words that are their own antonyms, this was particularlyrelevant to me as on the day of the large Touhoku earthquake, I was ona shinkansen heading into Tokyo. After being on the train for sixhours, I needed to get a beer and find some people to chat to to findout what had happened. I'd understood that there was an earthquake,but it was my first experience of an earthquake and I hadn't yetgrasped the magnitude of it in both the literal and metaphoricalsenses of the term magnitude. So I found a guy who wanted to practicehis English, and he explained to me that "This is a great day forJapan". "Very great" I understanding something along the lines ofwonderful/fantastic had to ask him "Why? Is it a national holiday?Maybe the emperor's birthday?" Of course, it occured to me when Itranslated his sentence into Japanese in my head, choosing すごい forgreat that he must have meant the terrible/dreadful sense of the word.So clearly there is a need for a richer, more expressive data modelthat can capture these nuances and senses, and not just provide aone-to-one lookup service.
  • Due to Jim's relationship with Monash University, hehas access to google's data-set of Japanese n-grams. An n-gram 安心リフォームへの近道 安心 リフォーム へ の 近道 [TAB]29 (5-gramsample) 安心 + リフォーム + へ + の + 近道安心 [TAB]41322178 安 心 [TAB]3274So this sequence of words occurred 29 times during the datacollection.By utilising this data we can look at making search have morerelevance. One of the problems with the existing flat file structureis that there is no meta-data helping with understanding how recent orrelevant a particular result is. Some of the terms may be legal orscientific terms, or pre-1945,Can be useful for spotting common co-locations too.

Introduction to wwwjdic project Introduction to wwwjdic project Presentation Transcript

  • 1 About me マーク・バーンズ about.me/mark.burns 日本語ができる Ruby developer On holiday from England I love ruby and startups
  • 2 Introduction Jim Breen’s (Monash University) Japanese-English online dictionary wwwjdic.com Data freely available accepts user-contributions
  • 3 wwwjdic (rewrite) https://github.com/markburns/wwwjdic View slide
  • 4 Current interaction GET http://wwwjdic.com 301 -> http://www.edrdg.org/cgi-bin/wwwjdic/wwjdic?1C POST http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E BODY: dsrchkey=%CD%F1&dicsel=1 View slide
  • 5 Response 5
  • 6 Aims JSON API Cleaner UI Nice features: e.g. autocomplete Easily extensible open source codebase
  • 7 JSON API GET http://localhost:4000/ 卵 .json
  • 8 Simpler UI (Example) GET http://localhost:4000/ 卵 8
  • 9 Autocomplete
  • 10 Trie index http://oldblog.antirez.com/post/autocomplete-with-redis.html Autocomplete
  • 11 Trie index Time: O(log(N)) N=~150,000. Space: N*(Ma+1) =~ 51MB
  • 12 TRIE 12
  • 13 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_
  • 14 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_ ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"]
  • 15 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_ ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"]
  • 16 ["egg laid in wint", "egg laid in winte", "egg laid in winter", "egg laid in winter*", "egg m", "egg me", "egg mem", "egg memb", "egg membr", "egg membra", "egg membran", "egg membrane", "egg membrane*", "egg s", "egg sa"] ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"] https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_
  • 17 "walr""walt" "walrus" ["walr", "walru", "walrus", "walrus*", "walruse", "walruses", "walruses*", "walt", "waltz", "waltz ", "waltz (", "waltz (c", "waltz (co", "waltz (com", "waltz (comp"] https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_
  • 18 shutl.com & graphs
  • 19 Isomorphism?
  • 20 N-grams 安心 リフォーム へ の 近道 [TAB]29 (Anshin reform he no chikamichi) 安心 + リフォーム + へ + の + 近道 安心 [TAB]41,322,178
  • 21 Present/State of Play Data import to redis Indexed word lookup Autocomplete Begun work on text glossing
  • 22 Noticably Missing Not yet released to production No test/staging server However, should be easy enough to run locally
  • 23 Future Wordnet plus graph db => mapping of languages Analysis of kanji User experience/Design/Polish N-grams Other ideas/collaboration?
  • 24 https://github.com/markburns/wwwjdic http://www.slideshare.net/_mark_burns/slides-24568551 about.me/mark.burns Questions? 24