Introduction to wwwjdic project

•Download as PPT, PDF•

1 like•635 views

Mark Burns

An introductory talk for Hacker News Kansai meetup on the ruby rewrite of Jim Breen's wwwjdic

Technology Self Improvement

1
About me
マーク・バーンズ
about.me/mark.burns
日本語ができる Ruby developer
On holiday from England
I love ruby and startups

2
Introduction
Jim Breen’s (Monash University)
Japanese-English online dictionary
wwwjdic.com
Data freely available
accepts user-contributions

3
wwwjdic
(rewrite)
https://github.com/markburns/wwwjdic

4
Current interaction
GET http://wwwjdic.com
301 -> http://www.edrdg.org/cgi-bin/wwwjdic/wwjdic?1C
POST http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E
BODY: dsrchkey=%CD%F1&dicsel=1

6
Aims
JSON API
Cleaner UI
Nice features: e.g. autocomplete
Easily extensible open source codebase

7
JSON API
GET http://localhost:4000/ 卵 .json

8
Simpler UI
(Example)
GET http://localhost:4000/ 卵
8

10
Trie index
http://oldblog.antirez.com/post/autocomplete-with-redis.html
Autocomplete

11
Trie index
Time: O(log(N)) N=~150,000.
Space: N*(Ma+1)
=~ 51MB

13
https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

14
https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_
["eg", "ega", "egal", "egali", "egalit",
"egalita", "egalitar", "egalitari", "egalitaria",
"egalitarian", "egalitarian*", "egg", "egg ",
"egg (", "egg (e"]

15
https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_
["eg", "ega", "egal", "egali", "egalit",
"egalita", "egalitar", "egalitari", "egalitaria",
"egalitarian", "egalitarian*", "egg", "egg ",
"egg (", "egg (e"]
["egg dish", "egg dishe", "egg dishes",
"egg dishes*", "egg l", "egg la", "egg lai",
"egg laid", "egg laid ", "egg laid i", "egg
laid in", "egg laid in ", "egg laid in w",
"egg laid in wi", "egg laid in win"]

16
["egg laid in wint", "egg laid in winte", "egg
laid in winter", "egg laid in winter*", "egg m",
"egg me", "egg mem", "egg memb", "egg
membr", "egg membra", "egg membran",
"egg membrane", "egg membrane*", "egg s",
"egg sa"]
["eg", "ega", "egal", "egali", "egalit",
"egalita", "egalitar", "egalitari", "egalitaria",
"egalitarian", "egalitarian*", "egg", "egg ",
"egg (", "egg (e"]
["egg dish", "egg dishe", "egg dishes",
"egg dishes*", "egg l", "egg la", "egg lai",
"egg laid", "egg laid ", "egg laid i", "egg
laid in", "egg laid in ", "egg laid in w",
"egg laid in wi", "egg laid in win"]
https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

17
"walr""walt"
"walrus"
["walr", "walru", "walrus", "walrus*",
"walruse", "walruses", "walruses*",
"walt", "waltz", "waltz ", "waltz (",
"waltz (c", "waltz (co", "waltz (com",
"waltz (comp"]
https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

20
N-grams
安心リフォームへの近道 [TAB]29
(Anshin reform he no chikamichi)
安心 + リフォーム + へ + の + 近道
安心 [TAB]41,322,178

21
Present/State of
Play
Data import to redis
Indexed word lookup
Autocomplete
Begun work on text glossing

22
Noticably Missing
Not yet released to production
No test/staging server
However, should be easy enough to run
locally

23
Future
Wordnet plus graph db => mapping of
languages
Analysis of kanji
User experience/Design/Polish
N-grams
Other ideas/collaboration?

24
https://github.com/markburns/wwwjdic
http://www.slideshare.net/_mark_burns/slides-24568551
about.me/mark.burns
Questions?
24

What's hot

Baby – SS & FKshortstp73

Site 2013Nguyễn Học

Presentation on tank fish culture at pksfRasal Ali

PyCon大会分享Qing Feng

زُبَرَ الْحَدِيدِ و الْقِطْرِDr. GM Sherbini

Computer nerworkCambriannews

おひろめ会〜教師なしワード抽出moai kids

123movies auxxiCrackle

多快好省的前端开发实践美团技术团队

Node js javascript no lado do servidorMauricio Vieira

Head to head shed 20 dairy cow in banglaMohammad Ruhul Amin

Tail to tail shed 20 dairy cow in banglaMohammad Ruhul Amin

Gdazad12915

Williams darnell finalppp_slideshowKash Kobain

What's hot (14)

Baby – SS & FK

Site 2013

Presentation on tank fish culture at pksf

PyCon大会分享

زُبَرَ الْحَدِيدِ و الْقِطْرِ

Computer nerwork

おひろめ会〜教師なしワード抽出

123movies au

多快好省的前端开发实践

Node js javascript no lado do servidor

Head to head shed 20 dairy cow in bangla

Tail to tail shed 20 dairy cow in bangla

Williams darnell finalppp_slideshow

Viewers also liked

แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่Kamthon Sarawan

Canals de tv via satel·lit asmamgonellgomez

Day2 Travis Klein

V mware organizing-for-the-cloud-whitepaperEMC

Jose gafasjoserobertoayora

Improve Patient Care and Reduce IT Costs with Vendor Neutral Archiving and Cl...EMC

La televisió blaimgonellgomez

4 Ms of Big Data: Make Me More Money – InfographicEMC

Webdays blida mobile top 10 risksIslam Azeddine Mennouchi

Day 7Travis Klein

Forbidden fruits of Active Directory – Cloning, snapshotting, virtualization Microsoft TechNet - Belgium and Luxembourg

Flash Implications in Enterprise Storage Array DesignsEMC

The colorful friendsChandan Dubey

El cas del... oriol, oriol i nilmgonellgomez

Warren buffetChandan Dubey

Dell Webinar 2014-06-24: Subqueries For SuperheroesTracy McKibben

ThebraceletChandan Dubey

Modern infrastructure for business data lakeEMC

International tradeTravis Klein

RSA Monthly Online Fraud Report -- May 2013EMC

Viewers also liked (20)

แบบบ้านสองชั้น สวยทันสมัย ตกแต่งน่าอยู่

Canals de tv via satel·lit asma

Day2

V mware organizing-for-the-cloud-whitepaper

Jose gafas

Improve Patient Care and Reduce IT Costs with Vendor Neutral Archiving and Cl...

La televisió blai

4 Ms of Big Data: Make Me More Money – Infographic

Webdays blida mobile top 10 risks

Day 7

Forbidden fruits of Active Directory – Cloning, snapshotting, virtualization

Flash Implications in Enterprise Storage Array Designs

The colorful friends

El cas del... oriol, oriol i nil

Warren buffet

Dell Webinar 2014-06-24: Subqueries For Superheroes

Thebracelet

Modern infrastructure for business data lake

International trade

RSA Monthly Online Fraud Report -- May 2013

Similar to Introduction to wwwjdic project

Zero to Sixty: AWS Elastic Beanstalk (DMG204) | AWS re:Invent 2013Amazon Web Services

"今" 使えるJavaScriptのトレンドHayato Mizuno

Polyglot payloads in practice by avlidienbrunn at HackPraMathias Karlsson

MongoDB shell games: Here be dragons .. and JavaScript!Stennie Steneker

Shell ScriptAdam Victor Brandizzi

Amplify your stack - Jsfoo pune 2012threepointone

MateriApps LIVE! の設定Computational Materials Science Initiative

Leveling Up at JavaScriptRaymond Camden

Node.js Anti PatternsBen Hall

Your Library Sucks, and why you should use it.Peter Higgins

Free The Enterprise With Ruby & Master Your Own DomainKen Collins

Writing your Third PluginJustin Ryan

Social Coding With JRubyKoichiro Ohba

Ruby ile tanışma!Uğur Özyılmazel

Getting Started With Play FrameworkTreasury user10

DOD 2016 - Tomasz Torcz - The Song of JBoss and Chef PROIDEA

Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe

03 tk2123 - pemrograman shell-2Setia Juli Irzal Ismail

Rails Presentation (Anton Dmitriyev)True-Vision

Why Rust? by Edd Barrett (codeHarbour December 2019)Alex Cachia

Similar to Introduction to wwwjdic project (20)

Zero to Sixty: AWS Elastic Beanstalk (DMG204) | AWS re:Invent 2013

"今" 使えるJavaScriptのトレンド

Polyglot payloads in practice by avlidienbrunn at HackPra

MongoDB shell games: Here be dragons .. and JavaScript!

Shell Script

Amplify your stack - Jsfoo pune 2012

MateriApps LIVE! の設定

Leveling Up at JavaScript

Node.js Anti Patterns

Your Library Sucks, and why you should use it.

Free The Enterprise With Ruby & Master Your Own Domain

Writing your Third Plugin

Social Coding With JRuby

Ruby ile tanışma!

Getting Started With Play Framework

DOD 2016 - Tomasz Torcz - The Song of JBoss and Chef

Metasepi team meeting #16: Safety on ATS language + MCU

03 tk2123 - pemrograman shell-2

Rails Presentation (Anton Dmitriyev)

Why Rust? by Edd Barrett (codeHarbour December 2019)

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Slack Application Development 101 Slidespraypatel2

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

AI as an Interface for Commercial BuildingsMemoori

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Key Features Of Token Development (1).pptxLBM Solutions

How to convert PDF to text with Nanonetsnaman860154

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Human Factors of XR: Using Human Factors to Design XR Systems

Slack Application Development 101 Slides

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

AI as an Interface for Commercial Buildings

Maximizing Board Effectiveness 2024 Webinar.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Key Features Of Token Development (1).pptx

How to convert PDF to text with Nanonets

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

Advanced Test Driven-Development @ php[tek] 2024

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

08448380779 Call Girls In Friends Colony Women Seeking Men

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Introduction to wwwjdic project

1. 1 About me マーク・バーンズ about.me/mark.burns 日本語ができる Ruby developer On holiday from England I love ruby and startups

2. 2 Introduction Jim Breen’s (Monash University) Japanese-English online dictionary wwwjdic.com Data freely available accepts user-contributions

3. 3 wwwjdic (rewrite) https://github.com/markburns/wwwjdic

4. 4 Current interaction GET http://wwwjdic.com 301 -> http://www.edrdg.org/cgi-bin/wwwjdic/wwjdic?1C POST http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1E BODY: dsrchkey=%CD%F1&dicsel=1

5. 5 Response 5

6. 6 Aims JSON API Cleaner UI Nice features: e.g. autocomplete Easily extensible open source codebase

7. 7 JSON API GET http://localhost:4000/ 卵 .json

8. 8 Simpler UI (Example) GET http://localhost:4000/ 卵 8

9. 9 Autocomplete

10. 10 Trie index http://oldblog.antirez.com/post/autocomplete-with-redis.html Autocomplete

11. 11 Trie index Time: O(log(N)) N=~150,000. Space: N*(Ma+1) =~ 51MB

12. 12 TRIE 12

13. 13 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

14. 14 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_ ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"]

15. 15 https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_ ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"]

16. 16 ["egg laid in wint", "egg laid in winte", "egg laid in winter", "egg laid in winter*", "egg m", "egg me", "egg mem", "egg memb", "egg membr", "egg membra", "egg membran", "egg membrane", "egg membrane*", "egg s", "egg sa"] ["eg", "ega", "egal", "egali", "egalit", "egalita", "egalitar", "egalitari", "egalitaria", "egalitarian", "egalitarian*", "egg", "egg ", "egg (", "egg (e"] ["egg dish", "egg dishe", "egg dishes", "egg dishes*", "egg l", "egg la", "egg lai", "egg laid", "egg laid ", "egg laid i", "egg laid in", "egg laid in ", "egg laid in w", "egg laid in wi", "egg laid in win"] https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

17. 17 "walr""walt" "walrus" ["walr", "walru", "walrus", "walrus*", "walruse", "walruses", "walruses*", "walt", "waltz", "waltz ", "waltz (", "waltz (c", "waltz (co", "waltz (com", "waltz (comp"] https://github.com/markburns/wwwjdic/blob/master/app/data_access/auto_

18. 18 shutl.com & graphs

19. 19 Isomorphism?

20. 20 N-grams 安心リフォームへの近道 [TAB]29 (Anshin reform he no chikamichi) 安心 + リフォーム + へ + の + 近道安心 [TAB]41,322,178

21. 21 Present/State of Play Data import to redis Indexed word lookup Autocomplete Begun work on text glossing

22. 22 Noticably Missing Not yet released to production No test/staging server However, should be easy enough to run locally

23. 23 Future Wordnet plus graph db => mapping of languages Analysis of kanji User experience/Design/Polish N-grams Other ideas/collaboration?

24. 24 https://github.com/markburns/wwwjdic http://www.slideshare.net/_mark_burns/slides-24568551 about.me/mark.burns Questions? 24

Editor's Notes

My name is Mark BurnsI'm a ruby developer, I speak Japanese, and I'm on holiday from England.
I'm here to talk today about Jim Breen's Japanese Dictionary, wwwjdic,in particular, an open source rewrite of this online dictionary. As you may have guessed, it's originally written and maintained mostly byJim Breen, who is a retired professor (and current PhD student) at MonashUniversity in Melbourne Australia.It's freely available, actually I'm not 100% sure about the license, I'm no internet/international lawyer, but it's a flexible license that allows free and commercial use, but with a 'please-do-the-right-thing'and donate some money if it benefits you kind of deal
So the start of the rewrite is available here: [URL]I'll also show the slideshare URL at the end of the talk so youcan make a note to be able to see all the various linksIn the past I've spoke to Jim about making improvements to the webinterface of the dictionary. I feel it could be better presented andmore user-friendly/intuitive.
For example a typical lookup would be this kind of interaction:Visit wwwjdic.comredirected to this long URL with a particular query param for the word-search pagefill in a form and do a POST request toa URL with a specific query string parameter andspecifically encoded bodyAnd the results are currently available as HTML that looks likethis:
So it's great, if you like information, and know where to look.You have links to everything you might need to do, and more.And it's this 'and more', that I think is the issue with a lot of information presentation.To be honest, it's not great for beginners, without thought on hierarchy of importance of information(which I'll come back to) Now, there's nothing wrong with this at all, it's just that it suitsits specific audience in particular. And by that I mean, technicallyminded learners of Japanese. I can only guess, but I also imagine it is morecommonly known amongst English native speakers than native Japanese.
I thought it would be nicer to be able to make it in general more accessibleSo my aims of creating this project are thus:* Provide a JSON API* A Cleaner UI/UX* Autocomplete/other nice UI touches* Maintainability
8.Propose APIwhere you can GET a simply defined (easy to remember) URLGET http://wwwjdic.com/egg.json
And some nicer design for the HTML output. now I'm not a front-end designer by any means, but I can appreciate the philosophyof clean design
A first attempt was made using the Rails flavour of the ActiveRecord pattern against an SQL backend . (Easy to get up andrunning, but squeezes the concepts of domain model and persistence together). But a dictionary is much more read heavy than write heavy,and the model of languages doesn't fit as well in a relationaldatabase. The existing data is a few flat text files so I wanted toget a decent compromise for maintainability and it would be nice tonot completely throw away all the performance of the existingsystem's custom C code reading from flat text files.
Autocomplete was done with a trie index The whole code and concept was pretty much taken from Antirez's (theauthor of redis) blog post http://oldblog.antirez.com/post/autocomplete-with-redis.html It scales quite nicely, as the entries are of the 150,000 magnitude Time O(log(N)) Space N*(Ma+1) Where Ma is average length of a word (5.6) =~51MB
OK some details, Not too specific, but detailed enough hopefully to keep everyone happy. This is a result of doing a lookup on an index generated for autocompletion. E.g. the user searched for ‘egg’, and the list shows all the following matches in the autocomplete list.
Here’s the lookup
After entering ‘ eg ’ this is the value of `matches` Where we iterate over each match, and if the match doesn’t match, we break out. otherwise we append the match to our list of matches
Here we have an example where the user has entered “walr” and the break clause is hit, as the value “walt” does not match “walr”
In my work for shutl, a UK startup aimed at solving the onlinedelivery problem we use graph databases to help us match upcarrier/vehicle availability and pricing with customer requirementsand retail store opening hours. I think it could be interesting tostart structuring the data in a graph format. Words can at least belinked to the entries listed in their definitions. There can be amore semantically rich level of relationships represented though
I think that mapping words to a graph is a more natural way to expressthe relationship between two languages. Firstly, you don't always haveisomorphic (one-to-one) relationships between any two words in eitherlanguage. すごい can mean in English either great or terrible. It can meansomething like wonderful or fantastic, as well as dreadful. I oftenstruggle with words that are their own antonyms, this was particularlyrelevant to me as on the day of the large Touhoku earthquake, I was ona shinkansen heading into Tokyo. After being on the train for sixhours, I needed to get a beer and find some people to chat to to findout what had happened. I'd understood that there was an earthquake,but it was my first experience of an earthquake and I hadn't yetgrasped the magnitude of it in both the literal and metaphoricalsenses of the term magnitude. So I found a guy who wanted to practicehis English, and he explained to me that "This is a great day forJapan". "Very great" I understanding something along the lines ofwonderful/fantastic had to ask him "Why? Is it a national holiday?Maybe the emperor's birthday?" Of course, it occured to me when Itranslated his sentence into Japanese in my head, choosing すごい forgreat that he must have meant the terrible/dreadful sense of the word.So clearly there is a need for a richer, more expressive data modelthat can capture these nuances and senses, and not just provide aone-to-one lookup service.
Due to Jim's relationship with Monash University, hehas access to google's data-set of Japanese n-grams. An n-gram 安心リフォームへの近道安心リフォームへの近道 [TAB]29 (5-gramsample) 安心 + リフォーム + へ + の + 近道安心 [TAB]41322178 安心 [TAB]3274So this sequence of words occurred 29 times during the datacollection.By utilising this data we can look at making search have morerelevance. One of the problems with the existing flat file structureis that there is no meta-data helping with understanding how recent orrelevant a particular result is. Some of the terms may be legal orscientific terms, or pre-1945,Can be useful for spotting common co-locations too.

Introduction to wwwjdic project

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to wwwjdic project

Similar to Introduction to wwwjdic project (20)

Recently uploaded

Recently uploaded (20)

Introduction to wwwjdic project

Editor's Notes