PgRESTNode.js in the Database       Audrey Tang
只講程式不講故事
PgREST is…• JSON document store• Running inside PostgreSQL• Working with existing relational data• Capable of loading Node...
JSON{ "title":    "萌",  "bopomofo": "ㄇㄥˊ",  "pinyin":   "méng",  "definitions": [    { "type": "名", "def": "草木初生的芽。" },   ...
PostgreSQLCREATE TABLE moe ( "entry" JSON );INSERT INTO moe VALUES ($${ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng", ...
PLV8CREATE EXTENSION plv8;CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$   return JSON.stringify( obj...
plv8x: OperatorsSELECT entry |> this.bopomofo FROM moe;-- "ㄇㄥˊ"SELECT entry ~> @bopomofo FROM moe;-- "ㄇㄥˊ"SELECT @bopomofo...
plv8x: Command Linenpm i -g plv8xexport PLV8XCONN=dbnameplv8x -r script.ls # .js works tooplv8x -E plv8.execute("SELECT en...
plv8x: Modulesnpm i -g uax11plv8x -i uax11plv8x -E require "uax11" .toFullwidth "méng"# mengSELECT entry ~> require "uax11...
plv8x: Functionsplv8x -f text fullwidth(text)=uax11:toFullwidthplv8x -f text PINYIN(json)=:&0.pinyin.toUpperCase!SELECT fu...
Summary• V8 : JavaScript engine• PLV8 : Stored procedures in JavaScript• plv8x : Package manager for PLV8    • Turns NPM m...
Cutting out the Middleware• Serve JSON API from SQL• Shared models & validation code• Put Business Logic into DB• Perfect ...
@clkao++
3du.tw
The Revised MoE Dictionary (1994)
The Good• 160,000+ entries• Official, high quality sources• Rich etymology and historical usage• Full text search with reg...
The Bad• Results are not bookmarkable• Requires N clicks to get to a definition• Rare characters become low-res bitmaps• D...
The Sad“    本會非常歡迎各位來連結「國語辭典」,但是    本會目前只開放以超連結 (hyperlink) 的方式與    國語辭典 首頁 連結,至於其他方式本會並未對    外開放授權。若還有疑問或建議,歡迎來信。        ...
.…and the Very Crazy• 不需登入的網頁,會自動把你登出!
Yeh’s Ping, 2013.1.26.“    所以我要 響應 零時政府 g0v.tw 的活動,來做    3du.tw,把字、詞、成語、定義、例句等等正    體中文資料,用開放的文字 API 釋放出來,加    上索引和搜尋的功能,讓...
Hackpad for 3du.tw
零時黑客集體砍站事件
g0v hackath1n, 2013.1.27.• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)• Scrape 3000 characters as raw HTML (@au)• Design JS...
←🀝 Big-5→🀎 UTF-8
Crowd-OCR for 1000+ glyphs
Finished in 24 hours!Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,Caasi Huang, Daniel Liang, Grey Lee, I...
粗略的共識會動的程式
Applications• XUL Desktop App (@racklin)• OS X Dictionary (@yllan)• Windows 8 App (@wenpei)• iOS Client (@tomjpsun, @james...
Integrations• Rails API server (@albb0920)• AngularJS Client+Server (@viirya)• Chrome Extension (@tonytonyjan)• Sublime Te...
Fair Use“    為非營利之教育目的,依著作權法第 50 條,    「以中央或地方機關或公法人之名義公開發表    之著作,在合理範圍內,得重製、公開播送或    公開傳輸。」此處轉換格式、重新編排的編輯    著作權(如果有的話)由...
CC0: Public Domain“    除前述資料檔之外,本目錄下的所有其他檔    案,由作者 唐鳳 在法律許可的範圍內,拋棄該    著作依著作權法所享有之權利,包括所有相關    與鄰接的法律權利,並宣告將該著作貢獻至公    眾領域。
moedict.tw
5 Stars of Open Data1. ⊙☉ Open License2.     Structured Data3.     Non-Proprietary Format4. ✧ Each Item has an URI5. ✩ Lin...
URI Endpoints• https://moedict.tw/#文字• 3 APIs (for non-Unicode characters):    • /raw/文字.json          {[8ff0]}    • /uni/...
Web Fonts for Private-Use Area• Initially based on Hán Nôm font (@YaoWei)    • Subset everything outside Big5 range    • H...
科技始終來自於佛性
Live Demo
Reaching the Fifth Star1. ⊙☉ Open License2.     Structured Data3.     Non-Proprietary Format4. ✧ Each Item has an URI5. ✩ ...
Chinese Segmentation• Therearenowhitespacesbetweenwords• Lots of heuristic algorithms• Naive solution: Longest-token match...
In-browser Implementation{"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代...
Worked well, but…• Freezes IE8, crashes IE7    • Broken on Android 2.x, too• So let’s pre-segment on server    • Needs a t...
/a/文字.json{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、`記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書~。","q":["`五代史~`平話~....
Live Demo, part II
Materialized View: 160k .json files                                  (@obra++)
Let’s PhoneGap it!• Freezes XCode, crashes Eclipse• Solution: Pack into 1024 .txt files    • Take the first character, mod...
Google Play & App Store
User-Driven Development• Wildcard and part-of-word searching (@esor)• Two-column layout for tablets (@hlb)• Toggle between...
Personal Motivation• My main caretakers were my grandparents    • Grandma from Lo̍k-káng, Taiwan    • Grandpa from Sì-chuā...
Taiwan Bân-lâm-gi Common Dictionary                           (MoE, 2011)
Good Parts• Unified Romanization system (TL)• Standardized Ideographic characters (RHC)• Full text search with Mandarin, T...
Not-so-good Parts• Entries are in non-bookmarkable <iframe>s• No equivalent Mandarin field for entries• Still uses bitmaps...
g0v hackath2n, 2013.3.23.
Crowd-OCR for 154 glyphs, 2013.3.25.
Finished over lunch!Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,@YuanChao, @clkao, @MGDesigner, @gontera…
Database received, 2013.3.27.• 詞目總檔.xls 詞目總檔.屬性對照.xls• 釋義.xls 釋義.詞性對照.xls• 又音.xls 又音.屬性對照.xls• 近義詞對應.xls 反義詞對應.xls• 詞彙方言差....
.…What about that extra request?“    您好:    資料匯入目前大致無誤。不過,twblg 網頁上    的「華語檢索」,可以用「一乾二淨」找到閩    語典的「離離」條目,這個對照表似乎沒有在    Exc...
Well…“    語言之間的對譯,不能盡然以詞彙對應,對不    夠深入了解的使用者來說,會讓他誤以為A語    言的X詞等於B語言的Y詞(並且這種呈現,會    被民眾認知為「教育部的辭典說的」)。
However…“    因此華語對應這個欄位,我們是藏在系統中。    如果是民間的辭典編輯,會比較沒有這個負    擔,因此我這裡確實不能給,非常希望你們能    有辦法解決。
.…it’s all good.“    好的,感謝您的提醒和協助。    目前從網頁以 Big5 範圍取出的華語條目,    共有 26274 筆對映。    在應用上,這部份我們會註明不屬於教育部    CC-BY-ND 的授權範圍。
Data Cleanup, 2013.3.30.• Convert all .xsl to .csv with LibreOffice 4    • 3 stars:   Non-Proprietary Format• Replace PUA ...
PgREST: MongoLab API Server• GET /collections/table_or_view   • q=&c=true&f=&fo=true&s=&sk=&l= curl $LY/collections/bills?...
PgREST: Import/Exportpgrest dbnameexport MOE=http://127.0.0.1:3000curl -i -X PUT -H "Content-Type: text/csv"      --data-b...
PgREST: 3du.tw JSON in 48 lines    https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls“
Live Demo, part III
Lessons Learned• Open Data is a beginning, not an end• Keep conversations with all participants    • Turn detractors into ...
宅心仁厚仁者無敵
阿宅無敵
When is Transparency Useful?“    眾人為了共同目標聚在一起,才能做出改變,    科技人很難獨力完成。    衡量成功的標準,可以是有多少人的生命因你    獲得改善,而不只是有多少人看你架的網站。       ...
開站一時開源一輩子
Thank you!
Thank you!  “   新的轉機和閃閃星斗,      正在綴滿沒有遮攔的天空。      那是五千年的象形文字,      那是未來人們凝視的眼睛。              ⧸/北島〈回答〉
PgREST: Node.js in the Database
Upcoming SlideShare
Loading in...5
×

PgREST: Node.js in the Database

23,184

Published on

OSDC.tw 2013 talk on plv8x, g0v, 3du, and https://moedict.tw/

Published in: Technology
1 Comment
16 Likes
Statistics
Notes
No Downloads
Views
Total Views
23,184
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
34
Comments
1
Likes
16
Embeds 0
No embeds

No notes for slide

PgREST: Node.js in the Database

  1. 1. PgRESTNode.js in the Database Audrey Tang
  2. 2. 只講程式不講故事
  3. 3. PgREST is…• JSON document store• Running inside PostgreSQL• Working with existing relational data• Capable of loading Node.js modules• Compatible with MongoLab’s REST API• = LiveScript + PLV8 + plv8x + OneJS
  4. 4. JSON{ "title": "萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng", "definitions": [ { "type": "名", "def": "草木初生的芽。" }, { "type": "名", "def": "事物發生的開端或徵兆。" }, { "type": "名", "def": "人民。" } ] }
  5. 5. PostgreSQLCREATE TABLE moe ( "entry" JSON );INSERT INTO moe VALUES ($${ "title":"萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng", "definitions": [ { "type": "名", "def": "草木初生的芽。" }, { "type": "名", "def": "事物發生的開端或徵兆。" }, { "type": "名", "def": "人民。" } ] } $$);INSERT INTO moe VALUES (這不是 ㄓㄟ ㄙㄣˇ); -- type error
  6. 6. PLV8CREATE EXTENSION plv8;CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$ return JSON.stringify( obj[key] );$$ LANGUAGE plv8;SELECT get_json_key(entry, bopomofo) FROM moe;-- "ㄇㄥˊ"
  7. 7. plv8x: OperatorsSELECT entry |> this.bopomofo FROM moe;-- "ㄇㄥˊ"SELECT entry ~> @bopomofo FROM moe;-- "ㄇㄥˊ"SELECT @bopomofo <~ entry FROM moe;-- "ㄇㄥˊ"SELECT ~> new Date;-- "2013-04-17T12:31:57.523Z"
  8. 8. plv8x: Command Linenpm i -g plv8xexport PLV8XCONN=dbnameplv8x -r script.ls # .js works tooplv8x -E plv8.execute("SELECT entry FROM moe").0.entry.definitions# [ { type: 名, def: 草木初生的芽。 },# { type: 名, def: 事物發生的開端或徵兆。 },# { type: 名, def: 人民。 } ]
  9. 9. plv8x: Modulesnpm i -g uax11plv8x -i uax11plv8x -E require "uax11" .toFullwidth "méng"# mengSELECT entry ~> require "uax11" .toFullwidth @pinyin FROM moe;-- "meng"
  10. 10. plv8x: Functionsplv8x -f text fullwidth(text)=uax11:toFullwidthplv8x -f text PINYIN(json)=:&0.pinyin.toUpperCase!SELECT fullwidth(Ingy döt Net);-- Ingy dot NetSELECT fullwidth( PINYIN(entry) ) FROM moe;-- MENG
  11. 11. Summary• V8 : JavaScript engine• PLV8 : Stored procedures in JavaScript• plv8x : Package manager for PLV8 • Turns NPM modules into SQL functions • JSON expressions with ~> and <~• Code reuse for browser + server + database !
  12. 12. Cutting out the Middleware• Serve JSON API from SQL• Shared models & validation code• Put Business Logic into DB• Perfect fit for Medium Data™
  13. 13. @clkao++
  14. 14. 3du.tw
  15. 15. The Revised MoE Dictionary (1994)
  16. 16. The Good• 160,000+ entries• Official, high quality sources• Rich etymology and historical usage• Full text search with regular expressions• Still frequently updated!
  17. 17. The Bad• Results are not bookmarkable• Requires N clicks to get to a definition• Rare characters become low-res bitmaps• Difficult to use on mobile devices• ”Optimized for IE 5.0 and Netscape 4.7+”!?
  18. 18. The Sad“ 本會非常歡迎各位來連結「國語辭典」,但是 本會目前只開放以超連結 (hyperlink) 的方式與 國語辭典 首頁 連結,至於其他方式本會並未對 外開放授權。若還有疑問或建議,歡迎來信。 ⧸/教育部國語推行委員會〈有關授權〉
  19. 19. .…and the Very Crazy• 不需登入的網頁,會自動把你登出!
  20. 20. Yeh’s Ping, 2013.1.26.“ 所以我要 響應 零時政府 g0v.tw 的活動,來做 3du.tw,把字、詞、成語、定義、例句等等正 體中文資料,用開放的文字 API 釋放出來,加 上索引和搜尋的功能,讓任何想加值的個人或 公司都可以使用。 ⧸/葉平〈還文於民〉
  21. 21. Hackpad for 3du.tw
  22. 22. 零時黑客集體砍站事件
  23. 23. g0v hackath1n, 2013.1.27.• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)• Scrape 3000 characters as raw HTML (@au)• Design JSON schema from samples (@pingooo)• Design SQL schema from samples (@albb0920)• Parse HTML into JSON & SQLite (@kcwu)• …and for those 24x24 bitmaps…
  24. 24. ←🀝 Big-5→🀎 UTF-8
  25. 25. Crowd-OCR for 1000+ glyphs
  26. 26. Finished in 24 hours!Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…
  27. 27. 粗略的共識會動的程式
  28. 28. Applications• XUL Desktop App (@racklin)• OS X Dictionary (@yllan)• Windows 8 App (@wenpei)• iOS Client (@tomjpsun, @jamessa, @pct)• iOS Offline App (@zonble)
  29. 29. Integrations• Rails API server (@albb0920)• AngularJS Client+Server (@viirya)• Chrome Extension (@tonytonyjan)• Sublime Text plugin (@zonble)• WinRT Component (@eriksk)
  30. 30. Fair Use“ 為非營利之教育目的,依著作權法第 50 條, 「以中央或地方機關或公法人之名義公開發表 之著作,在合理範圍內,得重製、公開播送或 公開傳輸。」此處轉換格式、重新編排的編輯 著作權(如果有的話)由 @kcwu 以 CC0 釋出。
  31. 31. CC0: Public Domain“ 除前述資料檔之外,本目錄下的所有其他檔 案,由作者 唐鳳 在法律許可的範圍內,拋棄該 著作依著作權法所享有之權利,包括所有相關 與鄰接的法律權利,並宣告將該著作貢獻至公 眾領域。
  32. 32. moedict.tw
  33. 33. 5 Stars of Open Data1. ⊙☉ Open License2. Structured Data3. Non-Proprietary Format4. ✧ Each Item has an URI5. ✩ Linking between Items
  34. 34. URI Endpoints• https://moedict.tw/#文字• 3 APIs (for non-Unicode characters): • /raw/文字.json {[8ff0]} • /uni/文字.json ⿰亻壯 • /pua/文字.json U+F8FF0
  35. 35. Web Fonts for Private-Use Area• Initially based on Hán Nôm font (@YaoWei) • Subset everything outside Big5 range • Hand-drawn PUA chars like ⿰亻壯• Later on, switched to Hanazono 花園明朝 font • 75,619 + 8,236 glyphs • From 花園大学国際禅学研究所
  36. 36. 科技始終來自於佛性
  37. 37. Live Demo
  38. 38. Reaching the Fifth Star1. ⊙☉ Open License2. Structured Data3. Non-Proprietary Format4. ✧ Each Item has an URI5. ✩ Linking between Items
  39. 39. Chinese Segmentation• Therearenowhitespacesbetweenwords• Lots of heuristic algorithms• Naive solution: Longest-token match • Requires a large dictionary • …wait, we just got one here
  40. 40. In-browser Implementation{"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代巨擘|代楷模|代風流|代鼎臣|以當十|以貫之|來一往|來二去|依舊式|個勁兒|個子兒|個樣兒|倡三歎|倡百和|偏之見|傅眾咻|償宿願|元大武|元復始|兵一卒|刀一割|刀兩斷|刀兩段|分一毫|切從簡|切現成|切眾生|刻千金|力承當|勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|口價兒|口兩匙|口咬定|口咬死|古腦兒|名半職|吐為快|吹一唱|呼再諾|呼百應|呼百諾|命嗚呼|哄而上|哄而散|哄而起|哄而集|唱一和|唱三歎|唱百和|喫一添|
  41. 41. Worked well, but…• Freezes IE8, crashes IE7 • Broken on Android 2.x, too• So let’s pre-segment on server • Needs a tool to move JS into DB • …wait, we just got one here
  42. 42. /a/文字.json{"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、`記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書~。","q":["`五代史~`平話~.`梁~`史~.`卷~`上~:「`您~`去~`攻破~`宋~`州~,`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~,`便~`發文~`字~`來~`報~`我~。」","`警世通言~.`卷~`十~`三~.`三~`現身~`包龍圖~`斷~`冤~:「`有~`甚事~`煩惱~?`想~`是~`縣~`裡~`有~`甚~`文字~`不了~。」"]}],"p":"wén zì"}],"t":"`文~`字~"}
  43. 43. Live Demo, part II
  44. 44. Materialized View: 160k .json files (@obra++)
  45. 45. Let’s PhoneGap it!• Freezes XCode, crashes Eclipse• Solution: Pack into 1024 .txt files • Take the first character, mod 1024 • Related words share the same bucket• Great success!
  46. 46. Google Play & App Store
  47. 47. User-Driven Development• Wildcard and part-of-word searching (@esor)• Two-column layout for tablets (@hlb)• Toggle between Pinyin and Bopomofo (@matic)• Volume key on Android resizes fonts (@ivan)• Top Request: Taiwanese Bân-lâm-gi
  48. 48. Personal Motivation• My main caretakers were my grandparents • Grandma from Lo̍k-káng, Taiwan • Grandpa from Sì-chuān, China• Raised biligually as a pre-schooler • But only Mandarin had a writing system • Editing her memoir brought back memories
  49. 49. Taiwan Bân-lâm-gi Common Dictionary (MoE, 2011)
  50. 50. Good Parts• Unified Romanization system (TL)• Standardized Ideographic characters (RHC)• Full text search with Mandarin, TL & RHC• MP3 pronounciations of all entries• Licensed under CC-BY-ND 3.0
  51. 51. Not-so-good Parts• Entries are in non-bookmarkable <iframe>s• No equivalent Mandarin field for entries• Still uses bitmaps for Ext-B+ fonts• Easy to scrape but hard to parse • …as discovered by @happyman_eric
  52. 52. g0v hackath2n, 2013.3.23.
  53. 53. Crowd-OCR for 154 glyphs, 2013.3.25.
  54. 54. Finished over lunch!Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,@YuanChao, @clkao, @MGDesigner, @gontera…
  55. 55. Database received, 2013.3.27.• 詞目總檔.xls 詞目總檔.屬性對照.xls• 釋義.xls 釋義.詞性對照.xls• 又音.xls 又音.屬性對照.xls• 近義詞對應.xls 反義詞對應.xls• 詞彙方言差.xls 語音方言差.xls• 例句.xls
  56. 56. .…What about that extra request?“ 您好: 資料匯入目前大致無誤。不過,twblg 網頁上 的「華語檢索」,可以用「一乾二淨」找到閩 語典的「離離」條目,這個對照表似乎沒有在 Excel 檔中看到?
  57. 57. Well…“ 語言之間的對譯,不能盡然以詞彙對應,對不 夠深入了解的使用者來說,會讓他誤以為A語 言的X詞等於B語言的Y詞(並且這種呈現,會 被民眾認知為「教育部的辭典說的」)。
  58. 58. However…“ 因此華語對應這個欄位,我們是藏在系統中。 如果是民間的辭典編輯,會比較沒有這個負 擔,因此我這裡確實不能給,非常希望你們能 有辦法解決。
  59. 59. .…it’s all good.“ 好的,感謝您的提醒和協助。 目前從網頁以 Big5 範圍取出的華語條目, 共有 26274 筆對映。 在應用上,這部份我們會註明不屬於教育部 CC-BY-ND 的授權範圍。
  60. 60. Data Cleanup, 2013.3.30.• Convert all .xsl to .csv with LibreOffice 4 • 3 stars: Non-Proprietary Format• Replace PUA characters with mapped Unicode • Add x-造字.csv and x-華語對照表.csv• Time to put PgREST to work!
  61. 61. PgREST: MongoLab API Server• GET /collections/table_or_view • q=&c=true&f=&fo=true&s=&sk=&l= curl $LY/collections/bills?q={"proposal.0":"吳育昇"} curl $MOE/collections/entries?q={"部首":"一"}&c=1• PUT /collections/table_or_view
  62. 62. PgREST: Import/Exportpgrest dbnameexport MOE=http://127.0.0.1:3000curl -i -X PUT -H "Content-Type: text/csv" --data-binary @uni/詞目總檔.csv $MOE/collections/entriescurl $MOE/collections/entries ̍# [{"主編號","1","屬性":"1","詞目":"一","音讀":"tsit",# "文白俗替":"替","部首":"一","部首序":"001-00-01","方言差":""}]
  63. 63. PgREST: 3du.tw JSON in 48 lines https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls“
  64. 64. Live Demo, part III
  65. 65. Lessons Learned• Open Data is a beginning, not an end• Keep conversations with all participants • Turn detractors into collaborators • Keep a kind heart • Assume the best intentions
  66. 66. 宅心仁厚仁者無敵
  67. 67. 阿宅無敵
  68. 68. When is Transparency Useful?“ 眾人為了共同目標聚在一起,才能做出改變, 科技人很難獨力完成。 衡量成功的標準,可以是有多少人的生命因你 獲得改善,而不只是有多少人看你架的網站。 — Aaron Swartz, «Open Government»
  69. 69. 開站一時開源一輩子
  70. 70. Thank you!
  71. 71. Thank you! “ 新的轉機和閃閃星斗, 正在綴滿沒有遮攔的天空。 那是五千年的象形文字, 那是未來人們凝視的眼睛。 ⧸/北島〈回答〉
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×