Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Topic Modelling and APIs

1,086 views

Published on

Topic Modelling using LDA and exposing APIs in REST as resources and mills

The particular use case is Farsi language

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Topic Modelling and APIs

  1. 1. Topic Modelling in Farsi and APIs @aliostad Ali Kheyrollahi Barcelona 2015
  2. 2. Machine Learning and APIs
  3. 3. Topic Modelling
  4. 4. Topic Modelling
  5. 5. Topic Modelling Search
  6. 6. Topic Modelling* Document to document similarity
  7. 7. Farsi
  8. 8. Farsi 23rd most popular Spoken Language, above Italian and Polish
  9. 9. Farsi 14th most popular Internet Language, above Korean and Swedish
  10. 10. Acquisition and pre- My Book ‫‌من‬‫ب‬‫کتا‬ ketaab -e- man
  11. 11. What did you just say? ‫‌من‬‫ب‬‫کتا‬ ‫ن‬ ‫‌م‬ ‫اب‬ ‫ت‬ ‫ک‬ ‫ک‬ ‫ﮐﻢ‬‫ﻳﯾﮑﯽ‬‫ﺗﮏ‬‫ﺑﺎﮎک‬
  12. 12. How weird can it get? Unicode codez
  13. 13. And there’s more ‫‌من‬‫ب‬‫کتا‬ ‫من‬ ‫کتاب‬ Zero-width non-joiner (0x200C) ‌HTML => ‫کتابمن‬
  14. 14. Topic Modelling and LDA
  15. 15. Latent Dirichlet Allocation (LDA) ✤ Mainly a “clustering” algorithm ✤ Defines topics as latent variables within the documents ✤ Its implementations available in most programming languages ✤ Python => Gensim and Java => Mallet
  16. 16. Topic Modelling concepts in LDA ✤ Document: “Bag of words” vs. “Markov chain” ✤ Word: mere an id (“library”=>123, “librarian”=>789) ✤ Dictionary: set of all words ✤ Corpus: set of all documents ✤ Topic: Distribution over words (LDA)
  17. 17. Using Latent Dirichlet Allocation ✤ Document as a vector of topic weights {0: 0.01, 12: 0.19, 42: 0.23} ✤ Cosine similarity for document similarity ✤ Document similarity works really well ✤ Not great in some domains [to fix => Hierarchical] ✤ Boosting
  18. 18. Topic Model and APIs
  19. 19. Resources: Dictionary Dictionary POST languages/farsi/dictionaries HTTP/1.1 Host: example.com 200 OK Location: languages/farsi/dictionaries/123 languages/{lang}/dictionaries/{id} Create
  20. 20. Resources: Dictionary Dictionary PUT languages/farsi/dictionaries/123 HTTP/1.1 Content-Type: application/json { “documents”: [ {“fullText”: “‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,}”یک‬ … ] } languages/{lang}/dictionaries/{id} Add document words
  21. 21. Resources: Corpus Corpus POST languages/farsi/corpi HTTP/1.1 Host: example.com 200 OK location: languages/farsi/corpi/123 languages/{lang}/corpora/{id} Create
  22. 22. Resources: Corpus Corpus PUT languages/farsi/corpi/123 HTTP/1.1 Content-Type: application/json { “documents”: [ {“fullText”: “‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,}”یک‬ … ] } languages/{lang}/corpora/{id} Add documents
  23. 23. Resources: TopicModel TopicModel POST languages/farsi/topicmodels?passes=6&alpha=auto HTTP/1.1 Host: example.com { “dictionaryId”:123, “corpusId”:456 } languages/{lang}/topicmodels Create (request)
  24. 24. Resources*: TopicModel TopicModel 202 Accepted Location: languages/farsi/topicmodels/789 languages/{lang}/topicmodels Create (response)
  25. 25. State of current ML APIs
  26. 26. State of current ML APIs HATEOAS Hypermedia REST APIs C A C H E Markov Chain Graph Theory Deep Learning Bayesian
  27. 27. Server Authority State
  28. 28. Server Authority Algorithm
  29. 29. Is this really a resource? Converts
  30. 30. Mills
  31. 31. Mills ✤ A single piece of work/specialty (& verb) ✤ Encapsulating an “algorithm” ✤ Do not own data (own config tho): 
 Raw data in, processed result out ✤ All calls are safe and idempotent
  32. 32. Topic Model Mills: classifier TopicModellanguages/{lang}/topicmodels/{id}/classifier classifier (request) POST languages/farsi/topicmodels/789/classifier HTTP/1.1 Host: example.com { “fullText”:“‫گفت‬ ‫عروق‬ ‫و‬ ‫قلب‬ ‫‌تخصص‬‫ق‬‫فو‬ ‫,”یک‬ “refinement”:”hierarchical” }
  33. 33. Topic Model Mills: classify TopicModellanguages/{lang}/topicmodels/{id}/classify classifier (response) OK 200 Content-Type: application/json { “15”: 0.03, “123”: 0.2, “390”: 0.09, … }
  34. 34. Thank you! @aliostad aliostad [at] gmail [dot] com
  35. 35. Acknowledgements ✤ Windmill picture: https://www.flickr.com/photos/capnkroaker/2473951927/in/photolist-4LBDHz-4Mba53-pnrARE-4Ktk7H ✤ Algorithm picture: https://www.flickr.com/photos/peterrosbjerg/4257452000/in/photolist-7udy9Q-834w2L-dcPgeA-dcPg7s-dcPdHZ-jiRgZs- jiQstc-8qnJKb-8qdWAF-8qh6D1-8qdWEz-b8ADMZ-b8Ausi-ansdvD-dcPgc9-8kNsd1-pNCgk1-b7G8ZT-8pQRPF-8pTvUy-eXse1A-99XXLF- eKT1Y-831n9H-jj7vuo-jiRZij-b7G84Z-b7G78P-fvrqUB-b7GajB-jiPF46-8ERYQY-jiNzQv-jiRkkW-jiNUNm-jiPgPZ-jiPbWU-jiNQQm-jiRhDF-jiPF7m- jiSdJs-jiPqDT-jiSa8u-jiPwr5-jiM5Ac-jiMKze-jiPZCZ-jiNwf6-jiNtMF-jiPHut ✤ Water Tanks picture: https://www.flickr.com/photos/psilver/2280385292/in/photolist-4tvz6N-9h1PHy-cuie7- RJ83k-696owN-85tcrs-74MqFc-pkuu5-o3BsGV-bR11F2-8jNAnA-ep2fVX-8YyHWv-ABECA-av9mMk-7LMozD-dMySvh-7Pipgo-5rXApy-Q8zgi- eFxGYc-7sDbjx-87LdLE-aELtQV-7AnXb7-dJqjNR-XYpHK-nAFVCS-95G4EU-9jxNiT-7F1RPj-68hFop-7VFYSs-nzr9W2-pb3zpe-9j5sua-9962cu- bJ1UED-dp6yqD-8UCQTj-NywAX-kBG3xr-9aTXxq-pVmJui-k8BDsX-7XXtce-7pKUVr-5Hn3CL-rvWcUu-kW6dat ✤ IT Web Jobs: http://www.itjobswatch.co.uk/jobs/uk/machine%20learning.do ✤ Question mark picture: https://www.flickr.com/photos/129627585@N07/15684220620/in/photolist-pTXJmU-4W4Xed-dKRgQ2- LLBYA-8uuSCh-8qk5Q-4y7wzQ-6feu6Z-6EsuSe-f7eVmb-9WAPNR-f75fPR-8Lzt7R-9L2t4y-apWJQR-fhdGoH-4v7kg9-65wR72-7FyjMW-epmYfa- abQKN1-6m1HuV-86Uor8-a64uYL-a61DMr-9oCDYc-dW2Xad-a64vnN-a61A6c-2Zvn7-5pjxSz-9Gd43a-oQRf6d-oQReWf-p8kRm6-p8iYSE- xtEEP-7oxXJg-a64sGS-7U52mA-2Z97S-a61D7e-a64uFd-aiWYZk-2Z9mV-4cmUWW-2Zoxb-2Zg4r-a61Fmi-a61B1T ✤ Timber picture: https://www.flickr.com/photos/simonbleasdale/2797031694/in/photolist-5gaw5o-hLB1BA-9Zafk9-bnjrwy-cSf2fU- cSf4tN-69qu5j-69qzMC-dkby9F-7wjTFp-kRxidH-53cRdq-nDr93N-kRFANh-25iCW3-cjLSKs-9R81XA-4xHk5Z-9R7MXG-5gay3S-baDJJF- bnjrdE-9R7XMf-9R8Knj-9R8Ceq-bAej8Z-bnjrhU-8WcymA- bnjrA5-9R8RgQ-9R5CtZ-9R8e6h-9R8Peh-9R5eca-9R8htw-9R5bbD-9R5yov-9R51kT-9R5hqK-dvvEFb-dvq6fP-dvq7GT-dvvFG7-dvvEyL-dvvGtA- dvq8bT-dvvGAN-dvq6U2-dvvGWQ-dvq5sD ✤ Timber: https://commons.wikimedia.org/wiki/Category:Timber#/media/File:Oregon_BLM_Forestry_10_(6871708937).jpg
  36. 36. References ✤ Gensim: https://radimrehurek.com/gensim/ ✤ LDA Paper: http://machinelearning.wustl.edu/mlpapers/ paper_files/BleiNJ03.pdf ✤ Client-Server Domain Separation: http://byterot.blogspot.com.es/ 2012/11/client-server-domain-separation-csds-rest.html ✤ Mill proposal: Is coming!

×