Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science at DBC in 29 slides

965 views

Published on

Om DBC's data science aktiviteter. Oplæg på Data Science Day den 14. januar 2016 v/Christian Boesgaard,

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data science at DBC in 29 slides

  1. 1. Data Science at DBCin 29slides Christian Boesgaard DBC, Team XP
  2. 2. ●a few words about DBC ●a very short story about data science at DBC ●some examples of what we do
  3. 3. DBC provides solutions to support the goal of the libraries: “to encourage enlightenment, education and cultural activities”. (From “lov om biblioteksvirksomhed”)
  4. 4. National Bibliography registration of books, music, AV materials, Internet documents, articles and reviews in newspapers and magazines (metadata production, 50+ persons)
  5. 5. DanBib The union catalogue of the Danish libraries and the infrastructure for interlibrary loans. (metadata + usage)
  6. 6. bibliotek.dk Access to all Danish publications and to the holdings of the Danish libraries. (metadata usage … and user data production)
  7. 7. What Data? ●registration metadata some full text docs front covers loan data ●search data (And much more)
  8. 8. How it began (I have been at DBC 10+ years and have a background in distributed systems, applied cryptography, and philosopy)
  9. 9. Stanford CS221
  10. 10. So... AI is not that magical … And it works We should really use this!
  11. 11. Automatic metadata assignment for articles Training set: 136K articles with subject metadata 22K subject terms (95% used 169 times or less)
  12. 12. “København Zoo beskyldes for at udbrede kreationisme” +- creationisme -+ 9 darwinisme ++ 12 evolutionsteori +- formidling -+ 6 intelligent design +- kristendom +- livets oprindelse -+ 6 religion ++ 8 skabelsen +- skilte +- zoologiske haver
  13. 13. “Copenhagen Zoo is accused of advancing creationism” +- creationism -+ 9 darwinism ++ 12 evolution theory +- dissemination -+ 6 intelligent design +- christianity +- origin of life -+ 6 religion ++ 8 creation +- signs +- zoo
  14. 14. Approach 1.bag-of-words + liblinear 2.bag-of-words + k-nearest neighbors 3.paragraph vectors + k-nn Works pretty well for assisted indexing and is now an integrated part of the system used for
  15. 15. Metadata to Metadata Sometimes, simple is good: demokrati [930] politiske_forhold 897 politik 341 historie 243 islam 234 valg 155 ytringsfrihed 129 menneskerettigheder 117 oprør 94 udenrigspolitik 93
  16. 16. XP
  17. 17. Recommendations content-based (metadata) collaborative (item-item, loans) Foucaults Pendul - Umberto Eco Dronning Loanas mystiske flamme - Umberto Eco Baudolino - Umberto Eco Rosens navn - Umberto Eco Kirkegården i Prag - Umberto Eco Judasbrevet - Eric Frattini Skaberens kort - Emilio Calderón
  18. 18. Ranking ●popularity personalized (loans/likes/...) For search results (or recommendations)
  19. 19. Suggestions ●popularity (loans) subjects, creator, etc. E.g. for completion
  20. 20. From Lady Gaga to James Joyce
  21. 21. “Enlightenment” ...Not guaranteed But we can recommend “towards” a curated collection (based on item-item similarity or P(loan(y)|loan(x)) )
  22. 22. Similarity Paths Born this way - Lady Gaga (music) Rasmus Seebach - Rasmus Seebach (music) In these waters - Mads Langer (music) De urørlige (movie) Fasandræberne - Jussi Adler-Olsen (book)
  23. 23. Similarity Paths Fasandræberne - Jussi Adler-Olsen Det syvende barn - Erik Valeur Profeterne i Evighedsfjorden - Kim Leine Min kamp - Karl Ove Knausgård På sporet af den tabte tid - Marcel Proust Fædre og sønner - Ivan Turgenev Portræt af kunstneren [...] - James Joyce Ulysses - James Joyce
  24. 24. Similarity Paths (for the kids...) Sheik Yerbouti - Frank Zappa Aladdin Sane - David Bowie The red shoes - Kate Bush MDNA world tour - Madonna Lotus - Christina Aguilera Born this way - Lady Gaga
  25. 25. The End Christian Boesgaard Team XP cbo@dbc.dk
  26. 26. What we use (for “data science”) Python: SciPy stack, scikit-learn, gensim, Tornado. Kafka, MongoDB, Solr. (Java, R)

×