Decoder Ring


Published on

Presentation of my project Decoder Ring at the Games, Learning & Society Conference 2010.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  • **** Why scraping data is difficult but possible
    - Many sites use different terminology and structure for what are essentially similar data types (post vs. discussion vs. thread; user vs. account)
    - Unpredictable markup on websites -- often BAD markup
    - Picture of malformed HTML
    - Creating a generic scraper tool would be sloppy, inaccurate, and error-prone
    - Fortunately, writing site-specific scrapers is a pretty straight-forward process
    - Roughly 4 hours per scraper, getting to be less as I gain more experience

  • Decoder Ring

    1. 1. Decoder Ring Jeff Beeman @doogiemac GLS Conference 2010
    2. 2. Background • Fall 2009 semester • Seminars w/ Jim & Betty • Wanted to do some sort of emulation of work I had been reading (Gee, Hayes, Steinkuehler, Duncan, etc.) • Seemed to me the process for doing it was painful
    3. 3. Traditional process Copy into Take notes / Find content Word docs hi-light phrases Come up w/ Manually transfer equations & charts data to Excel (At least how I see it)
    4. 4. Traditional process Copy into Take notes / Find content Word docs hi-light phrases Come up w/ Manually transfer equations & charts data to Excel Wasting time... and it’s BORING
    5. 5. I’m lazy • I want to • use technology to solve repetitive, boring problems for me • write something once, use it many times • take advantage of work others have already done • work with a lot of data
    6. 6. Better process Create Find content importer Import content Analyze content Get someone else to do this
    7. 7. Initial requirements • Abstracted, flexible, powerful data model • Sustainable, low cost, framework • Web based to facilitate collaboration • Facilitate importing and browsing large data sets • Automated reporting
    8. 8. Overview
    9. 9. Data model Collection Name Taxonomy Description Name Post User Term Title Username Name Body Avatar Description Author Creation date Post date Attributes (rank, sex, etc.) Parent post (optional) External identifier All data normalized into Collections, Posts, Users, Taxonomies
    10. 10. Database-backed • Reports can be generated on the fly
    11. 11. Database-backed • Data can be queried and searched
    12. 12. Collaborative • Multiple projects, multiple contributors
    13. 13. Open source
    14. 14. Getting the content Collections Posts Users Seems to be the overwhelmingly most difficult part of doing this work.
    15. 15. Again, I’m lazy • I have a tool that has a normalized, predictable data model. • I can “scrape” websites or other data sets and put them into the data model.
    16. 16. Write once... Scrapers / importers
    17. 17. Reduced to as little work as possible • Given a common file format, data is quick and easy to import into Decoder Ring • Bad news: Scrapers need to be written for every site • Good news: They’re very quick to write (average 4 - 8 hours each)
    18. 18. Analysis & Reporting Content navigation
    19. 19. Analysis & Reporting Content editing
    20. 20. Analysis & Reporting
    21. 21. Analysis & Reporting
    22. 22. This is great, but... • It’s making things faster, but what does it do that’s new? • Collaboration, networking of researchers • Immediate reporting provides insight where it may not otherwise be seen • Still some difficulties: • How do you effectively communicate how to use / apply a taxonomy?
    23. 23. Demo
    24. 24. Todo • Per-collection taxonomy visibility • Per-collection access control • Cross-collection reports • Search-based reports (i.e. taxonomy term activity for all posts with the word "tutorial") • More accurate and faster search (Solr): i.e. All posts with "violence" near the words "games OR video games OR entertainment" • More robust hosting infrastructure (more users, collections)
    25. 25. Long-term todo • DR could "learn" over time about taxonomies and language: i.e. What words commonly appear in phrases tagged "scientific learning"? • Comparisons with external data: i.e. Thread activity corresponding to product release announcements (Starcraft II thread) • Web-based content import: Once a parser is written, the ability to queue up import via the DR website