Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DMDS Winter Workshop 2 Slides

387 views

Published on

Slides from the 2nd winter workshop on digital scholarship and programming at the Sherman Centre for Digital Scholarship at McMaster University.

Published in: Education
  • Be the first to comment

  • Be the first to like this

DMDS Winter Workshop 2 Slides

  1. 1. Winter 2015: Session #2 Programming on the Whiteboard February 19, 2015 (Paige Morgan)
  2. 2. Last week... • The work of creating usable data • Forms that this data might take: • markup language • Spreadsheets (MySQL & relational DBs) • Graph databases (RDF/Linked Open Data
  3. 3. This week: • Caveat Curator (challenges of working with data) • Programming on the Whiteboard, i.e., conceptualizing the specific steps that you need to take to accomplish your goals
  4. 4. Goals/Takeaways • A better understanding of the workflow for dealing with data • Greater ability to talk about what you’re trying to do
  5. 5. Why this focus on data? • Understanding your data, and your intended actions, is a key skill for developing any digital project (big or small). • You may have one big project – but your data may support several small/intermediary projects.
  6. 6. Caveat Curator
  7. 7. Programming languages (and digital apps) are like human languages in that they both have phrases, patterns, and rules.
  8. 8. Programming languages are unlike human languages in that they aren’t for communicating with people.
  9. 9. They are also unlike human languages in that every programming utterance does something, i.e., causes an action to occur.
  10. 10. You can get used to patterns – even unfamiliar ones.
  11. 11. The shift is in getting used to thinking in terms of every single action.
  12. 12. Today’s subject matter includes actions that you’ll need to think about before you work with...
  13. 13. Image: Josh Lee, @wtrsld, via Twitter, January 2014.
  14. 14. Even when you’re just experimenting, you need to prep your data.
  15. 15. You may know your dataset in detail already, from your research -- but your computer is concerned with different levels of detail.
  16. 16. Becoming aware of those levels of detail is not only helpful for your project ideas...
  17. 17. ...it’s also a useful skill for working with programming languages. (where a stray /> or ; can break your program/website)
  18. 18. Data only works if your computer can read it.
  19. 19. But my data is just text! (Doesn’t that make things easy?)
  20. 20. (Remember, your computer is fairly stupid).
  21. 21. Formatted text is often full of text your computer can’t parse correctly.
  22. 22. The┘re┘sÜlt ís that yoÜr te┘xt might come┘ oÜt looking like┘this whe┘n yoÜ ope┘n it in a programming e┘nvironme┘nt.
  23. 23. So you need to convert it to plain text. (without any of the fancy details encoded in MS Word fonts.)
  24. 24. (This is key if you work with newspapers, older printed texts, or archival material.)
  25. 25. Maybe you want to work with sailing data and ports of call:
  26. 26. The ship you’re interested in leaves the Ivory Coast for St. Helena...
  27. 27. But when you create your map, you get this:
  28. 28. The latitude/longitude coordinate is the significant datum.
  29. 29. The city name is just the human-readable component.
  30. 30. Each datum needs to be unique.
  31. 31. Figuring out what sort of unique configuration will work best involves at least some experimentation.
  32. 32. To experiment effectively, you’ll want to keep careful records.
  33. 33. If you develop categories of information, you’ll want to keep a record of what each category means, and what its limits are.
  34. 34. Cleaning and structuring your data is a foundation issue that changes, depending on the available format of your data.
  35. 35. What if your data is crowdsourced?
  36. 36. You can require a particular format for submissions
  37. 37. You can even put programmatic limits on the formats available for submission
  38. 38. But in the end, you’re probably still going to need to scrub and/or format.
  39. 39. This is true even for data from supposedly reputable sources, like government or media organizations.
  40. 40. Example: Doctor Who Villains dataset http://tinyurl.com/doctorwhov illains
  41. 41. This step is no fun!
  42. 42. But it’s absolutely necessary.
  43. 43. If you are thinking about your data, and the tasks that you need to accomplish, then it’s easier to determine what sort of language or platform your project needs.
  44. 44. There are countless tutorials, online courses, etc., for almost any programming language or platform. (You can also ask for a Sherman Centre consultation to figure out what you need to learn.)
  45. 45. Learning how to work with any tool can be a slow process, especially at first.
  46. 46. However, knowing what tasks you’re working towards makes it easier to understand the purpose of the introductory lessons.
  47. 47. It’s also easy to think about how the first rules you learn for any language or platform might affect your goals.
  48. 48. Pseudocode • Used by programmers to break down a complex task into manageable steps • Easily adaptable for use by non- programmers
  49. 49. Pseudocode Example (Visible Prices)• Computer has a file that contains prices from different texts. • Computer must know that each price amount is connected with an object, and with a bibliographical record. • Users can input a price amount, and computer will retrieve all objects that match the price, and display them to the user, along with bibliographical information. • (More complex): Computer is able to retrieve prices linked with certain categories (clothing, food, etc.)
  50. 50. (And now for a couple of examples of projects in process…)
  51. 51. Social Work and Social Change (Tina Wilson)
  52. 52. Social work + social change • Recent history of academic social work in Canada; 1960s onward • Interested in the ways in which academic social work has attempted to advance justice-oriented social change projects, and how political, cultural, and theoretical shifts have influenced this type of disciplinary imagination and work • Related to disciplinary boundaries and methods and orthodoxies, and the social role of universities
  53. 53. MARC Record, front end
  54. 54. MARC Record, back end <?xml version="1.0"?> <record xmlns="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21sli m.xsd"> <leader>00000cam a000001</leader> <controlfield tag="001">468966</controlfield> <controlfield tag="008">710913s1968vaub000 0 eng c</controlfield> <datafield tag="010" ind1=" " ind2=" "> <subfield code="a">a68007753 </subfield> </datafield> <datafield tag="040" ind1=" " ind2=" "> <subfield code="a">Virginia. Univ. Libr.</subfield> <subfield code="b">eng</subfield> <subfield code="c">DLC</subfield> <subfield code="d">OCLCQ</subfield> <subfield code="d">CLU</subfield> <subfield code="d">OCLCO</subfield> <subfield code="d">OCLCF</subfield> <subfield code="d">OCLCQ</subfield> </datafield> <datafield tag="043" ind1=" " ind2=" "> <subfield code="a">n-us-va</subfield> </datafield> <datafield tag="050" ind1="0" ind2="4"> <subfield code="a">HV98.V8</subfield> <subfield code="b">C46</subfield> </datafield> <datafield tag="082" ind1=" " ind2=" "> <subfield code="a">361/.9/755</subfield> </datafield> <datafield tag="100" ind1="1" ind2=" "> <subfield code="a">Cepuran, Joseph.</subfield> </datafield> <datafield tag="245" ind1="1" ind2="0"> <subfield code="a">Public assistance and child welfare:</subfield> <subfield code="b">the Virginia pattern, 1646 to 1964.</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="a">[Charlottesville]</subfield> <subfield code="b">Institute of Government, University of Virginia,</subfield> <subfield code="c">1968.</subfield> </datafield> <datafield tag="300" ind1=" " ind2=" "> <subfield code="a">vii, 120 pages</subfield> <subfield code="c">28 cm</subfield> </datafield> <datafield tag="336" ind1=" " ind2=" "> <subfield code="a">text</subfield> <subfield code="b">txt</subfield> <subfield code="2">rdacontent</subfield> </datafield> <datafield tag="337" ind1=" " ind2=" "> <subfield code="a">unmediated</subfield> <subfield code="b">n</subfield> <subfield code="2">rdamedia</subfield> </datafield> <datafield tag="338" ind1=" " ind2=" "> <subfield code="a">volume</subfield> <subfield code="b">nc</subfield> <subfield code="2">rdacarrier</subfield> </datafield> <datafield tag="504" ind1=" " ind2=" "> <subfield code="a">Includes bibliographical references.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="0"> <subfield code="a">Public welfare</subfield> <subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="0"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy</subfield> <subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst00854729</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Public welfare.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01083250</subfield> </datafield> <datafield tag="651" ind1=" " ind2="7"> <subfield code="a">Virginia.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01204597</subfield> </datafield> <datafield tag="710" ind1="2" ind2=" "> <subfield code="a">University of Virginia.</subfield> <subfield code="b">Institute of Government.</subfield> </datafield> </record> MARC 21 Format
  55. 55. Things to count • Social problems: child abuse, unemployment, inequality • Concepts: mental hygiene, non-voluntary clients, culture of poverty, consciousness raising, privilege • Sub-populations: immigrants, unwed mothers, the oppressed • Institutions: work houses, shelters, detention, the non-profit industrial complex • Interventions: motivational interviewing, case management, urban planning, life skills education, community organizing • Types of social work: case work, radical social work, community development, clinical social work • SW books in Canadian Libraries
  56. 56. Project: People, Persons and Individuals: Is the DSM Dehumanizing? (Mackenzie Salt)
  57. 57. Objective To analyze the diagnostic chapters of five volumes of the DSM to determine whether the referring expressions used therein are dehumanizing and if so, determine if the usages have changed over time.
  58. 58. Problems Traditional discourse analysis is done by hand and can be very time consuming. Volumes of the DSM range from 494 pages to 991 pages.
  59. 59. Solutions • Digital Corpus Analysis – Computer Software (R) • Faster • More Efficient • Can handle large amounts of data at once • Data has to be prepared before it is ready to be used for digital analysis.
  60. 60. Preparation of Data • Physical data must be converted to digital medium • Steps – Permission – Scanning to PDF – OCR PDF – Convert OCRed PDF to Plain Text – Clean Plain Text
  61. 61. Permission • Digital e-book copies of DSM are not available for any of the versions • American Psychiatric Association holds copyright and is VERY protective
  62. 62. Scanning DSMs • Physical copies of the DSMs need to be scanned into a digital format (in this case PDF) • PDFs need to be converted to a text format that a computer can read, edit, and work with OCR PDFs
  63. 63. Clean Plain Text Files • Once you have OCRed plain text files, you need to make sure they are accurate – Computers are only as good as their input • If the data input is messy, the analysis will be messy • Made files consisting of only the chapters for analysis • Checked for and fixed any remaining OCR/Scanning errors
  64. 64. Now To The Project • Come up with a list of referring expressions based on a visual scan through the DSMs • Use R to narrow down the list to only the most frequent – Narrows 10K+ unique words to a handful • Use R to pull out all sentences with the terms in question – Narrows down ~19K sentences to 655 for individual
  65. 65. Benefits of Digital Analysis • This project still used some manual analysis • Using digital technologies and corpora sped things up considerably • Made it easier to break down large corpus to manageable parts • Now have a corpus on which to do other projects in the future: prep-work already
  66. 66. Working with Digital Corpora with R • Pros – Free and Cross-Platform – Powerful, Efficient, Fast – Capable of working with VERY large datasets – Subsequent projects can be much faster as code can be saved and built on or recycled • Cons – Code-based command-line style interface (?) – GIGO – Depending on project, input data may need substantial preparation
  67. 67. Summary • Overall, 75% of the time spent doing this project was prepping the data • Project took only 3-4 months to do, part-time • Corpus analyzed totalled 1.08 million words from the DSM-III through the DSM-5 • Future projects based on this corpus will be much faster to do as well • Digital technologies made this project feasible • Project was much faster than if done by hand
  68. 68. It is likely that your data will have a longer life span than any specific project you create.
  69. 69. In many instances, it may be more useful to focus on the data curation as much as a single project.
  70. 70. Key DS Values • Adaptive • Sustainable/resource-aware • Collaborative • Social
  71. 71. Key skill• Thinking flexibly about your data (and potential project) • Are there portions of your dataset that could be extracted for use in a particular tool? • How can you adjust your data in order to show it to people (and be more able to talk/write/present about your research interests?)
  72. 72. And now, it’s your turn...
  73. 73. For this activity, I recommend that you pair up, or form small groups to work together.
  74. 74. Group Activity• What do you need to do with your data? (share, aggregate, combine…?) • What units might that data exist in? • What categories do you need to create? • What connections need to exist between the units and the categories?
  75. 75. Next steps • What’s the smallest version of your dataset possible? (useful for testing out tools) • Possible tools to examine (as ways of presenting your data) • Omeka (http://www.omeka.net) • Scalar (http://scalar.usc.edu) • Simile (http://www.simile-widgets.org) • Google Fusion Tables (https://support.google.com/fusiontables/answer/2571232)
  76. 76. SCDS support for data wrangling • Consultations http://www.tinyurl.com/scds-consult • Colloquium slots (opportunities to talk through your project plans for a supportive audience) • Graduate fellowships (workspace and greater access to SCDS staff expertise)
  77. 77. Spring Workshops! • Project Ideation and Development; Choosing Tools for Every Part of Your Project • April 9th and 16th, 2015 (pre- registration available soon)

×