Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Studying archives of online behavior


Published on

Slides from guest presentation at Aron Lindberg's Computational-Qualitative Field Research seminar: Needed readings at

Published in: Software
  • Be the first to comment

Studying archives of online behavior

  1. 1. Studying Archives of Online Behavior Computational Qualitative Research Seminar James Howison University of Texas at Austin Link to slides on twitter @jameshowison Readings at hSsNnnM9Pa?dl=0
  2. 2. Readings • The presentation and discussion will draw on: – Howison, J., & Crowston, K. (2014). Collaboration through open superposition: A theory of the open source way. MIS Quarterly, 38(1), 29–50. – Howison, J., Wiggins, A., & Crowston, K. (2011). Validity Issues in the Use of Social Network Analysis with Digital Trace Data. Journal of the Association for Information Systems, 12(12), Article 2. – Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through Documentary Practices. In Proceedings of the 44th Hawaii International Conference on System Sciences (HICSS 2011) (pp. 1–10). Waikoloa, HI. – Annabi, H., Crowston, K., & Heckman, R. (2008). Depicting What Really Matters: Using Episodes to Study Latent Phenomenon. In Proceedings of the International Conference on Information Systems (ICIS). – The methodological appendix for the Howison and Crowston Superposition article.
  3. 3. To the Archives! The evidence is here, somewhere. CC Credit: photos/hamadryades/
  4. 4. Opportunities of online archive studies • Quantity • Granularity • Accessibility – Much is openly available – Or the organization can provide bulk access – (compare to ethnography and getting individual cooperation) • Emic'ness
  5. 5. Emic'ness? Emic: in their words (from the inside) Etic: in your words (from the outside) Naturalistic: the archives are primary to the users and the activity themselves: "documentary traces are the primary mechanism in which users themselves know their distributed communities and act within them.” (Ribes and Geiger, 2011)
  6. 6. Yet, many challenges We are using the system (and the system that archived and presents the traces) as a data collection method. But the systems were not built for research. So we need to ask, for any research question: How well do the archives represent the activity, as it happened?
  7. 7. Individual Exercise (6 mins) 1. Pick a system that renders online archives of something you are interested in. – Can be your project for this course or something you choose right now. – Slight preference for an archive showing traces from more than 1 person 2. Go and find a specific archive page and read it. 3. Write a sentence or two about what is happening there.
  8. 8. Quick Group discussion (4 mins) • Let’s hear from a few participants about their choices.
  9. 9. Individual exercise II (6 mins) • How might archives diverge from experience? 1. How did the system record activity at the time? 2. How did the conversion to archives occur? 3. How is your experience of reading the archives different from the experience of the participants in the activity that was archived?
  10. 10. Discussion in groups • Group discuss questions (go question by question, not person by person) – How recorded? (each person speak) – How converted? … – How is reading experience different?
  11. 11. Most surprising? • One person from each group report back aspect that was most surprising.
  12. 12. Archival transformation • Deletions – Some data is periodically purged from databases, after all they are running a website, not a research database. • Overlaps – When database dumps are pulled periodically • Re-calculations – Historical depictions on a site (e.g., counts of messages, members, or other data such as downloads) might be later creations or re-calculations – Can you rely on participants having seen those figures at the time?
  13. 13. Database schemas are not research ontologies • Databases (or websites) often use words that are very exciting for research – “Friends”, “Followers”, “Assignment”, “Member” • But their meaning may have very, very little to do with the sociological/theoretical concept – At best they are a hint that something interesting is happening, but often are interpreted literally! • Examples from Sourceforge – use of “assigned to” field on close. – “member list” does not show who is active (no one was ever removed!)
  14. 14. Non-archived activity PublicPrivate Errors Warnings Code Local Binary Application Logs Stack Dumps Variables Stacks Commit Log Annoucement Email Discussion Emails Release Notes Bug report Discussion of Bug Bug Repository Testing Builds from CVS Public Release CVS Check-in Coding Compiling Debugging Public Release Binary Release Source Release Private Public CVS Check-out Local testing
  15. 15. Reasoning with missing/complete data • Trouble both ways • Assuming that the data are complete (rather than a system selected sample) • Can miss important activities or whole archives that need to be integrated. • Oddly enough, when data are complete issues can also emerge – See discussion in JASIST validity in SNA paper.
  16. 16. Hidden readership • Archives almost never tell you who read what, and when they read it. – Might be key to interpretation (or might be irrelevant) – Definitely crucial to any argument about information flow (and almost all interpretations of SNA measures are about information flow). • You may be able to impute readership from responses, but it’s a weak signal.
  17. 17. Activity traces scattered through archives • Participants experience a flow of activities across different systems – Linked by time and order that they occur • But they are archived by different systems – If you just read the mailing list you miss so much – And yet so many studies *want* their archive to be the only one (so much easier to analyze).
  18. 18. Release Notes Dev Email Bug Tracker RFE TrackerUser Forum TaskOutcome Task Relevant Documents TaskOutcome Task Relevant Documents TaskOutcome Task Relevant Documents CVS Search and assign Relevant Documents
  19. 19. Pacing of activities • Participant observation in an open source project highlighted the role of pacing. – Rapid replies indicated interest and importance but also availability – Very long gaps (sometimes years) indicated deferral and return. • In other work I was reading archives and found pacing hard to appreciate; it was very salient in participant observation but hidden in studies relying on trace data alone.
  20. 20. An episode
  21. 21. How to represent pacing? Time stamps
  22. 22. Representing pacing • Calculate gaps?
  23. 23. Reading gaps doesn’t help, easy to ignore, make them harder to ignore?
  24. 24. Visualize events
  25. 25. What is to be done? • Sufficient engagement with the system and community to adequately interpret the traces. • Use a system and see how your data is archived. • When you think a phenomena/construct can be operationalized computationally, at least show some narrative examples from the dataset. • Complement archives with interviews and/or surveys – Archives make great prompts for interviews – Lakhani and Wolf (2003) survey immediately after a post. • Gaskin et al (2014) “Zooming in and out of sociomaterial routines” MISQ.
  26. 26. An ontology for trace data studies • Document – Archived content. E.g., An e-mail message, tracker comment, release note, pull-request, log entry. – Provides evidence for events and actions. – One document may provide evidence for multiple events and actions. • Event – An event causes documents to be archived. Sending an email, releasing a version. • Action – The contextualized meaning of an event. e.g., contributing code, showing leadership (can be at quite different conceptual levels in different studies.) • Participant – An actor (typically a person, but could be a machine or bot) • Identifier – A string associated with a participant. – Many identifiers could refer to one participant (e.g, email and username) – but many participants may act through one identifier (e.g., “admin account”)
  27. 27. Episodes • A unit of analysis, facilitating comparison and summary (e.g., counting) – Compare to content analysis or nlp that counts mentions of concepts, database queries that count documents, surveys that measure attitudes. – The detail provided by trace data renders episodes more accessible, research to be more granular, closer to the work. • Ideally emic (meaningful to and recognizable by participants)
  28. 28. Ok, but how to store this? • Moving from documents and events to actions and outcomes is interpretative work – I do the qualitative first, then hope to make it computable (e.g, through machine learning) • It is akin to content analysis but a much more complicated ontology – Content analysis (classic or grounded theory) assigns Codes to Documents – Software like Atlas ti has trouble handling coding of structured data (dates, linked documents like threads, multiple identifiers for single participant.).
  29. 29. I use RDF • Resource Description Format – Triples: James hasEmail – URLs working natively (making viewing original archives easy) • Retains original data structure – e.g., Document in thread by Identifier – Allows ad-hoc addition of structure (schemaless) – Allows inheritance (e.g., MailingListEvent a CommunicationEvent) • Allows you to overlay higher level structure – e.g., Action(s) in (ordered) Episode by Participant – And then apply codes to Actions (storing when, who, why) • Querying via SPARQL, Validation via RDF rules (aka SPIN)
  30. 30. An episode
  31. 31. Showing an example