Your SlideShare is downloading. ×
0
Secondary data analysis  with digital trace dataExamples from FLOSS research         Andrea Wiggins         13 Juillet, 2011
Secondary Data Analysis•   Uses existing data produced or collected by    someone else, usually for a different purpose   ...
Digital Trace Data•   Records of activity (trace data) undertaken through    an online information system (thus digital)• ...
Characteristics1. Found data (not produced for research)2. Event-based data (not summary data)3. Events occur over time, s...
Requirements•   Understand the original data source    •   How it was collected, potential problems    •   Limitations of ...
Advantages•   Data may be “complete”    •   Usually no response bias (exception: cookies)    •   May cover long periods of...
Disadvantages•   Often difficult to know limitations of data    •   Data may be poorly documented    •   Original creator m...
Example: Email Networks•   Data source: email listservs for FLOSS projects•   Analysis approach: create social networks   ...
Figures from Howison et al., 2006Temporal Aggregation                  9
Network Workflow       10
Network Results                                                     • Different levels of correlation                     ...
Example: Classification•   Replication of success-tragedy classification    •   Classification criteria originally drawn from...
Variables•   Inputs: project names and 5 threshold values for    classification tests, e.g. number of downloads•   Project ...
Classification workflow          14
Classification Results   Class        Original           Our results    Differenceunclassifiabl      3 186               3 2...
Thanks!•   Questions?                    16
Upcoming SlideShare
Loading in...5
×

Secondary data analysis with digital trace data

972

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
972
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Secondary data analysis with digital trace data"

  1. 1. Secondary data analysis with digital trace dataExamples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  2. 2. Secondary Data Analysis• Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  3. 3. Digital Trace Data• Records of activity (trace data) undertaken through an online information system (thus digital)• Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  4. 4. Characteristics1. Found data (not produced for research)2. Event-based data (not summary data)3. Events occur over time, so it is longitudinal data 4
  5. 5. Requirements• Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe• Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  6. 6. Advantages• Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual• Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data• But remember: you usually get what you pay for! 6
  7. 7. Disadvantages• Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment• Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  8. 8. Example: Email Networks• Data source: email listservs for FLOSS projects• Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures• Technical challenges • Temporal aggregation • Identity resolution 8
  9. 9. Figures from Howison et al., 2006Temporal Aggregation 9
  10. 10. Network Workflow 10
  11. 11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences• Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down• A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  12. 12. Example: Classification• Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories• Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  13. 13. Variables• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads• Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  14. 14. Classification workflow 14
  15. 15. Classification Results Class Original Our results Differenceunclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  16. 16. Thanks!• Questions? 16
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×