Your SlideShare is downloading. ×
Secondary data analysis with digital trace data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Secondary data analysis with digital trace data

925
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
925
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Secondary data analysis with digital trace dataExamples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  • 2. Secondary Data Analysis• Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  • 3. Digital Trace Data• Records of activity (trace data) undertaken through an online information system (thus digital)• Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  • 4. Characteristics1. Found data (not produced for research)2. Event-based data (not summary data)3. Events occur over time, so it is longitudinal data 4
  • 5. Requirements• Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe• Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  • 6. Advantages• Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual• Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data• But remember: you usually get what you pay for! 6
  • 7. Disadvantages• Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment• Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  • 8. Example: Email Networks• Data source: email listservs for FLOSS projects• Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures• Technical challenges • Temporal aggregation • Identity resolution 8
  • 9. Figures from Howison et al., 2006Temporal Aggregation 9
  • 10. Network Workflow 10
  • 11. Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences• Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down• A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  • 12. Example: Classification• Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories• Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  • 13. Variables• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads• Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  • 14. Classification workflow 14
  • 15. Classification Results Class Original Our results Differenceunclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  • 16. Thanks!• Questions? 16