Secondary data analysis with digital trace data
Upcoming SlideShare
Loading in...5
×
 

Secondary data analysis with digital trace data

on

  • 1,122 views

 

Statistics

Views

Total Views
1,122
Views on SlideShare
1,122
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Secondary data analysis with digital trace data Secondary data analysis with digital trace data Presentation Transcript

  • Secondary data analysis with digital trace dataExamples from FLOSS research Andrea Wiggins 13 Juillet, 2011
  • Secondary Data Analysis• Uses existing data produced or collected by someone else, usually for a different purpose • Databases • Repositories • Surveys • Emails • Social networks 2
  • Digital Trace Data• Records of activity (trace data) undertaken through an online information system (thus digital)• Increasingly common in studies of online phenomena • Large volumes of available data • Can be complete: a census, not a sample • May be more reliably recorded than other data 3
  • Characteristics1. Found data (not produced for research)2. Event-based data (not summary data)3. Events occur over time, so it is longitudinal data 4
  • Requirements• Understand the original data source • How it was collected, potential problems • Limitations of the sample • What the data describe• Match with appropriate analysis methods and measures • New types of data may require new measures • Theoretical coherence is very important 5
  • Advantages• Data may be “complete” • Usually no response bias (exception: cookies) • May cover long periods of time and large groups • Multiple different data types, but mostly textual• Data are often easy to acquire • APIs or scraping web pages (with caution) • Databases, archives, or repositories of research data• But remember: you usually get what you pay for! 6
  • Disadvantages• Often difficult to know limitations of data • Data may be poorly documented • Original creator may not be available for comment• Volume of data can be overwhelming • Sampling strategies needed, e.g., temporal, random • Substantial time required for data preparation: 90% of effort • Exceptions are everywhere and will break analyses, but can only be discovered through trial and error 7
  • Example: Email Networks• Data source: email listservs for FLOSS projects• Analysis approach: create social networks • Within discussion threads, individuals are nodes, and links are reply-to messages • Some conceptual issues for interpretation, choice of measures• Technical challenges • Temporal aggregation • Identity resolution 8
  • Figures from Howison et al., 2006Temporal Aggregation 9
  • Network Workflow 10
  • Network Results • Different levels of correlation between venues, suggesting different types of interactions • User venues more decentralized than developer venues, reflecting greater number of participants • Overall trend toward decentralization could be result of different influences• Observed anomalous patterns in trackers for both projects: periodic centralization spikes Cleaning up before shutting down• A single user makes batch bug closings (up to 279!) – Fire’s (feature request) tracker housekeeping appears to be preparation for project closure – Gaim’s tracker housekeeping was more regular and repeated 11
  • Example: Classification• Replication of success-tragedy classification • Classification criteria originally drawn from interviews with community members • Data extracted from repositories• Technical challenges • Merging data from two repositories • Processing large volume of data in multiple steps 12
  • Variables• Inputs: project names and 5 threshold values for classification tests, e.g. number of downloads• Project statistics retrieved from repositories • Founding date • Data collection date • Dates for all releases • Number of downloads • URL 13
  • Classification workflow 14
  • Classification Results Class Original Our results Differenceunclassifiabl 3 186 3 296 +110 e II 13 342 (12%) 16 252 (14%) +2 910 (+2%) IG 10 711 (10%) 12 991 (11%) +2 280 (+1%) TI 37 320 (35%) 36 507 (31%) -813 (-4%) TG 30 592 (28%) 32 642 (28%) +2 050 (0%) SG 15 782 (15%) 16 045 (14%) +263 (-1%) other 8 422 0 Total 119 355 117 733 15
  • Thanks!• Questions? 16