Collaborative Data Analysis with Taverna Workflows

Collaborative Data Analysis with Taverna Workflows Andrea Wiggins, Kevin Crowston & James Howison 16 October 2009

FLOSS Phenomenon Free/Libre Open Source Software Distributed collaboration to develop software Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright

eScience Proof-of-Concept Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components Selected several papers to replicate based on: Data availability Suitability of analytical approach

Research Replication Classifies projects based on metrics for success and stage of project growth English & Schweik, 2007 Examines dynamics of social networks of project communications over time Howison et al., 2006 Examines distribution of project sizes for evidence of preferential attachment theory of growth in networks Conklin, 2004 Description Study

FLOSS Research Data Data sources include interviews, surveys, and ethnographic fieldwork Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. RoRs: Repositories of Repositories Data sources for research

RoRs: FLOSSmole Public access to 300+ GB data 300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC 5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server

RoRs: SRDA SourceForge Research Data Archive Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation

Analysis Tool Requirements Scalability Move analysis from small n’s to big(ger) n’s Data meshing Reproducing research required analysis of data drawn from multiple RoRs Collaborative analysis design Needed to tap into diverse skills from different contributors

Integrating Diverse Skills Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills Collaborator 2: Analyst Competent with R, but no other coding skills Good at debugging Collaborator 3: PI Helps find solutions when all else fails

Taverna Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community Additional collaboration support via myExperiment Combined features and flexibility met our needs

Work Process Evaluate paper’s data, methods, and findings Develop abstract workflow together, focusing on functionality Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary Independent individual development and testing, using dummy inputs Integration of partial workflows Test (debug, test…) and run

“ Identifying success and tragedy of FLOSS Commons” Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects Requires data from 2 repositories, FLOSSmole & SRDA Extension Parameterized all thresholds Tested additional criterion tests

Key Strategies Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory Designed components for reuse Particularly data handling and “shims” Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve

Important Details Used SVN for version management myExperiment can now manage this much more easily Set up server caching on query results to speed up testing Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!

Extending Analyses Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development Ran analyses on larger data sets than original studies

Workflow Re-use Specifically designed workflows and components for re-use Components for sampling and analysis had no constants, only parameters Effortful development for data handling paid off “ Plug-and-play” components used in every subsequent workflow Shifts the challenge from data to research, where it should be!

Challenges with Using Workflows Software usability - continually improving Bugs wreaked havoc at times Data handling Continually more challenging than expected No existing web services, nor appropriate examples to emulate All bioscience, no social science

Barriers to Uptake Little science issues Many paradigms: lack of agreement in research focus, theory, methods Lack of incentives to collaborate Bimodal distribution of requisite skills “ I can’t possibly do that! I can’t code!” “ Why bother? I can code my own. You should too; just use Python.” Students are more willing to experiment with tools and new approaches

Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable Example: Recent Research

“Normal” Download-Release Patterns BibDesk Users update fairly quickly after releases

External effects! Taverna’s Download-Release Patterns 1.3.2-RC1 +2 presentations 1.5.0 ? ?

Interpretation Taverna is not a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads

Questions? More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64

Collaborative Data Analysis with Taverna Workflows

More Related Content

What's hot

Similar to Collaborative Data Analysis with Taverna Workflows

More from Andrea Wiggins

Recently uploaded

Collaborative Data Analysis with Taverna Workflows