Collaborative Data Analysis with Taverna Workflows Andrea Wiggins,  Kevin Crowston & James Howison 16 October 2009
FLOSS Phenomenon Free/Libre Open Source Software  Distributed collaboration to develop software Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright
eScience Proof-of-Concept Project to replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components Selected several papers to replicate based on: Data availability  Suitability of analytical approach
Research Replication Classifies projects based on metrics for success and stage of project growth English & Schweik, 2007 Examines dynamics of social networks of project communications over time Howison et al., 2006 Examines distribution of project sizes for evidence of preferential attachment theory of growth in networks Conklin, 2004 Description Study
FLOSS Research Data Data sources include interviews, surveys, and ethnographic fieldwork Digital “trace” data Archival, secondary, by-product of work  Easy to get, but hard to use Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. RoRs: Repositories of Repositories Data sources for research
RoRs: FLOSSmole Public access to 300+ GB data  300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC 5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server
RoRs: SRDA SourceForge Research Data Archive Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure,  regular table deprecation, some limited documentation
Analysis Tool Requirements Scalability Move analysis from small n’s to big(ger) n’s Data meshing Reproducing research required analysis of data drawn from multiple RoRs Collaborative analysis design Needed to tap into diverse skills from different contributors
Integrating Diverse Skills Collaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills Collaborator 2: Analyst Competent with R, but no other coding skills Good at debugging Collaborator 3: PI Helps find solutions when all else fails
Taverna Scientific workflow tool Free. Open. We like that. Responsive support from myGrid team, lively user community Additional collaboration support via myExperiment Combined features and flexibility met our needs
Work Process Evaluate paper’s data, methods, and findings Develop abstract workflow together, focusing on functionality Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary Independent individual development and testing, using dummy inputs Integration of partial workflows Test (debug, test…) and run
“ Identifying success and tragedy of FLOSS Commons” Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects Requires data from 2 repositories, FLOSSmole & SRDA Extension Parameterized all thresholds Tested additional criterion tests
Classification Workflow
Key Strategies Consciously worked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory Designed components for reuse Particularly data handling and “shims” Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve
Important Details Used SVN for version management  myExperiment can now manage this much more easily Set up server caching on query results to speed up testing Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!
Extending Analyses Replications implemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development Ran analyses on larger data sets than original studies
Workflow Re-use Specifically designed workflows and components for re-use Components for sampling and analysis had no constants, only parameters Effortful development for data handling paid off “ Plug-and-play” components used in every subsequent workflow Shifts the challenge from data to research, where it should be!
Challenges with Using Workflows Software usability - continually improving Bugs wreaked havoc at times Data handling Continually more challenging than expected No existing web services, nor appropriate examples to emulate All bioscience, no social science
Barriers to Uptake Little science issues Many paradigms: lack of agreement in research focus, theory, methods Lack of incentives to collaborate Bimodal distribution of requisite skills “ I can’t possibly do that! I can’t code!” “ Why bother? I can code my own. You should too; just use Python.” Students are more willing to experiment with tools and new approaches
Estimating user base and potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable Example: Recent Research
“Normal” Download-Release Patterns BibDesk Users update fairly quickly after releases
External effects! Taverna’s Download-Release Patterns 1.3.2-RC1 +2 presentations 1.5.0 ? ?
Interpretation Taverna is not a “normal” open source project Speaking tours,  tutorials, articles,  and other events  influence downloads
Questions? More: Poster on eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64

Collaborative Data Analysis with Taverna Workflows

  • 1.
    Collaborative Data Analysiswith Taverna Workflows Andrea Wiggins, Kevin Crowston & James Howison 16 October 2009
  • 2.
    FLOSS Phenomenon Free/LibreOpen Source Software Distributed collaboration to develop software Typical social research topics Coordination and collaboration Growth and evolution (social and code) Code quality Motivation, leadership, success Culture and community Intellectual property and copyright
  • 3.
    eScience Proof-of-Concept Projectto replicate published FLOSS research using eScience approaches Use existing shared data sets Develop workflows collaboratively Build library of reusable components Selected several papers to replicate based on: Data availability Suitability of analytical approach
  • 4.
    Research Replication Classifiesprojects based on metrics for success and stage of project growth English & Schweik, 2007 Examines dynamics of social networks of project communications over time Howison et al., 2006 Examines distribution of project sizes for evidence of preferential attachment theory of growth in networks Conklin, 2004 Description Study
  • 5.
    FLOSS Research DataData sources include interviews, surveys, and ethnographic fieldwork Digital “trace” data Archival, secondary, by-product of work Easy to get, but hard to use Repositories Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. RoRs: Repositories of Repositories Data sources for research
  • 6.
    RoRs: FLOSSmole Publicaccess to 300+ GB data 300K+ projects from 8 repositories Politely scraped, then parsed Flat files & SQL datamarts Released monthly via SF & GC 5 TB allotment on TeraGrid @ SDSC Allows direct database access without compromising our humble server
  • 7.
    RoRs: SRDA SourceForgeResearch Data Archive Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge Original obtuse structure, regular table deprecation, some limited documentation
  • 8.
    Analysis Tool RequirementsScalability Move analysis from small n’s to big(ger) n’s Data meshing Reproducing research required analysis of data drawn from multiple RoRs Collaborative analysis design Needed to tap into diverse skills from different contributors
  • 9.
    Integrating Diverse SkillsCollaborator 1: Data Wrangler Expert with data sources and handling Multiple coding languages and great technical skills Collaborator 2: Analyst Competent with R, but no other coding skills Good at debugging Collaborator 3: PI Helps find solutions when all else fails
  • 10.
    Taverna Scientific workflowtool Free. Open. We like that. Responsive support from myGrid team, lively user community Additional collaboration support via myExperiment Combined features and flexibility met our needs
  • 11.
    Work Process Evaluatepaper’s data, methods, and findings Develop abstract workflow together, focusing on functionality Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary Independent individual development and testing, using dummy inputs Integration of partial workflows Test (debug, test…) and run
  • 12.
    “ Identifying successand tragedy of FLOSS Commons” Replication of English & Schweik, 2007 Classification of project success by stage of growth for 110K projects Requires data from 2 repositories, FLOSSmole & SRDA Extension Parameterized all thresholds Tested additional criterion tests
  • 13.
  • 14.
    Key Strategies Consciouslyworked to maintain transparency and modularity Copious code comments Metadata makes workflows self-explanatory Designed components for reuse Particularly data handling and “shims” Assigned the right work to the right people Created interdependencies, but reduced the initial learning curve
  • 15.
    Important Details UsedSVN for version management myExperiment can now manage this much more easily Set up server caching on query results to speed up testing Created an OWL ontology to map between RoRs, which significantly improved performance Nontrivial effort!
  • 16.
    Extending Analyses Replicationsimplemented some of authors’ suggestions for future work Also implemented our own variations Easy to add and modify these after the initial replication development Ran analyses on larger data sets than original studies
  • 17.
    Workflow Re-use Specificallydesigned workflows and components for re-use Components for sampling and analysis had no constants, only parameters Effortful development for data handling paid off “ Plug-and-play” components used in every subsequent workflow Shifts the challenge from data to research, where it should be!
  • 18.
    Challenges with UsingWorkflows Software usability - continually improving Bugs wreaked havoc at times Data handling Continually more challenging than expected No existing web services, nor appropriate examples to emulate All bioscience, no social science
  • 19.
    Barriers to UptakeLittle science issues Many paradigms: lack of agreement in research focus, theory, methods Lack of incentives to collaborate Bimodal distribution of requisite skills “ I can’t possibly do that! I can’t code!” “ Why bother? I can code my own. You should too; just use Python.” Students are more willing to experiment with tools and new approaches
  • 20.
    Estimating user baseand potential user interest based on common release-and-download patterns Downloads a proxy for project success, a common dependent variable Example: Recent Research
  • 21.
    “Normal” Download-Release PatternsBibDesk Users update fairly quickly after releases
  • 22.
    External effects! Taverna’sDownload-Release Patterns 1.3.2-RC1 +2 presentations 1.5.0 ? ?
  • 23.
    Interpretation Taverna isnot a “normal” open source project Speaking tours, tutorials, articles, and other events influence downloads
  • 24.
    Questions? More: Posteron eScience for FLOSS floss.syr.edu www.myexperiment.org/groups/64