Collaborative Data Analysis with Taverna Workflows


Published on

Microsoft eScience Workshop 2009 presentation on collaborative development of Taverna workflows for data analysis in FLOSS research.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Collaborative Data Analysis with Taverna Workflows

  1. 1. Collaborative Data Analysis with Taverna Workflows Andrea Wiggins, Kevin Crowston & James Howison 16 October 2009
  2. 2. FLOSS Phenomenon <ul><li>Free/Libre Open Source Software </li></ul><ul><ul><li>Distributed collaboration to develop software </li></ul></ul><ul><li>Typical social research topics </li></ul><ul><ul><li>Coordination and collaboration </li></ul></ul><ul><ul><li>Growth and evolution (social and code) </li></ul></ul><ul><ul><li>Code quality </li></ul></ul><ul><ul><li>Motivation, leadership, success </li></ul></ul><ul><ul><li>Culture and community </li></ul></ul><ul><ul><li>Intellectual property and copyright </li></ul></ul>
  3. 3. eScience Proof-of-Concept <ul><li>Project to replicate published FLOSS research using eScience approaches </li></ul><ul><ul><li>Use existing shared data sets </li></ul></ul><ul><ul><li>Develop workflows collaboratively </li></ul></ul><ul><ul><li>Build library of reusable components </li></ul></ul><ul><li>Selected several papers to replicate based on: </li></ul><ul><ul><li>Data availability </li></ul></ul><ul><ul><li>Suitability of analytical approach </li></ul></ul>
  4. 4. Research Replication Classifies projects based on metrics for success and stage of project growth English & Schweik, 2007 Examines dynamics of social networks of project communications over time Howison et al., 2006 Examines distribution of project sizes for evidence of preferential attachment theory of growth in networks Conklin, 2004 Description Study
  5. 5. FLOSS Research Data <ul><li>Data sources include interviews, surveys, and ethnographic fieldwork </li></ul><ul><li>Digital “trace” data </li></ul><ul><ul><li>Archival, secondary, by-product of work </li></ul></ul><ul><ul><li>Easy to get, but hard to use </li></ul></ul><ul><li>Repositories </li></ul><ul><ul><li>Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. </li></ul></ul><ul><li>RoRs: Repositories of Repositories </li></ul><ul><ul><li>Data sources for research </li></ul></ul>
  6. 6. RoRs: FLOSSmole <ul><li>Public access to 300+ GB data </li></ul><ul><ul><li>300K+ projects from 8 repositories </li></ul></ul><ul><ul><li>Politely scraped, then parsed </li></ul></ul><ul><ul><li>Flat files & SQL datamarts </li></ul></ul><ul><ul><li>Released monthly via SF & GC </li></ul></ul><ul><li>5 TB allotment on TeraGrid @ SDSC </li></ul><ul><ul><li>Allows direct database access without compromising our humble server </li></ul></ul>
  7. 7. RoRs: SRDA <ul><li>SourceForge Research Data Archive </li></ul><ul><li>Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge </li></ul><ul><ul><li>Original obtuse structure, regular table deprecation, some limited documentation </li></ul></ul>
  8. 8. Analysis Tool Requirements <ul><li>Scalability </li></ul><ul><ul><li>Move analysis from small n’s to big(ger) n’s </li></ul></ul><ul><li>Data meshing </li></ul><ul><ul><li>Reproducing research required analysis of data drawn from multiple RoRs </li></ul></ul><ul><li>Collaborative analysis design </li></ul><ul><ul><li>Needed to tap into diverse skills from different contributors </li></ul></ul>
  9. 9. Integrating Diverse Skills <ul><li>Collaborator 1: Data Wrangler </li></ul><ul><ul><li>Expert with data sources and handling </li></ul></ul><ul><ul><li>Multiple coding languages and great technical skills </li></ul></ul><ul><li>Collaborator 2: Analyst </li></ul><ul><ul><li>Competent with R, but no other coding skills </li></ul></ul><ul><ul><li>Good at debugging </li></ul></ul><ul><li>Collaborator 3: PI </li></ul><ul><ul><li>Helps find solutions when all else fails </li></ul></ul>
  10. 10. Taverna <ul><li>Scientific workflow tool </li></ul><ul><ul><li>Free. Open. We like that. </li></ul></ul><ul><ul><li>Responsive support from myGrid team, lively user community </li></ul></ul><ul><li>Additional collaboration support via myExperiment </li></ul><ul><ul><li>Combined features and flexibility met our needs </li></ul></ul>
  11. 11. Work Process <ul><li>Evaluate paper’s data, methods, and findings </li></ul><ul><li>Develop abstract workflow together, focusing on functionality </li></ul><ul><li>Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary </li></ul><ul><li>Independent individual development and testing, using dummy inputs </li></ul><ul><li>Integration of partial workflows </li></ul><ul><li>Test (debug, test…) and run </li></ul>
  12. 12. “ Identifying success and tragedy of FLOSS Commons” <ul><li>Replication of English & Schweik, 2007 </li></ul><ul><ul><li>Classification of project success by stage of growth for 110K projects </li></ul></ul><ul><ul><li>Requires data from 2 repositories, FLOSSmole & SRDA </li></ul></ul><ul><li>Extension </li></ul><ul><ul><li>Parameterized all thresholds </li></ul></ul><ul><ul><li>Tested additional criterion tests </li></ul></ul>
  13. 13. Classification Workflow
  14. 14. Key Strategies <ul><li>Consciously worked to maintain transparency and modularity </li></ul><ul><ul><li>Copious code comments </li></ul></ul><ul><ul><li>Metadata makes workflows self-explanatory </li></ul></ul><ul><li>Designed components for reuse </li></ul><ul><ul><li>Particularly data handling and “shims” </li></ul></ul><ul><li>Assigned the right work to the right people </li></ul><ul><ul><li>Created interdependencies, but reduced the initial learning curve </li></ul></ul>
  15. 15. Important Details <ul><li>Used SVN for version management </li></ul><ul><ul><li>myExperiment can now manage this much more easily </li></ul></ul><ul><li>Set up server caching on query results to speed up testing </li></ul><ul><li>Created an OWL ontology to map between RoRs, which significantly improved performance </li></ul><ul><ul><li>Nontrivial effort! </li></ul></ul>
  16. 16. Extending Analyses <ul><li>Replications implemented some of authors’ suggestions for future work </li></ul><ul><ul><li>Also implemented our own variations </li></ul></ul><ul><ul><li>Easy to add and modify these after the initial replication development </li></ul></ul><ul><li>Ran analyses on larger data sets than original studies </li></ul>
  17. 17. Workflow Re-use <ul><li>Specifically designed workflows and components for re-use </li></ul><ul><li>Components for sampling and analysis had no constants, only parameters </li></ul><ul><li>Effortful development for data handling paid off </li></ul><ul><ul><li>“ Plug-and-play” components used in every subsequent workflow </li></ul></ul><ul><ul><li>Shifts the challenge from data to research, where it should be! </li></ul></ul>
  18. 18. Challenges with Using Workflows <ul><li>Software usability - continually improving </li></ul><ul><ul><li>Bugs wreaked havoc at times </li></ul></ul><ul><li>Data handling </li></ul><ul><ul><li>Continually more challenging than expected </li></ul></ul><ul><li>No existing web services, nor appropriate examples to emulate </li></ul><ul><ul><li>All bioscience, no social science </li></ul></ul>
  19. 19. Barriers to Uptake <ul><li>Little science issues </li></ul><ul><ul><li>Many paradigms: lack of agreement in research focus, theory, methods </li></ul></ul><ul><li>Lack of incentives to collaborate </li></ul><ul><li>Bimodal distribution of requisite skills </li></ul><ul><ul><li>“ I can’t possibly do that! I can’t code!” </li></ul></ul><ul><ul><li>“ Why bother? I can code my own. You should too; just use Python.” </li></ul></ul><ul><li>Students are more willing to experiment with tools and new approaches </li></ul>
  20. 20. <ul><li>Estimating user base and potential user interest based on common release-and-download patterns </li></ul><ul><ul><li>Downloads a proxy for project success, a common dependent variable </li></ul></ul>Example: Recent Research
  21. 21. “Normal” Download-Release Patterns <ul><li>BibDesk </li></ul><ul><ul><li>Users update fairly quickly after releases </li></ul></ul>
  22. 22. <ul><li>External effects! </li></ul>Taverna’s Download-Release Patterns 1.3.2-RC1 +2 presentations 1.5.0 ? ?
  23. 23. Interpretation <ul><li>Taverna is not a “normal” open source project </li></ul><ul><ul><li>Speaking tours, tutorials, articles, and other events influence downloads </li></ul></ul>
  24. 24. Questions? <ul><li>More: </li></ul><ul><ul><li>Poster on eScience for FLOSS </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul>