Collaborative Data Analysis with Taverna Workflows

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Collaborative Data Analysis with Taverna Workflows - Presentation Transcript

    1. Collaborative Data Analysis with Taverna Workflows Andrea Wiggins, Kevin Crowston & James Howison 16 October 2009
    2. FLOSS Phenomenon
      • Free/Libre Open Source Software
        • Distributed collaboration to develop software
      • Typical social research topics
        • Coordination and collaboration
        • Growth and evolution (social and code)
        • Code quality
        • Motivation, leadership, success
        • Culture and community
        • Intellectual property and copyright
    3. eScience Proof-of-Concept
      • Project to replicate published FLOSS research using eScience approaches
        • Use existing shared data sets
        • Develop workflows collaboratively
        • Build library of reusable components
      • Selected several papers to replicate based on:
        • Data availability
        • Suitability of analytical approach
    4. Research Replication Classifies projects based on metrics for success and stage of project growth English & Schweik, 2007 Examines dynamics of social networks of project communications over time Howison et al., 2006 Examines distribution of project sizes for evidence of preferential attachment theory of growth in networks Conklin, 2004 Description Study
    5. FLOSS Research Data
      • Data sources include interviews, surveys, and ethnographic fieldwork
      • Digital “trace” data
        • Archival, secondary, by-product of work
        • Easy to get, but hard to use
      • Repositories
        • Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.
      • RoRs: Repositories of Repositories
        • Data sources for research
    6. RoRs: FLOSSmole
      • Public access to 300+ GB data
        • 300K+ projects from 8 repositories
        • Politely scraped, then parsed
        • Flat files & SQL datamarts
        • Released monthly via SF & GC
      • 5 TB allotment on TeraGrid @ SDSC
        • Allows direct database access without compromising our humble server
    7. RoRs: SRDA
      • SourceForge Research Data Archive
      • Gated researcher-only access to a 300 GB+ SQL db of monthly dumps from SourceForge
        • Original obtuse structure, regular table deprecation, some limited documentation
    8. Analysis Tool Requirements
      • Scalability
        • Move analysis from small n’s to big(ger) n’s
      • Data meshing
        • Reproducing research required analysis of data drawn from multiple RoRs
      • Collaborative analysis design
        • Needed to tap into diverse skills from different contributors
    9. Integrating Diverse Skills
      • Collaborator 1: Data Wrangler
        • Expert with data sources and handling
        • Multiple coding languages and great technical skills
      • Collaborator 2: Analyst
        • Competent with R, but no other coding skills
        • Good at debugging
      • Collaborator 3: PI
        • Helps find solutions when all else fails
    10. Taverna
      • Scientific workflow tool
        • Free. Open. We like that.
        • Responsive support from myGrid team, lively user community
      • Additional collaboration support via myExperiment
        • Combined features and flexibility met our needs
    11. Work Process
      • Evaluate paper’s data, methods, and findings
      • Develop abstract workflow together, focusing on functionality
      • Split the work(flow) between data and analysis, specifying names and forms of inputs and outputs at the boundary
      • Independent individual development and testing, using dummy inputs
      • Integration of partial workflows
      • Test (debug, test…) and run
    12. “ Identifying success and tragedy of FLOSS Commons”
      • Replication of English & Schweik, 2007
        • Classification of project success by stage of growth for 110K projects
        • Requires data from 2 repositories, FLOSSmole & SRDA
      • Extension
        • Parameterized all thresholds
        • Tested additional criterion tests
    13. Classification Workflow
    14. Key Strategies
      • Consciously worked to maintain transparency and modularity
        • Copious code comments
        • Metadata makes workflows self-explanatory
      • Designed components for reuse
        • Particularly data handling and “shims”
      • Assigned the right work to the right people
        • Created interdependencies, but reduced the initial learning curve
    15. Important Details
      • Used SVN for version management
        • myExperiment can now manage this much more easily
      • Set up server caching on query results to speed up testing
      • Created an OWL ontology to map between RoRs, which significantly improved performance
        • Nontrivial effort!
    16. Extending Analyses
      • Replications implemented some of authors’ suggestions for future work
        • Also implemented our own variations
        • Easy to add and modify these after the initial replication development
      • Ran analyses on larger data sets than original studies
    17. Workflow Re-use
      • Specifically designed workflows and components for re-use
      • Components for sampling and analysis had no constants, only parameters
      • Effortful development for data handling paid off
        • “ Plug-and-play” components used in every subsequent workflow
        • Shifts the challenge from data to research, where it should be!
    18. Challenges with Using Workflows
      • Software usability - continually improving
        • Bugs wreaked havoc at times
      • Data handling
        • Continually more challenging than expected
      • No existing web services, nor appropriate examples to emulate
        • All bioscience, no social science
    19. Barriers to Uptake
      • Little science issues
        • Many paradigms: lack of agreement in research focus, theory, methods
      • Lack of incentives to collaborate
      • Bimodal distribution of requisite skills
        • “ I can’t possibly do that! I can’t code!”
        • “ Why bother? I can code my own. You should too; just use Python.”
      • Students are more willing to experiment with tools and new approaches
      • Estimating user base and potential user interest based on common release-and-download patterns
        • Downloads a proxy for project success, a common dependent variable
      Example: Recent Research
    20. “Normal” Download-Release Patterns
      • BibDesk
        • Users update fairly quickly after releases
      • External effects!
      Taverna’s Download-Release Patterns 1.3.2-RC1 +2 presentations 1.5.0 ? ?
    21. Interpretation
      • Taverna is not a “normal” open source project
        • Speaking tours, tutorials, articles, and other events influence downloads
    22. Questions?
      • More:
        • Poster on eScience for FLOSS
        • floss.syr.edu
        • www.myexperiment.org/groups/64

    + Andrea WigginsAndrea Wiggins, 2 weeks ago

    custom

    51 views, 0 favs, 0 embeds more stats

    Microsoft eScience Workshop 2009 presentation on co more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 51
      • 51 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories