Little eScience
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Little eScience

  • 4,948 views
Uploaded on

Presentation for the myGrid team at the University of Manchester, putting the practice of eScience into the context of little science

Presentation for the myGrid team at the University of Manchester, putting the practice of eScience into the context of little science

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
4,948
On Slideshare
4,875
From Embeds
73
Number of Embeds
2

Actions

Shares
Downloads
10
Comments
1
Likes
2

Embeds 73

http://duncan.hull.name 72
http://feeds2.feedburner.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Little eScience Andrea Wiggins June 18, 2009
  • 2. Overview • Background • Exposition: Sociology of Science • Broad generalizations about science • Example: FLOSS Research • Little science context for eScience research • Expectations: What next? http://www.flickr.com/photos/pmtorrone/304696349/
  • 3. My Background • BA: Maths with economics • Nonprofit & IT industry work • Adult literacy, nonprofit management support, professional theatre • Web analytics • MSI: Human-computer interaction, complex systems & network science • PhD: Information science & technology
  • 4. Science • Systematic investigation for the production of knowledge • Scientific method emphasizes reproducibility • Not all phenomena are reproducible... • Many categories • Experimental, applied, social, etc. • Categories are not mutually exclusive http://www.flickr.com/photos/radiorover/419414206/
  • 5. Paradigms & Revolutions • Kuhn - Laws, theories, applications & instrumentation that create coherent traditions of scientific research • Paradigms help us direct our research, but limit our view of the world • New technologies can lead to scientific revolutions by revealing anomalies http://www.flickr.com/photos/weichbrodt/644302381/
  • 6. Normal Science • Kuhn - “normal science” is research based on broadly accepted scientific paradigms • Shared paradigms are based on rules and standards for scientific practice • Key requirement: agreement on focus and conduct of research • Ǝ(Grand Challenges)|Discipline http://www.flickr.com/photos/themadlolscientist/2421152973/
  • 7. Big Science • de Solla Price - “Big Science” is... • Inherently paradigmatic • Always normal science • Produces detailed insights into the minutiae of phenomena studied in the paradigm http://www.flickr.com/photos/31333486@N00/1883498062/
  • 8. Pre-paradigmatic Science • Paradigms require agreement on... • Epistemology • Ontology • Methodology • Most social sciences are pre-paradigmatic • Primarily exploratory research • Very little replication http://www.flickr.com/photos/askpang/327577395/
  • 9. Little Science • de Solla Price - “Little Science” is a romanticized precursor to Big Science, featuring lone, long-haired geniuses misunderstood by society, etc. • If it’s not Big Science, it’s Little Science • Pre-paradigmatic and fraught with ambiguity • Often fundamentally exploratory • Epistemological/theoretical/methodological divergence among researchers http://www.flickr.com/photos/mrjoax/2548045246/
  • 10. Social Science • Social science is real science: the goal is systematic knowledge production • Focuses on the study of the social life of human groups and individuals • IMHO, fundamentally more difficult than “hard” sciences due to infinite complexity of social phenomena • Replicability is a major challenge with respect to scientific method • Not all social science can or should aspire to replicability http://www.flickr.com/photos/smiteme/2379629501/
  • 11. Normalizing Science • Becoming a normal science requires community and convergence • Ǝ(community) != Ǝ(agreement) • Establishing grand challenges and methods are primary tasks of normalizing • Resistance to change is pervasive http://www.flickr.com/photos/9036026@N08/2949211479/
  • 12. Scientific Collaboration • Collaboration requires common focus, if not also epistemology and ontology • Challenging enough in normal sciences • Harder in pre-paradigmatic research • Economics: systemic disincentives to collaborate, versus potential benefits and ideals of science http://www.flickr.com/photos/richardsummers/542738965/
  • 13. Big Science Collaboration • LHC, CERN, etc. • Thousands of collaborators • Complex but coordinated, at least somewhat centralized • Requires shared goals and resources, plus (lots of) communication • Only happens in normal sciences http://www.flickr.com/photos/8767020@N08/531355152/
  • 14. Little Science Collaboration • A Professor & a grad student, give or take • Localized goals and resources • -> localized research practices • Small research teams • Fundamentally difficult to achieve consensus that allows larger groups • Restricts the ability to obtain funding and undertake ambitious projects http://www.flickr.com/photos/lamazone/2735939345/
  • 15. Scientific Collaboration Requirements • Shared goals • Establishes focus of research • Shared research resources • Both social and artifactual • Social aspects include training and community socialization we can has share? http://www.flickr.com/photos/ryanr/142455033/
  • 16. Historical Research Artifacts • Letters, Books, Journals, Lectures • Also technologies: methods, instrumentation • Sharing? • Recordkeeping is not always a researcher’s main priority • Without records, there’s not much to share except the research outputs http://www.flickr.com/photos/smailtronic/1535870363/
  • 17. Today’s Research Artifacts • Large scale datasets, scripts, software, workflows, papers, images, video, audio, annotations, ephemera, web sites... • “Research objects” - bundling all the pieces together • Hybrids of boundary objects and touchstones • Technologies -> scientific revolution! • Open science http://www.flickr.com/photos/smiteme/2379630899/
  • 18. Example: FLOSS Research • Phenomenological & interdisciplinary • Software engineering, Information Systems, Anthropology, Sociology, CSCW, etc... • Ethos • (Idealistic) combination of open source values and scientific values http://www.flickr.com/photos/themadlolscientist/2542236565/
  • 19. FLOSS Phenomenon • Free/Libre Open Source Software “Free as in speech, free as in beer” - liberty versus cost • Distributed collaboration to develop software • Volunteers and sponsored developers • Community-based model of development http://www.flickr.com/photos/prawnwarp/541526661/
  • 20. Typical FLOSS Research Topics • Coordination and collaboration • Growth and evolution (social and code) • Code quality • Business models and firm involvement • Motivation, leadership, success • Culture and community • Intellectual property and copyright http://www.flickr.com/photos/eean/519258881/
  • 21. What we study @ SU • Social aspects of FLOSS • What practices make some distributed work teams more effective than others? • How are these practices developed? • What are the dynamics through which self-organizing distributed teams develop and work?
  • 22. Sharing FLOSS Research Artifacts • Community: Small but growing, maybe around 400 researchers worldwide, with lively face-to-face interaction but relatively low listserv activity • Data: Lots of it, and readily available, though often difficult to use for several reasons • Analyses and tools: Not quite as easy to get, but there if you can find them • Papers: Repositories are as yet underdeveloped, but efforts are underway http://www.flickr.com/photos/12698507@N08/2762563631/
  • 23. FLOSS Research Community • Handful of small research groups, mostly in UK & Europe • Most often found in Software Engineering departments • International conferences targeted to academics, developers, or both • OSS, ICSE, FOSDEM, etc. • IFIP WG 2.13 http://www.flickr.com/photos/steevithak/2883218362/
  • 24. FLOSS Research Data • Data sources include interviews, surveys, and ethnographic fieldwork • Digital “trace” data: archival, secondary, by-product of work, easy but hard • Repositories • Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. • RoRs: Repositories of Repositories • Data sources for research
  • 25. We Built It... • Motivations • Stop hammering forge servers, getting entire campus IPs blocked... • Stop reinventing the wheel! • Adoption • Shared data sources seeing increasing use • Next step is harder: sharing tools and workflows http://www.flickr.com/photos/circulating/997909242/
  • 26. RoRs: FLOSSmole • Multiple PIs @ Syracuse, Elon, & Carnegie Mellon One grad student @ SU (me), a couple of undergrads @ Elon         • Public access to 300+ GB data on                • 300K+ projects from 8 repositories            • Flat files & SQL datamarts                            • Released via SF & GC    • 5 TB allotment on TeraGrid @ SDSC                      
  • 27. RoRs: FLOSSmetrics • Produced by LibreSoft with academic and corporate partners • Public access to data for 2800+ projects • Analyzed & raw data from CVS, email, trackers • Tools for: • calculating code metrics • parsing trackers • parsing email lists
  • 28. RoRs: SRDA • SourceForge Research Data Archive • One PI @ Notre Dame University • One massive 300 GB+ SQL db of monthly dumps from SourceForge • Original obtuse structure, regular table deprecation, some documentation • Gated access: researchers only, condition of data release from SF
  • 29. RoRs: Emerging Sources • Ultimate Debian Database (UDD) • 300 MB compressed Postgres DB, produced by Debian community • Planning to add to FLOSSmole
  • 30. FLOSS Research Analyses • When available... • Bespoke Scripts • Taverna workflows
  • 31. FLOSS Research Papers • First, there was opensource.mit.edu • They no longer maintain it, and gave us the data • Work-in-progress working papers repository at FLOSSpapers.org • Essential viability problem is that repositories require long-term stewardship... • ...which requires long-term commitments of funding and personnel, not just volunteers
  • 32. FLOSS Research Collaboration • Multiple partners involved in producing FLOSSmole & FLOSSmetrics • Federated data sources by choice, starting to develop ontologies • As yet, a Little Science domain • Cross-institutional collaboration poses many challenges • Usual difficulties magnified by general lack of resources, both financial and human
  • 33. Latest Initiatives • Resource-oriented • Expanding resources: data, research artifacts, and pedagogical materials • DOIs: 10.4118/* • Semantic data interoperability • Community-oriented • FLOSShub.org
  • 34. Evangelizing eScience • Made presentations at OSS conferences: well received, but hard to make converts for several reasons • Tried to get other research group members to use Taverna: learning overhead is too high for most • Submitted a paper on eScience to an IS conference: rejected because reviewers were unable to adequately evaluate eScience as a topic, as it’s too unfamiliar • Currently just doing our work this way, as an exemplar http://www.flickr.com/photos/naezmi/2418745377/
  • 35. Barriers to Uptake • Lack of agreement in research focus, theory, methods; researcher isolation • Bimodal distribution of requisite skills • “I can’t possibly do that! I can’t code!” • “Why bother? I can code my own. You should too; just use Python.” “Overheard” on Twitter: Friend #1: i HATE that openoffice automatically took over my "open with..." defaults. Friend #2: @Friend #1 <opensourcedeveloper> If you don't like it, then why don't you submit code to change the behavior!? </opensourcedeveloper> http://www.flickr.com/photos/noner/1739876378/
  • 36. What I had to learn to get this far • Taverna • A little bit of OWL, RDF, & SPARQL • A lot more Unix terminal & XML • I would not have taken this on if I had known what was in store, but once I got started, I was hooked • Relational DB management & SQL • More R, plus packages and dependency management • Java & Eclipse - just enough to write my own Beanshells • SVN & SSH http://www.flickr.com/photos/sashala/292868436/
  • 37. Sociotechnical Engineering • Tools are part of the solution, thanks to brilliant CS and SE people • Social elements are the true barrier • Awareness of methods and benefits • Incentive systems • Resistance to change (paradigms again) • Proof of concept is difficult http://www.flickr.com/photos/pinprick/3117108495/
  • 38. Using Taverna for Little eScience • Implementing analysis is usually easy • Data handling is almost always hard • All data are in SQL databases, with consistent IDs • Lots of data manipulation is required • Avoiding web services as much as possible • Infrastructure and resources are limited • Benefit is truly questionable: AFAIK, I am 50% of the user base...
  • 39. Example: Our Recent Research • Estimating user base and potential user interest in FLOSS projects • Based on common release-and-download patterns • Proxy for project success, a common dependent variable Area under Potential user curve is active experimentation Active user base users updating growth (good growth publicity?) downloads Version 0.5 Version 0.6 Version 0.7
  • 40. ● 5000 4000 measure downloads 3000 ● user_base 2000 ● ● ● baseline ● 1000 ● ● ● ● ● ● Oct−2005 Apr−2006 Oct−2006 Apr−2007 “Normal” Download- BibDesk Release Patterns
  • 41. 1.3.2-RC1 +2 presentations 1.5.0 ? ? Taverna’s Download- External effects! Release Patterns
  • 42. Taverna’s Estimated 14 day baseline & drop-off Baseline & User Base
  • 43. Taverna’s Estimated 7 day baseline & drop-off Baseline & User Base
  • 44. Interpretation • Taverna is not a “normal” open source project • Speaking tours, tutorials, articles, and other events influence downloads • What this demonstrates... • Care is needed with quantitative measures • Not all open source projects are the same • Taverna users are just as reactive as any http://www.flickr.com/photos/pagedooley/2121472112/
  • 45. Where next? • Adoption is a long-term agenda, as changing social practices doesn’t happen overnight • For FLOSS research and our disciplinary communities • We will keep doing our work this way, and hope to draw in others “Won’t you come out and play?” http://www.flickr.com/photos/atiq/2658884520/
  • 46. Thanks! • Credits where they are due • Kevin Crowston, my advisor • James Howison, my collaborator • Everett Wiggins, my husband