0
Little eScience
Andrea Wiggins
June 18, 2009
Overview

• Background


• Exposition: Sociology of Science


  • Broad generalizations about science


• Example: FLOSS R...
My Background

• BA: Maths with economics


• Nonprofit & IT industry work


  • Adult literacy, nonprofit management suppor...
Science

• Systematic investigation for the production of knowledge


  • Scientific method emphasizes reproducibility


  ...
Paradigms & Revolutions

• Kuhn - Laws, theories, applications & instrumentation that create
  coherent traditions of scie...
Normal Science

• Kuhn - “normal science” is research based on broadly accepted scientific
  paradigms


• Shared paradigms...
Big Science

• de Solla Price - “Big Science” is...


   • Inherently paradigmatic


   • Always normal science


• Produc...
Pre-paradigmatic Science

• Paradigms require agreement on...


  • Epistemology


  • Ontology


  • Methodology


• Most...
Little Science

• de Solla Price - “Little Science” is a
  romanticized precursor to Big Science,
  featuring lone, long-h...
Social Science

• Social science is real science: the goal is systematic knowledge production


• Focuses on the study of ...
Normalizing Science

• Becoming a normal science requires community and convergence


  • Ǝ(community) != Ǝ(agreement)


•...
Scientific Collaboration

• Collaboration requires common focus, if not also epistemology and ontology


• Challenging enou...
Big Science Collaboration

• LHC, CERN, etc.


  • Thousands of collaborators


  • Complex but coordinated,
    at least ...
Little Science Collaboration

• A Professor & a grad student, give or take


   • Localized goals and resources


      • ...
Scientific Collaboration Requirements

• Shared goals


  • Establishes focus of research


• Shared research resources


 ...
Historical Research Artifacts

• Letters, Books, Journals, Lectures


• Also technologies: methods, instrumentation


• Sh...
Today’s Research Artifacts

• Large scale datasets, scripts, software, workflows, papers, images, video,
  audio, annotatio...
Example: FLOSS Research

• Phenomenological & interdisciplinary


  • Software engineering,
    Information Systems,
    A...
FLOSS Phenomenon

• Free/Libre Open Source Software
 “Free as in speech, free as in beer” - liberty versus cost



  • Dis...
Typical FLOSS Research Topics

• Coordination and collaboration


• Growth and evolution (social and code)


• Code qualit...
What we study @ SU

• Social aspects of FLOSS


  • What practices make some distributed work teams more effective than
  ...
Sharing FLOSS Research Artifacts

• Community: Small but growing, maybe around 400 researchers worldwide,
  with lively fa...
FLOSS Research Community

• Handful of small research groups, mostly in UK & Europe


   • Most often found in Software En...
FLOSS Research Data

• Data sources include interviews, surveys, and ethnographic fieldwork


• Digital “trace” data: archi...
We Built It...

• Motivations


  • Stop hammering forge servers, getting entire campus IPs blocked...


  • Stop reinvent...
RoRs: FLOSSmole

• Multiple PIs @ Syracuse, Elon, & Carnegie Mellon
  One grad student @ SU (me), a couple of undergrads @...
RoRs: FLOSSmetrics

• Produced by LibreSoft with academic and corporate partners


• Public access to data for 2800+ proje...
RoRs: SRDA

• SourceForge Research Data Archive


  • One PI @ Notre Dame University


  • One massive 300 GB+ SQL db of m...
RoRs: Emerging Sources

• Ultimate Debian Database (UDD)


  • 300 MB compressed Postgres DB,
    produced by Debian commu...
FLOSS Research Analyses

• When available...


   • Bespoke Scripts


   • Taverna workflows
FLOSS Research Papers

• First, there was opensource.mit.edu


   • They no longer maintain it, and gave us the data


• W...
FLOSS Research Collaboration

• Multiple partners involved in producing FLOSSmole & FLOSSmetrics


• Federated data source...
Latest Initiatives

• Resource-oriented


  • Expanding resources: data, research artifacts, and pedagogical materials


 ...
Evangelizing eScience

• Made presentations at OSS conferences: well received, but hard to make
  converts for several rea...
Barriers to Uptake

• Lack of agreement in research focus, theory, methods; researcher isolation


• Bimodal distribution ...
What I had to learn to get this far

• Taverna                           • A little bit of OWL, RDF, & SPARQL


• A lot mo...
Sociotechnical Engineering

• Tools are part of the solution, thanks to brilliant CS and SE people


• Social elements are...
Using Taverna for Little eScience

• Implementing analysis is usually easy


• Data handling is almost always hard


   • ...
Example: Our Recent Research

• Estimating user base and potential user interest in FLOSS projects


   • Based on common ...
●
            5000

            4000
                                                                         measure
down...
1.3.2-RC1
          +2 presentations   1.5.0



  ?   ?




Taverna’s Download-
                                     Exter...
Taverna’s Estimated
                       14 day baseline & drop-off
Baseline & User Base
Taverna’s Estimated
                       7 day baseline & drop-off
Baseline & User Base
Interpretation

• Taverna is not a “normal” open source project


  • Speaking tours, tutorials, articles, and other event...
Where next?

• Adoption is a long-term agenda, as changing social practices doesn’t happen
  overnight


• For FLOSS resea...
Thanks!

• Credits where they are due


  • Kevin Crowston, my advisor




  • James Howison, my collaborator




  • Ever...
Upcoming SlideShare
Loading in...5
×

Little eScience

3,205

Published on

Presentation for the myGrid team at the University of Manchester, putting the practice of eScience into the context of little science

Published in: Technology, News & Politics
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,205
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Little eScience"

  1. 1. Little eScience Andrea Wiggins June 18, 2009
  2. 2. Overview • Background • Exposition: Sociology of Science • Broad generalizations about science • Example: FLOSS Research • Little science context for eScience research • Expectations: What next? http://www.flickr.com/photos/pmtorrone/304696349/
  3. 3. My Background • BA: Maths with economics • Nonprofit & IT industry work • Adult literacy, nonprofit management support, professional theatre • Web analytics • MSI: Human-computer interaction, complex systems & network science • PhD: Information science & technology
  4. 4. Science • Systematic investigation for the production of knowledge • Scientific method emphasizes reproducibility • Not all phenomena are reproducible... • Many categories • Experimental, applied, social, etc. • Categories are not mutually exclusive http://www.flickr.com/photos/radiorover/419414206/
  5. 5. Paradigms & Revolutions • Kuhn - Laws, theories, applications & instrumentation that create coherent traditions of scientific research • Paradigms help us direct our research, but limit our view of the world • New technologies can lead to scientific revolutions by revealing anomalies http://www.flickr.com/photos/weichbrodt/644302381/
  6. 6. Normal Science • Kuhn - “normal science” is research based on broadly accepted scientific paradigms • Shared paradigms are based on rules and standards for scientific practice • Key requirement: agreement on focus and conduct of research • Ǝ(Grand Challenges)|Discipline http://www.flickr.com/photos/themadlolscientist/2421152973/
  7. 7. Big Science • de Solla Price - “Big Science” is... • Inherently paradigmatic • Always normal science • Produces detailed insights into the minutiae of phenomena studied in the paradigm http://www.flickr.com/photos/31333486@N00/1883498062/
  8. 8. Pre-paradigmatic Science • Paradigms require agreement on... • Epistemology • Ontology • Methodology • Most social sciences are pre-paradigmatic • Primarily exploratory research • Very little replication http://www.flickr.com/photos/askpang/327577395/
  9. 9. Little Science • de Solla Price - “Little Science” is a romanticized precursor to Big Science, featuring lone, long-haired geniuses misunderstood by society, etc. • If it’s not Big Science, it’s Little Science • Pre-paradigmatic and fraught with ambiguity • Often fundamentally exploratory • Epistemological/theoretical/methodological divergence among researchers http://www.flickr.com/photos/mrjoax/2548045246/
  10. 10. Social Science • Social science is real science: the goal is systematic knowledge production • Focuses on the study of the social life of human groups and individuals • IMHO, fundamentally more difficult than “hard” sciences due to infinite complexity of social phenomena • Replicability is a major challenge with respect to scientific method • Not all social science can or should aspire to replicability http://www.flickr.com/photos/smiteme/2379629501/
  11. 11. Normalizing Science • Becoming a normal science requires community and convergence • Ǝ(community) != Ǝ(agreement) • Establishing grand challenges and methods are primary tasks of normalizing • Resistance to change is pervasive http://www.flickr.com/photos/9036026@N08/2949211479/
  12. 12. Scientific Collaboration • Collaboration requires common focus, if not also epistemology and ontology • Challenging enough in normal sciences • Harder in pre-paradigmatic research • Economics: systemic disincentives to collaborate, versus potential benefits and ideals of science http://www.flickr.com/photos/richardsummers/542738965/
  13. 13. Big Science Collaboration • LHC, CERN, etc. • Thousands of collaborators • Complex but coordinated, at least somewhat centralized • Requires shared goals and resources, plus (lots of) communication • Only happens in normal sciences http://www.flickr.com/photos/8767020@N08/531355152/
  14. 14. Little Science Collaboration • A Professor & a grad student, give or take • Localized goals and resources • -> localized research practices • Small research teams • Fundamentally difficult to achieve consensus that allows larger groups • Restricts the ability to obtain funding and undertake ambitious projects http://www.flickr.com/photos/lamazone/2735939345/
  15. 15. Scientific Collaboration Requirements • Shared goals • Establishes focus of research • Shared research resources • Both social and artifactual • Social aspects include training and community socialization we can has share? http://www.flickr.com/photos/ryanr/142455033/
  16. 16. Historical Research Artifacts • Letters, Books, Journals, Lectures • Also technologies: methods, instrumentation • Sharing? • Recordkeeping is not always a researcher’s main priority • Without records, there’s not much to share except the research outputs http://www.flickr.com/photos/smailtronic/1535870363/
  17. 17. Today’s Research Artifacts • Large scale datasets, scripts, software, workflows, papers, images, video, audio, annotations, ephemera, web sites... • “Research objects” - bundling all the pieces together • Hybrids of boundary objects and touchstones • Technologies -> scientific revolution! • Open science http://www.flickr.com/photos/smiteme/2379630899/
  18. 18. Example: FLOSS Research • Phenomenological & interdisciplinary • Software engineering, Information Systems, Anthropology, Sociology, CSCW, etc... • Ethos • (Idealistic) combination of open source values and scientific values http://www.flickr.com/photos/themadlolscientist/2542236565/
  19. 19. FLOSS Phenomenon • Free/Libre Open Source Software “Free as in speech, free as in beer” - liberty versus cost • Distributed collaboration to develop software • Volunteers and sponsored developers • Community-based model of development http://www.flickr.com/photos/prawnwarp/541526661/
  20. 20. Typical FLOSS Research Topics • Coordination and collaboration • Growth and evolution (social and code) • Code quality • Business models and firm involvement • Motivation, leadership, success • Culture and community • Intellectual property and copyright http://www.flickr.com/photos/eean/519258881/
  21. 21. What we study @ SU • Social aspects of FLOSS • What practices make some distributed work teams more effective than others? • How are these practices developed? • What are the dynamics through which self-organizing distributed teams develop and work?
  22. 22. Sharing FLOSS Research Artifacts • Community: Small but growing, maybe around 400 researchers worldwide, with lively face-to-face interaction but relatively low listserv activity • Data: Lots of it, and readily available, though often difficult to use for several reasons • Analyses and tools: Not quite as easy to get, but there if you can find them • Papers: Repositories are as yet underdeveloped, but efforts are underway http://www.flickr.com/photos/12698507@N08/2762563631/
  23. 23. FLOSS Research Community • Handful of small research groups, mostly in UK & Europe • Most often found in Software Engineering departments • International conferences targeted to academics, developers, or both • OSS, ICSE, FOSDEM, etc. • IFIP WG 2.13 http://www.flickr.com/photos/steevithak/2883218362/
  24. 24. FLOSS Research Data • Data sources include interviews, surveys, and ethnographic fieldwork • Digital “trace” data: archival, secondary, by-product of work, easy but hard • Repositories • Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc. • RoRs: Repositories of Repositories • Data sources for research
  25. 25. We Built It... • Motivations • Stop hammering forge servers, getting entire campus IPs blocked... • Stop reinventing the wheel! • Adoption • Shared data sources seeing increasing use • Next step is harder: sharing tools and workflows http://www.flickr.com/photos/circulating/997909242/
  26. 26. RoRs: FLOSSmole • Multiple PIs @ Syracuse, Elon, & Carnegie Mellon One grad student @ SU (me), a couple of undergrads @ Elon         • Public access to 300+ GB data on                • 300K+ projects from 8 repositories            • Flat files & SQL datamarts                            • Released via SF & GC    • 5 TB allotment on TeraGrid @ SDSC                      
  27. 27. RoRs: FLOSSmetrics • Produced by LibreSoft with academic and corporate partners • Public access to data for 2800+ projects • Analyzed & raw data from CVS, email, trackers • Tools for: • calculating code metrics • parsing trackers • parsing email lists
  28. 28. RoRs: SRDA • SourceForge Research Data Archive • One PI @ Notre Dame University • One massive 300 GB+ SQL db of monthly dumps from SourceForge • Original obtuse structure, regular table deprecation, some documentation • Gated access: researchers only, condition of data release from SF
  29. 29. RoRs: Emerging Sources • Ultimate Debian Database (UDD) • 300 MB compressed Postgres DB, produced by Debian community • Planning to add to FLOSSmole
  30. 30. FLOSS Research Analyses • When available... • Bespoke Scripts • Taverna workflows
  31. 31. FLOSS Research Papers • First, there was opensource.mit.edu • They no longer maintain it, and gave us the data • Work-in-progress working papers repository at FLOSSpapers.org • Essential viability problem is that repositories require long-term stewardship... • ...which requires long-term commitments of funding and personnel, not just volunteers
  32. 32. FLOSS Research Collaboration • Multiple partners involved in producing FLOSSmole & FLOSSmetrics • Federated data sources by choice, starting to develop ontologies • As yet, a Little Science domain • Cross-institutional collaboration poses many challenges • Usual difficulties magnified by general lack of resources, both financial and human
  33. 33. Latest Initiatives • Resource-oriented • Expanding resources: data, research artifacts, and pedagogical materials • DOIs: 10.4118/* • Semantic data interoperability • Community-oriented • FLOSShub.org
  34. 34. Evangelizing eScience • Made presentations at OSS conferences: well received, but hard to make converts for several reasons • Tried to get other research group members to use Taverna: learning overhead is too high for most • Submitted a paper on eScience to an IS conference: rejected because reviewers were unable to adequately evaluate eScience as a topic, as it’s too unfamiliar • Currently just doing our work this way, as an exemplar http://www.flickr.com/photos/naezmi/2418745377/
  35. 35. Barriers to Uptake • Lack of agreement in research focus, theory, methods; researcher isolation • Bimodal distribution of requisite skills • “I can’t possibly do that! I can’t code!” • “Why bother? I can code my own. You should too; just use Python.” “Overheard” on Twitter: Friend #1: i HATE that openoffice automatically took over my "open with..." defaults. Friend #2: @Friend #1 <opensourcedeveloper> If you don't like it, then why don't you submit code to change the behavior!? </opensourcedeveloper> http://www.flickr.com/photos/noner/1739876378/
  36. 36. What I had to learn to get this far • Taverna • A little bit of OWL, RDF, & SPARQL • A lot more Unix terminal & XML • I would not have taken this on if I had known what was in store, but once I got started, I was hooked • Relational DB management & SQL • More R, plus packages and dependency management • Java & Eclipse - just enough to write my own Beanshells • SVN & SSH http://www.flickr.com/photos/sashala/292868436/
  37. 37. Sociotechnical Engineering • Tools are part of the solution, thanks to brilliant CS and SE people • Social elements are the true barrier • Awareness of methods and benefits • Incentive systems • Resistance to change (paradigms again) • Proof of concept is difficult http://www.flickr.com/photos/pinprick/3117108495/
  38. 38. Using Taverna for Little eScience • Implementing analysis is usually easy • Data handling is almost always hard • All data are in SQL databases, with consistent IDs • Lots of data manipulation is required • Avoiding web services as much as possible • Infrastructure and resources are limited • Benefit is truly questionable: AFAIK, I am 50% of the user base...
  39. 39. Example: Our Recent Research • Estimating user base and potential user interest in FLOSS projects • Based on common release-and-download patterns • Proxy for project success, a common dependent variable Area under Potential user curve is active experimentation Active user base users updating growth (good growth publicity?) downloads Version 0.5 Version 0.6 Version 0.7
  40. 40. ● 5000 4000 measure downloads 3000 ● user_base 2000 ● ● ● baseline ● 1000 ● ● ● ● ● ● Oct−2005 Apr−2006 Oct−2006 Apr−2007 “Normal” Download- BibDesk Release Patterns
  41. 41. 1.3.2-RC1 +2 presentations 1.5.0 ? ? Taverna’s Download- External effects! Release Patterns
  42. 42. Taverna’s Estimated 14 day baseline & drop-off Baseline & User Base
  43. 43. Taverna’s Estimated 7 day baseline & drop-off Baseline & User Base
  44. 44. Interpretation • Taverna is not a “normal” open source project • Speaking tours, tutorials, articles, and other events influence downloads • What this demonstrates... • Care is needed with quantitative measures • Not all open source projects are the same • Taverna users are just as reactive as any http://www.flickr.com/photos/pagedooley/2121472112/
  45. 45. Where next? • Adoption is a long-term agenda, as changing social practices doesn’t happen overnight • For FLOSS research and our disciplinary communities • We will keep doing our work this way, and hope to draw in others “Won’t you come out and play?” http://www.flickr.com/photos/atiq/2658884520/
  46. 46. Thanks! • Credits where they are due • Kevin Crowston, my advisor • James Howison, my collaborator • Everett Wiggins, my husband
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×