Data, Code, and Research at Scale


Published on

Published in: Technology, News & Politics
1 Comment
  • If you want to see the talk itself, it's available at
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Figure from Joël de Rosnay, 1979 book “The Macroscope”\n
  • \n
  • This is happening elsewhere, across other fields. Consider the impact of Google Books as a macroscope.\n
  • \n
  • \n34,260 real-life couples - “I met someone on OkCupid”, give username, hundreds per day\n\n- Would you consider sleeping with someone on the first date :: do you like the taste of beer?\n- Long-term compatibility :: Do you like horror movies?; Have you ever traveled around another country alone?; Wouldn't it be fun to chuck it all and go live on a sailboat?\n\n
  • \n
  • \n
  • The Foundation makes grants to support original research and broad-based education related to science, technology, and economic performance; and to improve the quality of American life\n\nOne thing to know about Sloan - the Foundation likes data. A lot.\n
  • The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and spectroscopic redshift survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico, United States\nThe survey was begun in 2000, and has mapped over 35% of the sky\n\n
  • Census of Marine Life - “global network of researchers in more than 80 nations engaged in a 10-year scientific initiative to assess and explain the diversity, distribution, and abundance of life in the oceans.”\n
  • Indoor Environment - in fact, virtually every science or social science program we have now involves a data infrastructure\n
  • Data deluge\n
  • What to throw away?\n
  • Code\n
  • Data’s great, but to work with it at scale, you need code.\n\n(The coffee grinder analogy isn’t quite right, but be glad that you didn’t get a meat grinder instead)\n
  • The n-gram viewer is a big black box. We have no idea what’s happening inside.\n
  • They do offer links to the data itself\n
  • Look at arrows, which mask some important transformations.\n
  • A lot of my scholarly work was on “mediators”, the people between producers and consumers. Oriented in this direction. Handwork vs. work “at scale”\n
  • NPR piece on data science\n
  • John Rauser from Amazon at Strata NYC 2011\n
  • John Rauser from Amazon at Strata NYC 2011\n\n“Telling stories with Data”\n
  • John Rauser from Amazon at Strata NYC 2011\n
  • NPR piece on data science\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Beyond data cleanup, production of new knowledge. Communication between participants (channel Lintott)\n
  • \n
  • Two main modes of knowledge production: scientific method founded on empirical falsifiability, and hermeneutic approaches that characterize much of the humanities and some social science.\n
  • Get a big pile of stuff, look for patterns, and iteratively hone in.\n\nAny economist will start shouting “correlation, not causation”.\n
  • nod to Dan Atkins for mentioning it yesterday - data mining\n
  • Steve Ramsay on browsing a library: “Here, I don’t know what I’m looking for, really. I just have a bundle of ‘interests’ and proclivities. I’m not really trying to find ‘a path through culture.’ I’m really just screwing around.”\n
  • Working at scale with data - Sense of play, fiddling with knobs. Exploration, visualization.\n\n\n
  • \n
  • Standing on the shoulders of giants\n
  • \n
  • \n
  • takes for granted wide system of institutions, as well as platforms and genres. Cite a book, you can trust a broad system of libraries as well as the consistency of individual manifestations of the same work\n
  • Chain of evidence\n
  • Data, code, are all important\n
  • Let’s imagine you publish an article. Many possible points of failure along chain moving upstream. Sociologists of science describe process of contestation as sequential opening of black boxes...\n
  • \n
  • \n
  • \n
  • Dan Cohen talked about learning to live with imperfection - software is never perfect, it’s just shipped.\n
  • Social features (Github)\n
  • Not everyone gets commit access; bug tracking is a form of decentralized review\n
  • Forking\n
  • \n
  • Workshop hosted by Victoria Stodden and others that convened projects that leverage technology in the interest of reproducible research\n
  • Some problems - looked at through another lens, this is essentially a culture of surveillance where everything is visible at all times.\n
  • Also, limited resources mean that the perfect capture of everything isn’t feasible, or useful downstream\n
  • Lots to decide on, and hopefully affirmatively address rather than simply allow technology to determine.\n
  • \n
  • Step back and look not at individual research projects, but the overall system. We’re seeing a lot of changes...\n
  • \n
  • \n
  • \n
  • \n
  • Dan Cohen on PressForward yesterday (12/2/11) - “if you don’t like our choices, you can check our work”\n
  • Opportunities to innovate in humanities, given 1) low stakes in publishing industry, 2) close linkages with libraries, and 3) vibrant community discussion.\n
  • Data, Code, and Research at Scale

    1. 1. Data, Code, andResearch at Scale Josh Greenberg The Alfred P. Sloan Foundation @epistemographer
    2. 2. DisclaimerThese statements do not necessarily reflect thethoughts of the Alfred P. Sloan Foundation or my colleagues; they are mine alone.
    3. 3. Research at Scale
    4. 4.
    5. 5.
    6. 6. Macroscope
    7. 7.
    8. 8. “My aim here is to inspire computer scientists toimplement software frameworks that empowerdomain scientists to assemble their own continuouslyevolving macroscopes, adding and upgrading existing(and removing obsolete) plug-ins to arrive at a setthat is truly relevant for their work” Katy Borner, “Plug and Play Macroscopes”
    9. 9.,+technology&year_start=1800&year_end=2000&corpus=0&smoothing=3
    10. 10.
    11. 11. Data
    12. 12. Big Data
    13. 13. SDSS
    14. 14. Census of Marine Life
    15. 15.
    16. 16.
    17. 17. Code
    18. 18.
    19. 19.,+technology&year_start=1800&year_end=2000&corpus=0&smoothing=3
    20. 20.,+technology&year_start=1800&year_end=2000&corpus=0&smoothing=3
    21. 21.
    22. 22. Who does the work?
    23. 23. Data Science
    24. 24. Data ScienceEngineering Applied Math John Rauser @
    25. 25. Ap pli ed M at h Writing ri ng ee ginEn John Rauser @
    26. 26. Ap pli ed M at h Writing ri ng ee ginEn John Rauser @
    27. 27. Data Science (#alt-ac?)
    28. 28. All hands on deck
    29. 29. Galaxy Zoo
    30. 30.
    31. 31.
    32. 32. Galaxy Zoo
    33. 33. Epistemology
    34. 34. Epistemology of Big Data?(Flip Kromer)
    35. 35. Screwmeneutics?
    36. 36.
    37. 37. Trust
    38. 38.,_Rosenwald_4,_Bl._5r.jpg
    39. 39. Reproducibility
    40. 40. empirical falsifiability : methods ::hermeneutic inquiry : provenance
    41. 41. Citation
    42. 42. Our means of dissemination areout of sync with the methods of scholarly production
    43. 43.
    44. 44. A thought experiment:
    45. 45. A thought experiment: What if we wrotescholarship like code?
    46. 46. Version Control
    47. 47. Tagged release
    48. 48. Bug Tracking
    49. 49. The very technology thatenables research at scale potentially enables new modes of dissemination
    50. 50.
    51. 51.
    52. 52. Del Rigor en la Ciencia Jorge Luis Borges“En aquel Imperio, el Arte de la Cartografía logró tal Perfecciónque el Mapa de una sola Provincia ocupaba toda una Ciudad, y elMapa del Imperio, toda una Provincia. Con el tiempo, estos MapasDesmesurados no satisficieron y los Colegios de Cartógrafoslevantaron un Mapa del Imperio, que tenía el Tamaño del Imperio ycoincidía puntualmente con él. Menos Adictas al Estudio de laCartografía, las Generaciones Siguientes entendieron que esedilatado Mapa era Inútil y no sin Impiedad lo entregaron a lasInclemencias del Sol y los Inviernos. En los Desiertos del Oesteperduran despedazadas Ruinas del Mapa, habitadas por Animales ypor Mendigos; en todo el País no hay otra reliquia de las DisciplinasGeográficas.“Suárez Miranda: Viajes de varones prudentes,libro cuarto, cap. XLV, Lérida, 1658.” via
    53. 53. Discuss...
    54. 54. One more thing...
    55. 55. Research at Scale
    56. 56. Disaggregation ofscholarly materials
    57. 57. Flourishing of new channels / genres
    58. 58. Humanities : blogs :: Social Sciences : SSRN (preprint) ::Sciences : PLoS ONE (rapid publication)
    59. 59. Addition of data and code to pile
    60. 60. New macroscopicmethods of discovery, assessing impact
    61. 61. Why (digital) humanities?