How to follow actors through their traces. Exploiting digital traceability


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 27/08/12
  • 27/08/12
  • On the one hand, social sciences could use quantitative methods (surveys and statistics) to collect data on large population, but the data they collect would necessarily be relatively poor and superficial. On the other hand, they could use qualitative methods (interviews, focus group, observations) to collect rich and detailed data, but they were then forced to limit their investigation to small populations.
  • 27/08/12
  • Social science could observe many thing from far away (quantitative methods = wide angle) or have a close look to few things (qualitative methods = telephoto). Never could they maintain the span and the focus of their observation at the same time, nor change their focal length continuosly.
  • Up until now, social sciences cannot use natural experiments either, because this type of experiments requires a detailed knowledge of a large number of subjects (Snow, for instance, had the complete map of the water distribution system of London, which allowed him to know which water company was serving each specific household). Unfortunately, these two conditions are seldom reunited in social sciences. Since their foundation, social sciences have always had to deal with a sort of methodological strabismus .
  • To use another metaphor, this is what I call the ‘Gulliver sociology’.
  • 27/08/12 In the previous unit we learnt how difficult is to study controversies. In this unit, we will discover that, luckily, there is at least one thing that can help us in this otherwise impossible mission. The one thing that can make the task of controversy mapping less helpless.
  • 27/08/12 Hop-o'-My-Thumb
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • But this situation has started to change as soon as social scientists have stopped considering media (and electronic media in particular) just as an object of study…
  • … and started considering them also as a possible source of data. Digital media have, in fact, a very interesting feature: all the interactions that they mediate becomes easily traceable and is often easily traced. Though these traces are not collected for the sake of social science (but for surveillance, marketing or for technical optimiszation), they can nonetheless be exploited by social scientists. Giving social sciences, for the first time in their history, access to plenty of data.
  • These data concerns huge population as about one third of world population has access to the Internet and about half of it owns a mobile phone. Digital media are spreading like a immense carbon paper, tracing social phenomena to an extent that has never been possible before. As a proof of concept, in the image in the slide Paul Butler showed how it is possible to generate a very detailed map of the world by mapping friendships connections in Facebook.
  • At the same time, this data are also as rich than the data collected with qualitative data. As a proof of concept, see the documentary on the life of American On Line user 711391. Drawing on a an accidental leak of AOL data, the documentary reports the three month complete search history of this user. The sequence of her queries (and nothing else) allows disturbingly intimate access to the life of this ”religious middle-aged and somewhat obese middle-aged lady from Houston Texas who is looking for a way to rejuvenate her sex life” (as we come to discover).
  • What is most important, thanks to digital traceability is now possible to collect data that are rich and concerning large population at the same time , as convincingly demonstrated by the famous Google study on the detection of flu epidemics.
  • In this study Google engineers identified the 45 search queries that best matched the flu curves released by the U.S. Centers for Disease Control (CDC). Then they combined the curves of this 45 queries and built and indicator that has an increadible mean correlation of 0.97 with CDC data.
  • With the advantage that whereas the CDC needs about two weeks to collect and release the data on US flu epidemics, Google can calculate its indicator every day.
  • (Google also made the same type of research possible to anyone and on any subject through Google Insight for Search and Google Correlate)
  • (Google also made the same type of research possible to anyone and on any subject through Google Insight for Search and Google Correlate)
  • (Google also made the same type of research possible to anyone and on any subject through Google Insight for Search and Google Correlate)
  • (Google also made the same type of research possible to anyone and on any subject through Google Insight for Search and Google Correlate)
  • From the point of view of social science, the change is dramatic. For the first time, it is possible to start imagining methods having both a large scope and a detailed focus, thereby overcoming the limitations of both quantitative and qualitative methods. The image in the slide is a good proof of concept. In this map of the US blogosphere in 2006 realized by Ben Fry, it is possible to observe zoom out to see the big picture and observe large-scale patterns (like the fact the the more visible websites link to the less visible one, but not the other way around – the so called preferential attachment), but also to zoom in and observe each individual connection. A new generation of quali-quantitative methods becomes therefore possible …
  • This is a map digital tools and methods that we use at the médialab of Sciences Po. In this course (and in particular in the second semester) you will lean to use most of them.
  • … and it becomes possible to move from the sociology of Gulliver to the sociology of Alice (as you know in her trip to Wonderland Alice can change her size at her will by drinking a magical potion and eating a magical cookie).
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • 27/08/12
  • The first challenge consists in taking the data mining metaphor seriously. Everyone who ever visited a gold mine knows well that what is striking about this type of landscape is the feeling of absence that dominate them. Where a mountain is supposed to be, there is a huge hole instead. Describing mining as the act of collecting gold and other precious materials is mistaking the aim for the practice. 0.1% of mining is about collecting precious substances, 99,9% of it is about removing tons and tons of rocks, sand and earth. Gold is the product of such absence, what is left when everything else is gone. The same is true for information mining: it is not about collecting as much data as possible; it is about getting rid of most of it. This is important, because the current ‘data deluge’ ideology, obsessed as it is with the question of collecting, storing, exploiting data, forgets that the careful selection of data is most important part of all scientific protocol.
  • 27/08/12 An example will make our argument clear. The so-called Internet map is, to our knowledge, the largest publicly available map of the Web. As you can see, very little knowledge can be extracted from this map. All that we can see is that the Web is polarized by language (the color of the nodes) and that some nodes are (far) more connected than the other (size of the nodes). None of this is a surprise.
  • 27/08/12 Beautiful and breathtaking as they may be, this kind of maps is useless for research purposes. This is not data mining, this is compulsive hoarding: a syndrome that is growing more and more serious among the data deluge fans.
  • A good map of the Web is always limited in its ambition: it tries to represent a limited portion of the Web and the better this portion is delimited, the better is the map. In the example an interesting map of the French political blogosphere, realized by Linkfluence (a research partner of the médialab).
  • Because the selection of the websites has been done carefully it is possible to use this map as a research tool and discover for example, that the extreme left and the extreme right have two very different position in French online politics: the first being little, spread out and central; the second being massive, clusterized and eccentric.
  • 0.1% of Web-crawling is about collecting relevant websites, 99,9% of it is about removing irrelevant ones. That is why the most important button in all the crawling tools that we develop at the médialab (in the slide you see the old Navicrawler and the soon-to-be-release Hyphe) is the one allowing the exclusion of one website from the corpus. Providing us tools for filtering, delimiting, sieving data is the first contribution that we would like to have from CHI experts.
  • 27/08/12 The first skill is ‘searching’ that is to say using a search engine. This is, by far, the most common way of finding information on the Web. All of you have already used search engines millions of times. And yet, it is important (and not only for the sake of controversy mapping) for you to understand the very specific movement of search engine querying. Contrarily to what you may think, this movement should not aim at expansion (finding more information), but at reduction. The problem with search engines is not that they return too little information, but that they return too much (and most of it is not relevant). Improving one’s queries is therefore an effort in finding more and more specific words capable to reduce the information reduce by the search engine.
  • 27/08/12 In fact, the movement just described need to be precised. The aim of the research, of course, is not to reduce the quantity of information found, but to reduce the irrelevant information and increase the relevant one. This movement of concentration (or distillation) requires identifying a number of ‘specific keywords’ clearly focused on the subject of the research.
  • 27/08/12 This subject-specific keywords can include proper names, name of institutions, toponyms, scientific/technical terminology, scientific references and in general all words or expression that are not polysemic or vague.
  • 27/08/12 And here are some other advices on how to improve your queries
  • In order to understand the revolution brought by digital traceability in controversy mapping and, more generally, to social science, we have to go back to a famous research conducted by the British epidemiologist John Snow at the middle of the XIX century. John Snow was trying to understand the mechanisms of diffusion of the cholera (one of the main death cause in UK). At the time, the dominant theory was that cholera was caused by pollution or a noxious form of "bad air”. Snow, however, criticized by this theory and claimed instead that cholera germs were transported by infected water. Snow first tried to prove its theory by showing that one particularly severe cholera outbreak in London was centered around a particular water pump located in broad street. But how how to prove that these particular observation could be generalized to all cholera epidemics.
  • Snow, of course, could not prove his theory by direct experiments on human beings and yet an experimental evidence was exactly he needed to convince the scientific community. Trying to solve this conundrum, Snow came up with the idea of ‘natural experiment’. First of all, he observed that the mortality rate in different households was strongly correlated with the company that provided them water. In particular, the houses supplied by the Southwark Company the mortality was almost six times higher that in the houses supplied by the Lambeth Company.
  • But this proof was not sufficient, as other differences between the households could have explained the difference. Snow however had at his disposal the detailed map of the London water system and observed that the distribution network of Southwark and Lambeth intermingled in central London. Since in these district the households supplied by the two water company were side by side, Snow can easily assume that all other conditions were equal. In other words, it was as if London population had been divided randomly in an experimental group and a control group, a perfect experimental setting except that Snow had not prepared it himself, but just found it in ‘nature’.
  • One of the main difference between natural science and social science is that the latter cannot reproduce the phenomena that they study in the controlled setting of the laboratory. Social sciences cannot rely on controlled experiments to investigate collective dynamics (and this is why the comics in the slide are funny). But can social sciences employ at least natural experiments?
  • How to follow actors through their traces. Exploiting digital traceability

    1. 1. HOW How to follow actors through their traces Exploiting digital traceability Tommaso Venturini
    2. 2. The quali/quantitative divide rich data, small populations large populations, poor data
    3. 3. The problems with either methods Traditional quantitative methods: • data collection: standard discourses’ collection risks to hide the heterogeneity • data treatment: statistical comparison risks to hide divergences Traditional qualitative methods: •data collection: risk of not being representative (beyond small controversies) •data treatment: problem of weighting different discourse
    4. 4. The problem with both methods rich data, small populations large populations, poor data Wide angle VS. telephoto
    5. 5. Follow the White Rabbit why controversy mapping (and digital methods) will change everything you know about sociology Tommaso Venturini The methodological strabismus of social sciences… Photo credit – tarout_sun via Flickr - ©
    6. 6. … is reified in social theory The collective self is not a simple epiphenomenon of its morphologic base, precisely as the individual self is not a simple efflorescence of the nervous system. For the collective self to appear, a sui generis synthesis of individual self has to be produced. This synthesis creates a world of feelings, ideas, images that, once come to life, follow their own laws. Emile Durkheim, 1912 Le formes élémentaires de la vie religieuse
    7. 7. Emergence The emergent is unlike its components insofar as these are incommensurable, and it cannot be reduced to their sum or their difference (p. 412) George Henry Lewes, 1875 Problems of Life and Mind
    8. 8. Cats and mice Jack Cohen, 2000 The Collapse of Chaos: Discovering Simplicity in a Complex World
    9. 9. The amazing dictyostelium discoideum Evelyn Fox Keller “morphogenesis”
    10. 10. God save the ant Queen Theraulaz, G. & Bonabeau, E. (1999) A brief history of stigmergy Artificial Life, 5, 97–116
    11. 11. The Bootstrapping of life
    12. 12. The bootstrapping of intelligence
    13. 13. The bootstrapping of society Thomas Hobbes, 1651 The Leviathan
    14. 14. Gulliver sociology Gulliver's Travels Jonathan Swift, 1726
    15. 15. Diving in magma T. Venturini (2010) Public Understanding of Science 19(3)
    16. 16. The Tarde vs Durkheim controversy Gabriel Tarde vs Emile Durkheim
    17. 17. Against emergence It is surprising to see the men of sciences, so ready to repeat that nothing is ever created from nothing, admitting implicitly (as if it was self-evident) that the connections among different beings can become beings themselves (p. 67) Tarde, 1893 Monadologie et sociologie
    18. 18. Against emergence Supposons pour un instant qu'un de nos États humains, composé non de quelques milliers mais de quelques quatrillions ou quintillions d'hommes hermétiquement clos et inaccessibles individuellement (sorte de Chine infiniment plus populeuse encore et plus fermée) nous soit simplement connu par les données de ses statisticiens, dont les chiffres portant sur de très grands nombres se reproduiraient avec une extrême régularité. Quand une révolution politique ou sociale, qui nous serait révélée par un grossissement ou un affaissement brusques de certains de ces chiffres, se produirait dans cet État, nous aurions beau être certains qu'il s'agit là d'un fait causé par des idées et des passions individuelles, nous éviterions de nous perdre en conjectures superflues sur la nature de ces causes seules vraies, mais impénétrables, et le plus sage nous paraîtrait d'expliquer tant bien que mal les chiffres anormaux par des comparaisons ingénieuses avec les chiffres normaux habilement maniés. Nous atteindrions ainsi au moins des résultats clairs et des vérités symboliques. Toutefois, il importerait de temps en temps de nous rappeler le caractère purement symbolique de ces vérités. Tarde, 1893 Monadologie et sociologie
    19. 19. How to overcome the quali-quantitative divide?
    20. 20. Inscriptions as traces Callon, M., Law, J., & Rip, A. (1986) Mapping the Dynamics of Science & Technology
    21. 21. Inscriptions as traces Callon, M., Law, J., & Rip, A. (1986) Mapping the Dynamics of Science & Technology
    22. 22. And then the web arrived… <a href=""> click here </a>
    23. 23. And then the web arrived… and Google with it Brin, S., & Page, P. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1-7), 107–117
    24. 24. Digital traceability Latour, B. (2007). Beware your Imagination Leave Digital Traces. Times Higher Literary Supplement. Owen Gingerich, the great historian of astronomy, spent a life-time retrieving all the annotations of all the copies of Copernicus’s first edition. He could thus give a precise meaning to the rather empty notion of “Copernican revolution” and could show which parts of the book everyone had read and misinterpreted. Nowadays, any scientist can do the same for each portion of each article he or she has published so long as the local library has bought a good package of digital data banks. But what is more extraordinary is that any journalist can do so as well for the latest Madonna video or the dirtiest rumour about Prince Harry’s love affairs. In other words, the former distinction between the circulation of facts and the dissemination of opinions has been erased in such a way that they are both graduating to the same type of visibility — not a small advantage if we wish to disentangle the mixture of facts and opinions that has become our usual diet of information
    25. 25. Digital traceability Once you can get information as bores, bytes, modem, sockets, cables and so on, you have actually a more material way of looking at what happens in Society. Virtual Society thus, is not a thing of the future, it’s the materialisation, the traceability of society. It renders visible because of the obsessive necessity of materialising information into cables, into data. Latour, B. 1998 “Thought Experiments in Social Science: from the Social Contract to Virtual Society”
    26. 26. From digital traceability … Bruno Latour (1998), argued that the Web is mainly of importance to social science insofar as it makes possible new types of descriptions of social life. According to Latour, the social integration of the Web constitutes an event for social science because the social link becomes traceable in this medium. Thus, social relations are established in a tangible form as a material network connection. We take Latour’s claim of the tangibility of the social as a point of departure in our search (p. 342). Rogers, R., and Marres, N. 2002 “Frenchs candals on the Web, and on the streets: A small experiment in stretching the limits of reported reality.” Asian Journal of Social Science 66: 339-353.
    27. 27. … to digital methods The Internet is employed as a site of research for far more than just online culture. The issue no longer is how much of society and culture is online, but rather how to diagnose cultural change and societal conditions with the Internet. The conceptual point of departure for the research program is the recognition that the Internet is not only an object of study, but also a source. Rogers, R. 2009 The End of the Virtual: Digital Methods. Amsterdam University Press.
    28. 28. Quali-quantitative methods Top 50 US blogs Ben Fry, 2006
    29. 29. Datascapes exploration Linkscape Linkscape© by Linkfluence©
    30. 30. médialab tools
    31. 31. Alice sociology Alice's Adventures in Wonderland Lewis Carroll, 1865
    32. 32. Building on faults T. Venturini (2012) Public Understanding of Science 21(7)
    33. 33. Beware! 1. More data means more noise 2. Digital data is not your data
    34. 34. Beware: more data means more noise!
    35. 35. Taking “data mining” seriously Yanacocha Gold Mine, Cajamarca, Peru
    36. 36. An (pseudo-) exhaustive map of the Web
    37. 37. Compulsive hoarding
    38. 38. A good map of the Web
    39. 39. A good map of the Web
    40. 40. How to search/query Bisphenol Bisphenol heart diseases controversy Bisphenol Melzer controversy
    41. 41. How to search/query Bisphenol Bisphenol heart diseases controversy Bisphenol Melzer controversy BPA Polycarbonate endocrine disruptor hearth desease David Melzer Food and Drug Administration coronary artery disease Monica Lind Jeremy Pearson Steven Hentges Polycarbonate Global Group
    42. 42. Subject-specific keywords • Proper names • Name of institutions • Toponyms • Scientific/technical terminology • Scientific references • …
    43. 43. Improving your query • Exploit linguistic differences • Go advanced (use search fields) • Limit the time span • Use search operators • “exact” / -exclude / ~synonyms / * / OR / AND
    44. 44. Beware: digital data is not your data!
    45. 45. Whose data is this? • Proliferation of new devices, genres and formats for the documentation of social life… explosion of digital technologies that enable people to report and comment upon social life. • Routine generation of data about social life as part of social life. ‘Social media’ platforms… embed the process of social data generation in everyday practices. • Development of online platforms and tools for the analysis of digital social data. These days, most online platforms come with ‘analytics’ attached: a set of tools and services facilitating the analysis of the data generated by said platforms. Marres, N. (2011). Re-distributing Methods: Interventions in Digital Social Research.
    46. 46. Redistribution of research methods • Methods as usual (ex. Andrew Abbott, ) The techniques used by digital platforms have been long used in social sciences. • Big methods (ex. Newman et al, 2007) Digital traceability increases the quantity of social data thereby demanding use of mathematical techniques of analysis. • Virtual methods (ex. Christine Hine, 2000, 2005) Digital media transform the quality of social practices and demand therefore increased efforts of observations and interpretation. • Digital methods (ex. Richard Rogers, 2009) Digital platforms have their own methods that need to be understood and re-purposed for social research. • Re-mediation of methods (ex. Nortje Marres, 2011) The techniques used by digital platforms have been long used in social sciences, but are radically transformed the new context of their use. Marres, N. (2011). Re-distributing Methods: Interventions in Digital Social Research. More redistribution Less
    47. 47. Natural experiments Snow, J. (1855). On the Mode of Communication of Cholera
    48. 48. Natural experiments Snow, J. (1855). On the Mode of Communication of Cholera
    49. 49. Natural experiments Snow, J. (1855). On the Mode of Communication of Cholera
    50. 50.