• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Figures of the Many - Quantitative Concepts for Qualitative Thinking
 

Figures of the Many - Quantitative Concepts for Qualitative Thinking

on

  • 1,162 views

Slides for a talk given at the Big Data Symposium at Parnassos in Utrecht on April 25 2013.

Slides for a talk given at the Big Data Symposium at Parnassos in Utrecht on April 25 2013.

Statistics

Views

Total Views
1,162
Views on SlideShare
1,162
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • An almostclassic kind of reasoning about "more".Image: http://www.prweb.com/releases/information/digital/prweb509640.htm
  • Every one of use posses a large number of objects, many of them computers.
  • People do a lot of different things on Twitter, Facebook, etc. – and just because you and your immediate vicinity seem to have coherent practices, this does not mean others have.
  • Anatomy of a tweet. https://twitter.com/ICIJorg/status/321585235491962880https://api.twitter.com/1/statuses/show/321585235491962880.json
  • Very large scale systems on the one side, but highly concentrated data repositories on the other.The promise of data analysis is, of course, to use that data to make sense of all the complexity.
  • C. Wright Mills vs. Paul LazarsfeldMany people argue that we no longer need that grant, we already have the data.
  • Reasoning then guides practice. Description => decision-making.
  • "Why does the Astronaut step into the Space Shuttle?", does not seem like a sensible idea. What reasons are given that we do not think about astronauts as suicidal?
  • Cost-benefit analysis! How to price a life? (today: expected future earnings)
  • http://www.youtube.com/watch?v=zFl6p4D59AAhttp://www.videohippy.com/video/11216/Little-Britain-Computer-says-NoExample: opening a bank account at ABN-Amro (credit rating)http://www.creditchecker.nl/Questions: ShouldI give that person money? How much? At what interest rate?
  • Questions: I am the government, what should I do? Where should I invest? How does the economy work?Adolphed'Angeville:Essaisur la Statistique de la Population Française, 1836 - Full document: http://www.europeana.eu/portal/record/03486/DE44EEC02EA9F56E94AD9D3BD077AB298A92514E.html
  • Making decisions: in particular on the interpersonal level!
  • Allows for all kinds of folding, combinations, etc. – Math is not homogeneous, but sprawling!Different forms of reasoning, different modes of aggregation.These are already analytical frameworks, different ways of formalizing.
  • http://www.facebook.com/ElShaheeed (Created by WaelGhonim, considered to be a central place for the sparking of the Egyptian Revolution)http://apps.facebook.com/netvizz/ (tool used for extraction)
  • Simply plotting events is an analytical gesture. (=> pattern)
  • Changing scales, analytical gesture, "tame" large numbers and heighten visibility
  • Adding variables => allow for comparisons
  • Count per interval (here: day).
  • Different visuals, change counting interval, very different effect
  • But if we look at the number of posts published on the page, this is a very different picture! So we want to compare!
  • Find outliers and interesting moments not only in terms of values, but relationships between values.
  • Looking at "central tendencies" in data. When does it make sense? Here it does, because there is no powerlaw.
  • Whatdo the averages characterize here? Not much – there is no "typical" post.
  • In statistics, regression analysis is a statistical technique for estimating the relationships among variables. (correlation)A probability relationship: height and weight is correlated: if you are very tall, there is a good chance that you also weigh more; a statistical not a deterministic relationshhipErosion of determinism in the 19th centuryTitle : Recherchessur la population, les naissances, les décès, les prisons, les dépôts de mendicité, etc., dans le royaume des Pays-Bas , par M. A. Quételet,… 1827http://gallica.bnf.fr/ark:/12148/bpt6k81568v.r=.langEN
  • Positive correlation, but it's not 1:1
  • And now to graph theory.
  • Forsythe and Katz, 1946 – "adjacency matrix", Moreno, 1934
  • Visualization is, again, one type of analysis.Which properties of the network are "made salient" by an algorithm?http://thepoliticsofsystems.net/2010/10/one-network-and-four-algorithms/Models behind: spring simulation, simulated annealing (http://wiki.cns.iu.edu/pages/viewpage.action?pageId=1704113)
  • So, what can we do?Logistics are important, because they determine who can do what kind of research, requirements for groups, etc.1% easy to handle for modern hardware; but for how long?
  • A platform that hosts many different practices, from interpersonal communication to mass-media like oulets like Lady Gaga's account, which has 36M followers.But means or medians are still reference points!
  • We can of course produce descriptive statistics!Baselining allows us to make "drawing the line" more informed. Does not evacuate bias – there is no "view from nowhere" – but maybe more conscious.
  • Extend word lists (what am I missing?), account for refraction.
  • Compare
  • Larger roles of hashtags, not all are issue markers!
  • All in all, this process resulted in the specification of nine centrality measures based on three conceptual foundations. Three are based on the degrees of points and are indexes of communication activity. Three are based on the betweenness of points and are indexes of potential for control of communication. And three are based on closeness and are indexes either of independence or efficiency.(Freeman 1979)What concepts are they based on?
  • Network metrics are highly dependent on individual variables.
  • There is no need to analyze and visualize a graph as a network.Characterize hashtags in relation to a whole. (their role beyond my sample), better understand our fishing pole and the weight it carries.Tbt: throwback thursday
  • How do we interpret this: understand the platform, understand the context of the phenomenon, understand the algorithm, etc.
  • How do we interpret somethinglike this?
  • Quantitative forms allow us to fill this with "content".

Figures of the Many - Quantitative Concepts for Qualitative Thinking Figures of the Many - Quantitative Concepts for Qualitative Thinking Presentation Transcript

  • Figures of the ManyQuantitative Concepts for Qualitative ThinkingBernhard RiederUniversiteit van AmsterdamMediastudies Department
  • ContextTerms like "big data", "computational social science", "digital humanities","digital methods", etc. are receiving a lot of attention.They point to a set of practices for knowledge production: data analysis,visualization, modeling, etc.Instead of a totalizing search for a "logic" of data analysis, we couldinquire into the vocabulary of analytical gestures that constitute thepractice of data analysis.A twofold approach to methods:☉ Engagement, development, application => digital methods☉ Conceptual, historical, and political analysis and critique => software studies
  • This presentationHow do we talk about data? How do we analyze them? What is our frameof thought? How do we go further in terms of imagination, expressivity?☉ 1 / Confronting "the many"☉ 2 / Two kinds of mathematics☉ Objects and their properties => Statistics☉ Objects and their relations => Graph theoryEngage the theory of knowledge (epistemology) mobilized in data analysis,but through the actual techniques and not generalizing concepts.
  • What styles of reasoning?Hacking (1991) building the concept of "style of reasoning" on A. C.Crombie’s (1994) "styles of scientific thinking":☉ postulation and deduction☉ experiment and empirical research☉ reasoning by analogy☉ ordering by comparison and taxonomy☉ statistical analysis of regularities and probabilities☉ genetic developmentWhat kind of reasoning are we mobilizing in data analysis?Is the history of styles of reasoning simply intellectual progress, oradaptation to a changing world, or co-constitutive of that world?What is our world like?
  • "It is hard to believe that we still have to absorb the same types ofactors, the same number of entities, the same profiles of beings, andthe same modes of existence into the same types of collectives asComte, Durkheim, Weber, or Parson [sic], especially after science andtechnology have massively multiplied the participants to be cooked inthe melting pot." (Latour 2005, 260)
  • The proliferation of actors and facilitation of transversal connectivity havelead to large and complex forms of socio-technical grouping andstructuring.Forms of organization take the shape of (multi-sided) markets basedaround technological platforms that facilitate transactions.Social media use simple but flexible grammars of connectivity(combination of point to point and list forms), exchange, and aggregationthat accommodate various practices and levels of scale.The diversity of practices, contents, geographies, topologies, intensities,motivations, etc. makes it hard to generalize and theorize dynamics of use.1 / The many
  • Platforms like Twitterboost opportunities forconnectivity betweenvarious types of actors.
  • At the same time, theyproduce detailed datatraces that are highlycentralized and searchable.
  • Quality / quantity"One of my favorite fantasies is a dialogue between Mills and Lazarsfeld in which the formerreads to the latter the first sentence of The Sociological Imagination: Nowadays men oftenfeel that their private lives are a series of traps. Lazarsfeld immediately replies: How manymen, which men, how long have they felt this way, which aspects of their private livesbother them, do their public lives bother them, when do they feel free rather than trapped,what kinds of traps do they experience, etc., etc., etc. If Mills succumbed, the two of themwould have to apply to the National Institute of Mental Health for a million-dollar grant tocheck out and elaborate that first sentence. They would need a staff of hundreds, and whenfinished they would have written Americans View Their Mental Health rather than TheSociological Imagination, provided that they finished at all, and provided that either of themcared enough at the end to bother writing anything." (Maurice Stein, cit. in Gitlin 1978)Theory vs. empiricism, macro vs. micro, qualitative vs. quantitative, inductive vs.deductive, associative vs. formalistic, etc.The promise of data analysis tools, applied to exhaustive (and cheap) data, is tobridge the gap, to allow zooming, "quali-quanti" (Latour 2010).
  • “facts and statistics collected together for reference or analysis. See also datum.- Computing: the quantities, characters, or symbols on which operations are performed by acomputer, being stored and transmitted in the form of electrical signals and recorded onmagnetic, optical, or mechanical recording media.- Philosophy: things known or assumed as facts, making the basis of reasoning orcalculation.” (Oxford American Dictionary)Define: dataReasoning (OAD): "think rationally", "use ones mind", "calculate", "make senseof", "come to the conclusion", "judge", "persuade", etc.Reasoning as "giving reasons" – what counts as a good reason? What counts as agood argument? As a proof? What is "good" knowledge?Reasoning as a series of techniques, e.g. science, engineering, etc.
  • Why does the astronaut step into the space shuttle?
  • A short history of reasoning the "more"Commercial Capitalism (13th +)calculating for trade, arithmetic, sharing risk and profit in long-distance commerceRise of the Nation State (17th +)"art of the state", mercantilism, scientific revolutionIndustrialization (19th +)urbanization, scientific management, large bureaucracies☉ Fibonacci, "Liber Abaci", Fibonacci, Calculating with Arab numerals (Pisa, 1202)☉ Unknown, "Arte dellAbbaco", Practical arithmetic (Venice, 1478)☉ Pacioli, "Summa de arithmetica, geometria, proportioni et proportionalità", Double entrybookkeeping (Venice, 1494)☉ William Petty & John Graunt, Political Arithmetick (17th century)☉ Hermann Conring & Gottfried Achenwall, Statistik (17th & 18th century)☉ Adolphe Quetelet, Statistical regularities and the "average man" (19th century)☉ Francis Galton & Karl Pearson, Public health and eugenics (late 19th century)
  • Liber Abaci, Fibonacci, 1202Calculation for accounting,money-changing, insurance,lending, measurement, etc.
  • "Having proved that there die about 3,506 persons at Paris unnecessarily, to thedamage of France, we come next to compute the value of the said damage, andof the remedy thereof, as follows, viz., the value of the said 3,506 at 60 livressterling per head, being about the value of Algier slaves (which is less than theintrinsic value of people at Paris), the whole loss of the subjects of France in thathospital seems to be 60 times 3,506 livres sterling per annum, viz., 210,360livres sterling, equivalent to about 2,524,320 French livres." (Petty 1655)
  • The Assurance of Lifes,Charles Babbage, 1826First life tables wereassembled in the 17thcentury by John Graunt.Babbage builds a machineto produce tables faster.
  • Essai sur la statistique dela population française,Adolphe dAngeville, 1836population census, taxregister, house numbers, etc.modern statistics, largebureaucracies, quantitativesocial sciences, etc.
  • Over the last centuries, scientific thinking has become the dominant wayof producing knowledge and making decisions in most societies.Scientific thinking implies various styles of reasoning, different ways of"giving reasons", different analytical gestures, etc.Styles are intrinsically connected to our "lifeworld" (Husserl 1936).Two diagnoses:☉ Our lifeworld is changing in significant ways => "the many"☉ We need new ways of making sense of it => data analysisWhat is the style of data analysis? Its epistemology? One or many?What are its techniques, its analytical gestures?Some conclusions for part 1
  • 2 / Two kinds of mathematicsCan there be data analysis without math? No.Does this imply epistemological commitments? Yes.But there are choice, e.g. between:☉ Confirmatory data analysis => deductive☉ Exploratory data analysis (Tukey 1962) => inductiveThere is a fast growing variety of analytical gestures focusing on largenumbers of formalized and classed objects.
  • 2 / Two kinds of mathematicsStatisticsObserved: objects and propertiesInferred: relationsData representation: the tableVisual representation: quantity chartsGrouping: class (similar properties)Graph-theoryObserved: objects and relationsInferred: structureData representation: the matrixVisual representation: network diagramsGrouping: clique (dense relations)
  • Facebook Page "ElShaheeed", June 2010 – June 2011, (Poell / Rieder, forthcoming)7K posts, 700K users, 3.6M comments, 10M likes (tool: netvizz), work in progress!
  • New media platforms funnel practices into reduced and largely formal"grammars of action" (Agre 1989); data is therefore very clean, verycomplete, and very detailed.Can be imported with great ease into standard packages that come withmany analytical gestures built in R, Excel, SPSS, Rapidminer, etc.).Tools are easy, concepts are hard.Statistics
  • Facebook Page "ElShaheeed", June 2010 – June 2011comment timescatter
  • Facebook Page "ElShaheeed", June 2010 – June 2011comment timescatter, log10 y scale
  • Facebook Page "ElShaheeed", June 2010 – June 2011:comment timescatter, log10 y scale, likes on
  • Facebook Page "ElShaheeed", June 2010 – June 2011comment timeline, per day
  • Facebook Page "ElShaheeed", June 2010 – June 2011comment timeline, per month
  • Facebook Page "ElShaheeed", June 2010 – June 2011page posts by type, per month
  • Facebook Page "ElShaheeed", June 2010 – June 2011comparison timeline: comments, posts, comments per post
  • Facebook Page "ElShaheeed", June 2010 – June 2011histogram of comment lengths in characters
  • Facebook Page "ElShaheeed", June 2010 – June 2011histogram of like count
  • Calculating relationships between variablesQuetelet 1827, Galton 1885, Pearson 1901"Erosion of determinism" (Hacking 1991)
  • Facebook Page "ElShaheeed", June 2010 – June 2011scatterplot comments / likes, with standard error
  • Facebook Page "ElShaheeed", June 2010 – June 2011:scatterplot comments / likes, per post type
  • 2 / Two kinds of mathematicsStatisticsObserved: objects and propertiesInferred: relationsData representation: the tableVisual representation: quantity chartsGrouping: class (similar properties)Graph-theoryObserved: objects and relationsInferred: structureData representation: the matrixVisual representation: network diagramsGrouping: clique (dense relations)
  • 3 / The mathematics of structureGraph theory has a long prehistory; social network analysis starts in the1930s with Jacob Morenos work.Graph theory is "a mathematical model for any system involving a binaryrelation" (Harary 1969); it makes relational structure calculable.
  • Three different force-based layouts of my FB profileOpenOrd, ForceAtlas, Fruchterman-Reingold
  • Non force-based layoutsCircle diagram, parallel bubble lines, arc diagram
  • Network statisticsbetweenness centralitydegreeRelational elements of graphs canbe represented as tables (nodeshave properties) and analyzedthrough statistics.Network statistics bridge the gapbetween individual units and thestructural forms they areembedded in.This is currently an extremelyprolific field of research.
  • Twitter 1% sample, 24 hours: 4.3M tweets, 3.4Musers, 2M accounts mentioned, 227K unique hashtags
  • Helpful: baseline samplingTwitters API proposes a random 1% statuses/sample endpoint that doesnot require privileged access.Provides datasets for researching certain types of questions and allows to"contextualize" (baseline) other collections.We (Gerlitz / Rieder 2013) explored 24 hours of the 1% sample andcaptured 4,376,230 tweets, sent from 3,370,796 accounts, at an averagerate of 50.65 tweets per second, leading to about 1.3GB of uncompressedand unindexed MySQL tables.
  • A baseline provides reference pointsBeware of averages in non-normal distributions! But 1% sample issufficiently large to allow representative exploration of subsamples.We can qualify structures and individual elements in terms with the helpof statistics and graph theory.
  • Twitter 1% sample, co-hashtag analysis227,029 unique hashtags, 1627 displayed (freq >= 50)Size: frequencyColor: modularity
  • Size: frequencyColor: user diversityTwitter 1% sample, co-hashtag analysis227,029 unique hashtags, 1627 displayed (freq >= 50)
  • Size: frequencyColor: degreeTwitter 1% sample, co-hashtag analysis227,029 unique hashtags, 1627 displayed (freq >= 50)
  • Nine measures of centrality (Freeman 1979)
  • Label PR α=0.85 PR α=0.7 PR α=0.55 PR α=0.4 In-Degree Out-Degree Degreen34 0.0944 0.0743 0.0584 0.0460 4 1 5n1 0.0867 0.0617 0.0450 0.0345 1 2 3n17 0.0668 0.0521 0.0423 0.0355 2 1 3n39 0.0663 0.0541 0.0453 0.0388 5 1 6n22 0.0619 0.0506 0.0441 0.0393 5 1 6n27 0.0591 0.0451 0.0371 0.0318 1 0 1n38 0.0522 0.0561 0.0542 0.0486 6 0 6n11 0.0492 0.0372 0.0306 0.0274 3 1 4
  • Twitter 1% sampleCo-hashtag analysisDegree vs.wordFrequency
  • Degree vs. userDiversityTwitter 1% sampleCo-hashtag analysis
  • Facebook Page "ElShaheeed"700K nodes, 11M connectionsColor: type
  • Facebook Page "ElShaheeed"700K nodes, 11M connectionsColor: outdegree
  • ConclusionsThere is a lot of excitement about data analysis, but our understanding ofstyles and analytical gestures is still very poor.We need interrogation and critiques of methodology that are developedfrom engagement and historical/conceptual investigation.We need analytical gestures that are more closely tied to concepts fromthe humanities and social sciences; exploration rather than confirmation.Visualization and simpler tools are very interesting but require technicaland conceptual literacy to deliver more than illustrations.This is probably not a fad.
  • "Incite, induce, deviate, make easy or difficult, enlarge or limit, render more orless probable… These are the categories or power." (Deleuze 1986, 77)
  • Thank Yourieder@uva.nlhttps://www.digitalmethods.nethttp://thepoliticsofsystems.net"Far better an approximate answer to the rightquestion, which is often vague, than an exact answer tothe wrong question, which can always be made precise.Data analysis must progress by approximate answers, atbest, since its knowledge of what the problem really is willat best be approximate." (Tukey 1962)