Big Social Data: The Social Turn in Big Data

  • 952 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
952
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
24
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Richard HeimannTwitter: @rheimannHeimann.richard@gmail.com
  • “Data is the new Oil” was a term coined by Clive Humby and embraced by the World Economic Forum in 2011 as it considered data as an economic asset like oil. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. (IBM) This data comes from everywhere: hard sensors used to gather climate information, social media sites, the open web, pictures, videos, transaction records, and cell phone GPS signals to name a few. This data is big and by every account growing at an increasing rate. These facts however give no hint to where the largest growth is or comparatively speaking where is the greatest reward lie for researchers. It is difficult to believe that data is increasing uniformly.There is no argument that Big Data1 has spurred technological innovation, the result of which has lowered processing cost on data and consequently has had a social impact. Businesses especially are using Big Data to answer questions that five years ago were not possible. A recent study (Brynjolfsson, 2011) found that technology investments of 179 large publicly traded firms that adopted big data driven decision making have output and productivity that is five to six percent higher than what would be expected given their other investments and information technology usage. Big Data advancements have included better targeted web ads by the likes of Google (Google Ads) and Facebook (Facebook Exchange) as well as better recommendation systems by Netflix and Amazon. The use of data on the part of these and similarly savvy data driven companies have seemingly had a positive, unilateral impact on operations, which is supporting evidence for a big data driven approach. Despite the remarkable growth of data, the description of Big Data still seems rather empty. The description of the distribution of the 90% outgrowth of Big Data would more accurately define what Big Data is and, more importantly, how it is different than traditional science data. If, for example, the data explosion is normally distributed then perhaps Big Data will have less of an impact than suspected for the social sciences. This would result from the fact that the tails are close to the average than true under a power law distribution. If the case, Big Data isn’t special or at least as special as suspected and is less oil than it is an average economic asset.
  • The Long Tail of Science Data A distribution is said to have a long tail if a larger share of population rests within its tail than would under a normal distribution (Wikipedia). A signature quality of a power law is the long tail and the large number of occurrences far from the "head" or central part of a distribution. The Long Tail, written by Chris Anderson (2004) encourages entertainment to “forget squeezing millions from a few megahits at the top of the charts.” He quite successfully foretells the “future of entertainment is in the millions of niche markets at the shallow end of the bitstream.” This principle tells why companies like Netflix have beat out the likes of Blockbuster; and how Amazon (NASDAQ:AMZN) has been so successful with stock prices increasing from about $40.00 in late October 2004 (when Anderson published his article in Wired) to nearly $234.00 in late 2012. The Long Tail of science data follows the power law distribution (Figure 1) as well. The National Science Foundation has shown their grants in dollar amounts to follow the power law - empirical support for the long tail of science data. (Figure 2). The tail has many heterogeneous datasets. These data are small and often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontiguous from other research efforts and discontinuous over space and time. Richard Hamming’s prominent words during speeches titled “You and Your Research” encourage researchers to ask “What are the important problems in my field?'' Understanding the long tail of data may suggest where computational spatial social scientists will have the greatest impact and contribute the greatest social good. The head of the distribution is where Big Data resides and perhaps where the greatest impact is to human understanding and human welfare. The head contains data, which are large and homogenous. The volume produces coincident datasets in time and space - unintentionally producing binding research across social science disciplines, even producing binding research between the natural and social sciences. The coincident nature makes them ideal for cross correlation and multivariate analysis. Open Innovation initatives hold certain promise for shared innovation and risk for research initiatives within business and government. “Open innovation is a paradigm that assumes that firms can and should use external ideas as well as internal ideas, and internal and external paths to market, as the firms look to advance their technology” (Chesbrough, 2003) Designing binding research across social science disciplines and between the natural and social scineces will require exploitation of the head of the long tail and shared interest in socially critical problems. Intelligence data too is often collected for a group or focal area and is almost always limited in scope and almost never preserves any semblance of external validity. These data are often collected for small projects and are often forgotten and not maintained. The poor curation of these data leads to their inevitable misplacement – dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The utter lack of central management of data in the tail invariably leads these data to be forgotten. The long tail is an intractably large management problem and an analytical one as well. The central curation of data in the head ensures maintenance, unlike data in the tail.
  • Honest SignalsNot all data in the head represent uniform utility for social science. The proliferation of Big Data has seen with it significant increases in social and spatial data. Location based services are the most significant technological advancement to geography since Geographic Information Systems (GIS) and may on scale prove to be more significant. Surprisingly, this advancement is somewhat a byproduct of communication platforms that only happen to be location aware. This data exhaust is what Alex Pentland of MIT calls honest signals. Pentland speaks of Big Data with tuned social consciousness. The data exhaust he says, are “those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy.” Big data is increasingly about real social and spatial behavior.Hard sensors producing satellites images have produced novel ways of calibrating populations and local economic activity (Measuring Economic Growth from Outer Space: Henderson, et al, 2008) but they often tell less of a population than do soft human geosensors. Further marginalizing the potential utility of hard sensors is the finite allocation of such resources. The military especially has dedicated, almost entirely, the allocation of hard sensors to Central Command, the geographic command governing the wars in Iraq and Afghanistan leaving much of the remaining observation space unobserved. Human geosensors offer light in hard sensor poor environments.Big Social Data will almost certainly spawn new science within the social sciences and will likely challenge political theory and inform policy at levels unimaginable just five years ago. The advancements in cloud computing no longer make it necessary to disregard data, or limit analysis by time or space. Big Social Data will be center to the most important governance questions of our time.
  • Economists studying macroeconomics and growth generally focus on Gross Domestic Product (GDP) as the dependent variable in their analyses. The conceptual problems in defining GDP, much less using it as a measure of welfare, are the stuff of introductory economics courses. Just as serious, however, is the problem that GDP is terribly measured. For example, in the United States, the standard deviation of the gap between the advance estimate of real quarterly GDP growth (which is available one month after the end of the quarter) and the latest revised estimate (recalculated every July for three years, and every five years thereafter) is 1.0% for 1983-2002. This is a substantial portion of the standard deviation of measured growth, 2.4% over the same period (BEA 2006, 2008). The measurement problems in GDP are far more serious in developing countries, for several reasons. Compared to developed countries, a much smaller fraction of economic activity in developing countries is conducted within the formal sector, the degree of economic integration among regions is low, and the government statistical infrastructure is often quite weak.
  • The 20th century was seminal for the natural sciences with discoveries such as the polio vaccine (Salk, 1952), discovery of penicillin (Fleming, 1945), discovering the double helix structure of DNA (Watson & Crick, 1953), and the first complete DNA sequence of an organism (Sanger et al., 1977), all of which advanced human understanding and human welfare. The advent of the internet hopes to do for the computational spatial social scientist in the 21st century what other measurement tools did for the natural scientist of the 20th century - advancing further human understanding and human welfare. Big Social Data however, promises the marked improvement of policy and governance decisions, affecting the lives of everyone. The recent social turn in Big Data makes great effort to spell a number of enduring malpositions of Big Data. This paper acknowledges contributions on the part of several socially conscious data scientists and hopes to advance those efforts with insights - ultimately highlighting the differences between the demands on data and analysis on the part of private industry and the demands on data and analysis on the part of security, governance, and policy.Big Social Data is a potential that is largely untapped and will allow decision makers to track progress, better understand and improve social conditions of local populations, and understand where existing policies require adjustment.  One ought to wonder what US Government engagements would look like if  Big Social Data could improve decision making or intelligence analysis by a mere five to six percent as industry has done for output and production.
  • The typical state centric analysis that seeks to determine how states can or do impose stability must also develop a sensory capability to better detect the precursors to political change, a social radar with a level of granularity, understanding, and confidence that enables policy leaders to make informed decisions that maximize national influence left of boom (Flynn, 2012).This is an (quick) analysis of failed states index 2012 (DV) and Demograp_2011;HumanFli_2011;PovertyE_2011;HumanRig_2011;Security_2011 (IVs). This is somewhat a space-time prediction of those variables contributing to failed state index in 2011 on failed state index in 2012. This is a quick analysis to demonstrate a point and in a pure cross section this would suffer from severe model misspecification due to endogeneity. But here we are using variables from two cross panels. Both internal and external validities can be threatened, depending on the exact set up and question. It can be difficult to esta`blish a form of causality for a given area between two points in time, or for two different areas at a given point in time. Even once this is established, such that internal validity is ensured, the difficulty is to transpose the insight(s) to either a different point in time, or a different area, which relates to external validity. The goal can be sought by spatial scientist and geographers to ensure external validity and seek out regionally flexible small theory from big data. Does fotteringham offer credible evidence that 1. global hypotheses can be tested against local models to examine where spatial and social processes are not satisfied 2. local hypotheses can be tested against spatial structural change.
  • David Kilcullen (2009) explains that today’s conflicts are a complex hybrid of contrasting trends that counterinsurgencies continue to conflate, blurring the distinction between local and global struggles, and thereby enormously complicating the challenges faced. Kilcullen steps through local and global struggles and outlines the importance of commensurate policy. This process can be characterized roughly as useful spatial models whose statistically significant global variables exhibit strong regional variation to inform local policy, and as statistically significant global variables that exhibit little regional variation to inform region-wide policy. See Figure 2.Unfortunately, computational COIN relies on time-series data almost exclusively, thus collating all actors into one fighting force operating on an assumed homogeneous population base. Operational analytical workflow consists of mere plotting of temporal patterns and describing discrete time {Xt-1} over time {Xt) change. The nature of insurgency, like most phenomena, is change overtime and space. Kalyvas examines how strategies vary temporally and spatially,focusing on the spatial variation of control on the part of the counterinsurgency and the spatial variation of violence. Kalyvas concludes that violence is nonrandom and non stationary. Insurgency, he concludes does not resemble a Hobbesian world. In other words, violence does exhibit spatial structure and commensurate spatially explicit theory should follow a blended idiographic and nomothetic methodology.
  • Big Data, Small Theory is a blended methodological approach between two competing aspirations, held by natural and social scientists respectively. The desire for nomothetic type laws on the part of natural scientist has created a sort of physics envy in the social sciences. The assumption made in order to achieve a universal law is stationarity; a single equilibria or singular process. Humans often behave in social groups and are consequently influenced by context - there are however exceptions to the rule. As Ernest Rutherford famously said, “the only law in social science is some do, some don’t.”
  • Smith takes the view that a law must be true entirely, and that a single counterexample is sufficient to refute, while Harvey argues that a law need not be deterministic. Expectations about laws clearly vary across the sciences. It is in principle possible for ahuman individual to violate any deterministic law about individual behavior, which would appear to deny any possibility of such laws in the social sciences, a theme discussed by Barnes.The assumption made in order to achieve a universal law is stationarity; a single equilibria or singular process. Humans often behave in social groups and are consequently influenced by context - there are however exceptions to the rule. As Ernest Rutherford famously said, “the only law in social science is some do, some don’t.”
  • The End of Theory - DispelledChris Anderson wrote in 2008 that the data deluge would make theory and the scientific method obsolete. Geographers, not to mention social scientists generally are forever concerned with theory. Big Social Data, thanks to the Open Web create a revived sense of experimentation and consequently new science and new theory. George Box is known to have said that the only way to understand complex systems is to shock that system and observe its reaction. Big social data allows for the observation of subsequent shocks, whether environmental, economic, political to many more subjects than previously imagined. The days of post hoc rationalizations of social systems on the part of intelligence analysts or researchers are waning. The goal of each ought to be to discover truths since all truths are easy to understand once they are discovered. Big Social Data’s goal is to learn about social systems at a speed commensurate with decision making and at a spatial support commensurate with policy development and assessment. Stylized facts using big data and constructed around small theory is one framework to operate within to develop theory. It is a blended approach between the nomothetic laws of the natural sciences and the idiographic details of soft geography. These small theories are not small in significance but locally calibrated to the populations they observe. These theories and stylized facts are based on empirical observation and are expected to be generally true, sufficient generally to be a useful norm and timely. Deviations from these facts ought to be interesting to include structural stability and instability over time and space. Ultimately, these stylized facts attempt to deal with geographic process rather than form; to understand social process in context. The 20th century was seminal for the natural sciences with discoveries such as the polio vaccine (Salk, 1952), discovery of penicillin (Fleming, 1945), discovering the double helix structure of DNA (Watson & Crick, 1953), and the first complete DNA sequence of an organism (Sanger et al., 1977), all of which advanced human understanding and human welfare. The advent of the internet hopes to do for the computational spatial social scientist in the 21st century what other measurement tools did for the natural scientist of the 20th century - advancing further human understanding and human welfare.

Transcript

  • 1. DATA is the new OIL… Richard Heimann © 2011
  • 2. Long Tail of data science… Head: Big Data Long Tail: Intelligence Reporting, Science Data – Dark DataHead: Big Data – Large continuous datasets coincident over Time & Space. Ideal for multivariate analysis.Tail {power law distribution} is good for business but suboptimal for governance. Data in tail is oftenunmaintained beyond their initially designed use case and individually curated. As a result, the data isdiscontiguous from other research efforts and discontinuous over space and time.Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of darkdata is real and prevalent in the tail. The long tail is an intractably large management problem. Richard Heimann © 2011
  • 3. Long Tail of data science…Head TailHomogenous HeterogeneousCentralized curation Individual curationMaintained UnmaintainedContinuous over S & T Discontinuous over S & TVisibly accessible DARK DataHigh Velocity Slow or NO velocityHigh Volume Low Volume Richard Heimann © 2011
  • 4. Long Tail of NSF data…Power law 80% 20%Number of Grants 7,478 1,869Dollar Amount $938,548,595 $1,199,088,125Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount) Richard Heimann © 2011
  • 5. Honest Signals
  • 6. Honest Signals – Spatial Randomness Deviate from spatial randomness suggests underlying social processes. “Every observable effect has a cause” (Thales) Perhaps the most profound honest signal is a rejection of the randomness.Random O&D Variable Total O&D Variable Xt Xt Count – 500 meter cell 500 meter cell
  • 7. Social Radar…http://paa2009.princeton.edu/papers/91094
  • 8. Social Radar……what LTG Michael Flynn calls a warning system to inform policy makers of potential crises ‘left of boom’ …it yieldsunderstanding of the human landscape advancing beyond standard terrain features.Humans geosensors represent social radar; particularly in areas unmarked by conflict and in sensor poor environments. Richard Heimann © 2011
  • 9. What is Big Social Data?Anyon (1982) that social science should be empirically grounded, theoretically explanatory and socially critical. Big Social Data is all those things with an emphasis on socially critical. The Social Turn of Big Data is upon us… Richard Heimann © 2011
  • 10. What is Big Social Data?What is Big Social Data? • Big Data + Tuned Social Consciousness. • Velocity, volume, variety, veracity. • Inward & outward asymptotics. • Social + Spatial Richard Heimann © 2011
  • 11. Big Data, Small Theory Statistically Statistically Statistically significant global significant global significant globalvariables that exhibit variables that exhibit variables that exhibit strong regional strong regional little regional variation inform variation inform variation inform nuanced local different local region wide policy. decisions. decisions. Richard Heimann © 2011
  • 12. Big Data, Small Theory Spatial Simpson’s ParadoxGlobal standards will always compete with local social phenomenon. Global models average regionally variant phenomenon. Local models account for regional variation. Richard Heimann © 2011
  • 13. Big Data, Small Theory Richard Heimann © 2011
  • 14. Big Data, Small TheoryStationarity Extreme HeterogeneityLacking Internal Validity Lacking External ValiditySingle Equilibria: A singular Multiple Equilibrium: One processprocess over space and for every observation over space.across study area. Richard Heimann © 2011
  • 15. Big Social Data? Observed to be generally true Sufficient generality to be useful as a norm Deviations from the law should be interesting Understanding of social process in context…the Nomothetic & Idiographic debate is solved!! Richard Heimann © 2011
  • 16. Big Data, Small Theory …building better analytics. learning more about our problems.constructing local variant, regionally flexible small theory. Improve policy and decisions! Richard Heimann © 2011
  • 17. Big Data, Small Theory179 large companies found that adopting "data-driven decision-making" achieved productivity gains that were 5 percent to 6percent higher than other factors could explain.• What if we could improve policy or intelligence analysis by 5 or 6percent?• What if we could improve decision support by at least 5 percent?• What if we could improve productivity by at least 5 percent? Richard Heimann © 2011
  • 18. The End of Theory Richard Heimann © 2011
  • 19. Richard Heimann © 2011