Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)
Upcoming SlideShare
Loading in...5
×
 

Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube)

on

  • 1,207 views

Big Social Data: The Spatial Turn in Big Data ...

Big Social Data: The Spatial Turn in Big Data
By Richard Heimann & Abe Usher
University of Maryland Baltimore County Webinar Description:

The increased access to spatial data and overall improved application of spatial analytical methods present certain potential to social scientific research. This webinar is designed to focus on substantive social science research perspectives while exposing rewards involved in the application of geographic information systems (GIS), Big Data, and spatial analytics in their own research. 


What is witnessed as the hype of Web 2.0 has worn off and the collaborative use of the Internet becomes a societal norm is an unprecedented explosion in the creation and analysis of geospatial data. Just as major governments are reducing their investments in location intelligence, individuals and non-government organizations are fueling a bonfire of innovation in the world of GIS data. 


Traditional spatial analyses grew up in an era of sparse data and very weak computational power. Today, both of those circumstances are reversed and many of the old solutions are no longer suitable to answer todays questions. 


"Big Social Data: The Spatial Turn in Big Data" reflects this change and combines two things which, until recently, engaged quite different groups of researchers and practitioners. Together, they require particular techniques and a sophisticated understanding of the special problems associated with spatial social data. Geographic Data Mining, or Geographic Knowledge Discovery, is not new, but is developing and changing rapidly as both more, and different, data becomes available, and people see new applications. The days of ‘Big Data’ require fresh thinking.


The webinar will highlight connections between spatial concepts and data availability. New emerging social media data will be promoted over traditional social science data, which better reflect some of the more recently developments in Big Data - most notably the socially critical exploration of such data. 

Statistics

Views

Total Views
1,207
Views on SlideShare
1,195
Embed Views
12

Actions

Likes
0
Downloads
5
Comments
0

3 Embeds 12

https://tasks.crowdflower.com 6
http://www.linkedin.com 4
http://192.168.6.179 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • So, over the next hour Abe and I are going to cover some elements of the Big Data movement, most notably the access to social media data, in our case Flickr and Twitter and some general pathways for analysis. We expect to deliver content over the next 45 - 50 minutes and leave 10 minutes or so to questions. I would add that about half of the time will be spent on two vignettes designed for this webinar specifically. Abe will explore pattern of life analysis and I will explore the spatial patterning of tweet composition from the presidential elections of 2012; though I will admit it has little to do with anything political. We will cover Major Trends within the Big Data movement as well as complimentary trends in location based data and services as well as Foundational Definitions. This is a nice backdrop and draws contrast to traditional social scienctific research, which I will attempt to explain in the Long Tail of Big Social Data. I will also cover briefly some laws of the spatial sciences and Abe will discuss important Big Data Concepts. Then we will cover our vignettes and quickly wrap up with “So, Whats?” and direction to additional resources. To that point, I am excited to share with you that all of the data and material, as well as code will be available following this webinar. So, without further delay I pass things off to Abe Usher.
  • In 2004, Google purchased a startup technology called ‘Keyhole’ that allowed Internet users to explore satellite images of the Earth. This product quickly became the application “Google Earth” – one of the most downloaded computer applications of all times, with more than 500 million downloads. To understand how Big Data technology and social media relate to Geography, we must first understand four major trends that are shaping how people create and interact with data. The first major trend, is an explosion in location based information which started in mid-2004. Google’s decision to make Google Earth free to everyone on the Internet was a pivotal moment in history. This application opened people’s minds to the possibility of exploring the world from the comfort of home. It also presented a standard data format, named ‘keyhole markup language’ or KML for short. Google Earth enables users to annotate the globe with their own observations – for example, marking the location of the home or school with a map push-pin, tracing a road with a simple line tool, or outlining a plot of land with a polygon tool. By 2007, there were more than 300 million Google KML files describing the elements of geography freely floating around the Internet. Google Earth significantly lowered the bar for exploring satellite images of the world, and for sharing geospatial facts with others. We’ll examine this activity further when we discuss volunteered geographic information later in our talk.
  • The second major trend shaping how people create and interact with information is the proliferation of mobile computing technology. In 1999, Research In Motion released the ‘Blackberry’ – an innovative mobile phone device that allows people to make phone calls and send and receive emails on the same device. In January 2007, Steve Jobs released the Apple iPhone. This was a major improvement over the Blackberry – it was a combination phone, ipod music device, and mobile computer with the ability to install apps from a central app-store. In September 2008, Google released a competing mobile computing device named Android. In April 2010, Apple released a GSM enabled computing tablet called the iPad. In February 2013, Google released an augmented reality head-mounted display computing device called “Google Glass.” The evolution of mobile computing elements continues to accelerate. The International Telecommunications Union estimates that the number of mobile phones and computing devices will exceed the population of the world in 2014. That’s more than 7 billion mobile devices.
  • The third major trend shaping how people create and interact with information is the use of social networking websites, and the creation of social media content. The number of Internet users in the world is currently approximately 2.2 billion. People with access to the Internet increasingly are interested in using it as a platform for interacting with others, and maintaining relationships through social networking sites such as Facebook, LinkedIn, and Twitter. Why is this interesting? The amount of raw content generated by these sites daily is staggering. For example, every 20 minutes on Facebook more than 10.2 million comments are posted to the site, and two million friend requests are accepted. That’s more than 700 million comments and 144 million friend connects – per day! Although most of this is informal, unstructured content – it is a rich set of observations that are constantly being generated. Never before in the history of humankind have scientists and researchers had so much potential data to work with.
  • The fourth major trend shaping how people create and interact with information is what I call the gamification of geo. With the requisite three preceding trends, there is an emerging behavior where Internet users now share location information as part of games and transparency oriented interaction. Photo sharing sites Flickr & Panoramio encourage Internet users to share geolocated photos with textual descriptions Twitter allows users to associate GPS specified coordinates with their tweets. The logical extreme of this type of activity is FourSquare, where users are provided with point and merit badge based incentives to share their own location information and observations about locations they visit.
  • The impact of these trends is a system of continuous, global geo-located observations, shared across the Internet.
  • Volunteered geographic information, or VGI for short, is the harnessing of tools to create, assemble, and disseminate geographic data provided voluntarily by individuals. Sites that contribute to this phenomenon include: OpenStreetMap Wikimapia Google Map Maker Flickr Panoramio Twitter Instagram Locr.com Just to name a few. VGI is interesting, because it is a way that billions of Internet users are informally collaborating to create an aggregate understanding of the world around them. In a way that eclipses the capability of any single corporation or nation state, the loosely coupled community of geospatially savvy Internet users are creating the greatest foundational database of geospatial content ever produced.
  • Social media as defined by Andreas Kaplan is "a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content.” http://goo.gl/oSrIS Less formally, you could say that social media is interactive content on the Internet that is fundamentally generated and managed by the user community. This includes major web destinations such as Facebook, Twitter, and Wikipedia, as well as homegrown wikis, blogs, forums, and online bulletin boards. Social media is interesting from a research perspective because it generates such as high volume of artifacts for analysis.
  • There are many definitions of big data, each with interesting nuances. A highly practical definition is that put forth by the IT consultancy Gartner Group, stating: Big data “is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”
  • In an effort to explain big data I have decided to compare things to what we have historically thought of social science research. This is more of a thought experiment rather than anything rigorous and empirical though I do have some empirical evidence to support this notion. The distribution in this slide is what is known as a power law distribution where observations are farther from the mean than they would otherwise be under a normal distribution. A signature quality of a power law is the long tail and the large number of occurrences far from the "head" or central part of a distribution. The tail is what I think of traditional social science data. These data are often collected for small projects and are often forgotten and not maintained. The poor curation of these data leads to their inevitable misplacement – notionally dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the tail. The utter lack of central management of data in the tail invariably leads these data to be forgotten. The long tail is an intractably large management problem and an analytical one as well. The central curation of data in the head ensures maintenance, unlike data in the tail. The head of the distribution is where Big Data resides and perhaps where the greatest impact is to human understanding and the advancement of human welfare. The head contains data, which are large and homogenous. The volume produces coincident datasets in time and space - unintentionally producing binding research across social science disciplines, even producing binding research between the natural and social sciences - what Edward Wilson called Consilience in his book of the same name on the Unity of Knowledge in 1998. The datasets in the head are limited but have broad utility and appeal to many, many users whereas resources in the tail, despite there being many thousands have appeal to a few scientists, analysts, and decision makers. The coincident nature of data in the head makes them ideal for cross correlation and multivariate analysis. Open Innovation initiatives hold certain promise for shared innovation and risk for research initiatives developing what is in effect a shared virtual laboratory where we all can work on the same data.
  • So, this could be empirical support for the previous notion communicated in the last slide. The National Science Foundation has shown their grants in dollar amounts to follow the power law. Data in the tail are small and often unmaintained beyond their initially designed use case and individually curated. As a result, the data is discontinuous from other research efforts and discontinuous over space and time.
  • Moving on... Some important laws of the spatial sciences are as follows. Laws are important for a number of reasons. First, laws allow any subsequent data analysis to be constructed from principles first and provide the basis for predicting performance and making analytical design choices. Not to mention that they are an asset of a strong and robust discipline. These laws specifically aid in pattern discovery and recognition. The first is commonly and glowingly known as the First Law of Geography and states that “All things are related, but nearby things are more related than distant things.” This is spatial dependency and is often measured with spatial autocorrelation or the relationship that a variable has with itself over space. The notion of near is what requires operationalizing with spatial weights. The second law is what Michael Goodchild called the “Second Law of Geography” or spatial heterogeneity. It is non constant variance over space and is effectively a break down of the First Law. Variance in this case can be technically be thought of a mean 0 and SD 1. It suggests non stationarity or multiple processes operating within our study area. A flavor of spatial heterogeneity is the spatial simpson’s paradox where global models competing with local models. The final law isn’t one of the spatial sciences but may be thought of as one of Big Social Data and that is tackling socially critical problems with Big Data. Examples of socially critical Big Data includes but is certainly not excluded to Google Flu Trends , The Billion Price s Project at MIT , the Global Puls e at the United Nations , as well as per sonal efforts on the behalf of soc ially conscious data scientists.
  • Here is an example of the Spatial Simpson’s Paradox where crime in the north and crime in the south when collapsed over space produce a “best of fit” that is representative of neither. If this were a policy issue the subsequent policy would be no good for anyone. David Kilcullen (2009) explains that today’s conflicts are a complex hybrid of contrasting trends that counterinsurgencies continue to conflate, blurring the distinction between local and global struggles, and thereby enormously complicating the challenges faced. Kilcullen steps through local and global struggles and outlines the importance of commensurate policy. This process can be characterized roughly as useful spatial models whose statistically significant global variables exhibit strong regional variation to inform local policy, and as statistically significant global variables that exhibit little regional variation could inform region-wide policy. Now, Abe will discuss complimentary mechanisms for approaching Big Data problems.
  • There is an infinite variety in ways of approaching Big Data analysis. Three methods that are well documented by Google are aggregation, association, and correlation. Aggregation Relates to quantitative methods for creating descriptive statistics. A simple example of this is the creation of counts, statistical means and medians, and standard deviations for an observed data set. We’ll explain practical applications of this in both of our vignettes. Association Relates to methods of identifying relationships of one data element to another. A very simple spatial example could be comparing geolocated tweets about coffee to the location of Starbucks and other coffee franchises. In many cases, a concentration of geolocated tweets is associated with the physical location of a coffee shop. Correlation Correlation is a special case of association – it is the process of quantifying a correspondence between two comparable entities. Rather than merely stating that there is a loose relationship between geolocated coffee tweets and coffee shops, with correlation we would actually attempt to create a numeric model that could be used for predicting the presence of a coffee shop based on a number of geolocated coffee tweets. Correlation is a special case because it can enable the creation of predictive models.
  • There is an infinite variety in ways of approaching Big Data analysis. Three methods that are well documented by Google are aggregation, association, and correlation. Aggregation Relates to quantitative methods for creating descriptive statistics. A simple example of this is the creation of counts, statistical means and medians, and standard deviations for an observed data set. We’ll explain practical applications of this in both of our vignettes. Association Relates to methods of identifying relationships of one data element to another. A very simple spatial example could be comparing geolocated tweets about coffee to the location of Starbucks and other coffee franchises. In many cases, a concentration of geolocated tweets is associated with the physical location of a coffee shop. Correlation Correlation is a special case of association – it is the process of quantifying a correspondence between two comparable entities. Rather than merely stating that there is a loose relationship between geolocated coffee tweets and coffee shops, with correlation we would actually attempt to create a numeric model that could be used for predicting the presence of a coffee shop based on a number of geolocated coffee tweets. Correlation is a special case because it can enable the creation of predictive models.
  • Before we dive into our specific examples of data analysis, I’d like to introduce you to a metaphor for dealing with Big Data. I call it the Kitchen Model of Big Data analysis. A kitchen is a great metaphor for understanding value creation. Raw materials go in, and they are transformed into more valuable (and hopefully delicious) outputs. In the physical world, there are a number of factors that contribute to the output of a kitchen. The most important include: The skill level of the chef The quality, quantity, and variety of ingredients The utensils that are available for work And recipes that are known to work well. In the quantitative data sciences, The chefs are the people The ingredients are the data sets you have at your disposal The utensils are the technical tools you choose to use And the recipes are the repeatable methodology that you create for addressing a particular analytic question. Any time that you frame an analytic question, it is a useful exercise to consider if you have The right people, the right data, the right tools, and the right methodology. In good faith with the community of geospatial practioners, we are happy to share our data, tools, and methodology with you in the form of two vignettes.
  • Before we dive into our specific examples of data analysis, I’d like to introduce you to a metaphor for dealing with Big Data. I call it the Kitchen Model of Big Data analysis. A kitchen is a great metaphor for understanding value creation. Raw materials go in, and they are transformed into more valuable (and hopefully delicious) outputs. In the physical world, there are a number of factors that contribute to the output of a kitchen. The most important include: The skill level of the chef The quality, quantity, and variety of ingredients The utensils that are available for work And recipes that are known to work well. In the quantitative data sciences, The chefs are the people The ingredients are the data sets you have at your disposal The utensils are the technical tools you choose to use And the recipes are the repeatable methodology that you create for addressing a particular analytic question. Any time that you frame an analytic question, it is a useful exercise to consider if you have The right people, the right data, the right tools, and the right methodology. In good faith with the community of geospatial practioners, we are happy to share our data, tools, and methodology with you in the form of two vignettes.
  • At the end of the presentation, we have links to the source code we used to process data in our vignettes. This code is in the form of scripts that can be run without requiring any commercial licensed software. For aspiring data scientists and geospatial researchers, we recommend four no-cost tools: The python programming language, the R programming language, Quantum GIS, and Google Earth.
  • Vignette one should bring some levity to Big Social Data but it is all the same driven by a social aspect and ultimately analyzes data that could serve as a proxy for other more substantive variables. My vignette is analyzing Twitter data using the Flesch-Kincaid index, which you may all be familiar with as a consequence of using MS Word. It has for some time provided the readability index to documents. The Guardian in February 2013 used the Flesch-Kincaid index to track the reading level of every state of the union address and noted how the linguist standard of the presidential address has declined. ‘ The state of our union is … dumber: How the linguistic standard of the presidential address has declined Using the Flesch-Kincaid readability test the Guardian has tracked the reading level of every state of the union http://www.guardian.co.uk/world/interactive/2013/feb/12/state-of-the-union-reading-level
  • For my analysis I used the Readability Ease Index which is the average sentence length weighted then subtracted from the average number of syllables per word. The output generally ranges from 0 - 100. To provide examples the Reader's Digest magazine has a readability index of about 65, Time magazine s core s about 52, an average 6th grade student age 11 has written assignments at a readability score of 60–70, and the Harvard Law Review has a general read ability score in the low 30s. The highest (easiest) readability score possible is around 120 (meaning every sentence consisting of only two one-syllable words). The score does not have a theoretical lower bound. It is possible to make the score as low as you want by arbitrarily including words with many syllables. In Twitter this could be a result of and I discovered Tweets where LOL was repeated to the max character limit of 140, which drives subsequent indices well below 0. These values were clipped by using a threshold of 0 - 100. This sentence, for example, taken as a reading passage unto itself, has a readability score of about thirty-three. The sentence, "The Australian platypus is seemingly a hybrid of a mammal and reptilian creature" is a 24.4 as it has 26 syllables and 13 words. One particularly long sentence about sharks in chapter 64 of Moby-Dick has a readability score of -146.77. [ 9]
  • Again, the index is inversely related to its sophistication. A high score is easier to read or put different poorly written. This is an example of a low score or a Tweet written with high sophistication. It is parsimonious and more dense on average that other tweets with syllables. The data has been mean centered so keep that in mind. The tweet is as follows “this gas situation is absolutely ridiculous” and written at an 11th grade level and has a mean centered value well below zero. The location of the Tweet is Mahwah NJ. About 20 miles outside of NYC.
  • This is an example of a high score or a Tweet written with low sophistication. It has but one monosyllable word. The tweet is as follows “down here in beach bout to shut this down wit & feeling the vibe s” and written at an 4th grade level and has a mean centered value well above zero. The location of the Tweet is Myrtle Beach SC.
  • explain thresholding Alpha 0.05
  • This table shows centrality and spread. By mean centering the data, that is subtracting the global mean from each region we can quickly identify deviation from the global mean. The Mid-Atlantic, Mountain, New England, and Pacific are all below the global mean whereas East North Central, East South Central, Southeast, West North Central, and West South Central are all above. You can also quickly see that the Pacific and the West South Central regions deviate most in their respective direction from the global mean.
  • Another way of exploring the data are box plots by region with underlying scatter plots. Adding a jitter allows us to get a sense of the distribution and cognitively speaking green is easier to interpret than other colors. I think an important point with the two last visualization methods is they are both locationally invariant beyond the coded region variable. As Michael Goodchild said, a fundamental property of spatial analysis is the lack of locational invariance. In other words, if the values within each region were shuffled neither of these two techniques would change. Lacking locational invariance is that results change when location changes.
  • There is merely a map of the post processed data; that is after thresholding. You notice that even with just 48,000 observations the pattern recognition is difficult due in part to coincident points in space and perhaps support for quantitative methods of pattern recognition and discovery.
  • Using LISA we can analyze both the first and second law of geography. Here you can see certain spatial clusters representative of spatial dependency. Notice that TFLG can be seen with HH in the middle of the country. These are high indices surrounded by other high indices. What is also noticeable are the low indices surrounded by other low indices in the north, centered around Montana and on the coasts namely the NYC Metropolitan area and San Jose/SF area. The regional variation noted by different spatial regimes is the second law of geography at work; It is the non stationarity of writing ability in the US. There are also numerous more localized relationships not clear from this map. However, in addition to the smooth quality of the analysis as noted by high values surrounded by high values and low values surrounded by low values there are also some interesting rough qualities characteristic of spatial outliers of high values surrounded by low values and low values surrounded by high values. For example, Columbus OH, Ithaca NY, and Gassaway WV are all low values surrounded by high values - meaning writing at a more sophisticated level than its neighbors and meeting statistical significance. By performing a spatial inner join with major cities, in this case cities with more than 300,000 people and the LISA classifications we can identify large cities and their sophistication in crafting Tweets. Following are the only cities that meet that criteria. El Paso, Oklahoma City, Omaha, Detroit, and Memphis all have statistically significant HH values. NYC and San Jose are low values surrounded by low values. Sacramento is a low value surrounded by otherwise high values and Wichita, Kansas City, Tulsa, and Nashville are all high flesch-kincaid indices surrounded by low flesch-kincaid indices. REMEMBER these indices are inversely related with writing ability; high values are low writing ability and vice versa low values are high writing ability. So, you might conclude among other things that NYC and San Jose are filled with nerds! Sorry DC. The LISA categories are statistically significant with a pseudo p-value < 0.05. Pseudo p-values are a computational approach to inference and proves to be a nice data reduction technique. Our original dataset of 3-digit zip codes is reduced from 862 observations to just 259 where all other observations are not statistically significant in the patterning of the kincaid index or just 30% of the original dataset. This analysis could certainly benefit from more data and in fact I am currently analyzing the same index with nearly one million tweets after data processing.
  • I thought it would be interesting to see the intersection with the computed index and some of the more prestigious or at least expensive zip codes in the US. With the exception of the exclusive 902 zip code are mean centered values fall below the global mean meaning higher writing levels. It might suggest that the 902 zip code is not as smart as a fifth grader. :)
  • In usual text mining algorithms, a document is represented as a vector whose dimension is the number of distinct keywords in it, which can be very large. Not long after publishing "The Cat in the Hat" at 225 words, Bennett Serf challenged Seuss to see if he could write a book using even FEWER words. Seuss was able to deliver and win the bet - "Green Eggs and Ham" uses exactly 50 words! Even our rather small Twitter dataset has a large N dimensional space of over 12,600 (12,603) unique words. The Flesch-Kincaid is an effective computational effort to add structure to this unstructured data. Also notice that [romney, obama, election, vote, and hope] all appeared within the top fifty words - determined by overall count. And now Abe will discuss the second vignette on pattern of life using geo social media data.
  • Our next vignette will explore how we can use geolocated social media data to understand spatial patterns.
  • For social media content that has explicit geolocations, it is straightforward to plot this on a map. Dropping point markers is useful for a very coarse analysis of what is happening in an area. In this map display, we’re examining one day’s worth of geolocated tweets in and around the Washington DC area. There is an interesting pattern to where the observations are. In general, you can see that the volume of content is much more dense in North-West DC than say Burke, Virginia. However, just looking at this markers, it is difficult to make sense of what this might mean – and it is impossible to effectively contrast a view like this to a view depicting another day. Using one of the three data science concepts from earlier, we need to apply a form of spatial aggregation to better understand what we’re looking at.
  • To put this into context, here’s what our analytic approach looks like. Rich and I are taking ingredients in the form of geolocated tweets, applying some aggregation algorithms with python code, and generating new visualizations and quantitative understanding of the space. If you want to repeat this data exploration, you can get the code from Github.
  • One way that we could aggregate the data to understand activity is to aggregate things by political boundaries. For many purposes, this is exactly what we might want to do. In the preceding example, using political boundaries is an effective way of being able to compare named places one to another. However, if we are interested in computing a kernel density or “heatmap” of activity, political boundaries are problematic. All of the 50 states within the US are different sizes. The same goes for counties. And in the case of zipcodes and census tracts, the areas actually change over time. All of these factors make it difficult to apply quantitative statistics. This general class of problem is referred to as the Modifiable Areal Unit Problem.
  • One way that we could aggregate the data to understand activity is to aggregate things by political boundaries. For many purposes, this is exactly what we might want to do. In the preceding example, using political boundaries is an effective way of being able to compare named places one to another. However, if we are interested in computing a kernel density or “heatmap” of activity, political boundaries are problematic. All of the 50 states within the US are different sizes. The same goes for counties. And in the case of zipcodes and census tracts, the areas actually change over time. All of these factors make it difficult to apply quantitative statistics. This general class of problem is referred to as the Modifiable Areal Unit Problem.
  • Instead we’ll use a thing called the geohash algorithm. Invented in 2008 by software engineer Gustavo Niemeyer, geohash is a data encoding that combines elements of longitude and latitude into a single variable. A computed geohash references a rectangle shaped box located on the earth. If you are interested in the implementation details, geohash is based on the same encoding concepts of the classicial quadtree data structure in computer science. It is called a ‘z curve’ representation because points with a similar geohash prefix tend to be near spatially (but not always). For our purposes, geohas is very useful for two dimensionally binning – putting geolocated content into boxes for quantification.
  • In this notional example, we’ve taken geolocated tweets with language related to coffee, and binned them into boxes. The numbers that you see in each box refers to the number of coffee mentions during some unit of time, like a day. With this simple, primitive binning mechanism, it is possible to make more advanced visualizations to depict spatial patterns.
  • For example, we could analyze our data and determine that when there are eight or more geolocated references to coffee within a geohash box on a given day, 95% of the time there is a coffee shop co-located in the same box. To create a visual element around this, we can create a thematic map – based on a simple count of coffee references to visually depict the likelihood of a coffee shop being present.
  • We can further simplify this information by merely displaying informal probabilities as colors on the map. Two dimensions is useful, but there is nothing preventing use from using additional dimensions for simplifying the visual understanding of what the data are trying to tell us. Imagine for example adding another dimension – a height of each box to depict the count of some factor that we are observing. So, how does this work in practice?
  • This next image is a Google Earth KML representation of geolocated social media content in and around Washington DC. Each vertically extruded polygon depicts the amount of activity based on color (where green is low activity, red is high activity) and height where low boxes are low activity and tall boxes are high activity zones. You might say that it is intuitively obvious to the casual observer – the tall red columns are where there is a lot of activity taking place.
  • An alternate view from above show a clear cluster of activity occurring near NW and central DC. There is also a fair amount of content on the periphery of DC, generally following a line around the beltway. Hopefully these are not tweets from drivers! Now if we stopped our data analysis at this point we’d be in for trouble, because we’d be falling prey to one of the classic blunders of geography.
  • This blunder is well depicted by an XKCD cartoon – many heatmaps are basically just population maps. If we take a barefoot empiricist approach to naively counting observations, we could mistakenly find a correlation between unrelated groups like people that subscribe to Martha Stewart Living, and people that like UMBC webinars. There are many advanced techniques to avoid this trap. Once of the simplest is to filter your data based on some other attribute, such as time.
  • In this next example, you can see geolocated social media activity between 10pm and midnight. The observations are much more sparse.
  • As you examine the clusters of activity, two of the hot spots to the extreme west and extreme east are evening activities at area schools. The tallest hotspot in the center is tourist activity near the national mall in downtown DC.
  • Because I love caffeinated beverages, I decided to look at real data and apply aggregation, association, and correlation To tweets discussing coffee and the word Starbucks. I examined approximately 30,000 geolocated tweets in the DC area, and wrote code to answer three questions: Where is the most commentary about coffee and Starbucks? Is commentary about coffee and Starbucks associated with the location of Starbucks stores? What is the numeric relationship between geo-located coffee commentary and actual stores?
  • Using the geohash algorithm as a mechanism for counting things, I found 81 spatial regions with textual references to the words coffee and/or Starbucks. 8 of the 81 regions are geohash boxes that include references to both coffee and starbucks in a narrow window of time. 7 of 8 (88%) of these boxes accurately classify a region as containing a Starbucks, just by using simple text analysis alone. This is a very exciting finding – what it tells us as geographers and data scientists is that we can use Aggregated geolocated observations as a form of automatic crowdsourcing to learn facts about our environment. We can also use these techniques to build highly accurate, predictive models
  • So, hopefully we have shown some examples that geospatial context does unlock insights into our data. That Location teaches us more about what we are analyzing. By explicitly accounting for space we adhere to statistical assumptions and avoid misspecification of our models. But, I think the most important element is that Big Data means that the faucet, so to speak is always running, which enables unique opportunity for experimentation.
  • There are a number of seminal works already in this space. In an attempt to be fair I have chosen two related works that provide unique insight. Eugene Wigner wrote, "The Unreasonable Effectiveness of Mathematics in the Natural Sciences” in 1960. Since I cannot improve on Eugene Wigner’s presentation for the natural sciences, I hope to offer some reflection for the social sciences. There is only one thing which is more unreasonable than the unreasonable effectiveness of mathematics in physics, and this is largely the unreasonable ineffectiveness of mathematics in social science. Peter Norvig et al. in 2009 wrote a paper titled, “The Unreasonable Effectiveness of Data.” In his opening sentence he draws the direct comparison to Wigner’s work and says that sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Social Scientists have suffered from a so-called physics envy over their inability to neatly model human behavior. Norvig continues by saying "We should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data. The effectiveness of data directly feeds its ultimate utility. Wigner provides examples of the "The Unreasonable Effectiveness of Mathematics” and notes how Galileo’s Experiment is true everywhere on the Earth, was always true, and will always be true. It is valid no matter whether it rains or not, whether the experiment is carried out in the Middle East or Northeast DC, no matter whether the person is a man or a woman, rich or poor, Muslim or Catholic. This invariance property of physics is well recognized and without invariance principles physics would not be possible. But, as Ernest Rutherford pointed out, the only law of the social sciences is “some do, some don’t.” So, does one counter example defeat a law? Social phenomenon is not invariant over time or space. While serial and spatial autocorrelation exists so does temporal and spatial heterogeneity and ultimately uncontrolled variance. Exploiting the complexity of data in the head of the power law holds promise for the social sciences. Integration of these data is key. I think we have shown some examples today of how to integrate, visualize and analyze this data and ultimately exploiting this complexity.
  • As promised, here are links to further material. The links are paths to the data and code used for this webinar. Abe and I can also be reached on Twitter - though I won’t tell you who writes with a lower kincaid index! :) I can also be reached at my UMBC email should you have any questions.
  • So, in conclusion I would like to thank you on the behalf of myself and Abe; we both really enjoyed getting this material together. I would like to again just echo a couple of key points discussed today. Abe mentioned some key elements of the analysis of geo-social media data. They were aggregation, association, and correlation and gave examples of each and their ultimate utility. I provided some key spatial laws to help govern analysis and pattern discovery and recognition. The chosen method, Moran’s I LISA used spatial weights files to construct a conceptualization of space as well as groupings by 3-digit zip codes both of which effectively operationalized the notion of “near” in Tolber’s first law. The areal units and the weights file is construction of interaction or association as termed in Abe’s slide and effectively aggregates the data into neighborhoods and ultimately allows a measure of correlation, but in the example today it was autocorrelation, or the relationship that variable has with itself over space. In other words, both vignettes followed these methodological pathways. If you are interested in learning more I would suggest exploring some of the previously promoted material. Alternately you can explore the MPS Program in GIS at UMBC Shady Grove where nontraditional datasets are explored. Thank you!

Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube) Big Social Data: The Spatial Turn in Big Data (Video available soon on YouTube) Presentation Transcript

  • 1Big Social Data:The Spatial Turn in Big DataRich Heimann, UMBC Adjunct FacultyAbe Usher, HumanGeo GroupMay 9, 2013
  • 2Agenda Major Trends; Foundational Definitions. [Abe] Long Tail of Big Social Data [Rich] Laws of the Spatial Sciences [Rich]– Big Data; Small Theory [Rich] Important Big Data Concepts [Abe]– The Kitchen Model [Abe] Vignettes [Rich & Abe] So, what? Additional Resources2
  • 3Major Trends Location Explosion 2004- present
  • 4 Location Explosion 2004- present Proliferation of mobile computingMajor Trends7 billion devices in 2014
  • 5 Location Explosion 2004- present Proliferation of mobile computing Social networkingMajor Trends> 700 million comments daily> 144 million connections daily
  • 6 Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geoMajor Trends
  • 7 Location Explosion 2004- present Proliferation of mobile computing Social networking Gamification of geoImpact:Continuous, global geo-located observations,shared across the Internet.Impact:Continuous, global geo-located observations,shared across the Internet.Major Trends
  • 8Definitions Volunteered Geographic Information* (VGI)“harnessing of tools to create, assemble, anddisseminate geographic data provided voluntarilyby individuals”* http://en.wikipedia.org/wiki/Volunteered_geographic_information8
  • 9 Volunteered Geographic Information (VGI) Social Media"a group of Internet-based applications that buildon the ideological and technological foundationsof Web 2.0, and that allow the creation andexchange of user-generated content”* http://goo.gl/oSrIS9Definitions
  • 10 Volunteered Geographic Information (VGI) Social Media Big Data“is high-volume, velocity and variety informationassets that demand cost-effective, innovativeforms of information processing for enhancedinsight and decision making”* http://goo.gl/DFFbr10Definitions
  • 11Long Tail: Traditional Social Science DataLong Tail: Traditional Social Science DataHead: Big Data; nontraditionalHead: Big Data; nontraditionalsocial science data.social science data.Head: Big Data – Large continuous datasets coincidentcoincident over Time & Space. Ideal for multivariate analysis.Tail {power law distribution} Data in tail is often unmaintained beyond their initially designed use case andindividually curated. As a result, the data is discontiguous from other research efforts and discontinuous overspace and time.Dark data is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data isreal and prevalent in the tail. The long tail is an intractably large management problem.Long Tail of Big Social Data
  • 12Power lawPower law 80%80% 20%20%Number of Grants 7,478 1,869Dollar Amount $938,548,595 $1,199,088,125Total Grants (NSF07) 9,347 (Count) $2,137,636,716 (Amount)Long Tail of NSF Data
  • 13Tobler’s [Tobler, 1970] First Law of Geography (TFLG)TFLG: “All things are related, but nearby things are more related thandistant things”Spatial Heterogeneity“Second law of geography”[Goodchild, 2003].Spatial Simpson’s ParadoxGlobal model will always compete and may be inconsistent with localmodels.Anyon (1982): social science should be empirically grounded,theoretically explanatory and socially critical.Laws of Spatial Science13http://www.bigdatarepublic.com/author.asp?section_id=2948
  • 14Spatial Simpson’s ParadoxSpatial Simpson’s ParadoxGlobal standards will always compete with local social phenomenon.Global standards will always compete with local social phenomenon.Violence inthe southViolence in theViolence in thenorthnorthViolence in thesouthViolence inViolence inthe norththe northViolenceGlobalGlobal models average regionally variant phenomenon. LocalLocal models account for regional variation.Big Data; Small Theory14
  • 15Important Big Data Concepts Aggregation Association Correlation15
  • 16Important Big Data Concepts Aggregation Quantitative methods for creating descriptive statistics Association Methods of identifying relationships of one dataelement to another Correlation The process of quantifying a correspondence betweentwo comparable entities
  • 17Two Vignettes1. Spatially patterning of Tweet compositionfrom the Presidential elections of 2012.2. Pattern of life analysis of a major US city.
  • 18Kitchen ModelChef Ingredients Utensils Recipes
  • 19Kitchen ModelChef Ingredients Utensils Recipes
  • 20Practice: Recommended Tools• Python• R• Quantum GIS• Google Earth20
  • 21Vignette 1:The Flesch-Kincaid Reading Algorithm
  • 22RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)RE = Readability Ease; ASL = Average Sentence Length (i.e., the number of wordsdivided by the number of sentences); ASW = Average number of syllables per word (i.e.,the number of syllables divided by the number of words)The output, i.e., RE is a number generally ranging from 0 to 100. The higher the number,the easier the text is to read.• Scores between 90.0 and 100.0 are considered easily understandable by an average5th grader.• Scores between 60.0 and 70.0 are considered easily understood by 8th and 9thgraders.• Scores between 0.0 and 30.0 are considered easily understood by college graduates.The Flesch-Kincaid Reading Algorithm
  • 23Clean Text “this gas situation is absolutely ridiculous.”Language englishLatitude 41.0862Longitude -74.1520USERID “ ”Kincaid 14.3Flesch 3.3Flesch-Kincaid (MeanCentered)-76.273849Leesbaarheid Score 56Leesbaarheid Grade 11The Flesch-Kincaid Reading Algorithm
  • 24Clean Text “down here in beach bout to shut this downwit & feeling the vibe s.”Language englishLatitude 33.68709Longitude -78.88915USERID “ ”Kincaid 3.5Flesch 100Flesch-Kincaid (MeanCentered)20.42615Leesbaarheid Score 22.9Leesbaarheid Grade 4The Flesch-Kincaid Reading Algorithm
  • 25Time Span: 2012-10-23 to 2012-11-06 (1 temporal bin, 2 weeks);Spatial Area: Data Clipped to US;Original Sample: 110,737 obs; 418,085 words & 1,446,494 characters without stop words(519,974 & 2,326,500 with stop words);Data processing: Removal of hashtags, @{users}, URLs, thresholding and mean centering;Pruned Sample: 47,690 observations;Method: Local Indicator of Spatial Autocorrelation (Moran’s I) with LISA Classifications ofHigh-High (HH), Low-Low (LL), High-Low (HL), Low-High (LH);Spatial Weights: knn40;Data Reduction: pseudo p-values 0.05, 0.01, 0.001.By the numbers...
  • 26Region mean SD 0% 25% 50% 75% 100% data:nEast NorthCentral 0.6193 16.514 -76.274 -5.77 4.93 11.92 20.426 7579East SouthCentral 0.6314 16.576 -74.673 -5.27 4.93 12.23 20.426 3028Mid-Atlantic -0.1988 16.590 -76.273 -6.47 3.73 11.43 20.426 6278Mountain -0.1212 16.586 -73.174 -7.00 4.32 11.43 20.426 2452NewEngland -0.1837 16.864 -73.174 -7.00 4.32 11.43 20.426 2392Pacific -0.8560 17.276 -78.274 -7.78 3.72 11.43 20.426 5390Southeast 0.1469 16.730 -79.373 -5.78 4.32 11.43 20.426 10022West NorthCentral 0.6010 16.385 -78.274 -5.78 5.22 12.23 20.426 2781West SouthCentral 0.8323 16.386 -79.273 -4.77 5.33 12.12 20.426 5572The Flesch-Kincaid Reading Algorithm
  • 27The Flesch-Kincaid Reading Algorithm(ggplot2)(Twitter, aes(x=regiontxt, y=flecMC, ylab="Flesch Kincaid Index", xlab="Region", data=Twitter))geom_point(colour="lightblue", alpha=0.1, position="jitter") +geom_boxplot(outlier.size=1, alpha=0.1)ot(flecMC~regiontxt, ylab="flecMC", xlab="regiontxt", data=Twitter)https://gist.github.com/rheimann/5525909
  • 29https://github.com/rheimannThe Flesch-Kincaid Reading AlgorithmRaw Data: data:n 47,690
  • 30High, High [n=77]Low, Low   [n=74]Low, High  [n=53]High, Low  [n=55]= El Paso, Oklahoma City, Omaha, Detroit, Memphis= NYC & San Jose #nerds= Sacramento= Wichita, Kansas City, Tulsa, Nashvillepseudo p-value < 0.05data:n 862 (3-digit Zip Codes)Gassaway, WVWatertown NYIthaca NYColumbus OHFresno CAhttps://github.com/rheimannThe Flesch-Kincaid Reading Algorithm
  • 31Rank ZIP code, City, State Median Home Price ($) Flesch-Kincaid IndexMean CenteredLeesbaarheidSchool Index100 Zip Code -3.2266 5.446 10014, New York, NY 4,116,5068 10021, New York, NY 3,980,8291 10065, New York, NY 6,534,43010 10075, New York, NY 3,885,409076 Zip Code -3.761 5.52 07620, Alpine, NJ 5,745,038119 Zip Code -0.0538 5.24 11962, Sagaponack, NY 4,180,3855 940 Zip Code3 94027, Atherton, CA 4,897,8645 94010, Hillsborough, CA 4,127,2507 94022, Los Altos Hills, CA 4,016,050 -3.596 5.87The Flesch Reading Ease Algorithm
  • 32Green Eggs and Ham by Dr. Suess averages 5.7 words per sentenceand 1.02 syllables per word, with a grade level of −1.3. (Most of the 50used words are monosyllabic; "anywhere", which occurs 8 times, isthe only exception.) The 50 dimensional space is small.Even this fairly small Twitter sample & after lots of data processing toremove words of count:1 and words fewer than three characters theN:12,603 dimensional space.Data Processing includes removing stop words and stemming.110,737 obs; 418,085 words & 1,446,494 characters without stop words(519,974 & 2,326,500 with stop words);Top 50 words include: [romney, obama, election, vote, hope]Green Eggs and Ham: N - Dimensional Problems
  • 33Vignette 2:Spatial Patterns of Activity
  • 34Spatial Patterns of Activity:Geolocated Social MediaNew forms of aggregation unlocknew insights in your data. Useful for coarsepattern analysis Looks interesting Difficult to analyzedirectly
  • 35Rich & Abe GeolocatedSocial MediaPythonGeohashAlgorithmCode onGithubSpatial Patterns of Activity:Applying the Kitchen Sink
  • 36 States, Counties, andCensus tracks All different sizes Sometimes change This is a problem:MAUPhttp://goo.gl/wQLTWSpatial Patterns of Activity:Let’s use Political Boundaries
  • 37 States, Counties, andCensus tracks All different sizes Sometimes change This is a problem: MAUPhttp://goo.gl/wQLTWSpatial Patterns of Activity:Let’s NOT use Political Boundaries
  • 38 Invented in 2008 byGustavo Niemeyer Similar to quadtree;breaks the world intorectangles Based on a z-curvealgorithm Useful for 2-d binningSpatial Patterns of Activity:Geohash
  • 3944485642444954432442141165542Spatial Patterns of Activity:Geohash MathNotional example:Occurrence of geolocated tweetsrelated to coffee.
  • 4044485642444954432442141165542Spatial Patterns of Activity:Geohash Math
  • 41Spatial Patterns of Activity:Geohash Math
  • 42Activity near Washington DCSpatial Patterns of Activity:3-d Google Earth
  • 43Activity near Washington DCSpatial Patterns of Activity:3-d Google Earth
  • 44Spatial Patterns of Activity:Avoid the Classic Blundershttp://xkcd.com/1138/
  • 45Night activity near Washington DCSpatial Patterns of Activity:Isolating a Time Series
  • 46Spatial Patterns of Activity:Isolating a Time SeriesSchool EventTouristsSchool Event
  • Spatial Patterns of Activity:A Caffeinated Example Aggregation Where is the most commentaryabout coffee and Starbucks? Association Is commentary about coffee andStarbucks associated with thelocation of Starbucks stores?(Yes) Correlation What is the numeric relationshipbetween geo-located coffeecommentary and actual stores?
  • Where is Starbucks?81 spatial regions identified with textual referencesto the words ‘coffee’ and/or ‘Starbucks.’8 of the 81 regions are boxes that include bothreferences to ‘coffee’ and ‘Starbucks’ within anarrow window of time.7 of 8 (88%) accurately classify a region ascontaining a Starbucks by using simple textanalysis alone..09.52.88
  • 49• Putting data in geospatial context unlocksinsight.• Location teaches us more about what weare analyzing.• Adhere to statistical assumptions andavoid misspecification in our models.• The “Big Data” aspects of social mediamean that the faucet is always running,enabling experimentation.So, What?
  • 50Eugene Wigner (1960 Nobel Laureate)““The Unreasonable Effectiveness of Data”The Unreasonable Effectiveness of Data”Peter Norvig Director of Research at Google Inc.““The Unreasonable Effectiveness ofThe Unreasonable Effectiveness ofMathematics in the Natural Sciences”Mathematics in the Natural Sciences”Academic Works; Embracing Complexity
  • 51Additional resources; Code andstuff...Rich HeimannCode and Data: https://github.com/rheimannSlides: http://www.slideshare.net/rheimann04Twitter: @rheimannUMBC: rheimann@umbc.eduCompany: Data Tactics Corporation: http://goo.gl/8QWtyAbe UsherCode and Data; https://github.com/abeusherTwitter: @abeusherCompany: HumanGeo Group: http://goo.gl/uDbZP
  • 52Thank you!!http://www.umbc.edu/shadygrove/gis/gis.php
  • 53Recommended resources: Books
  • 54Foundational data:1. Geonames.org: http://www.geonames.org/2. GADM.org: http://gadm.org/Streaming data:1. Twitter API: https://dev.twitter.com/– Datasift: http://datasift.com/1. GNIP: http://gnip.com/Recommended resources: Data54