Boost PC performance: How more available memory can improve productivity
15230406.2013.777139
1. This article was downloaded by: [Ball State University]
On: 23 April 2013, At: 11:24
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK
Cartography and Geographic Information Science
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/tcag20
Spatial, temporal, and socioeconomic patterns in the
use of Twitter and Flickr
Linna Li
a
, Michael F. Goodchild
a
& Bo Xu
b
a
Department of Geography, Center for Spatial Studies, University of California, Santa
Barbara, CA, USA
b
Department of Geography and Environmental Studies, California State University, San
Bernardino, CA, USA
Version of record first published: 19 Apr 2013.
To cite this article: Linna Li , Michael F. Goodchild & Bo Xu (2013): Spatial, temporal, and socioeconomic patterns in the use
of Twitter and Flickr, Cartography and Geographic Information Science, 40:2, 61-77
To link to this article: http://dx.doi.org/10.1080/15230406.2013.777139
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to
anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should
be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims,
proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in
connection with or arising out of the use of this material.
3. Sakaki, Okazaki, and Matsuo 2010), the automatic detec-
tion of local events (Lee and Sumiya 2010), and predict
election results based on the sentiments expressed in
tweets (Tumasjan et al. 2010). Furthermore, check-ins
collected from location-sharing services were used to
study human mobility patterns (Cheng et al. 2011).
Although data collected from social media, such as
Twitter, have been increasingly used to study geographic
landscapes and human behaviors (Li and Goodchild,
2013), it is difficult to estimate the representativeness of
such data. Despite the various studies, thus far there is no
research on the socio-demographic characteristics of users,
which is of great value since georeferenced data from
Twitter and Flickr, are implicative of the characteristics
of places, as well as local residents.
However, research has been done on socio-demo-
graphic characteristics of Internet users using surveys in
many countries. For example, Soule, Shell, and Kleen
(2003) found that gender is not a significant variable in
explaining heavy Internet usage, but education is, based
on the data from the Tenth Graphic, Visualizations, and
Usability Center (GVU) Survey conducted on the Web. A
study in the Philippines showed that younger, more afflu-
ent, and well-educated people in places with better infra-
structure are more capable of using Information and
Communications Technology (ICT, Alampay 2006).
Different Internet usage patterns of people from different
socio-economic groups were identified in central
Queensland (Taylor et al. 2003). As demonstrated in
these studies, the characteristics of Internet users are cru-
cial for understanding a range of relevant phenomena,
such as Internet addiction, social opportunities through
the access to ICT, and behavioral patterns in using such
technologies. Since conducting surveys is time-consuming
and labor-intensive, all the studies primarily collect data
through questionnaires, so they can only rely on a small
number of participants. In our study, we use geographic
location as a link to associate social media usage and
characteristics of local residents based on the data auto-
matically collected using social media APIs and the aux-
iliary census data.
This study provides an exploratory analysis of a subset
of Twitter and Flickr users, those who provide locational
information for tweets and photos, in terms of their demo-
graphic and socioeconomic properties at the county level
in California. Georeferenced tweets and photos indicate
the presence of their creators at that location. There are
three major reasons why people are present at a particular
location: location of residence, location of work, or loca-
tion of tourist attractions. In this article, we select geor-
eferenced tweets and photos contributed by local residents
to explore the demographic and socioeconomic character-
istics of these users. A user is considered a local resident
in a county only when the time interval between two
tweets or photos produced in that specific county by the
user is longer than 10 days.
The remainder of the article is structured as follows.
The section “Twitter and Flickr data collection and pre-
processing” describes the collection and pre-processing of
georeferenced data from Twitter and Flickr. The section
“The spatial distribution of georeferenced tweets and
photos” presents the spatial distributions of georeferenced
tweets and photos over the contiguous United States,
followed by a discussion of the temporal patterns of geor-
eferenced tweets and photos in the section “Temporal
patterns of tweets and photos.” We propose two descrip-
tive models in the section “Descriptive models of tweet
and photo densities in California” to illustrate the relation-
ships between the tweet and photo densities and the char-
acteristics of people in different counties of California.
The article concludes with a discussion of implications
and future research directions.
Twitter and Flickr data collection and pre-processing
Tweets and photo metadata were collected using Twitter
and Flickr’s public APIs and stored in a MySQL database.
We collected data from 21 January to 7 March 2011; these
dates were chosen to avoid major events that might cause
unusual patterns. In total, there are 19,758,954 records for
Twitter and 4,263,227 records for Flickr within the bound-
ing box of the contiguous United States. Location asso-
ciated with each tweet is in a variety of forms with
different precision levels. It may be automatically captured
by built-in Global Positioning System (GPS) receivers in
mobile devices like smart phones, calculated according to
the relative position of the user’s equipment in a cellular
network, or manually selected by a user from a set of
place names provided by Twitter. In the first case, location
is in the form of latitude and longitude, while in other
cases location is usually recorded as a neighborhood, a
city, or even a country. Other than coordinates, Twitter
takes the estimated location of a user’s device or an
Internet Protocol (IP) address of a computer and reverse
geocodes it to a few possible places provided to the user
for selection. The positional accuracy varies from one
method to another. For location recorded by GPS, it is
usually at the magnitude of several meters. For location
determined by triangulation in a cellular network, accu-
racy ranges from 30 to 3000 m, depending on the spatial
distribution of cells (Zandbergen 2009). For IP address,
the positional accuracy of georeference depends on the
method used to convert IP addresses to geographic coor-
dinates, usually at the level of ZIP code, city, state, or even
country. For example, Maxmind’s free GeoLite City data-
base claims that the spatial accuracy of georeference is
“over 99.5% on a country level and 78% on a city level
for the U.S. within a 40 kilometer radius.” Finally, the
62 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
4. accuracy of a place name depends on the spatial extent of
the place. Information about the tweets in the database
contains tweet ID, tweet text, time, location, and user ID.
In Flickr, photos were either georeferenced by built-in
GPS in cameras or manually georeferenced by a user who
identified photo location on a map. The location could
either be the place where a photo was taken or be the
location of an object in the photo. Automatic recording by
a GPS receiver is always the former case, while manually
georeferenced photos could be either way. One typical
error in location of photos occurs when a user uploads a
group of photos that involve several places to the same
location. Photo metadata contain information about photo
ID, photo title, description, tags, upload time, time when a
photo was taken, location, and owner ID.
For both tweets and photos, the locations are resolved
to five decimal places of latitude and longitude (approxi-
mately 1 m), but we should expect that the accuracy of
location is dependent on the accuracy of GPS in mobile
devices (which could be several meters) or the map scale
when a user specifies a photo location. Because the objec-
tive of this article is to study the spatial and temporal
patterns of tweets and photos, only data that have point
locational information with relatively high precision are
used, and those that are not georeferenced are excluded. It
is estimated that the percentage of georeferenced tweets is
less than 1% and geotagged photos around 3.33%.1
However, the total numbers of tweets and photos are
very large, so we can still obtain great volumes of geor-
eferenced data. In addition, we must be aware that these
data were contributed by users who are willing to share
their locations and not by everyone who uses the two
services. Therefore, the data are a subset of the entire
datasets of Twitter and Flickr given spatial and temporal
constraints, and the users are a subset of the entire user
groups. Like other data created by volunteers, there is bias
in terms of contributions made by different users, because
most contributions come from a very small percentage of
the total number of contributors. For instance, “In most
online communities, 90% of users are lurkers who never
contribute, 9% of users contribute a little, and 1% of users
account for almost all the action” (Nielsen 2006). Haklay
(2010) showed that most of the data for England were
contributed by only a few users and the difference of road
data coverage between wealthy areas and poor areas is
about 8% in OpenStreetMap (OSM). Contribution bias is
also present in our datasets. The 300 heaviest contributors
of local Twitter and Flickr users who share geographic
footprints are represented in Figure 1a and 1b, showing
the long tail effect: a large number of tweets and photos
are created by the first few hundred contributors.
When examining the relationships between georefer-
enced data densities and socioeconomic characteristics of
residents in California, we verify that the data were
produced by local users. First, we chose county as the
data aggregation level, because a person is more likely to
live in one census tract and work in another. Therefore, it
is difficult to tell whether a location is a user’s home or
work place at a finer spatial scale. By contrast, people are
more likely to live and work in the same county.
According to the 2000 Census Bureau county-to-county
commuting data for California, the percentage of resi-
dents who commute within the same county is as high as
83%. Second, we calculated the time a user stays in a
county by comparing the time interval between two
tweets and photos that are produced by the same user.
Only when a time interval is greater than 10 days, a user
is regarded as local, and data created by this user are
retained for further analysis.
Correlations between tweet and photo densities and
contributors’ properties were calculated at the county
level. Ideally, socioeconomic characteristics of users
would be determined at the individual level, but that
type of data is not available for obvious reasons, so loca-
tions were used to link the data densities and the residents.
This type of correlation based on group data rather than
individual data is called ecological correlation (Robinson,
1950). Ecological correlations between tweet and photo
densities and the socioeconomic characteristics of people
suggest that certain people with specific characteristics are
more involved in the generation of georeferenced tweets
and photos. However, it would be fallacious to infer
individual behaviors from data aggregated to geographic
areas (Openshaw 1984; Piantadosi, Byar, and Green 1988;
King 1997). For example, correlation between the number
of tweets from a place and the number of Native
Americans present in that place does not imply that
Native Americans are more likely to tweet. This study is
a first step toward an understanding of the relationships
between georeferenced tweets and photos and population;
the results suggest that it would be valuable to further
investigate these relationships.
The spatial distribution of georeferenced tweets and
photos
We plotted the locations of georeferenced tweets on a
map. As demonstrated in Figure 2, tweet locations roughly
describe the administrative boundary of the United States
and major roads at a very good resolution, which is similar
to the representation of Flickr photos in other research
(Crandall et al. 2009). Figure 3 shows georeferenced
tweets in part of Los Angeles. At this scale, the blocks
and local roads are delineated by tweet locations. For
instance, tweet locations are well aligned with the location
and shape of freeways, such as Interstate 405, as well as
some local roads. High density along major roads might
indicate people tweeting from vehicles, and perhaps from
locations adjacent to major roads such as hotels and gas
stations as well.
Cartography and Geographic Information Science 63
Downloadedby[BallStateUniversity]at11:2423April2013
5. Flickr photos have similar spatial patterns to tweet
locations. However, the number of photos is substantially
smaller than that of tweets during the same time period. It
takes more effort to take and upload photos than it does to
generate tweets. Despite a smaller number of photos than
tweets, some places are associated with more photos.
Intensive tweets are usually generated at places with
high population density, such as big metropolitan areas;
3500
3000
2500
2000
1500
1000
20,000
18,000
16,000
14,000
12,000
10,000
8000
6000
4000
2000
0
500
0
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
0 20 40 60 80 100 120
(b)
140 160 180 200 220 240 260 280 300
Numberofgeoreferencedtweets
Numberofgeoreferencedphotos
Ranked user - Top 300 users generating most tweets
Ranked user - Top 300 users generating most photos
(a)
Figure 1. (a) The number of georeferenced tweets generated by the top 300 contributors (highest: left; lowest: right). (b) The number of
georeferenced photos generated by the top 300 contributors (highest: left; lowest: right).
64 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
6. Figure 2. Georeferenced tweets within the bounding box of the contiguous United States.
Figure 3. A close-up of georeferenced tweets in part of Los Angeles.
Cartography and Geographic Information Science 65
Downloadedby[BallStateUniversity]at11:2423April2013
7. however, many photos are also taken at places with low
population density, such as Yosemite National Park.
To estimate the number of tweet and photo occur-
rences per unit area, we performed a kernel density ana-
lysis of the national data using tweet and photo locations.
Kernel density is a way of estimating the intensity of
points by creating a smooth surface using a bivariate
probability density function (Bailey and Gatrell 1995).
The kernel estimator is defined as
f ðxÞ ¼
1
nh
Xn
i¼1
K
x À xi
h
(1)
where n is the total number of points, h is the bandwidth
that determines the amount of smoothing, K is the kernel
function, x is the location of estimation, and xi is known
point location. The kernel function K could have differ-
ent forms, such as a Gaussian distribution, negative
exponential, or a simple binary function (it is constant
within the bandwidth and zero otherwise). The quadratic
function we used in the analysis is given below
(Silverman 1986):
KðcÞ ¼
3
π ð1 À cT
cÞ2
ifcT
c 1
0 otherwise
(2)
There are two parameters in kernel density estimation:
kernel bandwidth and cell size. The kernel was 100 km
and the cell size was 1 km given the size of the region.
The kernel bandwidth of 100 km is a compromise between
a map that is too smooth to interpret and one that is too
noisy to interpret. The cell size of 1 km was used to show
fine detail. As shown in Figures 4 and 5, both tweets and
photos tend to cluster in major cities with high population
density. For example, Seattle, Portland, San Francisco, and
Los Angeles on the west coast and Boston, New York
City, Baltimore, and Washington DC on the east coast are
clusters of both tweets and photos. We can almost identify
all major cities with significant economic, political, and
social influence in the United States from these two maps.
Although there are consistent patterns of tweets and
photos occurring at cities with high population density,
there are some differences, too. We calculated the normal-
ized density difference as follows:
Dd ¼
Dp
max ðDpÞ
À
Dt
max ðDtÞ
(3)
where Dd measures the relative difference between tweet
density and photo density, Dp and Dt are photo density and
tweet density at a location, respectively, and max (Dp) and
max (Dt) are the maximum photo and tweet density within
the study area. To account for the total amount of differ-
ence between the two sources, we normalized the density
value by the maximum density in each source, so the
range of density for both sources is between 0 and 1.
This allows us to compare density at each location as
opposed to other locations. As shown in Figure 6, some
locations stand out in the map of density difference as
places with high photo density, such as Lake Tahoe and
Yosemite National Park in California, Charleston in South
Carolina, and Orlando in Florida – which are popular
tourist attractions. The normalized photo density for
these places is substantially higher than the normalized
tweet density. On the other hand, Atlanta in Georgia,
Figure 4. Tweet density within the bounding box of the contiguous United States.
66 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
8. Cincinnati and Columbus in Ohio, and Detroit in
Michigan have significantly higher normalized tweet den-
sity. Furthermore, there are many tweets in the city of
Denver but a considerable number of photos in the
Rockies west of Denver.
At a finer scale, we generated a tweet density surface in
Los Angeles using a kernel of 10 km and a cell size of
100 m. As shown in Figure 7a, downtown Los Angeles and
Beverly Hills have the highest tweet density and it gradu-
ally decreases in the surrounding areas. The photo density
surface in Los Angeles is demonstrated in Figure 7b, with
three major clusters in downtown Los Angeles, Pasadena,
and Santa Monica. In these two figures, density estimation
does not stop at the coast and the values are not zero in the
ocean; however, a spatial constraint clearly could be applied
in the density calculation.
Figure 5. Flickr photo density within the bounding box of the contiguous United States.
Figure 6. Normalized density difference between Flickr photos and tweets.
Cartography and Geographic Information Science 67
Downloadedby[BallStateUniversity]at11:2423April2013
9. Figure 7. (a) Tweet density in Los Angeles County. (b) Flickr photo density in Los Angeles County.
68 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
10. Temporal patterns of tweets and photos
The density of tweets varies from place to place and also
through time. The hourly number of georeferenced tweets
in Los Angeles within a week is shown in Figure 8. The
highest rates of tweeting occurred between 8:00 in the
morning and at midnight. There are generally two tweet
peaks: one around 13:00–14:00 in the afternoon and the
other around 20:00–21:00 in the evening. The lowest rate
of tweeting is around 4:00–5:00 in the morning when most
people are sleeping. This trend is relatively consistent in
each day of the week and represents the activity pattern of
georeferenced tweets. A comparison of temporal patterns
of tweets and photos is shown in Figure 9a and 9b. In
contrast to the temporal pattern of tweets, Flickr users are
substantially more active during weekends and the rate of
photo-taking is highest during the afternoon hours.
However, temporal uncertainty should be considered
when interpreting the results. The time when a photo
was taken is provided by a camera, but not all photogra-
phers consistently keep the right time setting.
Descriptive models of tweet and photo densities in
California
In this section, we infer the characteristics of georefer-
enced tweet and photo users by studying the relationships
between tweet and photo densities and the socioeconomic
characteristics of people in different counties of California.
The hypothesis is that areas with high tweet or photo
density tend to have people with some specific character-
istics which may be age, race, educational attainment, the
type of occupation, and household income. The tweet
dataset contains 602,371 tweets in California that were
georeferenced by GPS, created by 44,097 users. Because
the study uses socioeconomic data of local residents only,
the raw data were preprocessed to exclude data that were
likely to be generated by tourists. As mentioned above, a
user is regarded as a local resident if he or she stays in a
county for a relatively long period of time (i.e., 10 days),
which is verified by the time interval between two tweets
or photos generated by the same user. As a result, there are
432,475 georeferenced tweets generated by 18,315 local
users, which represent about 71.80% of all georeferenced
tweets.
Data on distributions of age, race, educational attain-
ment, occupation, and household income were obtained
from the American Community Survey (ACS) 2006–
2010. These data made up the set of explanatory variables.
To create spatially intensive variables, all variables were
normalized by the total number of people in each county.
For instance, tweet density was calculated by the number
of tweets over the total population in a county. Hence, the
tweet density in the model is different from the tweet
density represented as a kernel density surface in the
section “The spatial distribution of georeferenced tweets
and photos”: It is the number of tweets per person in a
Figure 8. The average number of tweets per hour in Los Angeles County.
Cartography and Geographic Information Science 69
Downloadedby[BallStateUniversity]at11:2423April2013
11. Figure 9. (a) Time chart for georeferenced tweets. (b) Time chart for georeferenced photos.
70 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
12. county, rather than the number of tweets per land area
unit. The explanatory variables consist of the percentage
of people who fall into each of the categories (e.g., there
are 23 age groups, ranging from “under 5 years” to “85
years and over,” so there are 23 variables for the percen-
tage of people in all age groups and they add up to 1).
Since there are many categories in each of these types
of data, the number of explanatory variables is large
compared to the number of observations, and some expla-
natory variables are correlated with each other, thus multi-
ple linear regression is not appropriate because it requires
the absence of multicollinearity. Partial least squares
regression (PLSR), on the other hand, is a method parti-
cularly useful for describing the correlation between a
dependent variable and a set of strongly collinear inde-
pendent variables. It aims to reduce the set of variables to
a smaller number of uncorrelated components that char-
acterize most of the covariance between the dependent
variable and independent variables. PLSR was introduced
by Wold (1966) in the social sciences, and was later
widely adopted in chemometrics (Wold, Sjöström, and
Eriksson 2001). PLSR is related to principal component
regression (PCR): Both extract components from original
independent variables for regression modeling; however,
they differ in several ways. The major difference is that
principal components in PCR are solely determined by the
variance of independent variables, while those in PLSR
are determined by the covariance between dependent and
independent variables (Garthwaite 1994). Therefore, the
methods for constructing components in PCR and PLSR
are different, and the latter has the capability to capture
most of the information in independent variables that
explains the dependent variable by avoiding the problem
in PCR of discarding important principal components with
a low variance (Jolliffe 1982).
Fifty-eight explanatory variables in the model can be
grouped into five categories: age, race, educational attain-
ment, income, and occupation. Performance of PLSR on
the data resulted in five components that explain most of
the variance in tweet density (70.81%) and in the original
58 independent variables (82.89%). Table 1 lists the per-
centages of variance in the dependent and independent
variables explained by each component, and Table 2
gives a sample loading matrix for the five components
obtained from the original variables (see Appendix 1 for
the entire loading matrix for PLS components in the tweet
density model). The loading measures the importance of
each variable in accounting for the variance of a compo-
nent. A high loading value means that a specific variable
accounts for much variance in a component. Table 3 gives
a brief description of the meaning of the five components
based on the loading values. The first component accounts
for 37.94% of the variation in tweet density and 28.59% of
the variation in the independent variables. It is positively
highly loaded on the occupation variable of management,
business, science, and arts, the education variables of
bachelor’s degree and graduate or professional degree,
and the household income variables of $200.000 or more
Table 1. The percentage of variances explained by components in the Twitter model.
Component
1 2 3 4 5
Explained variance in independent variables 0.2859 0.0987 0.0352 0.3199 0.0893
Explained variance in dependent variable 0.3794 0.1200 0.1381 0.0159 0.0548
Note: Independent variables are percentages of people falling into different subcategories of age, race, educational attainment, occupation, and household
income, respectively, obtained from ACS (2006–2010), and the dependent variable is tweet density.
Table 2. Sample loading matrix for PLS components in the Twitter model.
Component
Explanatory variables 1 2 3 4 5
Bachelor’s degree 0.401932 −0.0772 0.042405 0.024567 0.009941
Graduate or professional degree 0.29582 −0.03287 0.007762 0.025411 0.000464
$150,000 to $199,999 0.174187 0.013374 −0.04549 −0.02102 −0.0099
$200,000 or more 0.245107 0.042515 −0.02049 −0.00515 −0.02389
Management, business, science, and arts occupations: 0.49972 −0.17736 0.009559 0.054147 −0.02251
Service occupations: −0.14328 −0.05453 0.092773 0.047814 −0.03232
Sales and office occupations: 0.021245 0.022759 0.069738 −0.03638 −0.02886
Natural resources, construction, and maintenance occupations: −0.25131 0.09103 −0.11676 −0.03511 0.029064
Production, transportation, and material moving occupations: −0.12638 0.118096 −0.05531 −0.03048 0.054624
Cartography and Geographic Information Science 71
Downloadedby[BallStateUniversity]at11:2423April2013
13. and $150,000 to $199,999. We may broadly call it a well-
educated people component. The second component
explains 12% of the variation in tweet density and
9.87% of the variance in the independent variables. It
has high positive loadings on low level of education
(i.e., less than 9th grade and 9th to 12th, no diploma)
and occupations in transportation and material moving.
This is a component for less-educated people. The third
component represents other race people and accounts for
13.81% of the dependent variable but only 3.52% of the
independent variables. The last two components both have
low explanatory powers for tweet density and are not
considered important in the model. Interestingly, there is
no obvious difference between male and female in the
behavior of generating georeferenced tweets, so sex was
not included in the final model. In simple correlations,
tweet density is also highly correlated with the percentage
of people between the ages of 25 and 44 years, but age is
correlated with income in this dataset, so variables of age
do not show up as highly loaded predictors on the
components.
The scores on each component may be mapped, as
demonstrated in Figure 10 for the first component. There
are five shades of color classified by natural breaks from
the darkest for the highest positive scores (the maximum:
0.36) to the lightest for the negative scores (the minimum:
–0.16). The San Francisco Bay area is described by high
positive scores, shown as the darkest area in the map. The
first component characterizes the percentage of people
with high education and salary, and associates this combi-
nation of characteristics with a high rate of tweeting. Take
San Francisco and Santa Clara Counties as an example.
These are places where many people work in high-tech
jobs with an advanced degree and where tweet density is
high. In contrast, northern and central California has a
dominance of negative scores, suggesting that the percen-
tage of well-educated people and tweet density are low in
these areas.
Component scores
–0.159129–0.122556
–0.122555–0.074462
–0.074461–0.013799
–0.013798–0.184898
0.184899–0.355314
0 100 200 400 km
Figure 10. First component scores for tweet density: linear combinations of the independent variables.
Table 3. Description of the PLS components in the Twitter
model.
Component Description
1 Well-educated people
2 Less-educated people
3 Other race people
4 White people
5 Asian people
72 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
14. The same procedure was applied to photo density. A
total of 752,176 georeferenced photos created by 19,594
users in California were collected from Flickr. Similarly,
only photos contributed by local residents were retained
for further analysis, resulting in 440,026 georeferenced
photos created by 7216 local users. Five components
constructed by PLSR capture 47.34% of the variation in
photo density and 81.49% of the variance in the original
independent variables (see Table 4). The entire loading
matrix for PLS components in the photo density model is
provided in Appendix 2. The explanatory power of this
model is not as high as the tweet density model for several
reasons. Although the total number of photos is about the
same as that of tweets, the number of unique photo con-
tributors (7216) is smaller than that of tweet creators
(18,315); therefore, photos were contributed by a much
smaller number of users compared to the tweet dataset. In
addition, the uncertainty of time when a photo is taken
may be present in Flickr photos, leading to judgmental
errors when time interval was used to infer whether a user
is a local resident or a tourist. The first component
explains 10.97% of the variance in the dependent variable
and 33.16% in the independent variables. The second
component captures only 7.33% of variation in the depen-
dent variable and 21.35% in the independent variables.
This contrast demonstrates the use of the covariance
between dependent and independent variables to construct
components in PLSR, rather than the use of only variance
of independent variables in PCA. The explanations of the
five components are listed in Table 5. The first component
is highly loaded on occupations of management, business,
science, and arts, bachelor’s degree, and graduate or pro-
fessional degree, and generally describes the percentage of
well-educated white people. The second component is
positively highly loaded on Asian people with bachelor’s
degree in the occupation of management, business,
science, and arts and is interpreted as well-educated
Asian people. The third component accounts for 9.52%
of variance in the dependent variable and 17.29% of
variance in the independent variables. It has positive high
loadings on white people, high school graduate, General
Educational Development (GED), or alternative, and
service occupations, which represents moderately edu-
cated white people. The last two components explain
10.87% and 8.67% of the dependent variable, but their
explanation powers are very low for independent vari-
ables, so they are not regarded as significant in the
model. Similar to the model of tweet density, gender
does not seem to make a difference in the interpretation
of photo density.
A straightforward interpretation of the models would
be the relationship between tweet and photo densities and
the demographic and socioeconomic characteristics of
people in these places. As the raw data were preprocessed
to retain only tweets and photos generated by local resi-
dents, socioeconomic properties of people who contribute
to these data may be inferred from this relationship, such
as race, education, occupation, and income. A distinction
between time intervals of tweets and photos indicate that
71.80% of georeferenced tweets are generated by local
people, while only 58.80% of georeferenced photos are
uploaded by local residents. Therefore, the tweet density
model may be more accurate in terms of inference about
properties of Twitter users from their spatial footprints.
Although at a coarse scale, these two models provide a big
picture of the properties of local people who contribute
georeferenced tweets and photos, and offer an exploratory
analysis on the representativeness of a subset of Twitter
and Flickr users.
Conclusion
The growing popularity of social networking and social
media services has attracted researchers from various dis-
ciplines, and this new form of geographic data has been
used in a variety of applications. However, many ques-
tions must still be answered in order to use these data
more appropriately. For example, who uses these services?
Why do people use them? How can we take advantage of
Table 4. percentage of variances explained by components in the Flickr model.
Component
1 2 3 4 5
Explained variance in independent variables 0.3316 0.2135 0.1729 0.0640 0.0359
Explained variance in dependent variable 0.1097 0.0733 0.0952 0.1087 0.0867
Note: Independent variables are percentages of people falling into different subcategories of age, race, educational attainment, occupation, and household
income, respectively, obtained from ACS (2006–2010), and the dependent variable is photo density.
Table 5. Description of the PLS components in the Flickr
model.
Component Description
1 Well-educated white people
2 Well-educated Asian people
3 Moderately educated white people
4 Less-educated people
5 Other race people
Cartography and Geographic Information Science 73
Downloadedby[BallStateUniversity]at11:2423April2013
15. this new source of information that may be potentially
used for any possible topic but with uncertainty and
bias? Understanding the spatial and temporal distribution
of georeferenced data would provide insight into these
questions. This article visualizes the spatial and temporal
patterns of georeferenced tweets and Flickr photos col-
lected within the contiguous United States. The tweets
collected within only a few weeks delineate the adminis-
trative boundaries of the United States and the major roads
at a very good resolution, especially in areas with high
population density. Flickr photos have similar spatial pat-
terns, although the total number of photos taken during the
same period of time is substantially smaller than that of
tweets. However, some places have considerably higher
normalized photo density than tweet density – a character-
istic of tourist attractions, such as Yosemite National Park.
The temporal patterns of tweets are relatively consistent
each day of the week, with two major peaks around
13:00–14:00 and 20:00–21:00 hours, but there are sub-
stantially more photos taken over weekends.
Two descriptive models using PLSR were con-
structed to explain the variation of tweet and photo
densities from place to place in California, using demo-
graphic and socioeconomic variables of people in each
county. According to the first model, tweet density is
highly dependent on the percentage of well-educated
people with an advanced degree and a good salary who
work in the areas of management, business, science, and
arts. The second model suggests that high photo density
is correlated with a high percentage of white and Asian
people with an advanced degree in the areas of manage-
ment, business, science, and arts. This study would be
informative to sociologists who study the behaviors of
social media users, geographers who are interested in the
spatial and temporal distribution of social media users,
marketing agencies who intend to understand the influ-
ence of social media, as well as other scientists who use
social media data in their research.
This research provides an exploratory analysis of the
characteristics of the contributors of georeferenced data,
so we may be aware of the representativeness of such
specific groups of people in the total population when
using the data. Two major sources of bias may be
reduced in the future: the bias caused by people’s move-
ment and the bias due to ecological correlation. Finally,
further research from the perspectives of psychology and
sociology is required to explain why people with some
specific social and demographic properties are more
involved in creating georeferenced tweets and photos.
Note
1. http://code.flickr.com/blog/2009/02/04/100000000-geo-
tagged-photos-plus/
Acknowledgments
The research was supported by the US National Science
Foundation, award 0849910, and by the U.S. Army Research
Office, award W911NF-09-1-0302.
References
Alampay, E. 2006. “Analysing Socio-Demographic Differences
in the Access and Use of ICTs in the Philippines Using the
Capability Approach.” The Electronic Journal of
Information Systems in Developing Countries 27 (5): 1–39.
Ames, M., and M. Naaman. 2007. “Why We Tag: Motivations
for Annotation in Mobile and Online Media.” In
Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, April 28–May 3, San Jose, CA,
971–980. New York: ACM.
Antoniou, B., J. Morley, and M. Haklay. 2010. “Web 2.0
Geotagged Photos: Assessing the Spatial Dimension of the
Phenomenon.” Geomatica 64 (1): 99–110.
Bailey, T. C., and A. C. Gatrell. 1995. Interactive Spatial Data
Analysis. London: Longman.
Bollen, J., H. Mao, and X. Zeng. 2011. “Twitter Mood Predicts
the Stock Market.” Journal of Computational Science 2 (1):
1–8.
Cheng, Z., J. Caverlee, K. Lee, and D. Z. Sui. 2011. “Exploring
Millions of Footprints in Location Sharing Services.” In
Proceedings of the Fifth International AAAI Conference on
Weblogs and Social Media (ICWSM), July 2011, Barcelona,
81–88. Palo Alto, CA: AAAI press.
Crandall, D. J., L. Backstrom, D. Cosley, S. Suri, D.
Huttenlocher, and J. Kleinberg. 2010. “Inferring Social Ties
from Geographic Coincidences.” Proceedings of the
National Academy of Sciences 107 (52): 22436–22441.
Crandall, D. J., L. Backstrom, D. Huttenlocher,and J. Kleinberg.
2009. “Mapping the World’s Photos.” In Proceedings of the
18th International Conference on World wide web, April 20–
24, Madrid. New York: ACM.
Garthwaite, P. H. 1994. “An Interpretation of Partial Least
Squares.” Journal of the American Statistical Association
89: 122–127.
Goodchild, M. F., and J. A. Glennon. 2010. “Crowdsourcing
Geographic Information for Disaster Response: A Research
Frontier.” International Journal of Digital Earth 3 (3):
231–241.
Haklay, M. 2010. “How Good Is Volunteered Geographical
Information? A Comparative Study of OpenStreetMap and
Ordnance Survey Datasets.” Environment and Planning B,
Planning Design 37 (4): 682–703.
Hollenstein, L., and R. Purves. 2010. “Exploring Place Through
User-Generated Content: Using Flickr to Describe City
Cores.” Journal of Spatial Information Science 1 (1):
21–48.
Huberman, B., D. Romero, and F. Wu. 2008. Social Networks
that Matter: Twitter under the Microscope. Accessed March
6, 2013. http://ssrn.com/abstract=1313405.
Java, A., X. Song, T. Finin, and B. Tseng. 2007. “Why We
Twitter: Understanding Microblogging Usage and
Communities.” In Proceedings of the 9th WebKDD and 1st
SNA-KDD Workshop on Web Mining and Social Network
Analysis, August 12, San Jose, CA, 56–65. New York:
ACM.
Jolliffe, I. T. 1982. “A Note on the Use of Principal Components
in Regression.” Applied Statistics 31: 300–303.
74 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
16. King, G. 1997. A Solution to the Ecological Inference Problem:
Reconstructing Individual Behavior from Aggregate Data.
Princeton, NJ: Princeton University Press.
Lee, R., and K. Sumiya. 2010. “Measuring Geographical
Regularities of Crowd Behaviors for Twitter-Based Geo-
Social Event Detection.” In Proceedings of the 2nd
ACMSIGSPATIAL International Workshop on Location
Based Social Networks (LBSN2010), 1–10. New York:
ACM.
Lerman, K., and R. Ghosh. 2010. “Information Contagion: An
Empirical Study of the Spread of News on Digg and Twitter
Social Networks.” In Proceedings of 4th International
Conference on Weblogs and Social Media (ICWSM),
Washington, DC, May 23–26, Menlo Park, CA: AAAI Press.
Li, L., and M. F. Goodchild. 2012. “Constructing Places from
Spatial Footprints.” In Proceedings of the 1st ACM
SIGSPATIAL International Workshop on Crowdsourced
and Volunteered Geographic Information, edited by M. F.
Goodchild, D. Pfoser, and D. Sui, November 6, Redondo
Beach, CA. New York: ACM.
Li, L., and M. F. Goodchild. 2013. “Spatio-Temporal Footprints
in Social Networks.” Encyclopedia of Social Networks and
Mining, edited by R. S. Alhajj, and J. G. Rokne, Springer.
Nielsen, J. 2006. “Participation Inequality: Encouraging More
Users to Contribute.” Jakob Nielsen’s Alertbox 9: 2006.
Openshaw, S. 1984. “Ecological Fallacies and the Analysis of
Areal Census Data.” Environment and Planning A 16:
17–31.
Piantadosi, S., D. P. Byar, and S. B. Green. 1988. “The
Ecological Fallacy.” American Journal of Epidemiology
127: 893–904.
Purves, R., A. Edwardes, and J. Wood. 2011. “Describing Place
through User Generated Content.” First Monday 16: 9–5.
Robinson, W. S. 1950. “Ecological Correlations and the
Behavior of Individuals.” American Sociological Review 15
(3): 351–357.
Sakaki, T., M. Okazaki, and Y. Matsuo. 2010. “Earthquake
Shakes Twitter Users: Real-Time Event Detection by Social
Sensors.” In Proceedings of the 19th International
Conference on World wide web, April 2010, Raleigh, NC,
851–860. New York: ACM.
Silverman, B. W. 1986. Density Estimation for Statistics and
Data Analysis. London: Chapman and Hall.
Soule, L. C., L. W. Shell, and B. A. Kleen. 2003. “Exploring
Internet Addiction: Demographic Characteristics and
Stereotypes of Heavy Internet Users.” Journal of Computer
Information Systems 44 (1): 64–73.
Taylor, W. J., G. X. Zhu, J. Dekkers, and S. Marshall. 2003.
“Socio-Economic Factors Affecting Home Internet Usage
Patterns in Central Queensland.” Informing Science 6:
233–246.
Tumasjan, A., T. O. Sprenger, P. G. Sandner, and I. M. Welpe.
2010. “Predicting Elections with Twitter: What 140
Characters Reveal about Political Sentiment.” Fourth
International AAAI Conference on Weblogs and Social
Media, May 23–26, Washington, DC.
Wold, H. 1966. “Estimation of Principal Components and
Related Models by Iterative Least Squares.” In Multivariate
Analysis, edited by P. R. Krishnaiaah, 391–420. New York:
Academic Press.
Wold, S., M. Sjöström, and L. Eriksson. 2001. “PLS-Regression:
A Basic Tool of Chemometrics.” Chemometrics and
Intelligent Laboratory Systems 58: 109–130.
Zandbergen, P. A. 2009. “Accuracy of iPhone Locations: A
Comparison of Assisted GPS, WiFI and Cellular
Positioning.” Transactions in GIS 13 (s1): 5–25.
Cartography and Geographic Information Science 75
Downloadedby[BallStateUniversity]at11:2423April2013
17. Appendix 1. Loading matrix for PLS components in the tweet density model
Component
Explanatory variables 1 2 3 4 5
Under 5 years −0.01474 0.077405 −0.04545 −0.01982 −0.01143
5–9 years −0.02025 0.057787 −0.04379 −0.01927 −0.01
10–14 years −0.03172 0.058438 −0.02302 −0.02861 −0.01334
15–17 years −0.02153 0.022426 −0.02407 −0.01397 −0.00691
18 and 19 years 0.00196 0.016601 −0.01363 −0.00271 −0.00794
20 years 0.000168 0.006627 −0.00405 0.002784 −0.00437
21 years 0.005339 0.013494 −0.00895 −0.00116 −0.00942
22–24 years 0.008545 0.042466 0.018536 0.001079 −0.01888
25–29 years 0.022469 0.091848 0.036858 0.005309 0.002426
30–34 years 0.026894 0.080128 0.02624 0.005096 −0.00324
35–39 years 0.0304 0.042774 0.018096 0.00483 0.00986
40–44 years 0.039194 0.004091 −0.02617 −0.00161 −0.00537
45–49 years 0.017316 −0.03679 −0.00857 0.002174 0.001503
50–54 years 0.00373 −0.0826 −0.00635 0.011897 0.008826
55–59 years −0.01004 −0.10516 0.011584 0.019274 0.027798
60 and 61 years −0.00336 −0.03628 0.013421 0.002809 0.002329
62–64 years −0.01243 −0.05618 0.011172 0.010629 0.009505
65 and 66 years −0.00818 −0.03081 0.005977 0.003103 0.008006
67–69 years −0.01376 −0.04628 0.009472 0.004551 0.003888
70–74 years −0.01464 −0.04668 0.014862 0.005115 0.006863
75–79 years −0.00534 −0.03874 0.011525 0.009119 0.010758
80–84 years −0.00167 −0.01497 0.012574 −0.0014 −0.00086
85 years and over 0.001647 −0.01959 0.013749 0.00079 7.76E-06
White alone −0.04727 0.041288 −0.10895 0.855233 −0.33653
Black or African American alone 0.019714 0.001025 0.043088 −0.15585 0.061714
American Indian and Alaska Native alone 0.039311 −0.03672 −0.049 0.040688 0.018066
Asian alone 0.02731 0.032401 −0.03394 −0.3589 0.364497
Native Hawaiian and Other Pacific Islander alone 0.003158 −1.33E-05 0.000767 −0.0057 0.00913
Some other race alone −0.04648 −0.03246 0.129032 −0.3555 −0.13824
Two or more races: 0.004259 −0.00552 0.019004 −0.01998 0.021366
Less than 9th grade −0.08597 0.331632 −0.10665 −0.02253 −0.00902
9th–12th grade, no diploma −0.16101 0.136008 0.007883 0.003644 −0.01523
High school graduate, GED, or alternative −0.302 −0.11136 0.021482 0.009998 0.058959
Some college, no degree −0.13348 −0.18998 0.02426 −0.03015 −0.04115
Associate’s degree −0.01529 −0.05624 0.002854 −0.01094 −0.00396
Bachelor’s degree 0.401932 −0.0772 0.042405 0.024567 0.009941
Graduate or professional degree 0.29582 −0.03287 0.007762 0.025411 0.000464
Less than $10,000 −0.05965 0.007578 0.041237 0.033263 0.014456
$10,000–$14,999 −0.11884 −0.0168 0.083771 0.034953 0.007335
$15,000–$19,999 −0.0885 −0.00257 0.045553 0.019674 −0.00656
$20,000–$24,999 −0.08984 0.020739 0.051756 0.005561 −0.03398
$25,000–$29,999 −0.06982 −0.00308 −0.00224 0.006647 0.015886
$30,000–$34,999 −0.05892 −0.01624 0.004458 −0.00272 −0.01196
$35,000–$39,999 −0.06398 −0.00129 0.025394 −0.00427 0.006125
$40,000–$44,999 −0.03699 −0.02119 0.001722 −0.00392 0.000198
$45,000–$49,999 −0.03612 −0.01511 −0.01185 0.011192 0.017964
$50,000–$59,999 −0.04508 0.019009 −0.0025 −0.01275 −0.00141
$60,000–$74,999 −0.01536 −0.03518 −0.03475 −0.0045 0.019347
$75,000–$99,999 0.047511 −0.00607 −0.04302 −0.01023 0.023652
$100,000–$124,999 0.112712 0.001461 −0.05712 −0.02623 −0.01345
$125,000–$149,999 0.10356 0.012866 −0.03643 −0.02051 −0.00383
$150,000–$199,999 0.174187 0.013374 −0.04549 −0.02102 −0.0099
$200,000 or more 0.245107 0.042515 −0.02049 −0.00515 −0.02389
Management, business, science, and arts occupations: 0.49972 −0.17736 0.009559 0.054147 −0.02251
Service occupations: −0.14328 −0.05453 0.092773 0.047814 −0.03232
Sales and office occupations: 0.021245 0.022759 0.069738 −0.03638 −0.02886
Natural resources, construction, and maintenance occupations: −0.25131 0.09103 −0.11676 −0.03511 0.029064
Production, transportation, and material moving occupations: −0.12638 0.118096 −0.05531 −0.03048 0.054624
76 L. Li et al.
Downloadedby[BallStateUniversity]at11:2423April2013
18. Appendix 2. Loading matrix for PLS components in the photo density model.
Component
Explanatory variables 1 2 3 4 5
Under 5 years −0.03985 −0.06968 −0.04157 0.035035 0.009508
5–9 years −0.03863 −0.06033 −0.03379 0.018745 0.005705
10–14 years −0.04948 −0.04034 −0.00729 0.030802 −0.01749
15–17 years −0.02739 −0.0298 −0.0105 0.001409 0.003521
18 and 19 years −0.00368 −0.02068 −0.01886 −0.00211 −0.01066
20 years 0.001142 −0.00986 −0.00504 −0.00208 −0.00873
21 years 0.001123 −0.01489 −0.01485 0.000999 −0.00771
22–24 years 0.001153 −0.01669 −0.00644 0.024757 −0.02822
25–29 years −0.00304 −0.02076 −0.00858 0.078787 −0.03802
30–34 years 0.00358 −0.02021 −0.01408 0.070264 −0.02725
35–39 years 0.010636 0.000369 −0.01345 0.047709 −0.01351
40–44 years 0.022184 −0.00871 −0.04678 0.000559 0.014569
45–49 years 0.020005 0.019069 −0.00946 −0.02387 0.013253
50–54 years 0.027076 0.038446 0.014259 −0.05507 0.028873
55–59 years 0.024689 0.058246 0.040607 −0.06575 0.022882
60 and 61 years 0.008444 0.027431 0.020304 −0.01986 −0.00012
62–64 years 0.011037 0.032313 0.032863 −0.03226 0.014433
65 and 66 years 0.002372 0.017243 0.015113 −0.01997 0.009119
67–69 years 0.004825 0.025543 0.026478 −0.02984 0.011136
70–74 years 0.004335 0.029439 0.030894 −0.02749 0.012094
75–79 years 0.009549 0.023947 0.020369 −0.02343 0.006073
80–84 years 0.002783 0.018798 0.015552 −0.00183 −2.78E-05
85 years and over 0.007138 0.02111 0.01425 −0.00551 5.71E-04
White alone 0.64126 −0.54537 0.384937 −0.04463 −0.05466
Black or African American alone −0.11371 0.098182 −0.08834 −0.0114 −0.04436
American Indian and Alaska Native alone 0.051597 −0.02843 −0.05185 −0.03109 0.121559
Asian alone −0.33726 0.269738 −0.17645 0.099653 −0.19034
Native Hawaiian and Other Pacific Islander alone −0.00434 7.60E-03 −0.00337 0.003228 −0.00463
Some other race alone −0.22297 0.165412 −0.06804 −0.02706 0.172535
Two or more races: −0.01458 0.032869 0.003109 0.011299 −0.0001
Less than 9th grade −0.14529 −0.25007 −0.05677 0.209129 −0.00659
9th–12th grade, no diploma −0.12839 −0.14642 0.06076 0.02307 −0.03606
High school graduate, GED, or alternative −0.17274 −0.03592 0.212701 −0.14922 0.089943
Some college, no degree −0.05122 0.058598 0.100471 −0.2111 −0.0332
Associate’s degree −0.00306 0.032506 0.01773 −0.05174 −0.03535
Bachelor’s degree 0.288224 0.213345 −0.18656 0.095178 0.006815
Graduate or professional degree 0.212482 0.127951 −0.14833 0.084679 0.014442
Less than $10,000 −0.02021 −0.02446 0.057107 −0.01381 0.001242
$10,000–$14,999 −0.04735 −0.01043 0.123375 −0.03851 −0.00763
$15,000–$19,999 −0.0397 −0.02443 0.07639 −0.03061 −0.00025
$20,000–$24,999 −0.04602 −0.02101 0.090227 0.001759 −0.01519
$25,000–$29,999 −0.04342 −0.0292 0.038342 −0.02533 0.014401
$30,000–$34,999 −0.0333 −0.0215 0.028601 −0.04265 0.013107
$35,000–$39,999 −0.0432 −0.00371 0.053541 −0.00873 0.003684
$40,000–$44,999 −0.02043 0.00364 0.030614 −0.01959 0.008193
$45,000–$49,999 −0.01826 −0.02243 0.008614 −0.03616 0.011423
$50,000–$59,999 −0.04024 −0.01621 0.026354 0.008339 −0.00548
$60,000–$74,999 −0.01251 −0.00814 −0.01824 −0.0494 0.013102
$75,000–$99,999 0.017281 0.015004 −0.0481 0.006423 −0.03166
$100,000–$124,999 0.056167 0.027812 −0.09461 0.03855 0.021724
$125,000–$149,999 0.048939 0.026345 −0.0844 0.03963 −0.00837
$150,000–$199,999 0.094066 0.044338 −0.13391 0.058537 0.004909
$200,000 or more 0.148192 0.064382 −0.15391 0.111557 −0.02321
Management, business, science, and arts occupations: 0.393605 0.217233 −0.29509 −0.0328 0.032189
Service occupations: −0.03144 0.015353 0.190332 −0.04594 −0.02918
Sales and office occupations: −0.00541 0.056445 0.020075 0.03938 −0.06822
Natural resources, construction, and maintenance occupations: −0.21088 −0.16301 0.090635 0.009872 0.037683
Production, transportation, and material moving occupations: −0.14587 −0.12602 −0.00595 0.029492 0.027525
Cartography and Geographic Information Science 77
Downloadedby[BallStateUniversity]at11:2423April2013