Twitter Dataset Analysis and Geocoding

1
Twitter Dataset Analysis and Geocoding
James Nelson
October 20, 2015

I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
2
1 Introduction
1.1 Project overview
This project was performed as part of the required curriculum for the class Management,
Access, and Use of Big and Complex Data (FA15-BL-INFO-I590-34717), Data Science Online
Master’s Program, Indiana University School of Informatics and Computing (IU-SoIC) (1). The
aim of the project was to validate user-defined location data in a Twitter dataset of 10,000
tweets (2) using MongoDB and the Google Maps Geocoding API (3). Subsequently, a subset of
the validated data was visualized by plotting the validated tweet origination locations using
Google Maps and associated Google Maps APIs (4). The dataset was a subset of a public
Twitter dataset of 3 million user profiles collected during May 2011 created by Li et al from the
University of Illinois (2). To complete the project, access to MongoDB on the IU-SoIC server
was provided along with a software bundle containing scripts and code in order to reformat,
import, query, update and visualize the dataset.
1.2 Learning objectives
A. Learn through hands on experience how to handle data and take it through typical big data
processing steps: data storing, cleansing, query and visualization using a Linux command
line interface.
B. Set up VirtualBox and download a prepared virtual machine onto a local computer.
C. Build a software bundle that has a set of tools, in the form of scripts, on the virtual machine.
D. Import, query and modify Twitter data in the NoSQL MongoDB database environment.
E. Validate, geocode and visualize tweet origination data using Google Maps APIs.
2 Methods
2.1 System overview
This project utilized a virtual machine (VM) environment from the IU-SoIC server containing the
Ubuntu operating system (Linux), MongoDB and all necessary software. The latest version of
the VirtualBox platform was installed locally to host the VM image (5). To initialize the VM image
from the IU-SoIC server, the file I590FALL2015.ova was downloaded from the IU Box
account and imported into the local VirtualBox console.
2.2 Software tools
To build the project java code package, a tarball (I590-TwitterDataSet.tar.gz)was
downloaded from the class website then extracted within the project base directory ./I590-
TwitterDataSet.Shown below is the directory tree structure description for the project.

3
Before building and deploy the code, the project base and executable file directories were
manually set in the configuration file build.properties (see underlined).
The project code was then compiled and deployed using the command: ”ant”. The tree
structure and description of the source code is given below.
2.3 Data reformatting and importation into MongoDB
The Twitter dataset file users_10000.txt needed to be reformatted from ISO-8859-1 to the
UTF-8 format that MongoDB accepts. This was accomplished by running the following script to
create the reformatted dataset revised_users.txt as follows:
I590-TwitterDataSet
├── bin (Contains scripts (executables); generated after the code deployment)
├── build (This is a build directory, generated during the code compile time)
│ ├── classes (.Class files that generated by java compiler)
│ │ ├── google
│ │ ├── mongodb
│ │ └── util
│ └── lib (contains core jar file for scripts in bin)
├── config (contains a configuration file: config.properties)
├── data (empty directory, put your data here)
├── input (contains a query criteria file, query.json, that needed for finding
│ and updating the documents in MongoDB)
├── lib (third party dependency library jars)
├── log (empty directory, put your log files here)
├── src (source code)
│ ├── google
│ ├── mongodb
│ └── util
└── templates (template files, and deploy script. The deploy script generates
platform-dependent scripts and output them to bin during code deployment)
# $Id: build.properties
# @author: Yuan Luo
# Configuration properties for building I590-TwitterProjectCode
project.base.dir=/home/mongodb/Projects/I590-TwitterProjectCode
java.home=/usr/bin
src
├── google
└── GeoCodingClient.java (return geocoding results from Google)
├── mongodb
│ ├── Config.java (extract parameters from the configuration file)
│ └── MongoDBOperations.java (select documents that satisfy a given query criteria
│ and update the documents by add geocode.)
└── util (utility classes)
├── Base64.java (encoder/decoder)
├── PropertyReader.java (help class for reading .properties files)
└── UrlSigner.java (OAuth 2.0 help class)

4
$ ./bin/reformat.sh users_10000.txt revised_users.txt
Next, the “headerline” below was added to the revised_users.txt file and the headers were
separated by tabs to create a tab separated “tsv” file.
user_id user_name friend_count follower_count status_count
favorite_count account_age user_location
The reformatted tab-separated dataset file revised_users.txt was then imported into
MongoDB using the script import_mangodb.sh. The MongoDB <db name> and
<collection name> are twitterdb and users, respectively. The <import file type>
is tsv. The command is:
$ ./bin/import_mangodb.sh twitterdb users tsv revised_users.txt
2.4 Data validation and geocoding
To perform the geocoding of the Twitter dataset (2) using the Google Geocoding API (3), the
user-defined string in the “user_location” dataset field needed to be verified. This process
was performed by using the QueryAndUpdate.sh script tool which invokes a novel java code
in the file GeoCodingClient.java. Briefly, this code queries and updates each document
that has a valid “user_location” recognized by the Google Geocoding API by performing the
following functions:
1) reformats the location removing whitespaces,
2) inserts the Geocoding URL: https://maps.googleapis.com/maps/api/geocode/json
3) inserts the extracted geoCode, containing the new fields “geocode”
"formatted_address" and "location" (containing the latitude and longitude)
4) reports Geocoding Status as: "OK".
Documents without a valid “user_location” returns only "geocode": null, without any
reformatting or geocoding. The following Linux shell command also requires a
<configuration file> and a JSON <query criteria file> as well as defining the
database and collection to be used.
$ ./bin/QueryAndUpdate.sh ./config/config.properties twitterdb users
./input/query.json ./log/query.log
For simplicity, Google API access was obtained as an “Anonymous user” circumventing other
detailed authentication options (6). However, since geocoding queries are limited to 2,500 per
day for Anonymous users, four days were needed to perform geocoding of all 10,000 user
profiles in the Twitter dataset.
2.5 Manual updating

5
Initially, in the first geocoding run a “500 error” message was returned. The solution to this error
was found on the class discussion board (7; Zong Peng 10/3/15). The word "break" was
replaced with "continue" in line 133 in the java code file MongoDBOperations.java. The
project code was recompiled and deployed using the command: ”ant”
To track the performance of the geocoding the following command was used in a new Linux
shell (7; Micheal Haley 10/3/15):
$ tail -f ./Projects/I590-TwitterProjectCode/log/twitterdblog.txt
Following the final run on the 4th
day the following message was returned:
In this run:
Total:1268 record(s) found.
Total:0 record(s) processed.
Total:0 record(s) updated.
To process the remaining records, “break” was returned to line 133 in the java code file
MongoDBOperations.java. Subsequently, the following 2 documents were manually
updated to remove the “user_location” string as shown (8, 9):
./"user_id" : 117246212 , …. , "user_location" : "The DMV (and no, not that DMV)"}
./"user_id" : 122836991 ……., "user_location" : "Inyomailboxbiatch...huh
3 Results and Discussion
3.1 Querying MongoDB
To determine the success of the geocoding, the following queries were used within the mongo
shell (7;Lawson/Eicher 10/2/2015, 10-12)
A)
B)
C)

6
D)
Thus, all of the 10,000 Twitter user profiles in the dataset were processed in our approach (A).
A total of 6,346 profiles were successfully geocoded (B), while 3,654 did not contain a valid
location recognized by the Google Geocoding API (C). Of these 3,654, a total of 1,392 tweets
did not have any user-defined value in the "user_location" field. Presumably the remaining
2,262 tweets contained nonsensical values in the "user_location" field.
3.2 Strategies to improve geocoding and query performance
A portion of the remaining 2,262 tweets contained GPS coordinates from either an Iphone (33
tweets) or Über twitter client (505 tweets) in the “user_location” field (12, 13). To determine
the exact number of these tweets, the following query was performed (11, 14).
It would be possible to manually curate these profiles such that they would be recognized by the
Google Geocoding API, however certainly an algorithm could be written to perform this function
(15).
There are several ways to improve performance of the query. The most obvious is to complete
the Google authentication process to increase the numbers of queries per day in order to
reduce the overall query run time (6). Another way to increase query efficiency is create an
index. A single field text index using the field “user_location” would allow you to omit
scanning user profiles with missing values in this field thus avoiding having to perform a full
collection scan (16-18). Documents with missing values in the field “user_location”
represent nearly 14% (1,392/10,000) of the “users” collection as discussed above.
The command to create this index is:
> db.users.createIndex({"user_location":"text"})
The word text is taken literal to mean any text string value (16). To search the text index of the
“users” collection, the $text and $search operators are used as follows (19):
This command shows there are no profiles in the index without a text string value in the field
“user_location” as denoted by “” “” (16-18).

7
3.3 Visualization
To visualize a subset of the reformatted geocoded profiles that reportedly originated in Indiana
the following command was used (20-22):
$mongoexport -d twitterdb -c users -q "{"geocode": {$exists:
true,$ne:null}, "geocode.formatted_address":{$regex:
"USA"},"geocode.formatted_address":{ $regex: "IN" }}" --csv --
fields geocode.formatted_address,user_name -o twitterout
The next step is to reformat the output file into the visualization format using the command (20):
$ awk '{ printf("[ %s ],n", $l);}' twitterout
To create a visualization html file of the Indiana tweets (68 total) the screen output list was
inserted into the sample html code on the webpage shown in reference (23). The Google Maps
API is used to visualize the data as shown below (3):
A copy of the validated data was dumped from the database for submission using the
mongodump tool (24):
$ mongodump –d twitterdb –c users

8
4 References
1) http://datamanagementcourse.soic.indiana.edu/
2) Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin Chen-Chuan Chang: Towards
social user profiling: unified and discriminative influence model for inferring home
locations. KDD 2012:1023-1031
3) https://developers.google.com/maps/documentation/geocoding/intro
4) https://developers.google.com/maps/
5) https://www.virtualbox.org/wiki/Downloads
6) https://developers.google.com/api-client-library/javascript/features/authentication
7) https://iu.instructure.com/courses/1491590/discussion_topics/6311828
8) http://docs.mongodb.org/manual/tutorial/modify-documents/
9) http://docs.mongodb.org/manual/faq/mongo/
10) http://jacobnibu.info/articles/Modeling%20Twitter%20Dataset.pdf
11) http://docs.mongodb.org/manual/tutorial/query-documents/
12) https://www.quora.com/What-is-%C3%9CT-19-137603-72-813111-in-Twitter
13) http://ubersocial.com/
14) http://docs.mongodb.org/manual/reference/operator/query/regex/
15) http://journals.uic.edu/ojs/index.php/fm/article/view/4366/3654
16) http://docs.mongodb.org/manual/core/index-text/
17) https://docs.mongodb.org/manual/core/crud-introduction/
18) http://docs.mongodb.org/manual/reference/operator/query/text/#op._S_text
19) http://docs.mongodb.org/manual/reference/operator/query/text/
20) README.txt file
21) https://docs.mongodb.org/manual/reference/program/mongoimport/
22) http://stackoverflow.com/questions/31514688/how-to-use-mongoimport-for-specific-
fileds-from-tsv-file/31528255#31528255
23) https://developers.google.com/chart/interactive/docs/gallery/map#fullhtml

9
24) http://docs.mongodb.org/manual/reference/program/mongodump/

Twitter Dataset Analysis and Geocoding

Recommended

Recommended

More Related Content

Similar to Twitter Dataset Analysis and Geocoding

Similar to Twitter Dataset Analysis and Geocoding (20)

More from James Nelson

More from James Nelson (12)

Recently uploaded

Recently uploaded (20)

Twitter Dataset Analysis and Geocoding