SlideShare a Scribd company logo
1
Twitter Dataset Analysis and Geocoding
James Nelson
October 20, 2015
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
2
1 Introduction
1.1 Project overview
This project was performed as part of the required curriculum for the class Management,
Access, and Use of Big and Complex Data (FA15-BL-INFO-I590-34717), Data Science Online
Master’s Program, Indiana University School of Informatics and Computing (IU-SoIC) (1). The
aim of the project was to validate user-defined location data in a Twitter dataset of 10,000
tweets (2) using MongoDB and the Google Maps Geocoding API (3). Subsequently, a subset of
the validated data was visualized by plotting the validated tweet origination locations using
Google Maps and associated Google Maps APIs (4). The dataset was a subset of a public
Twitter dataset of 3 million user profiles collected during May 2011 created by Li et al from the
University of Illinois (2). To complete the project, access to MongoDB on the IU-SoIC server
was provided along with a software bundle containing scripts and code in order to reformat,
import, query, update and visualize the dataset.
1.2 Learning objectives
A. Learn through hands on experience how to handle data and take it through typical big data
processing steps: data storing, cleansing, query and visualization using a Linux command
line interface.
B. Set up VirtualBox and download a prepared virtual machine onto a local computer.
C. Build a software bundle that has a set of tools, in the form of scripts, on the virtual machine.
D. Import, query and modify Twitter data in the NoSQL MongoDB database environment.
E. Validate, geocode and visualize tweet origination data using Google Maps APIs.
2 Methods
2.1 System overview
This project utilized a virtual machine (VM) environment from the IU-SoIC server containing the
Ubuntu operating system (Linux), MongoDB and all necessary software. The latest version of
the VirtualBox platform was installed locally to host the VM image (5). To initialize the VM image
from the IU-SoIC server, the file I590FALL2015.ova was downloaded from the IU Box
account and imported into the local VirtualBox console.
2.2 Software tools
To build the project java code package, a tarball (I590-TwitterDataSet.tar.gz)was
downloaded from the class website then extracted within the project base directory ./I590-
TwitterDataSet.Shown below is the directory tree structure description for the project.
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
3
Before building and deploy the code, the project base and executable file directories were
manually set in the configuration file build.properties (see underlined).
The project code was then compiled and deployed using the command: ”ant”. The tree
structure and description of the source code is given below.
2.3 Data reformatting and importation into MongoDB
The Twitter dataset file users_10000.txt needed to be reformatted from ISO-8859-1 to the
UTF-8 format that MongoDB accepts. This was accomplished by running the following script to
create the reformatted dataset revised_users.txt as follows:
I590-TwitterDataSet
├── bin (Contains scripts (executables); generated after the code deployment)
├── build (This is a build directory, generated during the code compile time)
│ ├── classes (.Class files that generated by java compiler)
│ │ ├── google
│ │ ├── mongodb
│ │ └── util
│ └── lib (contains core jar file for scripts in bin)
├── config (contains a configuration file: config.properties)
├── data (empty directory, put your data here)
├── input (contains a query criteria file, query.json, that needed for finding
│ and updating the documents in MongoDB)
├── lib (third party dependency library jars)
├── log (empty directory, put your log files here)
├── src (source code)
│ ├── google
│ ├── mongodb
│ └── util
└── templates (template files, and deploy script. The deploy script generates
platform-dependent scripts and output them to bin during code deployment)
# $Id: build.properties
# @author: Yuan Luo
# Configuration properties for building I590-TwitterProjectCode
project.base.dir=/home/mongodb/Projects/I590-TwitterProjectCode
java.home=/usr/bin
src
├── google
└── GeoCodingClient.java (return geocoding results from Google)
├── mongodb
│ ├── Config.java (extract parameters from the configuration file)
│ └── MongoDBOperations.java (select documents that satisfy a given query criteria
│ and update the documents by add geocode.)
└── util (utility classes)
├── Base64.java (encoder/decoder)
├── PropertyReader.java (help class for reading .properties files)
└── UrlSigner.java (OAuth 2.0 help class)
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
4
$ ./bin/reformat.sh users_10000.txt revised_users.txt
Next, the “headerline” below was added to the revised_users.txt file and the headers were
separated by tabs to create a tab separated “tsv” file.
user_id user_name friend_count follower_count status_count
favorite_count account_age user_location
The reformatted tab-separated dataset file revised_users.txt was then imported into
MongoDB using the script import_mangodb.sh. The MongoDB <db name> and
<collection name> are twitterdb and users, respectively. The <import file type>
is tsv. The command is:
$ ./bin/import_mangodb.sh twitterdb users tsv revised_users.txt
2.4 Data validation and geocoding
To perform the geocoding of the Twitter dataset (2) using the Google Geocoding API (3), the
user-defined string in the “user_location” dataset field needed to be verified. This process
was performed by using the QueryAndUpdate.sh script tool which invokes a novel java code
in the file GeoCodingClient.java. Briefly, this code queries and updates each document
that has a valid “user_location” recognized by the Google Geocoding API by performing the
following functions:
1) reformats the location removing whitespaces,
2) inserts the Geocoding URL: https://maps.googleapis.com/maps/api/geocode/json
3) inserts the extracted geoCode, containing the new fields “geocode”
"formatted_address" and "location" (containing the latitude and longitude)
4) reports Geocoding Status as: "OK".
Documents without a valid “user_location” returns only "geocode": null, without any
reformatting or geocoding. The following Linux shell command also requires a
<configuration file> and a JSON <query criteria file> as well as defining the
database and collection to be used.
$ ./bin/QueryAndUpdate.sh ./config/config.properties twitterdb users
./input/query.json ./log/query.log
For simplicity, Google API access was obtained as an “Anonymous user” circumventing other
detailed authentication options (6). However, since geocoding queries are limited to 2,500 per
day for Anonymous users, four days were needed to perform geocoding of all 10,000 user
profiles in the Twitter dataset.
2.5 Manual updating
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
5
Initially, in the first geocoding run a “500 error” message was returned. The solution to this error
was found on the class discussion board (7; Zong Peng 10/3/15). The word "break" was
replaced with "continue" in line 133 in the java code file MongoDBOperations.java. The
project code was recompiled and deployed using the command: ”ant”
To track the performance of the geocoding the following command was used in a new Linux
shell (7; Micheal Haley 10/3/15):
$ tail -f ./Projects/I590-TwitterProjectCode/log/twitterdblog.txt
Following the final run on the 4th
day the following message was returned:
In this run:
Total:1268 record(s) found.
Total:0 record(s) processed.
Total:0 record(s) updated.
To process the remaining records, “break” was returned to line 133 in the java code file
MongoDBOperations.java. Subsequently, the following 2 documents were manually
updated to remove the “user_location” string as shown (8, 9):
./"user_id" : 117246212 , …. , "user_location" : "The DMV (and no, not that DMV)"}
./"user_id" : 122836991 ……., "user_location" : "Inyomailboxbiatch...huh
3 Results and Discussion
3.1 Querying MongoDB
To determine the success of the geocoding, the following queries were used within the mongo
shell (7;Lawson/Eicher 10/2/2015, 10-12)
A)
B)
C)
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
6
D)
Thus, all of the 10,000 Twitter user profiles in the dataset were processed in our approach (A).
A total of 6,346 profiles were successfully geocoded (B), while 3,654 did not contain a valid
location recognized by the Google Geocoding API (C). Of these 3,654, a total of 1,392 tweets
did not have any user-defined value in the "user_location" field. Presumably the remaining
2,262 tweets contained nonsensical values in the "user_location" field.
3.2 Strategies to improve geocoding and query performance
A portion of the remaining 2,262 tweets contained GPS coordinates from either an Iphone (33
tweets) or Über twitter client (505 tweets) in the “user_location” field (12, 13). To determine
the exact number of these tweets, the following query was performed (11, 14).
It would be possible to manually curate these profiles such that they would be recognized by the
Google Geocoding API, however certainly an algorithm could be written to perform this function
(15).
There are several ways to improve performance of the query. The most obvious is to complete
the Google authentication process to increase the numbers of queries per day in order to
reduce the overall query run time (6). Another way to increase query efficiency is create an
index. A single field text index using the field “user_location” would allow you to omit
scanning user profiles with missing values in this field thus avoiding having to perform a full
collection scan (16-18). Documents with missing values in the field “user_location”
represent nearly 14% (1,392/10,000) of the “users” collection as discussed above.
The command to create this index is:
> db.users.createIndex({"user_location":"text"})
The word text is taken literal to mean any text string value (16). To search the text index of the
“users” collection, the $text and $search operators are used as follows (19):
This command shows there are no profiles in the index without a text string value in the field
“user_location” as denoted by “” “” (16-18).
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
7
3.3 Visualization
To visualize a subset of the reformatted geocoded profiles that reportedly originated in Indiana
the following command was used (20-22):
$mongoexport -d twitterdb -c users -q "{"geocode": {$exists:
true,$ne:null}, "geocode.formatted_address":{$regex:
"USA"},"geocode.formatted_address":{ $regex: "IN" }}" --csv --
fields geocode.formatted_address,user_name -o twitterout
The next step is to reformat the output file into the visualization format using the command (20):
$ awk '{ printf("[ %s ],n", $l);}' twitterout
To create a visualization html file of the Indiana tweets (68 total) the screen output list was
inserted into the sample html code on the webpage shown in reference (23). The Google Maps
API is used to visualize the data as shown below (3):
A copy of the validated data was dumped from the database for submission using the
mongodump tool (24):
$ mongodump –d twitterdb –c users
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
8
4 References
1) http://datamanagementcourse.soic.indiana.edu/
2) Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin Chen-Chuan Chang: Towards
social user profiling: unified and discriminative influence model for inferring home
locations. KDD 2012:1023-1031
3) https://developers.google.com/maps/documentation/geocoding/intro
4) https://developers.google.com/maps/
5) https://www.virtualbox.org/wiki/Downloads
6) https://developers.google.com/api-client-library/javascript/features/authentication
7) https://iu.instructure.com/courses/1491590/discussion_topics/6311828
8) http://docs.mongodb.org/manual/tutorial/modify-documents/
9) http://docs.mongodb.org/manual/faq/mongo/
10) http://jacobnibu.info/articles/Modeling%20Twitter%20Dataset.pdf
11) http://docs.mongodb.org/manual/tutorial/query-documents/
12) https://www.quora.com/What-is-%C3%9CT-19-137603-72-813111-in-Twitter
13) http://ubersocial.com/
14) http://docs.mongodb.org/manual/reference/operator/query/regex/
15) http://journals.uic.edu/ojs/index.php/fm/article/view/4366/3654
16) http://docs.mongodb.org/manual/core/index-text/
17) https://docs.mongodb.org/manual/core/crud-introduction/
18) http://docs.mongodb.org/manual/reference/operator/query/text/#op._S_text
19) http://docs.mongodb.org/manual/reference/operator/query/text/
20) README.txt file
21) https://docs.mongodb.org/manual/reference/program/mongoimport/
22) http://stackoverflow.com/questions/31514688/how-to-use-mongoimport-for-specific-
fileds-from-tsv-file/31528255#31528255
23) https://developers.google.com/chart/interactive/docs/gallery/map#fullhtml
I590 Management, Access, and Use of Big and Complex Data Jim Nelson
Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015
9
24) http://docs.mongodb.org/manual/reference/program/mongodump/

More Related Content

Similar to Twitter Dataset Analysis and Geocoding

Group project home management system
Group project home management systemGroup project home management system
Group project home management system
Sean Ahearne
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB Stitch
MongoDB
 
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveDataAndroid MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
Waheed Nazir
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video Dashboard
IRJET Journal
 
Application cloudification with liberty and urban code deploy - UCD
Application cloudification with liberty and urban code deploy - UCDApplication cloudification with liberty and urban code deploy - UCD
Application cloudification with liberty and urban code deploy - UCD
Davide Veronese
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
PRBETTER
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB Stitch
MongoDB
 
Building 12-factor Cloud Native Microservices
Building 12-factor Cloud Native MicroservicesBuilding 12-factor Cloud Native Microservices
Building 12-factor Cloud Native Microservices
Jakarta_EE
 
Master a Cloud Native Standard - MicroProfile.pptx
Master a Cloud Native Standard - MicroProfile.pptxMaster a Cloud Native Standard - MicroProfile.pptx
Master a Cloud Native Standard - MicroProfile.pptx
EmilyJiang23
 
project_proposal_osrf
project_proposal_osrfproject_proposal_osrf
project_proposal_osrf
om1234567890
 
D033017020
D033017020D033017020
D033017020
ijceronline
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
shwetachanchlani
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
shwetachanchlani
 
Martin Koons Resume 2015
Martin Koons Resume 2015Martin Koons Resume 2015
Martin Koons Resume 2015
Marty Koons
 
Extending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native ModulesExtending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native Modules
omorandi
 
File Repository on GAE
File Repository on GAEFile Repository on GAE
File Repository on GAE
lynneblue
 
MongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB Stitch Introduction
MongoDB Stitch Introduction
MongoDB
 
Simple stock market analysis
Simple stock market analysisSimple stock market analysis
Simple stock market analysis
lynneblue
 
vinay-mittal-new
vinay-mittal-newvinay-mittal-new
vinay-mittal-new
Vinay Mittal
 
MoSKito at Silpion Solutionscamp 2014
MoSKito at Silpion Solutionscamp 2014MoSKito at Silpion Solutionscamp 2014
MoSKito at Silpion Solutionscamp 2014
Leon Rosenberg
 

Similar to Twitter Dataset Analysis and Geocoding (20)

Group project home management system
Group project home management systemGroup project home management system
Group project home management system
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB Stitch
 
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveDataAndroid MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
Android MVVM architecture using Kotlin, Dagger2, LiveData, MediatorLiveData
 
YouTube Trending Video Dashboard
YouTube Trending Video DashboardYouTube Trending Video Dashboard
YouTube Trending Video Dashboard
 
Application cloudification with liberty and urban code deploy - UCD
Application cloudification with liberty and urban code deploy - UCDApplication cloudification with liberty and urban code deploy - UCD
Application cloudification with liberty and urban code deploy - UCD
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB Stitch
 
Building 12-factor Cloud Native Microservices
Building 12-factor Cloud Native MicroservicesBuilding 12-factor Cloud Native Microservices
Building 12-factor Cloud Native Microservices
 
Master a Cloud Native Standard - MicroProfile.pptx
Master a Cloud Native Standard - MicroProfile.pptxMaster a Cloud Native Standard - MicroProfile.pptx
Master a Cloud Native Standard - MicroProfile.pptx
 
project_proposal_osrf
project_proposal_osrfproject_proposal_osrf
project_proposal_osrf
 
D033017020
D033017020D033017020
D033017020
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
final ppt.pptx
final ppt.pptxfinal ppt.pptx
final ppt.pptx
 
Martin Koons Resume 2015
Martin Koons Resume 2015Martin Koons Resume 2015
Martin Koons Resume 2015
 
Extending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native ModulesExtending Appcelerator Titanium Mobile through Native Modules
Extending Appcelerator Titanium Mobile through Native Modules
 
File Repository on GAE
File Repository on GAEFile Repository on GAE
File Repository on GAE
 
MongoDB Stitch Introduction
MongoDB Stitch IntroductionMongoDB Stitch Introduction
MongoDB Stitch Introduction
 
Simple stock market analysis
Simple stock market analysisSimple stock market analysis
Simple stock market analysis
 
vinay-mittal-new
vinay-mittal-newvinay-mittal-new
vinay-mittal-new
 
MoSKito at Silpion Solutionscamp 2014
MoSKito at Silpion Solutionscamp 2014MoSKito at Silpion Solutionscamp 2014
MoSKito at Silpion Solutionscamp 2014
 

More from James Nelson

IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
IU Data Visualization Class Final Project: Visualizing Missing Species Intera...IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
James Nelson
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
James Nelson
 
JN resumeDS 050516
JN resumeDS 050516JN resumeDS 050516
JN resumeDS 050516
James Nelson
 
CGFP proposal
CGFP proposal CGFP proposal
CGFP proposal
James Nelson
 
Easl immuno poster
Easl immuno posterEasl immuno poster
Easl immuno poster
James Nelson
 
Pufa protocol
Pufa protocol Pufa protocol
Pufa protocol
James Nelson
 
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
James Nelson
 
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
James Nelson
 
Serum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in AdultsSerum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in Adults
James Nelson
 
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
James Nelson
 
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver diseaseSerum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
James Nelson
 
James Nelson CV 33016
James Nelson CV 33016James Nelson CV 33016
James Nelson CV 33016
James Nelson
 

More from James Nelson (12)

IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
IU Data Visualization Class Final Project: Visualizing Missing Species Intera...IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
IU Data Visualization Class Final Project: Visualizing Missing Species Intera...
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
 
JN resumeDS 050516
JN resumeDS 050516JN resumeDS 050516
JN resumeDS 050516
 
CGFP proposal
CGFP proposal CGFP proposal
CGFP proposal
 
Easl immuno poster
Easl immuno posterEasl immuno poster
Easl immuno poster
 
Pufa protocol
Pufa protocol Pufa protocol
Pufa protocol
 
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
A Randomized, Masked, Controlled Study of Omega-3 Polyunsaturated Fatty Acid ...
 
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
Variants In The Il6 And Il1β Genes Either Alone Or In Combination With C282Y ...
 
Serum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in AdultsSerum Vitamin D Deficiency is Associated with NASH in Adults
Serum Vitamin D Deficiency is Associated with NASH in Adults
 
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
Deep Sequencing Identifies Novel Circulating and Hepatic ncRNA Profiles in NA...
 
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver diseaseSerum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
Serum microRNA biomarkers for prognosis of nonalcoholic fatty liver disease
 
James Nelson CV 33016
James Nelson CV 33016James Nelson CV 33016
James Nelson CV 33016
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 

Twitter Dataset Analysis and Geocoding

  • 1. 1 Twitter Dataset Analysis and Geocoding James Nelson October 20, 2015
  • 2. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 2 1 Introduction 1.1 Project overview This project was performed as part of the required curriculum for the class Management, Access, and Use of Big and Complex Data (FA15-BL-INFO-I590-34717), Data Science Online Master’s Program, Indiana University School of Informatics and Computing (IU-SoIC) (1). The aim of the project was to validate user-defined location data in a Twitter dataset of 10,000 tweets (2) using MongoDB and the Google Maps Geocoding API (3). Subsequently, a subset of the validated data was visualized by plotting the validated tweet origination locations using Google Maps and associated Google Maps APIs (4). The dataset was a subset of a public Twitter dataset of 3 million user profiles collected during May 2011 created by Li et al from the University of Illinois (2). To complete the project, access to MongoDB on the IU-SoIC server was provided along with a software bundle containing scripts and code in order to reformat, import, query, update and visualize the dataset. 1.2 Learning objectives A. Learn through hands on experience how to handle data and take it through typical big data processing steps: data storing, cleansing, query and visualization using a Linux command line interface. B. Set up VirtualBox and download a prepared virtual machine onto a local computer. C. Build a software bundle that has a set of tools, in the form of scripts, on the virtual machine. D. Import, query and modify Twitter data in the NoSQL MongoDB database environment. E. Validate, geocode and visualize tweet origination data using Google Maps APIs. 2 Methods 2.1 System overview This project utilized a virtual machine (VM) environment from the IU-SoIC server containing the Ubuntu operating system (Linux), MongoDB and all necessary software. The latest version of the VirtualBox platform was installed locally to host the VM image (5). To initialize the VM image from the IU-SoIC server, the file I590FALL2015.ova was downloaded from the IU Box account and imported into the local VirtualBox console. 2.2 Software tools To build the project java code package, a tarball (I590-TwitterDataSet.tar.gz)was downloaded from the class website then extracted within the project base directory ./I590- TwitterDataSet.Shown below is the directory tree structure description for the project.
  • 3. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 3 Before building and deploy the code, the project base and executable file directories were manually set in the configuration file build.properties (see underlined). The project code was then compiled and deployed using the command: ”ant”. The tree structure and description of the source code is given below. 2.3 Data reformatting and importation into MongoDB The Twitter dataset file users_10000.txt needed to be reformatted from ISO-8859-1 to the UTF-8 format that MongoDB accepts. This was accomplished by running the following script to create the reformatted dataset revised_users.txt as follows: I590-TwitterDataSet ├── bin (Contains scripts (executables); generated after the code deployment) ├── build (This is a build directory, generated during the code compile time) │ ├── classes (.Class files that generated by java compiler) │ │ ├── google │ │ ├── mongodb │ │ └── util │ └── lib (contains core jar file for scripts in bin) ├── config (contains a configuration file: config.properties) ├── data (empty directory, put your data here) ├── input (contains a query criteria file, query.json, that needed for finding │ and updating the documents in MongoDB) ├── lib (third party dependency library jars) ├── log (empty directory, put your log files here) ├── src (source code) │ ├── google │ ├── mongodb │ └── util └── templates (template files, and deploy script. The deploy script generates platform-dependent scripts and output them to bin during code deployment) # $Id: build.properties # @author: Yuan Luo # Configuration properties for building I590-TwitterProjectCode project.base.dir=/home/mongodb/Projects/I590-TwitterProjectCode java.home=/usr/bin src ├── google └── GeoCodingClient.java (return geocoding results from Google) ├── mongodb │ ├── Config.java (extract parameters from the configuration file) │ └── MongoDBOperations.java (select documents that satisfy a given query criteria │ and update the documents by add geocode.) └── util (utility classes) ├── Base64.java (encoder/decoder) ├── PropertyReader.java (help class for reading .properties files) └── UrlSigner.java (OAuth 2.0 help class)
  • 4. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 4 $ ./bin/reformat.sh users_10000.txt revised_users.txt Next, the “headerline” below was added to the revised_users.txt file and the headers were separated by tabs to create a tab separated “tsv” file. user_id user_name friend_count follower_count status_count favorite_count account_age user_location The reformatted tab-separated dataset file revised_users.txt was then imported into MongoDB using the script import_mangodb.sh. The MongoDB <db name> and <collection name> are twitterdb and users, respectively. The <import file type> is tsv. The command is: $ ./bin/import_mangodb.sh twitterdb users tsv revised_users.txt 2.4 Data validation and geocoding To perform the geocoding of the Twitter dataset (2) using the Google Geocoding API (3), the user-defined string in the “user_location” dataset field needed to be verified. This process was performed by using the QueryAndUpdate.sh script tool which invokes a novel java code in the file GeoCodingClient.java. Briefly, this code queries and updates each document that has a valid “user_location” recognized by the Google Geocoding API by performing the following functions: 1) reformats the location removing whitespaces, 2) inserts the Geocoding URL: https://maps.googleapis.com/maps/api/geocode/json 3) inserts the extracted geoCode, containing the new fields “geocode” "formatted_address" and "location" (containing the latitude and longitude) 4) reports Geocoding Status as: "OK". Documents without a valid “user_location” returns only "geocode": null, without any reformatting or geocoding. The following Linux shell command also requires a <configuration file> and a JSON <query criteria file> as well as defining the database and collection to be used. $ ./bin/QueryAndUpdate.sh ./config/config.properties twitterdb users ./input/query.json ./log/query.log For simplicity, Google API access was obtained as an “Anonymous user” circumventing other detailed authentication options (6). However, since geocoding queries are limited to 2,500 per day for Anonymous users, four days were needed to perform geocoding of all 10,000 user profiles in the Twitter dataset. 2.5 Manual updating
  • 5. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 5 Initially, in the first geocoding run a “500 error” message was returned. The solution to this error was found on the class discussion board (7; Zong Peng 10/3/15). The word "break" was replaced with "continue" in line 133 in the java code file MongoDBOperations.java. The project code was recompiled and deployed using the command: ”ant” To track the performance of the geocoding the following command was used in a new Linux shell (7; Micheal Haley 10/3/15): $ tail -f ./Projects/I590-TwitterProjectCode/log/twitterdblog.txt Following the final run on the 4th day the following message was returned: In this run: Total:1268 record(s) found. Total:0 record(s) processed. Total:0 record(s) updated. To process the remaining records, “break” was returned to line 133 in the java code file MongoDBOperations.java. Subsequently, the following 2 documents were manually updated to remove the “user_location” string as shown (8, 9): ./"user_id" : 117246212 , …. , "user_location" : "The DMV (and no, not that DMV)"} ./"user_id" : 122836991 ……., "user_location" : "Inyomailboxbiatch...huh 3 Results and Discussion 3.1 Querying MongoDB To determine the success of the geocoding, the following queries were used within the mongo shell (7;Lawson/Eicher 10/2/2015, 10-12) A) B) C)
  • 6. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 6 D) Thus, all of the 10,000 Twitter user profiles in the dataset were processed in our approach (A). A total of 6,346 profiles were successfully geocoded (B), while 3,654 did not contain a valid location recognized by the Google Geocoding API (C). Of these 3,654, a total of 1,392 tweets did not have any user-defined value in the "user_location" field. Presumably the remaining 2,262 tweets contained nonsensical values in the "user_location" field. 3.2 Strategies to improve geocoding and query performance A portion of the remaining 2,262 tweets contained GPS coordinates from either an Iphone (33 tweets) or Über twitter client (505 tweets) in the “user_location” field (12, 13). To determine the exact number of these tweets, the following query was performed (11, 14). It would be possible to manually curate these profiles such that they would be recognized by the Google Geocoding API, however certainly an algorithm could be written to perform this function (15). There are several ways to improve performance of the query. The most obvious is to complete the Google authentication process to increase the numbers of queries per day in order to reduce the overall query run time (6). Another way to increase query efficiency is create an index. A single field text index using the field “user_location” would allow you to omit scanning user profiles with missing values in this field thus avoiding having to perform a full collection scan (16-18). Documents with missing values in the field “user_location” represent nearly 14% (1,392/10,000) of the “users” collection as discussed above. The command to create this index is: > db.users.createIndex({"user_location":"text"}) The word text is taken literal to mean any text string value (16). To search the text index of the “users” collection, the $text and $search operators are used as follows (19): This command shows there are no profiles in the index without a text string value in the field “user_location” as denoted by “” “” (16-18).
  • 7. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 7 3.3 Visualization To visualize a subset of the reformatted geocoded profiles that reportedly originated in Indiana the following command was used (20-22): $mongoexport -d twitterdb -c users -q "{"geocode": {$exists: true,$ne:null}, "geocode.formatted_address":{$regex: "USA"},"geocode.formatted_address":{ $regex: "IN" }}" --csv -- fields geocode.formatted_address,user_name -o twitterout The next step is to reformat the output file into the visualization format using the command (20): $ awk '{ printf("[ %s ],n", $l);}' twitterout To create a visualization html file of the Indiana tweets (68 total) the screen output list was inserted into the sample html code on the webpage shown in reference (23). The Google Maps API is used to visualize the data as shown below (3): A copy of the validated data was dumped from the database for submission using the mongodump tool (24): $ mongodump –d twitterdb –c users
  • 8. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 8 4 References 1) http://datamanagementcourse.soic.indiana.edu/ 2) Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin Chen-Chuan Chang: Towards social user profiling: unified and discriminative influence model for inferring home locations. KDD 2012:1023-1031 3) https://developers.google.com/maps/documentation/geocoding/intro 4) https://developers.google.com/maps/ 5) https://www.virtualbox.org/wiki/Downloads 6) https://developers.google.com/api-client-library/javascript/features/authentication 7) https://iu.instructure.com/courses/1491590/discussion_topics/6311828 8) http://docs.mongodb.org/manual/tutorial/modify-documents/ 9) http://docs.mongodb.org/manual/faq/mongo/ 10) http://jacobnibu.info/articles/Modeling%20Twitter%20Dataset.pdf 11) http://docs.mongodb.org/manual/tutorial/query-documents/ 12) https://www.quora.com/What-is-%C3%9CT-19-137603-72-813111-in-Twitter 13) http://ubersocial.com/ 14) http://docs.mongodb.org/manual/reference/operator/query/regex/ 15) http://journals.uic.edu/ojs/index.php/fm/article/view/4366/3654 16) http://docs.mongodb.org/manual/core/index-text/ 17) https://docs.mongodb.org/manual/core/crud-introduction/ 18) http://docs.mongodb.org/manual/reference/operator/query/text/#op._S_text 19) http://docs.mongodb.org/manual/reference/operator/query/text/ 20) README.txt file 21) https://docs.mongodb.org/manual/reference/program/mongoimport/ 22) http://stackoverflow.com/questions/31514688/how-to-use-mongoimport-for-specific- fileds-from-tsv-file/31528255#31528255 23) https://developers.google.com/chart/interactive/docs/gallery/map#fullhtml
  • 9. I590 Management, Access, and Use of Big and Complex Data Jim Nelson Lesson 9: Twitter Dataset Analysis and Modeling Project October 20, 2015 9 24) http://docs.mongodb.org/manual/reference/program/mongodump/