CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation

Complex Networks Class Project
!
Location Correlation in Human Mobility
!
Marcello Tomasini
Bio-Complex Lab
Department of Computer Sciences
Florida Tech

Twitter Miner Implementation
The application which mine Twitter is developed in Python and uses the
following libraries:
• twitter (Python Twitter Tools): data is collected through Twitter stream
API and appended to local buffer
• pymongo (MongoDB): data is stored on the Biocomplex Lab MongoDB
instance
• logging: Python logging facility is used to keep track of code exceptions,
and non-standard twitter messages in the stream (warning, limit,
disconnect). Mostly for debugging. Exceptions don’t stop program
execution (mostly), but try to recover instead, in order to avoid manual
intervention
• collections: collections.deque is used for a thread-safe high-performance
local buffering in order to reduce Network IO and overhead on BioComplex
Lab MongoDB server
• threading: data is pushed to BioComplex Lab MongoDB instance by a
separate thread. Thread pop out a fixed amount of elements from the
deque and try the insert operation. If insert operation fails, revert back the
transition. No tweets lost. Python GIL is not an issue here since the thread
is IO bounded
!
Code runs on Amazon EC2 t2.micro instance for maximum reliability (SLA
99.95%).
Code performance: easily handle ~8Mbps twitter stream (worldwide stream
of geotagged tweets) corresponding to ~2000 tweet/s.

Network Builder Implementation
The application which build the network is developed in Python and uses the
following libraries:
• pymongo (MongoDB): filter tweets with a bounding box (due to a Twitter
bug) and retrieve data from BioComplex Lab MongoDB instance. Query
projections help to reduce data transferred over network
• scikit-learn: provides functions to compute k-means clustering of
coordinate points. Clusters will represent locations
• numpy: provides fast arrays and matrices data structures
• matplotlib.pyplot: plot graphs
• igraph: create and export the network structure
!
Clustering need a distance metric; coordinates are not in an euclidean space,
but in a spherical space, thus to compute the great-circle distance [1]
between two points we could use haversine formula [2]
!
However, most implementations use a distance matrix when supplied with a
non standard metric, which requires O(n2) space. Given the size of the
dataset that’s impractical, thus we use Mercator projection [3] to project
coordinates in an euclidean space and then use standard k-means algorithm.
!!!!!
[1] http://en.wikipedia.org/wiki/Great-circle_distance
[2] http://en.wikipedia.org/wiki/Haversine_formula
[3] http://en.wikipedia.org/wiki/Mercator_projection

CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation

Similar to CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation (20)

Recently uploaded

Recently uploaded (20)

CSE5656 Complex Networks - Location Correlation in Human Mobility, Implementation