Gao cong geospatial social media data management and context-aware recommendation

Geospatial Social Media Data Management
and Context-aware Recommendation
Gao Cong (丛高)
Nanyang Technological University

Geo-Positioning Technologies
• Increasingly sophisticated technologies enable the
accurate geo-positioning of mobile users
 GPS-based technologies
 Russian GLONASS, Chinese Beidou, EU’s Galileo
 WPS: positioning based on Wi-Fi
 Cellular positioning
 New technologies are underway (e.g., indoor positioning)
• Both users and contents are associated with accurate
locations
4

Geospatial and textual Object
• A geo-textual object o has:
 A Geographical Location o.λ
 E.g., “50 Nanyang Ave. Singapore 639798”, or “latitude 1.2o N,
longitude 103.4oE”
 A textual description o.ψ
 E.g., “Canteen B”
5

Geospatial and textual data
• User generated content from social media is being
associated with geo-locations. For example,
 Points of interest (POIs) associated with text in websites,
such as Google Maps, Yelp, etc.
 geo-tagged micro-blogs (e.g., Twitter),
 photos with both tags and geo-locations in social photo
sharing websites (e.g., Flickr),
 check-in information on places in location-based social
networks (e.g., FourSquare, Facebook places).
• Integration of geo-location into keyword querying is
important
 53% mobile searches on Bing has local intent
 20%+ of Google web queries related to locations.
6
Static
Dynamic

Outline
• Querying static geo-textual data
 Basic query: Retrieve a list of objects, each satisfying user’s need
 Boolean Range Query (BRQ)
 Boolean kNN Query (BkQ) (TKDE’12)
 Top-k kNN Query (TkQ) (VLDB’09, VLDBJ’12, VLDB’12)
 Other types of queries (ICDE’12, SIGMOD’11, TODS’13)
 Beyond single object granularity: Retrieve a set of objects that
together satisfy the user’s need
• Publish/subscribe query on geo-textual data stream
• Personalized query: context-aware POI recommendation
• Summary
7

Boolean Range Query
• A query region
• A set of keywords
8
OChre Italian Restaurant:
pizza, white wine, cherry
tomatoes
Student club, Gym,
badminton, snooker
Adidas, Nike sports,
New Balance
Sports shoes
Roadlink: bikes with
various brands
Far east restaurant: spring
rolls, dumplings
Somerset mall: …
Adidas sports
accessories retail…
Pizza hut
Adidas retails
Keyword: pizza

Boolean kNN Query
• A query location
• Ranking Criteria: Spatial Proximity
9
tomatoes
Student club, Gym,
badminton, snooker
New Balance
Sports shoes
various brands
rolls, dumplings
Somerset mall: …
Adidas sports
Pizza hut
Adidas retails
k = 2
Keyword: Adidas, sports

Top-k kNN Query (TkQ)
• A query location
• Ranking Criteria: Combination of Spatial Proximity and
Text Relevancy
10
tomatoes
Student club, Gym,
badminton, snooker
New Balance Sports
shoes
various brands
rolls, dumplings
Somerset mall: …
Adidas sports
Pizza hut
Adidas retails
k = 2
Keyword: Adidas, sports
Gao Cong, Christian S. Jensen, Dingming Wu: Efficient Retrieval of the Top-k Most
Relevant Spatial Web Objects. PVLDB 2(1): 337-348 (2009)

How to process these queries efficiently?
• Indexes: many proposals
• Spatial Indexing Scheme
 R-tree based indices
 Grid based indices
 Space Filling Curve (SFC) based indices
• Textual Indexing Scheme
 Inverted File based indices
 Signature file (Bitmap) based indices
• Combination Scheme
 Spatial-first
 Text-first
 Tightly combined (hybrid index)
11

Other types of spatial-keyword queries
• Approximate String Search in Spatial Databases
 Yao, Bin, Feifei Li, M. ,Hadjieleftheriou, K. Hou, ICDE 2010
• Continuously moving spatial keyword queries
 Wu, Dingming, Man Lung Yiu, Christian S. Jensen, Gao Cong. ICDE11
• Reverse spatial and textual k nearest neighbour search
 Lu, Jiaheng, Ying Lu, Gao Cong. SIGMOD11, TODS’14
• Spatial-textual similarity join
 Ju Fan, Guoliang Li, Lizhu Zhou, Shanshan Chen, Jun Hu. VLDB12
 Panagiotis Bouros, Shen Ge and Nikos Mamoulis. VLDB12
• Top-k spatial keyword queries on road networks
 João B. Rocha-Junior and Kjetil Nørvåg. EDBT12
• Spatial Keyword Query Processing: An Experimental Evaluation
 Lisi Chen, Gao Cong, Christian S. Jensen, Dingming Wu: PVLDB, 2013
• Diversified Spatial Keyword Search On Road Networks.
 Chengyuan Zhang, Ying Zhang, Wenjie Zhang, Xuemin Lin, Muhammad Aamir
Cheema, Xiaoyang Wang EDBT 2014
• ……
• All treating geo-textual objects independently!
12

Outline
 Retrieve a set of objects that together satisfy the user need
(SIGMOD’11, TODS’15)
 Retrieve a region of interest for user exploration (VLDB’14)
 mCK-query (SIGMOD’15)
 Route planning query (VLDB’12)
• Summary
13

Problem Statement of m-CK problem
• Geo-textual object o
 Location 𝑜. 𝜆
 Textual description 𝑜. 𝜓
• m-closest keywords (m-CK) problem [Zhang et al, ICDE 2009,
ICDE 2010]
 A query q consists of m query keywords
 Find a group of objects T covering all the m query keywords
𝑞 ⊆∪ 𝑜∈𝑇 𝑜. 𝜓
 Objects should be close to each other
 Minimize the diameter of a group
 Diameter of a group:
 the maximum Euclidean distance between any pair of
objects
𝐷𝐷𝐷𝐷 𝑇 = max
𝑜 𝑖,𝑜 𝑗∈𝑇
𝐷𝐷𝐷𝐷(𝑜𝑖, 𝑜𝑗)
20

Applications
• Explore an area fulfilling user’s personalized needs
 Issue an m-CK query {sushi, cinema, spa}
21

Applications
• Detecting geographic locations of web resources
 Web resource can be documents, photos, etc.
 These resources are usually associated with some tags describing the
content.
 They may be posted without geographic location.
 We can issue an m-CK query using these tags as keywords.
 The center of the m-CK result can be used to geo-tag this resource
approximately.
22

Contributions Overview
1. We proved the m-CK problem is NP-hard
2. Greedy Keyword Group (GKG)
 Approximation algorithm with ratio 2
 Time Complexity 𝑂(𝑚|𝑂𝑡𝑖𝑖𝑖
|𝑑)
3. Smallest Keywords Enclosing Circle (SKEC) based algorithms
 Naïve algorithm SKEC, complexity 𝑂( 𝑂′
𝑛3
). Approximation
algorithm with ratio 2
3� (≈ 1.1547)
 Approximation algorithms SKECa and SKECa+ for SKEC problem,
they return same results with ratio 2
3� + 𝜖. Worst case Time
Complexity 𝑂( 𝑂′ log
1
𝜖
𝑛 log 𝑛)
4. Algorithm EXACT for solving m-CK query
 Based on SKECa+
23

Keyword-aware Optimal Route Query
24
• Identifying a preferable route is an important problem
 Real world applications already offer tools for trip planning or route
searching.
 RouteRank: http://www.routerank.com
 Google Maps: http://maps.google.com
 Existing research work: e.g., TPQ (SSTD 05), OSR(VLDB J. 08).
• An example route search query:
 Finding the most popular route to and from my hotel such that it
passes by shopping mall, restaurant, and pub, and the time
spent on the road is within 4 hours.
 None of the existing applications or research work can answer such a
query
Xin Cao, Lisi Chen, Gao Cong, Xiaokui Xiao. Keyword-aware Optimal Route
Search. PVLDB: 1136-1147 (2012)

Keyword-aware Optimal Route Query
• Q = (vs, vt, ψ, Δ, f)
 vs, vt: the start and end locations (hotel)
 ψ : a set of keywords (shopping mall, restaurant, and pub)
 should be covered in the return route
 Δ : the budget limit (within 4 hours)
 Hard constraint
 f : the function calculating the score of a route (popularity)
 To be optimized
• The problem is proved to be NP-hard
 Reduced from the weighted constrained shortest path problem
(Has no keyword constraint)
 Also related to the generalized traveling salesman problem (Has
no budget limit)
• We develop approximation algorithms with performance
guarantees for the problem.
25

Outline
 Boolean range subscription queries (SIGMOD’13)
 Top-k subscription queries (ICDE’15)
 Diversity-aware Top-k subscription queries (SIGMOD’15)
• Summary
26

Publish/subscribe query
• Users may issue subscription queries, which
continuously find tweets/objects satisfying conditions on
stream data.
• Example: Find the tweets containing bicycle AND sell
from now until 1 July 2013.
27

Publish/Subscribe System
28
Publisher
Publish/
Subscribe
System
Query (Subscriber)
geo-textual
object
Query (Subscriber)
Query (Subscriber)
Query (Subscriber)
o = ( ψ , l , t )
o.ψ : text information
o.l : location o.t : timestamp

Boolean Range Subscription Query
• Example.
29
Times Square
…running shoes…
…motor…sell
…protest…sell…
…bike…sell…
bike…exercise…
Result
Result
Query for tweets
containing protest AND
sell with their distance
to Times Sq smaller
than 15mi
Lisi Chen, Gao Cong, Xin Cao. An Efficient Query Indexing Mechanism for
Filtering Geo-Textual Data. In ACM SIGMOD, 2013

Boolean Range Subscription Query
• Boolean Range Continuous (BRC) Query
q = (ψ , r , tc , te )
 ψ : a set of keywords connected by AND or OR semantics
(bike AND sell, Mocha OR Espresso)
 r : the query region (within 5 miles from Times Square)
 tc, te : the creation and expiration time (from now until July 1st )
• Research problem: Answering a large number of
incoming BRC queries in real time on a stream of geo-
textual objects continuously
30

Applications
• Annotation of Points-of-Interest (POIs)
 A POI service provider (e.g.,Yelp) may want to annotate each POI
with its up-to-date relevant tweets in terms of both text relevance
and spatial proximity.
31
Maintains top-3
most relevant
geo-tagged
tweets in real-
time manner

Applications
• Location-Aware Subscription Query
 Users on Twitter want to be updated with tweets near their home
on a topic (e.g., food poisoning vomiting).
 Users would prefer to be updated with a few most relevant tweets
in terms of distance, text relevance, and recency, rather than being
overwhelmed by a large number of tweets.
32

Temporal Spatial-Keyword Subscription (TaSK) Query
A set of keywords: espresso, mocha
Location: Times Square
k - the number of results: 10
Objective: Maintain up-to-date top-k most relevant results
for each TaSK query over a stream of geo-textual objects.
How to measure
‘relevance’?
33

Problem Statement
• Ranking Criterion:
 Stsk : Temporal spatial-keyword score, a combination of distance
proximity (spatial), text relevance (keyword), and object freshness
(temporal).
 Ssk : Spatial-Keyword Score
 Sdist : Score of spatial proximity
 Srel : Score of text relevance
 DΔt : Exponential Decaying Factor
34
Lisi Chen, Gao Cong, Xin Cao, Kian-Lee Tan Temporal Spatial-Keyword
Top-k Publish/Subscribe. Proceedings of the 30th ICDE, 2015

Outline
 Time aware POI recommendation (SIGIR’13, CIKM’14, SIGIR’15)
 Group recommendation (KDD’14)
 Modeling user behavior from geo-textual data for recommendation
and Prediction ( Who, Where, When, and What ) (KDD’13,
TOIS’15)
 Sentiment-aspect aware POI recommendation (ICDE’15)
 Next POI prediction (IJCAI’15)
• Summary
35

Background and Motivation
• With GPS-enabled mobile devices, social media
associated with spatial information
 Microblogging: Twitter, Weibo
 Location based social networks: Foursquare, Jiepang
• Geo-annotated user-generated content (UGC) often has:
 posting user ID
 location (point-of-interest, POI)
 timestamp
 text
36

Point-of-interest Recommendation
• A great quantity of geo-annotated UGC has been
accumulated
 Twitter: 1-2 million tweets per hour, 2.7% of which are geo-
annotated [1]
 Foursquare: 6 billion check-ins [2]
• The spatial, temporal and semantic information enables a
number of applications
• Point-of-interest Recommendation: to recommend points-
of-interest (POIs) that a user is interested in but has not
visited
 To users: discovering new places, knowing their cities better
 To merchants: launching advertisements, attracting more
customers
[1] http://irevolution.net/2013/06/09/mapping-global-twitter-heartbeat/
[2] https://foursquare.com/about
37

Problems: POI recommendations
1. POI recommendation:
given a user u, recommend
POIs that he/she may be
interested in but has not
visited yet.
2. Context Aware POI
recommendation: given a
user u, a context (e.g., time),
recommend POIs that he/she
may be interested in the
context.
38

Context-aware POI recommendation
 For example, Mary wants to find a restaurant to have pizza with
her friend Bob at 7:00 PM on Friday
 Time: 7:00 PM, Friday
 Companion: Bob
 Requirement: having pizza
 Exploiting the different aspects to improve the accuracy of POI
recommendation
39

Challenges
 Data sparsity
The density of check-in matrix or tensor is often less than 0.05%,
which is extremely small compared to 1.2% for Netflix data.
 Check-in are implicit feedback data
Different from conventional rating data, the check-ins offer only
positive examples that a user likes.
 How to explore contextual information?
We need to incorporate contextual information, e.g., coordinates,
time stamps of check-ins.
40

Time-aware POI Recommendation
• Geographical Influence
 Nearby places
• Temporal Influence
 User mobility varies with time
 office @ morning, pubs @ night
 Both geographical and temporal influences are important for
POI recommendation
• Time-aware POI recommendation:
to recommend POIs for a user to visit at a specified time
 Splitting a day into 24 slots based on hour
41

Our Approaches
• Approach 1: Extending user-based Collaborative Filtering
(CF)
 Computing user similarity, in particular the historical data at the
target time
 The challenge is to solve the data sparsity problem
• Approach 2: Extending graph based approach
 It can effectively capture the interaction between different types of
entities.
• Approach 3: A new approach based on matrix/tensor
factorization + learning to rank
42
Q. Yuan, G. Cong, Z. Ma, A. Sun, N. M. Thalmann: Time-aware point-of-interest
recommendation. SIGIR 2013
Q. Yuan, G. Cong, A. Sun: Graph-based point-of-interest recommendation with
geographical and temporal influences. CIKM 2014
Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham: Rank-GeoFM: A Ranking
based Geographical Factorization Method for Point of Interest Recommendation.
SIGIR 2015

Experimental Setup
• Two real-world datasets
• Split visited POIs of a user into three parts:
• |training set| : |tuning set| : |testing set| = 6:1:3
• Metrics
 Precision@N, Recall@N, MAP@N, nDCG@N, N=5,10,20
Foursquare Gowalla
Region Singapore California & Nevada
Time Aug. 2010 - Jul. 2011 Feb. 2009 - Oct. 2010
#user 2,321 10,162
#POI 5,596 24,250
#check-in 194,108 456,988
Density (24 bins) 2.65*10-4 4.10*10-5
45

Experimental results (1)
POI recommendations
1. Rank-GeoFM outperforms state-of-the-art methods, e.g., GeoMF and GTBNM, by 30%
2. Incorporating geographical influence into Rank-GeoFM leads to a significant improvement.
3. The performance of BPR-MF is also promising because it is a ranking based method and more
suitable for handling sparse and implicit feedback data.
46

Experimental results (3)
Time-aware POI recommendation
1. Rank-GeoFM outperforms state-of-the-art methods by 20%.
48

Group POI Recommendation
 People often participate in activities together with others
 Having picnics with friends
 Having dinner with colleagues
• Group POI recommendation: recommending a list of POIs for a
group of users
 Facilitating groups making decisions
 Helping web services improve user engagement
• Challenges
 Conventional recommender systems are designed for individuals
 Difficult to make a trade-off among different members’ preferences
 Many groups are ad hoc
50
7:05 PM
Yuan et al. COM: a generative model for group recommendation. KDD 2014

|G|
COnsensus Model (COM)
• A group event g consists of a set of users ug and a POI ig
• Intuitions:
 Each group is relevant to several topics with different matching
degrees
 e.g., a picnic group is more relevant to hiking and dining topics than to
the body-building topic
 The topics of the group attract users to join the group
51
θ z
| g |
u

|G|
• Intuitions:
 Each group member selects a POI either based on the topic, or
traveling distance
 e.g., when selecting a POI for picnic, a user may consider either the
matching degree of a POI to the topic “hiking”, or the travel distance to
a POI
52
θ z
| g |
u i

|G|
• Intuitions:
 Different users make different trade-offs between the two factors
 Tossing a coin c from user-specific Bernoulli distribution λu
 Head: topic, tail: traveling distance
 e.g., if a user does not mind traveling, then the topic “hiking” has a
more significant influence to her selection. Thus, her toss result is
more likely to be “head”
53
θ z
| g |
u
c
i
tail
head?
λu
|U|

|G|
• Intuitions:
 A user may behave differently when selecting as a group member
and as an individual. In a group, a user tends to match her
preference to the topics of the group
 If head, selecting item based on the group topic attracted her
 e.g., a movie fan will select a hill instead of a cinema for the picnic
group
54
θ z
| g |
u i
head
tail
cλu
|U|

• For each topic zk, k = 1,…,K
 Draw multinomial user distribution Φ 𝑘
𝑍𝑍
~𝐷𝐷𝐷(β)
 Draw multinomial item distribution Φ 𝑘
𝑍𝑍
~𝐷𝐷𝐷(η)
• For each user uv, v = 1,…,|U|
 Draw multinomial item distribution Φ 𝑣
𝑈𝐼
~𝐷𝐷𝐷 ρ
 Draw Bernoulli distribution λ 𝑣~𝐵𝐵𝐵𝐵(γ)
• For each group g
 Draw topic distribution θg~𝐷𝐷𝐷 α
 For each group member
 Draw topic z~𝑀𝑀𝑀𝑀 θg
 Draw user u~𝑀𝑀𝑀𝑀 Φ 𝑧
𝑍𝑈
 Toss a coin c~𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 λ 𝑢
 If c = 0
– Draw item i~𝑀𝑀𝑀𝑀 Φ 𝑢
𝑈𝐼
 Else
– Draw item i~𝑀𝑀𝑀𝑀 Φ 𝑧
𝑍𝐼
55
θ
z
u
i
|G|
|g|
α
β φZU
λu
c
γ
U
K
ρu φUI
U
η φZI
K
U
We use Gibbs sampling to estimate
the parameters

Recommendation
• 2 steps
 Estimating the topic proportion θt of the given group members 𝒖𝑡 by
Gibbs sampling
 Ranking candidate POIs i based on the equation:
𝑃 𝑖 𝒖𝑡, θt = � � 𝜃𝑡,𝑧 ∙
𝑧∈𝑍
𝜑 𝑧,𝑢
𝑍𝑍
𝑢∈𝒖𝑡
(𝜆 𝑢 ∙ 𝜑 𝑧,𝑖
𝑍𝑍
+ (1 − 𝜆 𝑢) ∙ 𝜑 𝑢,𝑖
𝑈𝐼
)
• Revising the prior 𝜌 𝑢,𝑖 to incorporate
distance information
56
θ
z
u
i
|G|
|g|
α
β φZU
λu
c
γ
U
K
ρu φUI
U
η φZI
K
U

• Datasets
 Jiepang: group check-in records of a location-based social
network
 Plancast: event records of an event-based social network
• |training set| : |testing set| = 8:2
• Evaluation metrics
 Recall@N, nDCG, N = 5, 10, 20
57
Experimental Setup
Dataset Plancast Jiepang
#users 41,705 28,88
#groups 13,885 23.621
#Items 8,016 9,746
#members 23.30 4.68
#group item 1.00 1.01

Experimental Results
• Recall@N
• nDCG for different #topics
• COM achieves superior accuracy
58
N
Plancast
N
Jiepang
Rec@N
K
Plancast
K
Jiepang
Method Description
CF-RD Relevance & disagreement, PVLDB ’09
SIG Social Influence-based Group, SIGIR’12
PIT Personal Impact Topic Model, CIKM’12
COMP Proposed model w/o content info.
COM Proposed model w content info.
nDCG

Requirement-aware POI Recommendation
• Users may have specific requirements before submitting
the recommendation queries
 “delicious pizza” @ 7:00 PM
• Requirements directly reveal users’ interests
• Challenges:
 We need to model users (who), POIs (where), time (when) and
requirements (what)
 None of previous studies can handle the four factors
• A tweet d is modeled as a five-tuple {ud, ld, wd, td, sd}
 u: user, l={id, coordinate}: POI ID & geographical coordinates
 w: words, t={hh:mm:ss}: time in a day, s: workday/weekend
7:05 PM
60

Overview: Region and time
• Intuitions:
 An individual u’s mobility centers at different personal geographical
regions r (e.g., home region, work region, shopping region, etc.)
 The region r where a user u stays is influenced by day s
 e.g., weekday: work region; weekend: shopping region
 Draw a region 𝑟 ~ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀(𝜓 𝑢,𝑠)
 User u’s temporal patterns is determined by region r and day s
 e.g., visiting shopping region at weekday evening & weekend
afternoon
 Draw time 𝑡 ~ 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺(𝜈 𝑟,𝑠, 𝜆 𝑟,𝑠
−1
)
|U|
u
ts
r
| Du |
Graphical
Model
61

|U|
Overview: Topic and POI
• Intuitions:
 User u’s topic interests is influenced by u’s topic preference region r
 e.g., u: “reading” and “shopping”. u@Times Square: “shopping”
 Draw a topic 𝑧~ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀(𝜃 𝑢,𝑟)
 User u chooses a POI l based on either topic z or region r
 Nearby POI within r that meets the topic requirement z (e.g., meal)
 Different user makes different trade-offs between z and r
 Draw a switch 𝑐 𝐿~𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵(𝜉 𝑢
𝐿
)
 If 𝑐 𝐿
= 0, draw a POI 𝑙~𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺(𝝁 𝑟,𝑠, 𝚲 𝑟,𝑠
−1
)
 If 𝑐 𝐿 = 1, draw a POI 𝑙~𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀(𝜑 𝑧
𝑍𝑍
)
u
ts
r
| Du |
Graphical
Model
z
l cL
62

|U|
Overview: Word
• Intuitions:
 User u chooses a set of words w based on either topic z or region r
 Different user makes different trade-offs between z and r
 e.g., user u is shopping at home region: “grocery”, “family”
 Draw a switch 𝑐W
~𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵(𝜉 𝑢
W
)
 If 𝑐W
= 0, draw each word w~𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀(𝜑 𝑟
𝑅𝑅
)
 If 𝑐 𝑊
= 1, draw each word w~𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀(𝜑 𝑧
𝑍𝑊
)
u
ts
r
| Du |
Graphical
Model
z
l cL
w cW
| W |
63

#regions, #topics?
• # regions of each user is unknown
 students: campus region; white collar: home & work regions
• We employ Chinese Restaurant Process (CRP) to draw regions
and automatically learn #regions for each user
 customers: POIs in tweets
 table: regions
• # topics is unknown
 previous studies empirically tune it
• We employ Hierarchical Dirichlet Process (HDP) that can
automatically learn #topics
 A global distribution 𝜏 which is drawn from steak-breaking process
 The topic distribution 𝜃 𝑢,𝑟 is drawn from the global distribution 𝜏
64

Graphical Model
θr z r
l
w
cL
cW
t
γ
φZL
μr
φZW
φRW
ξu
ξu
Ψu,s β
o
δ
η
τ
λr
χ
Λr
νr
∞
∞
∞
|w|
|Du|
|S|
|U|
|S| ω0
ρ0
ι0
ν0
∞
ε0
υ0
κ0
μ0
|Z| |Z|
α
τ
G0
Gr
γ
STB
Process
Normal
Wishart
Prior
Dirichlet
Prior
Dirichlet
Prior
Dirichlet
Prior
Beta
Prior
Beta
Prior
Normal
Gamma
Prior
CRP
65

Applications
 Given any aspects of user, location, time and words, our model
can predict the others
 Requirement-aware POI recommendation: 𝑃(𝑙|𝑢, 𝑠, 𝑡, 𝒘)
 Activity prediction: 𝑃(𝒘|𝑢, 𝑠, 𝑡)
 User prediction: 𝑷(𝒖|𝒔, 𝒕, 𝒍)
 POI prediction for user: 𝑃(𝑙|𝑢, 𝑠, 𝑡)
 Tweets recommendation: 𝑃 𝒘 𝑢, 𝑠, 𝑡, 𝑙
* u: user, l: venue, w: words, s: day, t: time
66

Scenarios of User
recommendation
68

Scenarios of user recommendation
69

Effectiveness
• Three models to compare
 PMM (Stanford University, KDD 2011)
 W4 (KDD 2013)
 EW4 (TOIS 2015)
• Datasets: microblogs posted in USA
 171,768 microblogs in USA, 4,122 users, 35,989 POIs
• Metric: accuracy (top-1 precision) of predicting users for a
place
Acc
PMM 0.4021
W4 0.5863
EW4 0.7679
70

Future Work
• Effectiveness of queries on geo-textual data
• Publish/subscribe for geo-textual data is a relatively new topic
 What factors should be considered in ranking
 How to present results
 Distributed solution
• POI Recommendation
 Explainable Recommendation Results
 Exploiting other kinds of contextual information
 Weather, traffic pattern, etc.
 Efficiency, Cold start, Sparsity
73

Acknowledgement to my collaborators

Efficient Algorithms for Answering the
m-Closest Keywords Query
76

Outline
• Problem Statement
• Applications
• Algorithms
• Experimental Results
• Conclusions
77

Problem Statement
• Geo-textual object o
 Location 𝑜. 𝜆
 Textual description 𝑜. 𝜓
• m-closest keywords (m-CK) problem [Zhang et al, ICDE 2009,
ICDE 2010]
 A query q consists of m query keywords
 Find a group of objects T covering all the m query keywords
𝑞 ⊆∪ 𝑜∈𝑇 𝑜. 𝜓
 Objects should be close to each other
 Minimize the diameter of a group
 Diameter of a group:
 the maximum Euclidean distance between any pair of
objects
𝐷𝐷𝐷𝐷 𝑇 = max
𝑜 𝑖,𝑜 𝑗∈𝑇
𝐷𝐷𝐷𝐷(𝑜𝑖, 𝑜𝑗)78

Outline
• Applications
• Algorithms
• Conclusions
79

Applications
• Explore an area fulfilling user’s personalized needs
 Issue an m-CK query {sushi, cinema, spa}
80

Applications
• Detecting geographic locations of web resources
 Web resource can be documents, photos, etc.
 These resources are usually associated with some tags describing the
content.
 They may be posted without geographic location.
 We can issue an m-CK query using these tags as keywords.
 The center of the m-CK result can be used to geo-tag this resource
approximately.
81

Outline
• Applications
• Algorithms
• Conclusions
82

Contributions Overview
1. We proved the m-CK problem is NP-hard
 Approximation algorithm with ratio 2
 Time Complexity 𝑂(𝑚|𝑂𝑡𝑖𝑖𝑖
|𝑑)
3. Smallest Keywords Enclosing Circle (SKEC) based algorithms
 Naïve algorithm SKEC, complexity 𝑂( 𝑂′
𝑛3
). Approximation
algorithm with ratio 2
3� (≈ 1.1547)
 Approximation algorithms SKECa and SKECa+ for SKEC problem,
they return same results with ratio 2
3� + 𝜖. Worst case Time
Complexity 𝑂( 𝑂′
log
1
𝜖
𝑛 log 𝑛)
4. Algorithm EXACT for solving m-CK query
 Based on SKECa+
83

Greedy Keyword Group
1. Given a query {𝑡 𝑞𝑞, 𝑡 𝑞2, ⋯ , 𝑡 𝑞𝑚}, find the most infrequent
keyword 𝑡𝑖𝑖𝑖
2. For an object 𝑜 containing 𝑡𝑖𝑖𝑖, find an object 𝑝, which
a) contains uncovered keyword (𝑡 ∈ 𝑞 𝑜. 𝜓)
b) is the nearest object to 𝑜
3. Repeat step 2 until all query keywords are covered
4. Select the group with the smallest diameter
84

Greedy Keyword Group
• Example
 For a query contains keywords {carpark, shop, hotel}
 Suppose carpark is the most infrequent keyword
85

Smallest Keywords Enclosing
Circle
• Observation:
 The optimal solution can be enclosed by a circle.
 Minimum Objects Enclosing Circle (MOEC): the smallest
circle enclosing given objects
 If we can find this circle first, it will help find the
optimal group.
• Problem
 It remains challenging to find such a circle
86
How about finding the smallest circle
enclosing all query keywords?

Smallest Keywords Enclosing
Circle
• Smallest Keywords Enclosing Circle (SKEC)
 Smallest circle enclosing all query keywords
• Example:
 Query {carpark, shop, hotel, restaurant}
87
SKE
C
If the group of objects enclosed by
SKEC is the optimal result?

Why SKEC is not the optimal
result?
• SKEC is different from MOEC of optimal group
• Example
 Query {carpark, shop, hotel}
• Theorem: SKEC has an approximation ratio of 2
3�
88
SKE
C
Optimal group
enclosing circle
(MOEC)
However…
The such diameter
can be bounded by
a factor of 2
3�

How to find SKEC
• Naïve Solution:
 Enumerate objects as the boundary of the circle
 Time consuming 𝑂( 𝑂′
𝑛3
)
• Finding SKEC
1. The size of the circle.
 Suppose the circle diameter is known as D.
2. The position of the circle.
• Observation
 At least two objects should be on the boundary.
• Solution:
1. Choose an object o fixed on the boundary.
2. Rotate the circle around o, if all keywords can be covered in some
position we find SKEC.
89

• Rotate the circle around an object with given diameter D
• Whether a valid group can be found with diameter D?
 Yes. Try smaller diameter than D
 No. No solution will be found
with smaller diameter
• Monotonicity → Binary Search
 Binary search the circle diameter
 Until the search range less than
a given parameter 𝜖
 Binary search complexity: 𝑂(log
1
𝜖
)
• Smallest Keywords Enclosing Circle approximation (SKECa)
 Find SKEC with error 𝜖
 Find m-CK solution with approximation ratio 2
3� + 𝜖
How to find SKEC
90

Exact algorithm for m-CK problem
• SKEC can answer m-CK with a factor of 2
3� (≈1.15)
 Problem: optimal solution may be missed by the sweeping
circle
• Solution:
1. Enlarge the circle by 3
2� .
 Lemma: optimal solution must be covered by the circle
2. Sweep the circle as we do for finding SKEC.
3. Do exhaustive search in each valid circle.
 Work in a reduced search space
 Pruning strategies
91

Outline
• Applications
• Algorithms
• Conclusions
92

Approximation Algorithms
• Baseline:
 Adapted Spatial Group Keyword approximation (ASGKa)
SIGMOD 2013
 Query as part of result
 Enumerate all objects containing the most infrequent
keyword as query
• Our Methods:
2. Smallest Keywords Enclosing Circle approximation
(SKECa+)
93

Exact Algorithms
• Baselines:
1. Virtual bR*-tree (VirbR), ICDE 2010
 Exhaustive search
2. Adapted Spatial Group Keyword (ASGK), SIGMOD 2013
 Query point as part of result
 Enumerate all objects containing the most infrequent
keyword as query
• Our Method:
 Exact algorithm for m-CK problem (EXACT)
94

• Datasets
 POI crawled from Google Place API
 Geo-tweets with in USA
• Experiments
 Vary number of query keywords
 Vary optimal group diameter bound
 Vary optimal group diameter bound
 Vary query keywords frequency
 Scalability
95
Dataset Number of
Objects
Unique words Total words
New York(NY) 485,059 116,546 1,143,013
Los Angeles(LA) 724,952 161,489 1,833,486
Twitter(TW) 1,000,100 487,552 5,170,495

• Vary number of query keywords
96

• Vary optimal group diameter bound
 Success Rate: success results within 1 minute timeout
threshold
98

Conclusions
• We proved the m-CK problem is NP-hard.
• We proposed a 2-approximation greedy approach.
• We proposed algorithm utilizing enclosing circle to
approximately find m-CK results with approximation
ratio
𝟐
𝟑
(≈1.15).
• We improve the complexity of this algorithm with tight
approximation ratio
𝟐
𝟑
+ 𝝐.
• Based on the idea of Keywords Enclosing Circle,
we designed an exact algorithm.
• Experiments showed the efficiency of all the
proposed algorithms.
101

Gao cong geospatial social media data management and context-aware recommendation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gao cong geospatial social media data management and context-aware recommendation

Similar to Gao cong geospatial social media data management and context-aware recommendation (20)

More from jins0618

More from jins0618 (20)

Recently uploaded

Recently uploaded (20)

Gao cong geospatial social media data management and context-aware recommendation