Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Processing	
  Large	
  Complex	
  Data	
  
Social	
  Data	
  and	
  Mul8media	
  Analy8cs	
  for	
  News	
  and	
  Events	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #2	
  
Overview	
  
•  Intro...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #3	
  
Introduc2on	
  
Mo2va...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #4	
  
Pope	
  Francis	
  
P...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
hJp://www.puzzlemarketer.co...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  6	
  
rise	
  of	
  the	
  n...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Social	
  Networks	
  as	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #8	
  
Social	
  Networks	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Social	
  Networks	
  as	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Caption
Time
User
Profile
F...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  –	
  News	
  (Bo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  –	
  Crisis	
  –...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Events	
  -­‐	
  Fes2vals	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Many	
  other	
  examples:	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
API	
  Wrapper	
  
Website	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Challenges	
  –	
  Content	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Policy	
  –	
  Licensing	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #21	
  
Example	
  Use	
  Ca...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
SocialSensor	
  Project	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #23	
  
“It has changed the ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #24	
  
	
  
	
  
	
  
	
  
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Verifica2on	
  was	
  simple...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #26	
  
News	
  Use	
  Case	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #27	
  
Infotainment	
  
•  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #28	
  
Conceptual	
  Archit...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #29	
  
Research	
  Approach...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #30	
  
Scalable	
  visual	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #31	
  
Large-­‐scale	
  vis...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #32	
  
Framework	
  
•  Imp...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #33	
  
Scalable	
  indexing...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #34	
  
VLAD+SIFT	
  vs.	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #35	
  
Clustering	
  –	
  C...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
graph	
  
G	
  =	
  (V,	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Some	
  Examples	
  
Webpag...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Biology	
  example	
  
Node...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
blogosphere	
  as	
  a	
  g...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
social	
  web	
  as	
  a	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  graphs	
  on	
  the	
  w...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Community	
  and	
  graphs	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Pairs	
  of	
  nodes	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Oten	
  communi8es	
  ar...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
communi2es	
  and	
  graphs...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
community	
  arributes	
  
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Given	
  nodes	
  u	
  a...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  A	
  graph	
  can	
  be	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
graph	
  degress	
  
deg(vi...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Degrees	
  &	
  Adjancency	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Modularity	
  is	
  comp...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  In	
  a	
  random	
  gra...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Exhaus8ve	
  search	
  o...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  other	
  community-­‐nes...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Given	
  a	
  graph	
  G...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
taxonomy	
  
S.	
  Papadopo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  k-­‐clique	
  
•  N-­‐cl...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  (μ,ε)-­‐core:	
  	
  
– ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Betweenness	
  centralit...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Find	
  edges	
  that	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  GN	
  algorithm	
  is	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Girvan	
  -­‐	
  Newman	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  Event	
  Summariz...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  wo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  wo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  wo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  wo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Event	
  related	
  collec$...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Exis2ng	
  Approaches:	
  T...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Exis2ng	
  Approaches:	
  M...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
MGraph:	
  Framework	
  Ove...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Pre-­‐processing	
  /	
  Fi...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Pre-­‐processing	
  /	
  Fi...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2-­‐graph	
  Genera2on	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2-­‐graph	
  Genera2on	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  mul2-­‐modal	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  deduplica2on	
  
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  deduplica2on	
  
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Topic	
  Detec2on	
  
•  Ap...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Message	
  Selec2on	
  Scor...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Specificity	
  
High	
  spec...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Image	
  Ranking	
  &	
  Di...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Dataset	
  and	
  Event	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Relevance	
  Judgments	
  
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Experimental	
  Se{ng	
  
•...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  metrics	
  (1)...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  metrics	
  (2)...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Baselines	
  
•  Random:	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (1)	
  –	
  Prec...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results:	
  Canada	
  Team	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (2)	
  –	
  Dive...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (3)	
  
Performa...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (4)	
  
Impact	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Conclusions	
  
•  Graph-­‐...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Monitoring	
  and	
  intell...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Can	
  mul2media	
  on	
  t...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
The	
  Problem	
  
•  Every...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Media	
  REVEALr	
  
•  Dev...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Related	
  Work	
  
•  Majo...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Overview	
  of	
  Media	
  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Named	
  En2ty	
  Detec2on	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  Indexing	
  
•  C...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Improving	
  NDS	
  Resilie...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example:	
  Filtering	
  Ou...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Classifier	
  Details	
  
• ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mining:	
  Clustering	
  an...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Colle...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Items...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Clust...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  En22e...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NER	
  
•  Ma...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NDS	
  
•  Be...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NDS	
  
•  Ex...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Use	
  Cases:	
  Real-­‐wor...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
NDS	
  Use	
  Case	
  (bost...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Clustering	
  Use	
  Case	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
En2ty	
  Aggrega2on	
  Use	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Conclusion	
  
•  Key	
  co...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Computa2onal	
  Verifica2on	...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Methodology	
  
#120	
  
Tw...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  
•  Tweet	
  Sta...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results(2)	
  
#122	
  
Cla...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #123	
  
Other	
  approaches...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #124	
  
Demos	
  -­‐	
  App...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2media	
  Demo	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #126	
  
Mul2media	
  Demo	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
MongoDB	
  
Document-­‐orie...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Apache	
  Solr	
  
Full-­‐t...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Storm	
  
Distributed	
  re...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Redis	
  
Key	
  -­‐	
  Val...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
tags:	
  sagrada	
  familia...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #132	
  
City	
  profile	
  c...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #134	
  
ThessFest	
  
•  Th...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Fête	
  de	
  la	
  Musique...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #136	
  
Topic	
  analysis	
...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Other	
  Applica2on	
  Area...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Reusable	
  results	
  
•  ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #139	
  
Benchmarking	
  -­‐...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
dataset:	
  SNOW	
  2014	
 ...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Overview	
  of	
  Challenge...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Challenge	
  Ac2vity	
  Log...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Some	
  sta2s2cs	
  
•  Reg...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  Protocol	
  
•...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
social	
  event	
  detec2on...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
a	
  bit	
  of	
  backgroun...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
task	
  defini2on	
  &	
  da...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
sed2012:	
  evalua2on	
  se...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
evalua2on:	
  main	
  cavea...
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #150	
  
Conclusions	
  
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data
Upcoming SlideShare
Loading in …5
×

Processing Large Complex Data

874 views

Published on

Social Data and Multimedia Analytics for News and Events Applications lecture given at 2015 IEEE SPS Italy Chapter Summer School on Signal Processing (S3P)

Published in: Data & Analytics
  • Be the first to comment

Processing Large Complex Data

  1. 1. Processing  Large  Complex  Data   Social  Data  and  Mul8media  Analy8cs  for  News  and  Events   Applica8ons   Dr.  Yiannis  Kompatsiaris,  ikom@i2.gr   Mul$media,  Knowledge  and  Social  Media  Analy$cs  Lab,  Head   CERTH-­‐ITI   2015  IEEE  SPS  Italy  Chapter  Summer  School  on  Signal   Processing  (S3P)  
  2. 2. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #2   Overview   •  Introduc8on   –  Mo8va8on  –  Challenges   •  Example  Use  Cases   •  Research  Approaches   –  Large-­‐Scale  visual  search   –  Graphs  -­‐  Community  Detec8on  -­‐  Clustering   –  Social  Event  Detec8on   –  Verifica8on   •  Demos  –  Applica8ons   –  MM  News  Demo   –  ClusJour   –  Thessfest   •  Evalua8on  -­‐  Benchmarking   •  Conclusions  
  3. 3. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #3   Introduc2on   Mo2va2on   Example  Applica2ons   Conceptual  Architecture   Challenges  
  4. 4. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #4   Pope  Francis   Pope  Benedict   2007:  iPhone  release   2008:  Android  release   2010:  iPad  release   http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
  5. 5. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   hJp://www.puzzlemarketer.com/digital-­‐social-­‐brands-­‐in-­‐60-­‐seconds/    (Apr,  2012)  
  6. 6. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  6   rise  of  the  networks  
  7. 7. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Social  Networks  as  Graphs   10# social#web#as#a#graph# nodes&=&twi+er&users& edges&=&retweets&on&#jan25&hashtag& announcement&of&Mubarak’s&resigna<on& h1p://gephi.org/2011/the7egyp9an7revolu9on7on7twi1er/#
  8. 8. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #8   Social  Networks  as  Graphs   “Social  networks  have  emergent   proper$es.  Emergent  proper$es   are  new  aFributes  of  a  whole  that   arise  from  the  interac$on  and   interconnec$on  of  the  parts”   •  Emo8ons,  Health,  Sexual   rela8onships  do  not  depend   just  on  our  connec8ons  (e.g.   number  of  them)  but  on  our   posi8on  -­‐  structure  in  the  social   graph   –  Central  –  Hub   –  Outlier   –  Transi8vity  (connec8ons  between   friends)  
  9. 9. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Social  Networks  as  Real-­‐Life  Sensors   •  Social  Networks  is  a  data  source  with  an   extremely  dynamic  nature  that  reflects   events  and  the  evolu8on  of  community   focus  (user’s  interests)   •  Huge  smartphones  and  mobile  devices   penetra2on  provides  real-­‐8me  and   loca8on-­‐based  user  feedback   •  Transform  individually  rare  but   collec2vely  frequent  media  to  meaningful   topics,  events,  points  of  interest,   emo8onal  states  and  social  connec8ons   •  Present  in  an  efficient  way  for  a  variety  of   applica8ons  (news,  marke8ng,  science,   health,  entertainment)  
  10. 10. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Caption Time User Profile Favs Comms Tags Social  Media  aspects    
  11. 11. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   Xin  Jin,  Andrew  Gallagher,  Liangliang  Cao,  Jiebo  Luo,  and   Jiawei  Han.  The  wisdom  of  social  mulHmedia:   using  flickr  for  predicHon  and  forecast,   Interna8onal  conference  on  Mul8media  (MM  '10).  ACM.   11   “…if  you're  more  than  100  km  away  from  the  epicenter   [of  an  earthquake]  you  can  read  about  the  quake  on   twiJer  before  it  hits  you…”   Many  twiJer  examples  at:  What  can  TwiJer  tell  us  about  the  real  world?  TwiJer  and  the  Real   World  CIKM'13  Tutorial,  hJps://sites.google.com/site/twiJerandtherealworld/home    
  12. 12. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   12  
  13. 13. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   13   Be  careful  of  correla8on  diagrams  
  14. 14. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  –  News  (Boston  bombing)   #14   “Following  the  Boston  Marathon  bombings,  one  quarter  of   Americans  reportedly  looked  to  Facebook,  TwiJer  and   other  social  networking  sites  for  informa8on,  according  to   The  Pew  Research  Center.  When  the  Boston  Police   Department  posted  its  final  “CAPTURED!!!”  tweet  of  the   manhunt,  more  than  140,000  people  retweeted  it.”     “Authori8es  have  recognized  that  one  the  first   places  people  go  in  events  like  this  is  to  social   media,  to  see  what  the  crowd  is  saying  about  what   to  do  next”   "I  have  been  following  my  friend's   Facebook  [account]  who  is  near  the  scene   and  she  is  upda2ng  everyone  before  it   even  gets  to  the  news”  
  15. 15. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  –  Crisis  –  Humanitarian  (Syria)   #15   Syria  Tracker  offers  a  crisis  mapping  system  that  uses  crowdsourced  text,  photo   and  video  reports  and  data  mining  techniques  forming  a  live  map  of  the  Syrian   conflict  since  March  2011   …stream  of   content-­‐filtered   media  from   news,  social   media  (TwiJer   and  Facebook)   and  official   sources  
  16. 16. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Events  -­‐  Fes2vals   #16   http://www.eventmanagerblog.com/uploads/2012/12/event-technology-infographic.jpg
  17. 17. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Many  other  examples:  smellymaps   #17   Smell  related  words  in  geo-­‐located  social  media   hJp://researchswinger.org/smellymaps/  
  18. 18. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   API  Wrapper   Website  Wrapper   Scheduler   CRAWLING   Visual  Indexing   Near-­‐duplicates   Text  Indexing   INDEXING   Media  Fetcher   SNA   Sen2ment  -­‐  Influence   Trends  -­‐  Topics   MINING   Model  Building   Concepts   Relevance   Diversity   Popularity   RANKING   Veracity   Crawling  Specs   Sources   Interac2on   Responsiveness     Aggrega2on   VISUALIZATION   Aesthe2cs   Conceptual  Architecture  
  19. 19. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Challenges  –  Content  (Mining)   •  Mul2-­‐modality:  e.g.  image  +  tags,  video,  audio   •  Rich  social  context:  spa8o-­‐temporal,  social  connec8ons,   rela8ons  and  social  graph   •  Specific  messages:  short,  conversa8ons,  errors,  no  context   •  Inconsistent  quality:  noise,  spam,  fake,  propaganda   •  Huge  volume:  Massively  produced  and  disseminated   •  Mul2-­‐source:  may  be  generated  by  different  applica8ons   and  user  communi8es   •  Dynamic:  Fast  updates,  real-­‐8me  
  20. 20. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Policy  –  Licensing  –  Legal  challenges   •   Fragmented  access  to  data   –  Separate  wrappers/APIs  for  each  source  (TwiJer,  Facebook,  etc.)   –  Different  data  collec8on/crawling  policies   •   Limita8ons  imposed  by  API  providers  (“Walled  Gardens”)   •  Full  access  to  data  impossible  or  extremely  expensive  (e.g.  see  data    licensing  plans  for  GNIP  and  DataSit   •  Non-­‐transparent  data  access  prac8ces  (e.g.  access  is  provided  to  an    organiza8on/person  if  they  have  a  contact  in  TwiJer)     •   Constant  change  of  model  and  ToS  of  social  APIs   –  No  backwards  compa8bility,  addi8onal  development  costs   •   Ephemeral  nature  of  content   •  Social  search  results  oten  lead  to  removed  content  à  inconsistent    and  unreliable  referencing   •   User  Privacy  &  Purpose  of  use   •  Fuzzy  regulatory  framework  regarding  mining  user-­‐contributed  data
  21. 21. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #21   Example  Use  Cases   Events  and  News  
  22. 22. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   SocialSensor  Project  Objec2ve   SocialSensor  quickly  surfaces  trusted  and  relevant  material     from  social  media  –  with  context.   DySCO   behaviour   loca8on   8me  content   usage   social  context   Massive  social  media   and  unstructured  web   Social  media  mining   Aggrega8on  &  indexing   News  -­‐  Infotainment   Personalised  access    Ad-­‐hoc  P2P  networks  
  23. 23. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #23   “It has changed the way we do news”(MSN) “Social media is the key place for emerging stories – internationally, nationally, locally” (BBC) “Social media is transforming the way we do journalism” (New York Times) Source: picture alliance / dpa
  24. 24. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #24                                                                  Source:  GeJy  Images   “It’s really hard to find the nuggets of useful stuff in an ocean of content” (BBC) “Things that aren’t relevant crowd out the content you are looking for” (MSN) “The filters aren’t configurable enough” (CNN)
  25. 25. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Verifica2on  was  simpler  in  the  past...   Source: Frank Grätz #25  
  26. 26. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #26   News  Use  Case  Requirements   Quickly  surface  trusted  and  relevant  material  from   social  media  –  with  context.   •  “quickly”:  in  real  8me   •  “surfaces”:  automa8cally  discovers,  clusters  and  searches     •  “trusted”:  automa8c  support  in  verifica8on  process   •  “relevant”:  to  the  specific  event   •  “material”:  any  material  (text,  image,  audio,  video  =   mul8media),  aggregated  with  other  sources  (e.g.  web)   •  “social  media”:  across  all  relevant  social  media  plaworms   •  “with  context”:  loca8on,  8me,  sen8ment,  influence  
  27. 27. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #27   Infotainment   •  Events  with  large  numbers   of  visitors   •  Thessaloniki  Interna8onal   Film  Fes8val     –  80,000  viewers  /  100,000   visitors  in  10  days   –  150  films,  350  screenings   •  Discovery  and  presenta8on   of  relevant  aggregated   social  media   –  Trending  Topics   –  Sen8ment   –  Tweet  –  film  matching   –  Visualiza8on  (Social  Walls)  
  28. 28. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #28   Conceptual  Architecture  and  Main  components   SEMANTIC  MIDDLEWARE   Public   Data   SEARCH  &  RECOMMENDATION   USER  MODELLING  &  PRESENTATION   INDEXING  MINING   STORAGE   DATA  COLLECTION  /  CRAWLING   •  Real  8me  dynamic  topic   and  event  clustering   •  Trend,  popularity   and  sen8ment  analysis   •  Calculate  trust/influence   scores  around  people   •  Personalized  search,   access  &  presenta8on   based  on  social  network   interac8ons   •  Seman8c  enrichment   and  discovery  of  services  
  29. 29. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #29   Research  Approaches     Large-­‐Scale  Visual  Search   Graphs  –  Clustering/Community  Detec2on   Visual  Event  Summariza2on   Social  Media  Verifica2on  
  30. 30. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #30   Scalable  visual  feature  aggrega2on  &   indexing   •  Problem:  Example-­‐based  image  search   –  Find  images  that  represent  same  or  similar  object  or  scene   with  a  given  query  image   –  Viewed  from  different  viewpoints,    occlusions,    cluJer   •  Challenge:  Large-­‐scale   –  Searching  databases  with  tens  of  millions  of  images   –  Objec8ves  to  be  full-­‐filed:   •  Sufficient  discrimina8ve  power   •  Fast  response  8mes   •  Efficient  memory  usage  
  31. 31. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #31   Large-­‐scale  visual  search   image  collec8on   from  social  media/   Web   image  local  feature   extrac8on   feature  aggrega8on   feature  indexing  kNN  visual   similarity  search   concept-­‐based   image  annota8on   image  clustering   image  (geo)tagging   concept-­‐based   search/filtering   duplicate  detec2on  
  32. 32. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #32   Framework   •  Implementa8on  and  evalua8on  of  the  effec8veness   of  VLAD  in  combina8on  with  SURF   •  Scalable  image  indexing   E.  Spyromitros-­‐Xioufis,  S.  Papadopoulos,  Y.  Kompatsiaris,  G.   Tsoumakas,  I.  Vlahavas,  "A  Comprehensive  Study  over  VLAD  and   Product  Quan8za8on  in  Large-­‐scale  Image  Retrieval",  IEEE   Transac8ons  on  Mul8media  16(6),  pp.  1713-­‐1728,  October  2014.   image   local   descriptor   extrac8on   descriptor   aggrega8on   dimensionality   reduc8on  set  of  local   descriptors   fixed  size   vector   encoding  &   indexing   low  dimensional     vector   SIFT  /  SURF   BOW  /  VLAD   PCA   PQ  +  ADC/IVFADC  
  33. 33. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #33   Scalable  indexing  of  features   •  ADC  16x8  requires  16  bytes  per  image   –  ~67M  images  per  GB   •  IVFADC  requires  4  addi8onal  bytes  per  image   –  ~53.6M  images  per  GB   •  In  current  implementa8on  we  achieve  only  half  of  above  numbers  due  to   using  short  int[]  instead  of  byte[],  but  possible  to  improve.   •  Ideally,  1  billion  images  could  be  indexed  on  a  server  with   20GB  of  RAM  (projec2on).   •  Query  8me  (for  1M  vectors):   –  Exhaus8ve  search  of  VLAD  vectors  (d’=128):    0.50  sec   –  Product  Quan8za8on  with  ADC  16x8:    0.10  sec  (x5  faster)   –  Product  Quan8za8on  with  IVFADC  16x8:    0.02  sec  (x25  faster)  
  34. 34. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #34   VLAD+SIFT  vs.  VLAD+SURF         Accuracy  vs.  dimensionality   •  VLAD+SURF  improves  VLAD+SIFT  and  FV+SIFT  across  all  dimensions  in   both  Holidays  and  Oxford  datasets   Results  in  rows  star8ng  with  *  are  taken  from  Jégou  et  al.,  2011,    hence  the  missing  values  for  some  entries.   SIFT  corresponds    to  PCA  reduced  SIFT  which  yielded  beJer  results  than  standard  SIFT  in  Jegou  et  al.,  2011  
  35. 35. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #35   Clustering  –  Community  Detec2on    
  36. 36. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   graph   G  =  (V,  E)   nodes   edges   An  abstract  data  type  represen8ng  rela8onships  or  connec8ons  
  37. 37. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Some  Examples   Webpage  www.x.com   href=“www.y.com”   href  =  “www.z.com”   Webpage  www.y.com   href=“www.x.com”   href  =  “www.a.com”   href  =  “www.b.com”   Webpage  www.z.com   href=“www.a.com”   y   a   x   z   b  
  38. 38. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Biology  example   Nodes  –  Proteins     Edges  –  Interac8ons     Visualiza8on  plays  an  important  role  
  39. 39. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   blogosphere  as  a  graph   nodes  =  blogs   edges  =  hyperlinks   technical  -­‐  gadgets   society  -­‐  poli2cs   hJp://datamining.typepad.com/gallery/blog-­‐map-­‐gallery.html  
  40. 40. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   social  web  as  a  graph   nodes  =  twirer  users   edges  =  retweets  on  #jan25  hashtag   announcement  of  Mubarak’s  resigna2on   hJp://gephi.org/2011/the-­‐egyp8an-­‐revolu8on-­‐on-­‐twiJer/  
  41. 41. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  graphs  on  the  web  present  certain  structural   characteris8cs   •  groups  of  nodes  interac8ng  with  each  other  à    dense  inter-­‐connec2ons  à              func8onal/topical  associa8ons   •  what  can  we  gain  by  studying  them?   –  topic  analysis   –  photo  clustering   –  improved  recommenda8on  methods   –  detect  influencers   emerging  structures  
  42. 42. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Community  and  graphs                                  Communi8es  correspond  to  groups  of  nodes  on  a  graph  that   share  common  proper8es  or  have  a  common  role  in  the   organiza8on/opera8on  of  the  system.   S.  Fortunato,  C.  Castellano.  Community  structure  in  graphs.  arXiv:0712.2716v1,  Dec  2007.  
  43. 43. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Pairs  of  nodes  are  more  likely  to  be  connected  if  they  are   both  members  of  the  same  community,  and  less  likely  to   be  connected  if  they  do  not  share  communi8es.   •  explicit   –  the  result  of  conscious  human  decision     •  implicit   –  emerging  from  the  interac8ons  &  ac8vi8es  of  users     –  need  special  methods  to  be  discovered   –  Community  detec8on,  par88on,  clustering   Community  types  
  44. 44. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Oten  communi8es  are  defined  with  respect  to  a   graph,    G  =  (V,E)  represen8ng  a  set  of  objects  (V)  and   their  rela8ons  (E).   •  Even  if  such  graph  is  not  explicit  in  the  raw  data,  it  is   usually  possible  to  construct,  e.g.  feature  vectors  à   distances  à  thresholding  à  graph   •  Given  a  graph,  a  community  is  defined  as  a  set  of   nodes  that  are  more  densely  connected  to  each   other  than  to  the  rest  of  the  network  nodes.   communi2es  and  graphs  
  45. 45. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   communi2es  and  graphs  -­‐  example   inter-­‐community  edge   intra-­‐community  edge  
  46. 46. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   community  arributes   overlap   weighted  par8cipa8on   roles   hierarchy   evolu8on  
  47. 47. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Given  nodes  u  and  v  of  graph  G  =  (V,E)  a  cut  is  a  set   of  edges  C  ⊂  E,  such  that  the  two  nodes  are   unconnected  on  the  graph  G΄=  (V,E-­‐C).   •  Using  s  to  denote  a  “source”  node  and  t  to  denote  a   “terminal”  node,  a  cut  (S,T)  of  G  =  (V,E)  is  a  par88on   of  V  in  sets  S  and  Τ  =  V-­‐S,  such  that  s  ∈  S  and  t∈T.   graph  cuts   s t T S
  48. 48. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  A  graph  can  be  split  into  communi8es  in  numerous  ways,  i.e.   for  each  graph  there  are  many  possible  community   structures.  In  the  simple  case,  a  community  structure  is   defined  as  a  graph  par88on  into  a  set  of  node  sets            C  =  {Ci}   •  To  provide  a  measure  of  the  quality  of  a  community  structure,   we  make  use  of  modularity.   •  The  modularity  maximiza8on  method  detects  communi8es  by   searching  over  possible  divisions  of  a  network  for  one  or  more   that  have  par8cularly  high  modularity.     •  Modularity  quan8fies  the  extent  to  which  a  given  graph   par88on  into  communi8es  presents  a  systema8c  tendency  to   have  more  intra-­‐community  links  than  the  same  community   structure  would  present  if  the  links  would  be  rewired  under   ER  (Erdos-­‐Renyi)  graph  model.   Modularity  maximiza2on  
  49. 49. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   graph  degress   deg(vi)  =  ki  =  number  of  neighbors   In  directed  graphs,  we  differen8ate  between  in-­‐  and  out-­‐degree.   Αij  =  link  between  nodes  i  and  j   0  à  no  link   1  à  link   α  à  link  with  weight  equal  to  α   node  degree   adjacency  matrix  
  50. 50. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Degrees  &  Adjancency   v1   v2   v3   v4  v5   Adjacency  matrix  on  an  undirected  graph    :  A(i,j),    i,j  <=  n     degree  of  a  vertex  v     (number  of  edges  incident  upon  it):   ∑= w v wvAk ),(
  51. 51. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Modularity  is  computed  as  follows:       –  Αij:  adjacency  matrix   –  ki:  degree  of  node  i   –  ci:  community  of  node  i   –  δ(ci,cj)  =  1  if  i,  j  belong  to  the  same  community   –  m:  number  of  edges  on  the  graph   modularity  computa2on   ∑ −= ji ji ji ij cc m kk A m Q , ),() 2 ( 2 1 δ Expected number of edges between i and j, if edges are placed randomly. Observed number of intra-community edges.
  52. 52. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  In  a  random  graph  (ER  model),  we  expect  that  any   possible  par88on  would  lead  to  Q  =  0.   •  Typically,  in  non-­‐random  graphs  modularity  takes   values  between  0.3  and  0.7.     modularity  -­‐  example   Q = 0.60 clear community structure Q = 0.37 fuzzy communities
  53. 53. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Exhaus8ve  search  over  all  possible  divisions  is  usually   intractable   •  Algorithms  based  on  approximate  op8miza8on   –  greedy  algorithms   –  simulated  annealing   –  spectral  op8miza8on   –  local-­‐based  op8miza8on   •  Balances  between  speed  and  accuracy   Modularity  maximiza2on  approaches  
  54. 54. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  other  community-­‐ness  measures:   –  conductance   –  density   •  defini8ons  to  sa8sfy   –  each  member  should  be  connected  to  more  nodes  within   the  community  than  to  nodes  outside  it   –  each  member  should  be  connected  to  all  other  members   (k-­‐clique)   •  result  of  a  process   –  if  I  start  removing  edges  with  a  certain  order,  the  graph   will  break  into  pieces  à  communi8es   other  means  to  define  communi2es  
  55. 55. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Given  a  graph  G=(V,E),  find  a  par88on  of  V  in  k  disjoint   subsets,  such  that  the  number  of  edges  in  Ε  of  which  the   endpoints  belong  to  different  subsets  is  minimized.   •  Various  solu8ons:  Kernighan-­‐Lin  algorithm  [Kernighan70],   spectral  bisec8on  [Pothen90].   •  Mul8-­‐level  par88on  (me8s)  [Karypis99]:  Repeated  applica8on   of  bisec8on  un8l  the  graph  is  par88oned  into  k  parts  under   constraint  to  the  sizes  of  the  subsets.   •  Not  sa8sfactory  solu8on,  since  the  number  of  communi8es   needs  to  be  provided  as  input  to  the  algorithm.  Some8mes   event  the  community  sizes  need  to  be  provided  as  inputs.   graph  par22on   B.  W.  Kernighan,  S.  Lin.  An  Efficient  Heuris8c  Procedure  for  Par88oning  of  Electrical  Circuits.  Bell   Systems  Technical  Journal,  Vol.  49,  No.  2,  pp.  291-­‐  307,  February  1970.     A.  Pothen,  H.D.  Simon  and  K.-­‐P.  Liou.  Par88oning  sparse  matrices  with  eigenvectors  of  graphs.   SIAM  journal  of  Matrix  Analysis  and  Applica8ons,  11:  430-­‐452,  1990.      G.  Karypis  and  V.  Kumar,  A  fast  and  high  quality  mul8level  scheme  for  par88oning    irregular  graphs,  SIAM  J.  Sci.  Comput.  20  (1):  359–392,  1999.  
  56. 56. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   taxonomy   S.  Papadopoulos,  Y.  Kompatsiaris,  A.  Vakali,  P.  Spyridonos.  “Community  detec8on  in  Social  Media”.  In   Data  Mining  and  Knowledge  Discovery,  Springer,  2011  
  57. 57. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  k-­‐clique   •  N-­‐clique   •  k-­‐core   subgraph  discovery  (structure)                                      1   k=3  (triangle)   k=4   k=5   N=2  (star)   0-­‐core   1-­‐core   2-­‐core   4-­‐core   3-­‐core  
  58. 58. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  (μ,ε)-­‐core:     –  based  on  the  concept  of  structural  similarity   subgraph  discovery                                                          2   (μ,ε)-­‐core   μ  =  5,  ε  =  0.72   (μ,ε)-­‐core   μ  =  6,  ε  =  0.675   hub   outlier   Percentage  of   common  neighbors   for  each  edge  
  59. 59. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Betweenness  centrality   –  Being  in  many  shortest  paths     •  Closeness     –  Being  close  to  many  nodes     •  Eigenvector  centrality   –  End  of  many  paths     •  Degree  centrality   –  High  degree       hJps://commons.wikimedia.org/wiki/File:6_centrality_measures.png#/ media/File:6_centrality_measures.png   Carlos  Cas8llo,  Social  Media  Mining  and  Retrieval,   hJp://www.slideshare.net/ChaToX/social-­‐media-­‐mining-­‐and-­‐retrieval     centrality  
  60. 60. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Find  edges  that  stand  between  communi8es.   •  Progressively  remove  more  “central”  edges  un8l  the   graph  breaks  into  separate                         communi8es.   •  As  the  graph  spli†ng               progresses,  new  communi8es                                          emerge  that   are  assigned  to  a  hierarchical                     structure.   •  Edge  centrality  is  defined                         similarly  to  node  centrality:   60   divisive  -­‐  use  of  edge  centrality   Depic8on  of  node  centrality:      red  (min)  à  blue  (max)   ∑ ∈ ≠≠= Vts vts ts ts v vbc , , , )( )( σ σ )(, vtsσ ts,σ :  number  of  paths  from  node  s  to  t     that  include  node  v   :  total  number  of  paths  from  s  to  t   Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.
  61. 61. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  GN  algorithm  is  one  of  the  most  important  algorithms   s8mula8ng  a  whole  wave  of  community  detec8on  methods.   •  Basic  principle:   –  Compute  betweenness  centrality  for  each  edge.   –  Remove  edge  with  highest  score.   –  Re-­‐compute  all  scores.   –  Repeat  2nd  step.   •  Complexity:  Ο(n3)   •  Many  varia8ons  have  been  presented  to                     improve  precision  by  use  of  different  betweenness  measures   or  reduce  complexity,  e.g.  by  sampling  or  local  computa8ons.   Girvan  -­‐  Newman  algorithm   Girvan,  M.,  Newman,  M.E.J.  “Community  structure  in  social  and  biological  networks”.  In   Proceedings  of  Na8onal  Academy  of  Science,  U.  S.  A.  99(12),  7821–7826,  2002  
  62. 62. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Girvan  -­‐  Newman  (example)   Social  network  in  Zachary     karate  club   Hierarchical  community  structure   detected  by  the  algorithm.  
  63. 63. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  Event  Summariza2on  on  Social  Media  using   Topic  Modelling  and  Graph-­‐based  Ranking  Algorithms  
  64. 64. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (1)   •  Long-­‐running  events  →  Consist  of  several  sub-­‐events   e.g.  10  days  of  Sundance  Film  Fes8val  include  opening   and  awards  ceremonies,  screenings  etc.   •  A  lot  of  involved  persons  that  use  social  media  →  huge   amount  of  event-­‐related  micro-­‐blogging  messages     •  A  growing  number  of  these  messages  carry   mul2media  content     •  The  existence  of  an  image  in  a  micro-­‐post  can  convey  a   much  beJer  impression  for  the  specific  moment  of  the   ongoing  event  
  65. 65. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (2)              #nbafinals  →  2.6M  tweets  in  one  month   #BaltimoreRiots 29 April-2 May 2015 à1.3M tweets in 5 days E3 conference 2015 16-18 June >5M tweets before conference 2M tweets during conference new game releases à multimedia content
  66. 66. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (3)   But…   •  the  huge  number  of  messages,  makes  it  very   challenging  for  interested  users  to  monitor  the   evolu8on  of  the  event   •  many  messages  can  be  considered  as  spam  or  non-­‐ informa2ve   •  In  case  of  mul8media:  internet  memes,   screenshots,  images  of  low  quality…   •  Redundancy  due  to  near  duplicate  messages  and   images  
  67. 67. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (4)   #nbafinals     Irrelevant Duplicates with no explicit association Non-informative
  68. 68. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Event  related  collec$on  is  available       Visual  Event  Summariza2on   Visual  Event  Summariza2on  is  the  problem  of  selec8ng   a  concise  set  of  images  that  are  highly  relevant  to  the   event  and  contain  visually,  the  key  aspects  of  the   event.   Event-­‐based   Visual   Summarizer   List  of  all  event  images   Set  of  Selected     Representa2ve   and  Diverse   Images  
  69. 69. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Exis2ng  Approaches:  Text-­‐based   Radev  et  al.  (2004)   •  summary  consists  of  messages  that  are  closest  to  their  N·∙idf  centroid   Erkan  et  al.  (2004),  LexRank  &  Mihalcea  et  al.  (2004),  TextRank     •  finding  salient  sentences  by  using  the  centrality  of  each  sentence  in  a  similarity   graph     •  adapted  for  mul8-­‐document  summariza8on  using  each  message  as  a  sentence.   •  outperforms  naïve  centroid-­‐based  approach.   Shen  at  al.  (2013)   •  mixture  model  to  detect  sub-­‐events  at  par8cipant  level   •  N·∙idf  centroid  to  find  a  summary  of  each  sub-­‐event     Chakrabar2  and  Punera  (2011)   •  Hidden  Markov  Model  to  obtain  a  8me-­‐based  segmenta8on  of  tweets   •  N·∙idf  centroid  to  find  a  summary  of  each  8me  segment  
  70. 70. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Exis2ng  Approaches:  Mul2media   Bian  et  al.  (2013)   •  mul8modal  extension  of  LDA     •  textual  and  visual  features     Lin  et  al.  (2012)   •  mul8-­‐graph  of  objects  capturing  visual,  textual  and  temporal   proximity   •  8me-­‐ordered  sequence  of  important  objects  via  graph   op8miza8on   McParlane  et  al.  (2014)  –  state-­‐of-­‐the-­‐art  baseline   •  visual  features  +  SVM  to  discard  irrelevant  images   •  clustering  in  subtopics  and  selec8on  of  popular  images  for   each  subtopic  based  on  popularity  and  specificity  
  71. 71. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   MGraph:  Framework  Overview   1.  create  message  mul8-­‐graph  using  textual,  visual  and  temporal  proximity   2.  find  underlying  topics  using  SCAN  algorithm   3.  calculate  prior  scores  of  images  based  on  topics  and  popularity  (relevance)   4.  diversify  using  DivRank  
  72. 72. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Pre-­‐processing  /  Filtering   Text-­‐based  filtering   •  heuris8c  rules  for  spam  filtering  →  discard  very  short  messages  &   messages  with  many  men8ons,  URLs  or  hashtags.   •  filtering  of  unstructured  messages  using  POS  tagging    Accept    →  (determiner?  adjec$ve*  noun+  verb)+   Visual-­‐based  filtering   •  discard  small  images   •  detect  and  discard  memes,  screenshots  and  images  containing   heavy  text  
  73. 73. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Pre-­‐processing  /  Filtering   Text-­‐based  filtering   Visual-based filtering Tweet length POS tagging filtering
  74. 74. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2-­‐graph  Genera2on  (1)   Given  a  set  of  (original)  messages  M={m1,  m2,  ...,  mn}  we  construct  a   mul8-­‐graph  GM  =  {V,  Etextual,  Evisual,  Esocial,  E2me}     •  vertex  vi  ∈  V  corresponds  to  message  mi     •  Etextual  →  undirected  edges  expressing  the  textual  similarity  (cosine   similarity)  between  nodes  (Z·∙idf  vector  vm)   •  Evisual  →  undirected  edges  that  represent  the  visual  similarity  (L2   distance)  between  nodes  with  images  (VLAD+SURF  vectors)     Thresholding:  add  an  edge  in  Etextual  or  Evisual,  only  if  the  textual  or  visual  similarity   between  the  corresponding  nodes  is  higher  than  thtextual  or  thvisual  respec8vely      
  75. 75. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2-­‐graph  Genera2on  (2)      
  76. 76. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  mul2-­‐modal  sub-­‐graph   #  
  77. 77. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  deduplica2on   •  Visual  duplicates  for  which  there  is  no  explicit  connec8on  →   apply  Clique  Percola8on  Method  (CPM)  on  sub-­‐graph  Gvisual  =   {V,  Evisual}     •  Represent  detected  cliques  as  single  messages:   –  VLAD  aggrega8on  on  SURF  descriptors  of  all  images  in  the  clique     –  mean  value  of  publica8on  8me   –  aggregated  value  of  reposts  of  each  message.     –  merged  w·∙idf  vector   •  Replace  clustered  messages  in  GM  with                                                                 cliques  and  re-­‐calculate  the  corresponding                                                             edges  
  78. 78. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  deduplica2on   GM Gvisual
  79. 79. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Topic  Detec2on   •  Apply  Structural  Clustering  Algorithm  for  Networks   (SCAN)  →  iden8fy  dense  sub-­‐graphs  of  messages  in  GM     •  Sub-­‐graphs  represent  the  topics  that  exist  in  the   stream  of  messages   •  Each  topici  contains  messages  {Mi}  and  is  represented   as  a  merged  N·∙idf  vector  Vi   •  A  substan8al  amount  of  messages  is  kept  outside  of   the  detected  clusters   –  Hubs  &  Outliers  most  probably  are  non-­‐informa8ve   –  May  include  valuable  informa8on  →  also  considered  in   summariza8on  process  as  single-­‐item  clusters  
  80. 80. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Message  Selec2on  Score         reposts relevance x cluster size x specificity
  81. 81. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Specificity   High  specificity   Low  specificity   rare  across  all   topics  of  the   event     common   across   topics  
  82. 82. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Image  Ranking  &  Diversifica2on       variant  of   PageRank  aiming   diversity      
  83. 83. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Dataset  and  Event  Descrip2on   •  dataset  of  McMinn  et  al.  having  more  than  500  events   from  different    domains       •  we  used  the  50  largest  events  in  terms  of  tweets   •  sports  events    (e.g.,  the  Sochi  winter  Olympics),     poli8cal  events  (Ukraine    crisis,  Venezuelan  protests),   disasters,  etc.   •  364,005  tweets,  on  average  4,730  tweets/event   •  296,160  remaining  tweets,  due  to  suspended     accounts    and  deleted    messages   •  about  3,51%  of  these,  i.e.  12,772  tweets,  contain  an   embedded  image  
  84. 84. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Relevance  Judgments   Each  image  is  shown  to  3  par8cipants  (20  img-­‐20  part)  without  ranking   informa8on   Task  Descrip2on:  You  are  presented  with  an  image  and  an  event  8tle   describing  a  trending  topic  in  TwiJer.  For  each  image  and  event  8tle,  you  are   asked  to  answer  the  following  ques8on:     Is  this  image  relevant  to  the  event?   1.  The  image  is  clearly  not  relevant  to  the  event.   2.  The  image  is  probably  not  relevant  to  the  event,  but  I  am  not  en8rely  sure.   3.  The  image  is  somewhat  relevant  to  the  event,  but  I  have  my  doubts  on   whether  I  would  like  to  see  it  in  a  photo  coverage  of  the  event.   4.  The  image  is  clearly  relevant  to  the  event,  and  I  would  like  to  see  it  in  a  photo   coverage  of  the  event.  
  85. 85. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Experimental  Se{ng   •  VLAD+SURF  extrac8on   –  64–dimensional  SURF  descriptors   –  four  codebooks  of  128  visual  words  (in  total  512)  to  quan8ze  each  descriptor     –  aggregate  SURF  descriptors  into  a  single  vector  of  64*512  =  32.768  dimensions    using   VLAD  scheme   –  PCA  to  create  a  1024-­‐dimensional  L2-­‐normalized  reduced  vector  that  represents  the   visual  content  of  the  image   •  Mul8-­‐graph  genera8on   –  k  =  500  nearest  neighbors   –  visual  and  textual  similarity  thresholds  were  set  to  0.5  and  0.6   –  σ2  of  the  temporal  kernel  was  empirically  set  to  24  hours   •  SCAN  parameters  were  set  to    μ=2  and    ε=0.65   •  DivRank’s  dumping  factor  was  set  to  d=0.75  
  86. 86. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  metrics  (1)   Precision-­‐oriented  metrics   •  Precision  (P@N):  The  percentage  of  images  among  the  top  N   that  are  relevant  (answers  3&4)  to  the  corresponding  event,   averaged  among  all  events.  We  calculate  precision  for  N  equal   to  1,  5,  and  10.   •  Success  (S@N):  Percentage  of  events,  where  there  exist  at   least  one  relevant  image  among  the  top  N  returned,  for  N=10.   •  Mean  Reciprocal  Rank  (MRR)  :  Computed  as  1/r,  where  r  is   the  rank  of  the  first  relevant  image  returned,  averaged  over  all   events.  
  87. 87. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  metrics  (2)   Diversity-­‐oriented  metrics   •  α-­‐normalized  Discounted  Cumula2ve  Gain  :  α-­‐nDCG@N   measures  the  usefulness,  or  gain,  of  the  returned  images   based  on  their  posi8on  in  the  summary  (N=10).   •  Average  Visual  Similarity:  AVS@N  measures  the  average   visual  similarity  among  all  pairs  of  images  in  the  top  N  selected   images,  averaged  over  all  events.  Lower  AVS  values  are   preferable  since  they  imply  higher  diversity  in  terms  of  visual   content.  
  88. 88. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Baselines   •  Random:  randomly  selects  N  images  from  the  filtered  set  of  images  as  the   summary  set   •  MostPopular:  picks  up  the  N  most  popular  images  in  terms  of  reposts   •  LexRank:  uses  items  graph  GM,  ranks  the  nodes  using  the  LexRank  and   selects  the  top  N  nodes  that  contain  images     •  TopicBased:  selects  the  N  most  relevant  messages  from  the  most   significant  topics  (S_cov)  (relevance,  no  specificity  &  diversity)   •  P-­‐TWR:  ranks  images  in  descending  order  using  the  weigh8ng  scheme   described  in  McParlane  et  al.  (popularity)   •  S-­‐TWR:  groups  the  tweets  of  each  event  into  sub-­‐clusters  and  select  the   highest  ranked  item  of  each  cluster  using  the  previous  weigh8ng  scheme   (specificity)  
  89. 89. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (1)  –  Precision  oriented  metrics     89   •  MGraph  outperforms  all  of  the  compe8ng  methods   •  Popularity-­‐based  approach  performs  well  for  P@1  but  drops   significantly  for  N=5,10     •  LexRank  and  TopicBased  approaches  achieve  lower  but  more   steady  results     First relevant in positions 1 - 2
  90. 90. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results:  Canada  Team  in  #Sochi   Popularity-based S-TWR MGraph
  91. 91. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (2)  –  Diversity  oriented  metrics     •  MGraph  achieves  the  best  score  for  α-­‐nDCG@10   •  Best  values  of  AVS  achieved  by  S-­‐TWR   •  The  worst  results  in  terms  of  AVS  are  obtained  using  LexRank    
  92. 92. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (3)   Performance  of  MGraph  across  different  categories   •  Best  P@10  measure  is  obtained  for  events  about  Science  &  Technology   •  The  second  best  P@10  is  obtained  for  events  about  Arts  &  Entertainment     •  Difficult  to  diversify   •  The  best  value  of  AVS  is  achieved  for  events  about  disasters  &  accidents   e.g.,  earthquakes  
  93. 93. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (4)   Impact  of  the  dumping  factor  d  on  P@10,  S@5,  MRR  and  α-­‐nDCG@10   •  The  worst  results  for  all   metrics  are  obtained  for   d=0    (no  re-­‐ranking)   •  The  best  results  are   achieved  for  0.7<d<0.8   •  slight  decrease  for  d>0.8     •  more  diverse  →  less   relevant  
  94. 94. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Conclusions   •  Graph-­‐based  approach  for  visual  summaries  for  real-­‐world  events   •  Maximizes  relevance  and  diversity   •  Mul8modal  approach  taking  into  account   •  Textual  content   •  Visual  content   •  Social     •  Interac8ons  (replies)   •  Popularity   •  Time   •  Introduc8on  of  user  related  features  (e.g.  influence)  
  95. 95. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Monitoring  and  intelligence   system  for  Web  mul2media   verifica2on  
  96. 96. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Can  mul2media  on  the  Web  be  trusted?   #96   Real  photo   captured  April  2011  by  WSJ   but   heavily  tweeted  during  Hurricane  Sandy   (29  Oct  2012)     Tweeted  by  mul8ple  sources  &   retweeted  mul8ple  8mes     Original  online  at:           hJp://blogs.wsj.com/metropolis/2011/04/28/weather-­‐ journal-­‐clouds-­‐gathered-­‐but-­‐no-­‐tornado-­‐damage/    
  97. 97. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   The  Problem   •  Everyone  can  easily  publish  content  on  the  Web   •  Content  can  be  easily  repurposed  and  manipulated   •  News  outlets  are  compe8ng  for  views  and  clicks  à   Pressure  for  airing  stories  very  quickly  leaves  very   liJle  room  for  verifica8on.  à  Very  oten,  even  well-­‐ reputed  news  providers  fall  for  fake  news  content.   •  Mul8ple  tools  and  services  available  for  individual   tasks  à  complex  verifica8on  process   Very  hard  and  2me  consuming  to  check  the  veracity  of   Web  mul2media   #97  
  98. 98. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Media  REVEALr   •  Developed  within  the  REVEAL  project:              hJp://revealproject.eu/     •  Framework  for  collec8ng,  indexing  and  browsing   mul8media  content  from  the  Web  and  social  media   •  Support  for  verifica8on:   –  Near-­‐duplicate  detec8on  against  an  indexed  collec8on   –  Clustering  of  social  media  posts  by  visual  similarity  à   compara8ve  view  of  the  same  incident   –  Aggrega8on  and  visualiza8on  of  Named  En88es  around  an   incident   #98  
  99. 99. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Related  Work   •  Majority  of  works  have  focused  on  problem  of  topic   detec8on  and  summariza8on:   –  TwitInfo  (Marcus  et  al.,  2011)   –  TwiJermonitor  (Mathioudakis  &  Koudas,  2010)   –  Meme  detec8on  &  predic8on  (Weng  et  al.,  2014)   •  Visual  memes  and  clustering   –  Visual  meme  tracking  (Xie  et  al.,  2011)   –  Supervised  mul8modal  clustering  (Petkos  et  al.,  2012)   •  Image  manipula8on  tracking   –  Internet  image  archaeology  (Kennedy  &  Chang,  2008)   #99  
  100. 100. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Overview  of  Media  REVEALr   #100   Media  collec8on   Media  pre-­‐processing  &   feature  extrac8on   Media  analysis,  mining  &   indexing   Persistence  (storage,  indexing)   Access  (API)   Visualiza8on,  front-­‐end   TEXT   VISUAL  
  101. 101. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Named  En2ty  Detec2on   •  Brevity  and  noisy  nature  of  text  in  social  media  poses   a  serious  challenge   •  Employed  solu8on:   –  Pre-­‐processing:  tokeniza8on,  user  men8on  resolu8on,  text   cleaning   –  Stanford  NER  +  user  men8on  resolu8on   –  Regular  expressions  to  remove  special  characters  and   symbols  (e.g.,  #,  @,  URLs,  etc.)   #101  
  102. 102. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  Indexing   •  Content-­‐based  image  retrieval  to  solve  Near-­‐ Duplicate  Search  (NDS)  problem     •  Based  on  local  descriptors  (SURF),  aggrega8on   (VLAD),  dimensionality  reduc8on  (PCA),  quan8za8on   (PQ)  and  indexing  (IVFADC)   •  State-­‐of-­‐the-­‐art  visual  similarity  search   –  High  precision/recall   –  Very  efficient  and  scalable  implementa8on  (search  many   millions  of  images  in  a  few  msec,  maintain  full  index  in   memory  using  ~1GB/10M  images)   #102  
  103. 103. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Improving  NDS  Resilience  (NDS+)   •  Oten,  NDS  performance  suffers  from  overlay   graphics  and  fonts   •  To  address  this  issue,  we  integrate  a  descriptor-­‐level   classifier  that  tries  to  remove  the  font/graphic   descriptors  from  the  VLAD  vector   #103  
  104. 104. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example:  Filtering  Out  Font  Descriptors   •  Assuming  that  in  most  cases  the  classifier  is  correct,   the  resul8ng  VLAD  vector  is  of  much  higher  quality   compared  to  the  one  without  filtering   #104  
  105. 105. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Classifier  Details   •  Random  Forest  used  as  base  classifier   •  Cost  Sensi8ve  meta-­‐classifier  to  penalize   misclassifica8on  of  True  Posi8ves   •  Challenge  due  to  Class  Imbalance  (overlay   descriptors  <<  useful  image  content  descriptors)   –  Cost  Sensi8ve  meta-­‐classifier  performs  over-­‐sampling  of   minority  class  to  balance  the  training  set   •  Training  set  created  by  collec8ng  images  with   overlays  (e.g.,  memes)  from  the  Web  and  manually   annota8ng  them  (selec8ng  areas  w.  fonts/overlays)   #105  
  106. 106. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mining:  Clustering  and  Aggrega2on   •  Visual  aggrega8on   –  DBSCAN  on  the  visual  feature  representa8on  (PCA-­‐ reduced  VLAD  vectors)   –  Element  (tweet)  selected  based  on  the  largest  amount  of   keywords  (expected  to  result  in  more  informa8on)   •  En8ty  aggrega8on   –  NER  on  individual  items   –  En8ty  categoriza8on  (à  Persons,  Loca8on,  Organiza8ons)   –  En8ty  ranking  based  on  frequency  of  occurrence     #106  
  107. 107. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Collec2ons  View   #107  
  108. 108. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Items  View  &  Search   #108  
  109. 109. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Clusters  View   #109  
  110. 110. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  En22es  View   #110  
  111. 111. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NER   •  Manual  annota8on  of  400  tweets  from  the  SNOW   Data  Challenge  dataset  (Papadopoulos  et  al.,  2014)   •  Measure:  Accuracy  à  instance  is  considered  correct   when  both  en8ty  and  type  are  correctly  iden8fied   •  Three  compe8ng  solu8ons:     –  Base  Stanford  NER  (S-­‐NER)   –  S-­‐NER  +  Extensions/Post-­‐processing  (S-­‐NER+)   –  Ellogon  library  (hJp://www.ellogon.org)     #111  
  112. 112. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NDS   •  Benchmark  Datasets   –  Holidays:  1,491  images,  500  queries  (Jegou  et  al.,  2008)   –  Oxford:  5,063  images,  55  queries  (Philbin  et  al.,  2008)   –  Paris:  6,412  images,  55  queries  (Philbin  et  al.,  2008)   •  Accuracy:  mean  Average  Precision  (mAP)   #112   CLEAN  DATASET   NOISY  DATASET  
  113. 113. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NDS   •  Execu8on  Time  (msec)   •  Example   #113   INDEXED  IMAGE   QUERY  IMAGE   NDS:    #27   NDS+:  #1  
  114. 114. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Use  Cases:  Real-­‐world  Datasets   #114   sandy   boston   malaysia   ferry  
  115. 115. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   NDS  Use  Case  (boston)   #115  
  116. 116. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Clustering  Use  Case  (boston)   •  Visual  clustering  enables  compara8ve  view  and  analysis  over   8me  (in  this  case  showing  increasing  confidence  on  picture).   •  When  journalists  see  many  similar  photos  of  the  same  scene,   they  have  more  confidence  that  it  is  real  and  not  fabricated.   #116  
  117. 117. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   En2ty  Aggrega2on  Use  Case  (snow)     #117   LOCATIONS   PERSONS   ORGANIZATIONS  
  118. 118. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Conclusion   •  Key  contribu8ons   –  Framework  and  web  applica8on  offering  valuable   verifica8on  support  for  Web  mul8media   –  High-­‐quality  individual  components  for  NER,  NDS,   clustering  and  aggrega8on   •  Future  Work   –  Incremental  image  clustering   –  Temporal  views  to  explore  evolu8on  of  a  story   –  Mul8media  forensics  toolbox  (splice,  copy-­‐move   detec8on)   #118  
  119. 119. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Computa2onal  Verifica2on  in  Social  Media   •  Create  a  computa$onal  verifica$on  framework  to   classify  tweets  with  unreliable  media  content.   •  Events  used  for  experimenta8on   #119   Fake  images  posted  during  Hurricane  Sandy  natural  disaster   Fake  images  posted  during  Boston  Marathon  bombings  
  120. 120. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Methodology   #120   Tweet   Extrac8on   • Use  Topsy   machine  to  collect   tweets  with   certain  keywords   Image   Indexing   • Create  a   predefined  set  of   verified  fake  and   real  images     • Keep  the  tweets   with  iden8cal  or   near-­‐duplicate   images   Feature   Extrac8on   • Extract  Content   and  User  features   for  each  tweet   collected  and   their  combina8on   Dataset     • Annotate  each   tweet  as  fake  or   real  based  on  the   image   • Keep  only  tweets   wriJen  in  English,   Spanish  or   German   Classifica8on   • Test  using  cross-­‐ valida$on   approach   • Test  using  the  two   dis8nct  datasets   • Test  using   different  training   and  tes8ng   dataset   Content  features   • Length  of  the  tweet   • Number  of  words   • Contains  exclama8on  mark  and  their  number   • Contains  quota8on  mark  and  their  number   • If  the  text  contains  emo8con  (happy  or  sad)   • Number  of  uppercase  characters   • Number  of  hashtags   • Number  of  men8ons   • Number  of  pronouns   • Number  of  urls   • Number  of  sen8ment  words   • Number  of  retweets     User  features   • Username   • Number  of  friends   • Number  of  followers   • Number  of  followers/number  of  friends  ra8o   • Number  of  8mes  the  user  was  listed   • If  the  status  of  the  user  contains  url   • If  the  user  is  verified  or  not  
  121. 121. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results   •  Tweet  Sta8s8cs       •  Approaches   #121   Tweets  with  URLs   343939   Tweets  with  fake  images   10758   Tweets  with  real  images   3540   Hurricane  Sandy   Boston  Marathon   Tweets  with  URLs   112449   Tweets  with  fake  images   281   Tweets  with  real  images   460   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   81.41   67.72   80.68   KStar   81.28   71.16   81.38   Random   Forest   80.59   70.15   80.94   Detec8on  accuracy  using  cross  –  valida8on  approach     Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   76.45   70.81   81.25   KStar   81.28   74.12   75.78   Random   Forest   78.59   76.15   79.10   Hurricane  Sandy   Boston  Marathon  
  122. 122. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results(2)   #122   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   73.79   51.06   65.06   KStar   75.30   62.29   53.31   Random   Forest   74.02   63.10   65.96   Detec8on  accuracy  using  different  training  and  tes8ng  set  in  Hurricane  Sandy   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   55.05   50.12   54.10   KStar   50.01   50.10   50.97   Random   Forest   58.75   51.03   58.78   Detec8on  accuracy  using  Hurricane  Sandy  for  training  and  Boston  Marathon  for  tes8ng    
  123. 123. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #123   Other  approaches   •  Graph-­‐based  mul8modal  clustering  for  social  event   detec8on  in  large  collec8ons  of  images   –  automa8c  organiza8on  of  a  mul8media  collec8on  into   groups  of  items,  each  (group)  of  which  corresponds  to  a   dis8nct  event.   •  Unsupervised  concept  learning  detec8on  using  social   media  as  training  data   •  Text  analysis  for  en88es  matching  and  sen8ment   analysis     •  Placing  images  based  on  content-­‐features   •  Retrieving  diverse  images  for  same  en8ty    
  124. 124. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #124   Demos  -­‐  Applica2ons   MM  News  Demo   Clusrour   ThesFest  
  125. 125. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2media  Demo  
  126. 126. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #126   Mul2media  Demo  Architecture   #126   StreamManager   TwiJer   Facebook   Flickr   YouTube   RSS   Instagram   160.xx.xx.207   MongoDBWrapper   160.xx.xx.207   TextIndexer      (Solr)   160.xx.xx.207   160.xx.xx.207   MediaFetcher,  FeatureExtractor  (HDFS)   160.xx.xx.58   160.xx.xx.107   Social  Focused  Crawler  (HDFS)   160.xx.xx.187   Nutch   Nutch   VLAD   FeatureIndexer  (HDFS)   160.xx.xx.207   IVFADC   Data  Mining   160.xx.xx.191   Visual  Clust.   Geo  Clust.   Sta8s8cs   Web  server   160.xx.xx.116   API  (3)  API  (4)   API  (1)   API  (2)  
  127. 127. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   MongoDB   Document-­‐oriented  database  →  support  of  json   Current  stable  version:  3.0.6      hJps://www.mongodb.org/     Flexible  Data  Model  →  schemeless,  usefulll  for  social  media  data  that  change   over  8me   Horizontal  scaling  via  shards  and  replica  sets       Storage  of  social  media  items  as  json  objects  →  millions  of  documents  can   be  handled   Number  of  different  index  types  →  single  field,  compound,  mul8key  indexes.     Example:  Store  facebook  posts  and  index  them  by  publica8on  8me  and   number  of  likes   Query:  get  most  recent  posts  sorted  by  popularity  (#likes)   Na8ve  support  of  map-­‐reduce  jobs  →  get  most  shared  images  in  a  collec8on   of  tweets  
  128. 128. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Apache  Solr   Full-­‐text  search  plaworm  built  on  top  ofApache  Lucene   Current  version:  5.3.0  hJp://lucene.apache.org/solr/     Indexing  of  social  media  items  e.g.  Tweets,  FB  posts,  metadata  of  Youtube  videos   etc.     Addi2onal  features     l  Faceted  Search  and  Filtering  →  get  top  N  per  field  e.g.  users   l  Spa8al  index  &  Search  →  very  usefull  in  geo-­‐tagged  documents  e.g.  Tweets.   l  Plugin-­‐based  archtecture  →  language  detec8on,  NLP  etc  as  steps  of  indexing   pipeline     Get  tweets  containg  the  name  “Barack  Obama”  OR  the  phrase  “us  elec8ons”   having  geo-­‐loca8on  around  New  York         SolrCloud  →  Cluster  of  Solr  instances   Automa8c  load  balancing  and  fail-­‐over  for  queries   ZooKeeper  integra8on  for  cluster  coordina8on  and  configura8on  
  129. 129. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Storm   Distributed  real-­‐8me  computa8on  system  hJps://storm.apache.org     Topologies  →  processing  logic   Stream:  unbounded  sequence  of  tuples  e.g.  tweets  or  URLs         Spouts:  source  of  streams   Bolts:  processing,  filtering,  etc   Processing  of  URLS  shared  in  social  media  →   storm  pipeline   l  Expand  short  URLs   l  Fetch  new  URLs   l  Extract  content  e.g.  ar8cles  and  images  
  130. 130. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Redis   Key  -­‐  Value  cache  and  store   Current  stable  version:  3.0  hJps://storm.apache.org/   Par22oning  →  distribu8on  of  data  among  mul8ple  Redis  instances   Keys  can  contain  strings,  hashes,  lists,  sets,  sorted  sets,  etc   Atomic  opera2ons:  set,  increment,  push  etc     Store  crawling  status  of  URLs,  sharing  informa8on  of  URLs  and  images     Addi8onal  Feature   l  Implementa8on  of  Publisher/Subscriber  paJern   l  Communica8on  of  different  components  in  a  system  for  social   media  analy8cs  
  131. 131. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   tags:  sagrada  familia,   cathedral,  barcelona   taken:  12  May  2009   lat:  41.4036,  lon:  2.1743   PHOTOS  &  METADATA   SPATIAL  CLUSTERING  +  TEMPORAL  ANALYSIS   COMMUNITY  DETECTION   CLASSIFICATION  TO  LANDMARKS/EVENTS   VISUAL   TAG   HYBRID   [2  years,  50  users  /  120  photos]   #users  /  #photos   dura8on   [1  day,  2  users  /  10  photos]   S.   Papadopoulos,   C.   Zigkolis,   Y.   Kompatsiaris,   A.   Vakali.   “Cluster-­‐based   Landmark   and   Event   Detec8on   on   Tagged   Photo   Collec8ons”.  In  IEEE  Mul8media  Magazine  18(1),  pp.  52-­‐63,  2011   City  profile  crea2on  (Clusrour)  
  132. 132. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #132   City  profile  crea2on  (Clusrour)   Community  detec2on  on   image  similarity  graphs   Nodes:  photos   Edges:  visual  and  tag   similarity  
  133. 133. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  
  134. 134. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #134   ThessFest   •  Thessaloniki   Interna8onal  Film   Fes8val   •  Support  twiJer/ comment  usage   within  the  app   •  Ra8ngs  and   comments  per  film   •  Feedback   aggrega8on   •  Votes   •  Tweets   •  Real-­‐8me  feedback   to  the  organisa8on   and  visitors   ThessFest
  135. 135. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Fête  de  la  Musique  Berlin  app   •  FETEberlin  in  App  Store  and  Google  Play   •  More  than  100K  visitors   •  About  5K  musicians   •  More  than  5K  app  downloads,  25K   sessions   App  features   •  Browse  and  filter  detailed  program   •  Interac8ve  maps  and  rou8ng     •  Social  Sharing   •  Ar8sts’  and  Stages  Details   •  Social  Monitoring   Main  benefits  for  arendants   •  Visitors  can  browse  through  maps  and   don’t  get  lost  as  stages  are  numerous   •  Event  schedule  is  available  always  and   per  stage     –  Very  useful  when  the  server  was  down  and   there  was  no  access  to  the  online  schedule   #135  
  136. 136. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #136   Topic  analysis   •  Top-­‐10  topics   •  Manual  inspec8on   of  clusters:   –  53.8%  of  topic  8tles   considered   informa8ve   –  98.5%  of  clusters   were  found  to  be   “clean”   •  Topics  in  8me  
  137. 137. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Other  Applica2on  Areas   •  Science   –  Sociology,  machine  learning  (machine  as  a  teacher),  computer  vision   (annota8on)   •  Tourism  –  Leisure  –  Culture   –  Off-­‐the-­‐beaten  path  POI  extrac8on   •  Marke8ng   –  Brand  monitoring,  personalised  ads   •  Predic8on     –  Poli8cs:  elec8on  results   •  News   –  Topics,  trends  event  detec8on   •  Others   –  Environment,  emergency  response,  energy  saving,  etc  
  138. 138. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Reusable  results   •  Star2ng  point:  hJp://www.socialsensor.eu/results     –   Deliverables   –   Publica8ons     –   Datasets   –   Sotware   –   e-­‐leJer:  hJp://stcsn.ieee.net/e-­‐leJer/vol-­‐1-­‐no-­‐3   •  Open-­‐source  projects  (Apache  License  v2):                  hJps://github.com/socialsensor     –   Data  collec8on  (stream-­‐manager,  storm-­‐focused-­‐crawler)   –   Indexing  (framework-­‐client,  mul8media-­‐indexing)   –   Mining  (topic-­‐detec8on,  mul8media-­‐analysis,  community-­‐evolu8on-­‐ analysis,  social-­‐event-­‐detec8on)  
  139. 139. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #139   Benchmarking  -­‐  Datasets  
  140. 140. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   dataset:  SNOW  2014  Data  Challenge   •  A  set  of  ~1M  tweets  collected  using  a  list  of  5000  UK-­‐ focused  “news  hounds”  and  the  keywords  “Syria”,   “terror”,  “Ukraine”,  and  “bitcoin”  for  a  period  of  24   hours  star8ng  from  Feb  25,  18:00.   •  Average  rate:  ~720  tweets/minute   •  Number  of  unique  twiJer  accounts:  ~556K   •  Number  of  retweets:  ~648K   •  Number  of  replies:  ~135K   •  Ground  truth  topics:              hJp://figshare.com/ar8cles/SNOW_2014_Data_Challenge/1003755   #140  
  141. 141. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Overview  of  Challenge   •  Goal:  Detec8on  of  newsworthy  topics  in  a  large  and   noisy  set  of  tweets   •  Topic:  a  news  story  represented  by  a  headline  +  tags   +  representa8ve  tweets  +  representa8ve  images   (op8onal)   •  Newsworthy:  A  topic  that  ends  up  being  covered  by   at  least  some  major  online  news  sources   •  Topics  are  detected  per  2meslot  (small  equally-­‐sized   8me  intervals)   •  We  want  a  maximum  number  of  topics  per  8meslot   #141  
  142. 142. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Challenge  Ac2vity  Log   •  Challenge  defini8on  (Dec  2013)   •  Challenge  toolkit  and  registra8on  (Jan  20,  2014)   •  Development  dataset  collec8on  (Feb  3,  2014)   •  Rehearsal  dataset  collec8on  (Feb  17,  2014)   •  Test  dataset  collec8on  (Feb  25,  2014)   •  Results  submission  (Mar  4,  2014)   •  Paper  submission  (Mar  9,  2014)   •  Results  evalua8on  (Mar  5-­‐18,  2014)   •  Workshop  (Apr  7,  2014)   #142  
  143. 143. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Some  sta2s2cs   •  Registered  par8cipants:  25   –  India:  4,  Belgium:  3,  Germany:  3,  UK:  3,  Greece:  3,         Ireland:  2,  USA:  2,  France:  2,  Italy:  1,  Spain:  1,  Russia:  1   •  Par8cipants  that  signed  the  Challenge  agreement:  19   •  Par8cipants  that  submiJed  results:  11   •  Par8cipants  that  submiJed  papers:  9   #143  
  144. 144. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  Protocol   •  Defined  several  evalua8on  criteria:   –  Newsworthiness  à  Precision/Recall,  F-­‐score   –  Readability  à  scale  [1-­‐5]   –  Coherence  à  scale  [1-­‐5]   –  Diversity  à  scale  [1-­‐5]   •  List  of  reference  topics   •  Set  up  precise  evalua8on  guidelines   •  Blind  evalua8on  (i.e.  evaluator  not  aware  of  which   method  a  topic  comes  from)  based  on  Web  UI   •  Par8cipants  submiJed  topics  for  96  8meslots,  but   manual  evalua8on  happened  for  5  sample  8meslots.   •  Result  valida8on  and  analysis   #144  
  145. 145. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   social  event  detec2on    
  146. 146. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   a  bit  of  background...   •  mediaeval   –  well-­‐known  benchmarking  ac8vity  since  2010  (started  as   VideoCLEF  in  2008)   –  consists  of  several  tasks  dedicated  to  specific  challenges   •  social  event  detec2on  (SED)   –  first  run  in  2011  (7  par8cipants)  
  147. 147. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   task  defini2on  &  dataset   •  2011    collec8on:  73,645  flickr  photos  from  five  ci8es,  May  2009              find  events  related  to  two  target  categories              >  soccer  matches  in  Barcelona  and  Rome              >  concerts  in  venues  Paradiso  and  Parc  del  Forum     •  2012    collec8on:  167,332  flickr  photos  from  five  ci8es,  2009-­‐2011            find  events  related  to  three  target  categories            >  technical  events  (e.g.  exhibi8ons,  fairs)  in  Germany            >  soccer  events  in  Hamburg  and  Madrid            >  Indignados  movement  in  Madrid     •  2013    collec8on  1:  437,370  flickr  photos  +  1,327  YouTube  videos        collec8on  2:  57,165  Instagram  photos        cluster  collec8on  1  into  events  (aJach  YouTube  videos  to  them)        categorize  collec8on  2  images  into  eight  event  types  or  non-­‐event   variant  1   variant  4   variant  4  
  148. 148. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   sed2012:  evalua2on  setup   •  ground  truth:  photos  clustered  around  149  events   (18  technical,  79  soccer,  52  Indignados)   •  assess  the  following  aspects:   –  accuracy  of  same-­‐event  classifica8on   –  compare  clustering  quality  between  item-­‐to-­‐cluster  and   the  two  versions  of  item-­‐to-­‐item  (batch  &  incremental)   –  measure  contribu8ons  of  different  features   –  study  generaliza8on  abili8es  of  same  event  model  
  149. 149. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   evalua2on:  main  caveat   •  crea8on  strategy  of  benchmark  dataset  can   drama8cally  affect  how  hard  (or  easy)  the  problem  is   –  if  events  are  very  sparsely  distributed  over  8me,  then  a   simple  8me-­‐based  clustering  could  be  sufficient   –  if  events  correspond  to  users  one-­‐to-­‐one,  then  a  simple   user-­‐based  look-­‐up  could  yield  very  high  accuracy   –  using  the  same  source  for  training/tes8ng  makes  it  easy   •  need  to  explore  new  challenging  se†ngs   –  mul8ple  sources  of  mul8media   –  huge  amounts  of  non-­‐event  content   –  very  dense  coverage  of  feature  space  by  test  events  
  150. 150. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #150   Conclusions  

×