Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real-Time	Processing	of	Social	Media	Content	
for	Social	Good	
Muhammad	Imran	
Research	Scien,st	
Qatar	Compu,ng	Research	...
Outline	
•  P1:	Background	of	Humanitarian	CompuBng	(10%)	
–  Sudden-onset	emergencies,	Time-cri,cal	situa,ons	
–  Social	...
Aid	Needs,	InformaBon	Needs,	and	Gaps	
Info.	 Info.	 Info.	
Disaster	event	(earthquake,	flood)	 Urgent	needs	of	affected	peo...
Aid	Needs,	InformaBon	Needs,	and	Gaps	
Info.	 Info.	 Info.	
Disaster	event	(earthquake,	flood)	 Urgent	needs	of	affected	peo...
Decision-Making	and	Response	
Department	of	Community	Safety,	Queensland	Govt.	&	UNOCHA,	2011	
-  Delayed	decision-making	...
Decision-Making	and	Response	
Department	of	Community	Safety,	Queensland	Govt.	&	UNOCHA,	2011	
-  Delayed	decision-making	...
The	Value	of	Timely	InformaBon	
During	Disasters	
Based	on	FEMA	large-scale	survey	among	emergency	management	professional...
The	Value	of	Timely	InformaBon	
During	Disasters	
Based	on	FEMA	large-scale	survey	among	emergency	management	professional...
InformaBon	Types	and	Needs	
•  Reports	of	Injured	or	dead	people	
•  Infrastructure	damage	(e.g.,	buildings,	bridges,	Road...
Part	2:	
The	Role	of	Social	Media
CommunicaBons	Before	and	A_er		
ICT	and	Social	Media	
Gerald	Baron
InformaBon	Availability	in	the	Age	of	
ICT	and	Social	Media	
Based	on	FEMA	large-scale	survey	among	emergency	management	p...
Sandy	Hurricane	Twiaer	Data	Analysis	
@NYGovCuomo	orders	closing	of	NYC	bridges.	Only	Staten	Island	
bridges	unaffected	at	...
@NYGovCuomo	orders	closing	of	NYC	bridges.	Only	Staten	Island	
bridges	unaffected	at	this	,me.	Bridges	must	close	by	7pm.	#...
@NYGovCuomo	orders	closing	of	NYC	bridges.	Only	Staten	Island	
bridges	unaffected	at	this	,me.	Bridges	must	close	by	7pm.	#...
MERS	Outbreak:	Twiaer	Data	Analysis	
Middle	East	Respiratory	Syndrome	(MERS)	
Twicer	data	analysis	from:	2014-04-27	to	201...
Social	Media	During	MERS	Outbreak		
	
	
RT	@abecel:	Two	workers	at	FL	hospital	exposed	to	a	pa,ent	with	Middle	East	
Respi...
Social	Media	During	MERS	Outbreak			
	
First	Case	of	Deadly	Middle	Eastern	Virus	Found	in	U.S.:	The	Centers	for	Disease	
C...
Twiaer	Breaks	Events	Faster	
First	report	
Breaks	the	story	33	minutes	before	local	TV	
Hudson	Plane	Crash	
Westgate	Mall	...
Twiaer	Breaks	Events	Faster	
First	report	on	Twiaer	 A_er	1	minute	
A_er	2	minutes	
Boston	Bombing
Types	of	InformaBon	on	Twiaer	
-  Twicer	data	from	13	
recent	crises	
-  Over	100,000	tweets	
-  InformaBon	types	
-  Type...
2013	Pakistan	Earthquake	
September	28	at	07:34	UTC	
	
2010	HaiB	Earthquake	
January	12	at	21:53	UTC	
Data	and	OpportuniBe...
Part	3	
The	Role	of	AI	and	Data	Science	for	
Social	Good
Big	Data	Challenges	–	4Vs	
(Under	Time-criBcal	SituaBons)	
•  Volume		
	Scale	of	data	(e.g.,	millions	of	tweets	aber	an	ev...
Data	AcquisiBon
Twiaer	Data	CollecBon	
•  REST	APIs	
–  Provides	programma,c	access	to	post	a	new	tweet,	
read	profile,	and	followers.	
•  ...
REST	vs.	Streaming	API	
REST	API	
Streaming	API	
Public	streams	
User	streams	
Site	streams	
Streaming	endpoints	
Sample	c...
ProperBes	of	Social	Media	Data	
•  Mostly	SM	data	is	publicly	available	
•  Near	Real-Bme	access	
•  1%	to	3%	geo-tagged	
...
Slangs	and	Shortened	forms	
•  Single-word	slangs:	pls	(please),	srsly	(seriously)	
•  MulB-word	slangs:	imo	(in	my	opinio...
Data	Velocity	and	Volume	
High	velocity	
•  2012	Hurricane	Sandy:	18,000	tweets/min	
•  2013	Boston	bombings:	54,000	tweet...
Data	Processing
Social	Media	InformaBon	Processing	
•  Natural	Language	Processing	Methods	
– Informa,on	extrac,on	(e.g.	person,	loca,on,	...
Supervised	ClassificaBon	
Data	collec,on	
1	 2	
Human	annota,ons	
on	sample	data	
Machine	training	
3	
Classifica,on	
4	
Eve...
Data	Stream	Processing	
1.  Data	items	arrive	online	
2.  Streams	have	infinite	length	and	unbounded	in	size	
3.  No	contro...
TradiBonal	vs.	Stream	Processing	
Property	 TradiBonal	System	 Stream	Processing	System	
Number	of	passes	 Mul,ple	 Single...
Pure	Stream	Processing	and	Issues	
•  Rely	en,rely	on	automated	algorithms	
•  SM	data	streams	can	be	imprecise,	highly	va...
Crowdsourced	Stream	Processing	
(CSP)	
In	cases	where	cri1cal—in	terms	of	cost,	2me	or	reliability—decision-making	needs	t...
hcp://aidr.qcri.org/	
AIDR	—Ar,ficial	Intelligence	for	Disaster	Response—	is	a	free,	open,	and	easy-to-use	
	plaKorm	to	aut...
Data	collec,on	
1	 2	
Human	annota,ons	 Machine	training	
3	
Classifica,on	
4	
ONLINE	APPROACH	
DATA	COLLECTION	
H
A	
Learn...
Data	ClassificaBon	
Apply	machine	learning	Apply	crowdsourcing	
Goal:	To	find	relevant	and	
ac,onable	informa,on	in	
near	re...
Real-Bme	ClassificaBon	of		
Social	Media	Data	
hcp://aidr.qcri.org/
AIDR	Architecture	
Tweets
collector
Twitter
streaming API
Features
extractor
ClassifierP/S
Task
generator
Q P/S
Annotator
m...
Data	CollecBon	in	AIDR	(Twiaer)	
CollecBon	details	dashboard	
hcp://aidr.qcri.org/	
Geographical	region	filter	Language	filt...
Data	ClassificaBon	Approach	
3.	
Extrac,on	
2.	
Classifica,on	
1.	
Filtering
1.	Filtering	
Is	event-	
related?	
Contributes	to	
situaBonal	
awareness?	
Yes Yes
No No
2.	ClassificaBon	
Caution &
Advice
Information
Sources
Damage &
Casualties
Donations
Health
Shelter
Food
Water
Logistics
.....
hcp://aidr.qcri.org/	
Sesng	up	Classifiers
AIDR	–	Classifier	Sesng	(cont.)	
hcp://aidr.qcri.org/
Human	AnnotaBon	in	AIDR	
Internal	Tagging	Interface	
hcp://aidr.qcri.org/
Human	AnnotaBon	Using	MicroMappers	
MicroMapper	Interface	(web	clicker)	
hcp://aidr.qcri.org/	
Mobile	clicker
Tagged	Items	and	Machine	Output	
hcp://aidr.qcri.org/	
Training	examples	 Classifiers’	output
Quality,	Cost,	and	Performance	of	
AIDR
Quality	vs.	Cost	in	AIDR	
hcp://aidr.qcri.org/	
Goals:	Maximize	quality	– Minimize	cost	
•  Quality	
•  Classifica,on	accur...
Quality	vs.	Cost	in	AIDR	
hcp://aidr.qcri.org/	
Quality	vs.	cost	using	passive	learning	and	with/without	de-duplicaBon	
Qu...
Performance	
hcp://aidr.qcri.org/	
In	terms	of	throughput	and	latency	
Latency	of	feature	extractor,	classifier,	and	the	sy...
Processing	Evolving	Data	Streams
Data	Stream	Processing	
1.  Data	items	in	the	stream	arrive	online	
2.  Streams	have	infinite	length	and	unbounded	in	size	...
Types	of	Changes	in	SM	Streams	
Types	of	Stream	Dribs	
Concept	Drib	 Feature	Evolu,on	 Concept	Evolu,on	
Class	
boundaries...
Types	of	Changes	in	Streaming	Data	
Except	Noise	and	Blip,	all	the	presented	changes	are	treated	as	concept	drib	
and	requ...
InformaBon	Variability	on	Social	Media	
•  Different	events	present	different	informa,on	
categories	
•  Even	for	recurring	...
InformaBon	Variability	on	Social	Media	
•  Different	events	present	different	informa,on	
categories	
•  Even	for	recurring	...
InformaBon	Variability	on	Social	Media	
•  Different	events	present	different	informa,on	
categories	
•  Even	for	recurring	...
InformaBon	Variability	on	Social	Media	
•  Different	events	present	different	informa,on	
categories	
•  Even	for	recurring	...
InformaBon	Variability	on	Social	Media	
•  Different	events	present	different	informa,on	
categories	
•  Even	for	recurring	...
Social	Media	Data	Streams	ClassificaBon	
Two	major	issues	in	the	supervised	classifica,on	of	social	
media	streams:	
1.  How...
IdenBficaBon	of	Novel	Categories	
Classes.	
-  Injured	people	
-  Infrastructure	damage	
-  Shelter	needs	
-  Dona,on	reque...
Expert-Machine-Crowd	Sesng	
Constraints	Outlier	DetecBon	(COD-Means):	
1.  Constraints	forma,on	using	classified	items	
2. ...
Input	and	Output	
Category	A	 Category	B	 Category	C	 Miscellaneous	Z	
Category	A’	 Category	B’	 Category	C’	
Z1	 Z2	
Z’	
...
Constraints	FormaBon	
1.	Items	in	same	category	have	Must-link	constraints	
2.	Items	belonging	to	different	categories	have...
ObjecBve	FuncBon	
Standard	distor2on	error	
If	an	ML	constraint	if	violated	
then	the	cost	of	the	viola2on	is	
equal	to	th...
Assignment	and	Update	Rules	
Rule	1:	For	items	without	any	constraints	(standard	distor,on	error)		
Rule	2:	For	items	with...
COD-Means	Algorithm	
Algorithm	
1	
2	
3	
Ini2aliza2on	(e.g.	random	pick	of	k	centroids)	
Assignment	of	items	based	on	3	as...
Dataset	and	Experiments	
1.  Are	the	new	clusters	iden,fied	by	the	COD-Means	algorithm	genuinely	different	and	
novel?	
2.  ...
Clusters	Novelty	and	Coherence	
K-Means	vs.	COD-Means	
•  The	proposed	approach	generates	more	cohesive	and	novel	clusters...
Data	Improvements	EvaluaBon	
Affected individuals
Caution and advice
Donations and volunteering
Infrastructure and utiliti...
Impact	on	ClassificaBon	Performance
Social	Media	Image	Processing	
An	ApplicaBon	of	Computer	Vision
“A	picture	is	worth	a	thousand	words.”
Research	Goals	
•  Social	media	image	filtering	
– Real-,me	image	retrieval,	processing,	and	storage	
– Duplicate	or	near-d...
AutomaBc	Image	Processing	Pipeline	
Dat	Tien	Nguyen,	Firoj	Alam,	Ferda	Ofli,	Muhammad	Imran.	Automa2c	Image	Filtering	on	So...
Disaster	Datasets	(Twiaer)	
Dataset	details	for	all	four	disaster	events	with	their	year	and	number	of	images	
Number	of	l...
Relevancy	Filtering	
Examples	of	irrelevant	images	showing	cartoons,	banners,	adver,sements,	celebri,es,	etc.	
Performance...
Duplicate	Filtering	
Examples	of	near-duplicate	images	
Task:	Compute	similarity	between	a	pair	of	images	
Approach:	Perce...
Before/A_er	Image	Filtering	
Number	of	images	that	remain	in	our	dataset	aber	each	image	filtering	opera,on	
~	2	%	
~	2	%	
...
Before/A_er	Image	Filtering	
Number	of	images	that	remain	in	our	dataset	aber	each	image	filtering	opera,on	
~	2	%	
~	2	%	
...
Infrastructure	Damage	Assessment	
•  Three-class	classifica,on	
– Categories:	severe,	mild	&	licle-to-none	
•  Dis,nc,on	be...
AIDR	SMS	Processing	
AIDR	Helps	Answer	Thousands	of	
Health	Queries
Public	Health:	AIDR	+	UNICEF	Zambia	
Manual	processing	
and	rou,ng	of	SMS	
Counselors	(experts	of	HIV,	STIs)	
SMS	service	...
Public	Health:	AIDR	+	UNICEF	Zambia	
Manual	processing	
and	rou,ng	of	SMS	
Counselors	(experts	of	HIV,	STIs)	
SMS	service	...
New	ScienBst	Featured	This	Work
Media	Coverage
Domain	AdaptaBon/Transfer	Learning	
Ability	of	a	system	to	apply	knowledge	and	skills	
learned	in	previous	domains	to	nove...
Domain	AdaptaBon	
Labeled	source,	but	unlabeled	target	
Feature	
extractor	
Machine	
learning	
algorithm	
Feature	
extract...
Same	Domain	Learning	
Training	
data	
Machine	learning	model	
Tes,ng	
data	
infer	 predict	
Apples	 Apples	
Apples	
Orange...
Crisis-related	Data	ClassificaBon	
Training	
data	
Machine	learning	model	
Tes,ng	
data	
infer	 predict	
Italy	earthquake	
...
Domain	AdaptaBon
Model	AdaptaBon	EvaluaBon	
•  Model	adapta,on	using	single	source	
– Using	both:	in-domain	and	cross-domain	
•  Model	adap...
Transfer	Learning	
Differences	in	classificaBon	tasks:	
•  Different	classifica,on	tasks	
•  Different	types	of	disasters,	stak...
SummarizaBon	and	PrioriBzaBon	of	
AcBonable	InformaBon	
InformaBon	needs	&	problem:	
•  Different	stakeholders	
•  Different...
InformaBon	SummarizaBon	
	In	Real-Time	
Class	A	 Class	B	 Class	C	 Class	D	
Summary	 Summary	 Summary	 Summary	
Classified	...
Resources,	Datasets,	And	Tools
Towards	Standard	
Baselines	and	Datasets	
CrisisNLP.qcri.org	
-  Access	to	52	million	tweets	
-  Around	50k	labeled	tweets...
ACM	CompuBng	Survey	
Processing	Social	Media	Messages	in	Mass	
Emergency:	A	Survey		
[Imran	et	al.	2015]
27	Free	Data	Mining	Books	
hap://www.datasciencecentral.com/profiles/blogs/27-free-data-mining-books
Special	Issues	
Organizing	Editors	
	
Chris,an	Reuter	(University	of	Siegen)	
Muhammad	Imran	(Qatar	Compu,ng	Research	Ins,...
Conclusions	
•  InformaBon	bestows	power	for	disaster	response	
–  People	need	informa,on	as	much	as	water,	shelter,	and	f...
THANK	YOU!	
CrisisNLP.qcri.org	AIDR.qcri.org	
Email:	mimran@hbku.edu.qa	
Homepage:	hap://mimran.me	
Twiaer:	@mimran15
Upcoming SlideShare
Loading in …5
×

Real-Time Processing of Social Media Content for Social Good

302 views

Published on

This tutorial was given at the Data Science workshop organized by the Higher Education Commission (HEC) Pakistan in 2017.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Real-Time Processing of Social Media Content for Social Good

  1. 1. Real-Time Processing of Social Media Content for Social Good Muhammad Imran Research Scien,st Qatar Compu,ng Research Ins,tute Hamad Bin Khalifa University Doha, Qatar April 20th, 2017 Data Science Workshop
  2. 2. Outline •  P1: Background of Humanitarian CompuBng (10%) –  Sudden-onset emergencies, Time-cri,cal situa,ons –  Social Good factors –  Aid and informa,on needs •  P2: The Role of Social Media for Social Good (20%) –  Par,cular focus on micro-blogging plaKorms –  Availability of various types of informa,on and opportuni,es •  P3: The Role of ArBficial Intelligence for Social Good (70%) –  How AI is useful in crisis response –  Various AI techniques, approaches, and tools –  Work of crisis compu,ng group at QCRI –  Ongoing research –  Future direc,ons
  3. 3. Aid Needs, InformaBon Needs, and Gaps Info. Info. Info. Disaster event (earthquake, flood) Urgent needs of affected people InformaBon gathering Humanitarian organizaBons and local administraBon InformaBon gathering, especially in real-Bme, is the most challenging part Relief operaBons -  Food, water -  Shelter -  Medical assistance -  DonaBons -  Service and uBliBes
  4. 4. Aid Needs, InformaBon Needs, and Gaps Info. Info. Info. Disaster event (earthquake, flood) Urgent needs of affected people InformaBon gathering Humanitarian organizaBons and local administraBon InformaBon gathering, especially in real-Bme, is the most challenging part Relief operaBons -  Food, water -  Shelter -  Medical assistance -  DonaBons -  Service and uBliBes --Informa,on Bestows Power-- Will access to informaBon solve the problem?
  5. 5. Decision-Making and Response Department of Community Safety, Queensland Govt. & UNOCHA, 2011 -  Delayed decision-making -  Delayed crisis response -  High community harm -  Early decision-making -  Rapid crisis response -  Low community harm Target
  6. 6. Decision-Making and Response Department of Community Safety, Queensland Govt. & UNOCHA, 2011 -  Delayed decision-making -  Delayed crisis response -  High community harm -  Early decision-making -  Rapid crisis response -  Low community harm Target --Need Early Informa,on-- How early do we need it?
  7. 7. The Value of Timely InformaBon During Disasters Based on FEMA large-scale survey among emergency management professionals across the US. InformaBon value When informaBon is too late
  8. 8. The Value of Timely InformaBon During Disasters Based on FEMA large-scale survey among emergency management professionals across the US. InformaBon value When informaBon is too late
  9. 9. InformaBon Types and Needs •  Reports of Injured or dead people •  Infrastructure damage (e.g., buildings, bridges, Roads) •  Urgent needs of affected people (e.g., food, water, shelter) •  Dona,on offers and requests (e.g., money, volunteers) •  Medical Emergencies •  Disease symptoms reports •  Disease treatment reports and ques,ons •  …
  10. 10. Part 2: The Role of Social Media
  11. 11. CommunicaBons Before and A_er ICT and Social Media Gerald Baron
  12. 12. InformaBon Availability in the Age of ICT and Social Media Based on FEMA large-scale survey among emergency management professionals across the US. 1990s 2000s 2010s InformaBon value When informaBon is too late
  13. 13. Sandy Hurricane Twiaer Data Analysis @NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this ,me. Bridges must close by 7pm. #Sandy #NYC. rt @911buff: public help needed: 2 boys 2 & 4 missing nearly 24 hours aber they got separated from their mom when car submerged in si. #sandy #911buff freaking out. home alone. will just watch tv #Sandy #NYC. 400 Volunteers are needed for areas that #Sandy destroyed.
  14. 14. @NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this ,me. Bridges must close by 7pm. #Sandy #NYC. rt @911buff: public help needed: 2 boys 2 & 4 missing nearly 24 hours aber they got separated from their mom when car submerged in si. #sandy #911buff freaking out. home alone. will just watch tv #Sandy #NYC. 400 Volunteers are needed for areas that #Sandy destroyed. Personal Informa,ve Sandy Hurricane Twiaer Data Analysis
  15. 15. @NYGovCuomo orders closing of NYC bridges. Only Staten Island bridges unaffected at this ,me. Bridges must close by 7pm. #Sandy #NYC. rt @911buff: public help needed: 2 boys 2 & 4 missing nearly 24 hours aber they got separated from their mom when car submerged in si. #sandy #911buff freaking out. home alone. will just watch tv #Sandy #NYC. 400 Volunteers are needed for areas that #Sandy destroyed. Personal Informa,ve Cau,on and Advice Missing people report Dona,on request Sandy Hurricane Twiaer Data Analysis
  16. 16. MERS Outbreak: Twiaer Data Analysis Middle East Respiratory Syndrome (MERS) Twicer data analysis from: 2014-04-27 to 2014-07-14 QualitaBve analysis categories: Reports of symptoms Affected people reports Death reports Disease transmission reports Preven,on ques,ons Treatment ques,ons Reports of signs or symptoms such as fever, cough or ques,ons Reports of affected people due to the MERS disease Reports of deaths due to the MERS disease Ques,ons or sugges,ons related to the preven,on of disease Reports or ques,ons related to the transmission of the disease Ques,ons or sugges,ons regarding the treatment of the disease
  17. 17. Social Media During MERS Outbreak RT @abecel: Two workers at FL hospital exposed to a pa,ent with Middle East Respiratory Syndrome are showing flu-like symptoms Coronavirus symptoms include: fever, coughing, shortness of breath, congesBon in the nose and throat, and in some cases diarrhea. MERS #MERS is a rela,vely new respiratory illness, spread b/w people in close contact. Symptoms are fever, cough, & shortness of breath. Saudi Arabia finds another 32 MERS cases as disease spreads: RIYADH (Reuters) - Saudi Arabia said on Thursday ... hcp://t.co/cPhm0uTRCo Signs and symptoms Signs and symptoms Signs and symptoms Affected individuals
  18. 18. Social Media During MERS Outbreak First Case of Deadly Middle Eastern Virus Found in U.S.: The Centers for Disease Control has confirmed that a case of the deadly Midd... Third Case of MERS Confirmed in the U.S.: The U.S. Centers for Disease Control and Preven,on confirmed on Sat... hcp://t.co/Sb8PMyxVUn No clear transmission link btwn camels and humans for MERS. 94% Egyp,an camels seroposi,ve but no human cases yet. Hmm #asm2014 Saudi health authori,es announced on Monday that the death toll from the MERS coronavirus has reached 115 since the respiratory disease ... Transmission Death reports Affected individuals Affected individuals
  19. 19. Twiaer Breaks Events Faster First report Breaks the story 33 minutes before local TV Hudson Plane Crash Westgate Mall Aaack
  20. 20. Twiaer Breaks Events Faster First report on Twiaer A_er 1 minute A_er 2 minutes Boston Bombing
  21. 21. Types of InformaBon on Twiaer -  Twicer data from 13 recent crises -  Over 100,000 tweets -  InformaBon types -  Types of sources Source: Qatar Compu,ng Research Ins,tute - Published in World Humanitarian Data and Trends 2014 (UN OCHA)
  22. 22. 2013 Pakistan Earthquake September 28 at 07:34 UTC 2010 HaiB Earthquake January 12 at 21:53 UTC Data and OpportuniBes Social Media Plaiorms Availability of Immense Data: Around 16 thousands tweets per minute were posted during the hurricane Sandy in the US. OpportuniBes: -  Early warning and event detecBon -  SituaBonal awareness -  AcBonable informaBon extracBon -  Rapid response -  EffecBve communicaBons Disease outbreaks
  23. 23. Part 3 The Role of AI and Data Science for Social Good
  24. 24. Big Data Challenges – 4Vs (Under Time-criBcal SituaBons) •  Volume Scale of data (e.g., millions of tweets aber an event) •  Velocity High-velocity streams (e.g., thousands of tweets/min) •  Variety Different forms/types of data (informa,on types) •  Veracity Uncertainty of data
  25. 25. Data AcquisiBon
  26. 26. Twiaer Data CollecBon •  REST APIs –  Provides programma,c access to post a new tweet, read profile, and followers. •  Streaming APIs –  Receive live updates on the latest tweets matching a search query. •  Ads API, MoPub, and Gnip –  Twicer adver,sing management, MoPub is a mobile ad exchange and ad server. –  Gnip provides commercial-grade access to real-,me and historical Twicer data.
  27. 27. REST vs. Streaming API REST API Streaming API Public streams User streams Site streams Streaming endpoints Sample code hcps://github.com/twicerdev
  28. 28. ProperBes of Social Media Data •  Mostly SM data is publicly available •  Near Real-Bme access •  1% to 3% geo-tagged •  Highly informal, oben brief, and non- structured •  Wricen by different people in many languages •  Contains rumors and misinforma,on
  29. 29. Slangs and Shortened forms •  Single-word slangs: pls (please), srsly (seriously) •  MulB-word slangs: imo (in my opinion) •  Misspellings: missin (missing), ovrcme (overcome) •  PhoneBc subsBtuBon: 2morrow (tomorrow) •  Word without spaces: prayfornepal (pray for nepal) Can you guess? “r u ok m8” ?? >> “Are you OK, mate?”
  30. 30. Data Velocity and Volume High velocity •  2012 Hurricane Sandy: 18,000 tweets/min •  2013 Boston bombings: 54,000 tweets/min •  2011 Japan earthquake: 66,000 tweets/min High volume •  2012 Hurricane Sandy: 20 million tweets in 5 days Batch Periodic Near real-,me Real-,me Stream Increase in Data Velocity KB MB GB TB PB Increase in Data Volume File system -- MySQL -- Postgres – MongoDB -- Apache Cassandra -- Redis
  31. 31. Data Processing
  32. 32. Social Media InformaBon Processing •  Natural Language Processing Methods – Informa,on extrac,on (e.g. person, loca,on, organiza,on) – ClassificaBon and clustering – Automa,c summariza,on – Seman,c search – Machine transla,on •  Imagery content processing – Object detec,on & recogni,on – Image retrieval and filtering – Automa,c annota,on
  33. 33. Supervised ClassificaBon Data collec,on 1 2 Human annota,ons on sample data Machine training 3 Classifica,on 4 Event Timeline: DATA COLLECTION Humans alone cannot process large amounts of data, so we only use them to help process a subset We train machine using human input to automa,cally process large Data at high speed For example using Keywords, hashtags etc.
  34. 34. Data Stream Processing 1.  Data items arrive online 2.  Streams have infinite length and unbounded in size 3.  No control over the order in which data items arrive 4.  Processed items are either discarded or archived 5.  No retrieval unless stored in memory (oben small size) Credit Card fraud detecBon Sensor data classificaBon Social media streams mining Data stream
  35. 35. TradiBonal vs. Stream Processing Property TradiBonal System Stream Processing System Number of passes Mul,ple Single Memory availability Unlimited Restricted Processing ,me Unlimited Restricted Results availability Delayed Real-,me Results reliability Accurate Improvable
  36. 36. Pure Stream Processing and Issues •  Rely en,rely on automated algorithms •  SM data streams can be imprecise, highly variable, and oben unseen –  Concept-dri_: happens due to slow changes in the concepts –  Concept-evoluBon: happens due to the presence of unknown classes Aurora Stream Processing (Brown University) Flu pandemic 2009
  37. 37. Crowdsourced Stream Processing (CSP) In cases where cri1cal—in terms of cost, 2me or reliability—decision-making needs to take place in real-1me, based on data streams that are poten2ally noisy and unseen, fully automated stream processing systems do not meet the needs. Stream processing systems (SPs) Crowdsourcing systems (Cs) Crowdsourced stream processing systems (CSPs) Human processing role Automatic processing role Compostion Binary classification N-ary classification Open-ended Computation Filtering Task-generation Task-assignment Task-aggregation Serial Parallel Complex Hierarchal taxonomy Faceted taxonomy System Ref. Imran, Muhammad, Ioanna Lykourentzou, Yannick Naudet, and Carlos Cas2llo. "Engineering crowdsourced stream processing systems." arXiv preprint arXiv:1310.5463.
  38. 38. hcp://aidr.qcri.org/ AIDR —Ar,ficial Intelligence for Disaster Response— is a free, open, and easy-to-use plaKorm to automa,cally filter and classify relevant tweets posted during humanitarian crises. 1 2 3 Collect Curate Classify Grand Prize Winner from the Open Source So_ware World Challenge 2015
  39. 39. Data collec,on 1 2 Human annota,ons Machine training 3 Classifica,on 4 ONLINE APPROACH DATA COLLECTION H A Learning-1 CLASSIFICATION OF DATA & DECISION MAKING PROCESS Learning-2 Learning-3 … Learning-n Human annota,on - 1 Human annota,on - 2 Human annota,on - 3 … Human annota,on - n First few hours Near Real-Bme Processing
  40. 40. Data ClassificaBon Apply machine learning Apply crowdsourcing Goal: To find relevant and ac,onable informa,on in near real-,me. Growing stack of data AIDR Machine Learning + Crowdsourcing Filter-failure Need human-labeled examples
  41. 41. Real-Bme ClassificaBon of Social Media Data hcp://aidr.qcri.org/
  42. 42. AIDR Architecture Tweets collector Twitter streaming API Features extractor ClassifierP/S Task generator Q P/S Annotator model parameters Learner Output adapters Q Q load shedding load shedding query tweets 〈tweet〉 〈tweet, features〉 〈task〉 〈task, label〉 〈tweet, label, confidence〉 Redis channel Redis queue Human-in-the-loop (crowdsourcing) - Uni-grams - Bi-grams - InformaBon gain Random Forest (decision trees) - Task selecBon - Task prioriBzaBon Database: Postgres ApplicaBon layer: Java EE, RESTFul services, Weka machine learning library Data flow and control flow: Redis Front-end: ExtJS (JavaScript library)
  43. 43. Data CollecBon in AIDR (Twiaer) CollecBon details dashboard hcp://aidr.qcri.org/ Geographical region filter Language filter CollecBon setup
  44. 44. Data ClassificaBon Approach 3. Extrac,on 2. Classifica,on 1. Filtering
  45. 45. 1. Filtering Is event- related? Contributes to situaBonal awareness? Yes Yes No No
  46. 46. 2. ClassificaBon Caution & Advice Information Sources Damage & Casualties Donations Health Shelter Food Water Logistics ... ... Filtered tweets
  47. 47. hcp://aidr.qcri.org/ Sesng up Classifiers
  48. 48. AIDR – Classifier Sesng (cont.) hcp://aidr.qcri.org/
  49. 49. Human AnnotaBon in AIDR Internal Tagging Interface hcp://aidr.qcri.org/
  50. 50. Human AnnotaBon Using MicroMappers MicroMapper Interface (web clicker) hcp://aidr.qcri.org/ Mobile clicker
  51. 51. Tagged Items and Machine Output hcp://aidr.qcri.org/ Training examples Classifiers’ output
  52. 52. Quality, Cost, and Performance of AIDR
  53. 53. Quality vs. Cost in AIDR hcp://aidr.qcri.org/ Goals: Maximize quality – Minimize cost •  Quality •  Classifica,on accuracy •  Precision/AUC •  Cost to obtain labeled data •  Monetary in case of paid-workers •  Time in case of volunteers
  54. 54. Quality vs. Cost in AIDR hcp://aidr.qcri.org/ Quality vs. cost using passive learning and with/without de-duplicaBon Quality vs. cost using acBve learning and with/without de-duplicaBon
  55. 55. Performance hcp://aidr.qcri.org/ In terms of throughput and latency Latency of feature extractor, classifier, and the system Throughput of feature extractor, classifier, and the system
  56. 56. Processing Evolving Data Streams
  57. 57. Data Stream Processing 1.  Data items in the stream arrive online 2.  Streams have infinite length and unbounded in size 3.  No control over the order in which data items arrive 4.  Processed items are either discarded or archived 5.  No retrieval unless stored in memory (oben small size) Credit Card fraud detecBon Sensor data classificaBon Social media streams mining Data stream
  58. 58. Types of Changes in SM Streams Types of Stream Dribs Concept Drib Feature Evolu,on Concept Evolu,on Class boundaries change over ,me Feature subspace may change New features appear Feature distribu,on changes Novel classes emerge Recurrent novel classes re-appear
  59. 59. Types of Changes in Streaming Data Except Noise and Blip, all the presented changes are treated as concept drib and require model adapta,on. Ref. Brzeziński, Dariusz. "Mining data streams with concept drib." PhD diss., Master’s thesis, Poznan University of Technology, 2010.
  60. 60. InformaBon Variability on Social Media •  Different events present different informa,on categories •  Even for recurring events, categories propor,on change
  61. 61. InformaBon Variability on Social Media •  Different events present different informa,on categories •  Even for recurring events, categories propor,on change
  62. 62. InformaBon Variability on Social Media •  Different events present different informa,on categories •  Even for recurring events, categories propor,on change
  63. 63. InformaBon Variability on Social Media •  Different events present different informa,on categories •  Even for recurring events, categories propor,on change
  64. 64. InformaBon Variability on Social Media •  Different events present different informa,on categories •  Even for recurring events, categories propor,on change
  65. 65. Social Media Data Streams ClassificaBon Two major issues in the supervised classifica,on of social media streams: 1.  How to keep the categories used for classificaBon up-to-date? 2.  While adding new categories, how to maintain high classificaBon accuracy? by crowd Automatic processing Automatic processing output output Performing verification Providing training data a: Split automatic/manual processing b: Detect-verify paradigm Automatic processing Automatic processing output c: Improving quality through active learning input input input
  66. 66. IdenBficaBon of Novel Categories Classes. -  Injured people -  Infrastructure damage -  Shelter needs -  Dona,on requests -  Missing or stranded people -  Different health issues -  Novel urgent needs like -  Blankets -  Medicine -  Schools shut -  Airport closed/open -  … Pre-defined classes Unseen classes (Miscellaneous) Keep in mind we have a new class “Miscellaneous”
  67. 67. Expert-Machine-Crowd Sesng Constraints Outlier DetecBon (COD-Means): 1.  Constraints forma,on using classified items 2.  Clustering using COD-Means 3.  Labeling errors iden,fica,on (using outlier detec,on) List of categories documents stream Supervised Learning System Novel Categories Detector Using COD-Means Crowdsourcing task generator Emerging novel categories Crowdsourcing tasks to be labeled by crowd An expert Crowd workers Crowd/machine classified items. (Machine classified items with confidence score >= 0.90) Incoming uncategorized documents stream Machine categorized items (item, category and machine confidence score) triplet Refined training set Human labels Labels 1 2 3 4
  68. 68. Input and Output Category A Category B Category C Miscellaneous Z Category A’ Category B’ Category C’ Z1 Z2 Z’ INPUT OUTPUT
  69. 69. Constraints FormaBon 1. Items in same category have Must-link constraints 2. Items belonging to different categories have Cannot-link constraints Category A Category B Category C Category Z Must-link Cannot-link Note: Items in Z do not have any constraints
  70. 70. ObjecBve FuncBon Standard distor2on error If an ML constraint if violated then the cost of the viola2on is equal to the distance between the two centroids that contain the instances. If a CL constraint is violated then the error cost is the distance between the centroid C assigned to the pair and its nearest centroid h(c).
  71. 71. Assignment and Update Rules Rule 1: For items without any constraints (standard distor,on error) Rule 2: For items with Must-link constraints; cost of viola,on is distance b/w their centroids Rule 3: For items with Cannot-link constraints; cost is the distance b/w centroid c and Its nearest centroid is the Kronecker delta func2on i.e. it is 1 if x=y and 0 if x != y Update rule: The update rule computes a modified average of all points that belong to a cluster.
  72. 72. COD-Means Algorithm Algorithm 1 2 3 Ini2aliza2on (e.g. random pick of k centroids) Assignment of items based on 3 assignment rules considering ML and CL constraints Points in each cluster are sorted based on their distance to the centroid and top l are removed and inserted into L
  73. 73. Dataset and Experiments 1.  Are the new clusters iden,fied by the COD-Means algorithm genuinely different and novel? 2.  What is the nature of outliers (labeling errors) discovered by the COD-Means algorithm? Are they genuine outliers? 3.  What is the impact of outlier on the quality of clusters generated by COD-Means? 4.  Once refined clusters (without labeling errors) used in the training process, does the overall accuracy improves? 8 disaster-related datasets were used from Twiaer
  74. 74. Clusters Novelty and Coherence K-Means vs. COD-Means •  The proposed approach generates more cohesive and novel clusters by removing outliers •  As the value of L increases, more ,ght and coherent clusters emerge
  75. 75. Data Improvements EvaluaBon Affected individuals Caution and advice Donations and volunteering Infrastructure and utilities Sympathy and support Misc. to other categories Precision 0 0.25 0.5 0.75 1 Precision 0 0.25 0.5 0.75 1 2012 Colorado Wildfires 2013 Alberta Floods 2013 Boston Bombings 2013 Colorado Floods 2013 Train Crash 2013 Australia Bushfire 2013 Queensland Floods 2013 West Texas Explosion Precision 0 0.25 0.5 0.75 1 Precision 0 0.25 0.5 0.75 1 Affected individuals Caution and advice Donations and volunteering Infrastructure and utilities Sympathy and support Misc. to other categories Precision 0 0.25 0.5 0.75 1 Precision 0 0.25 0.5 0.75 1 Precision 0 0.25 0.5 0.75 1 Precision 0 0.25 0.5 0.75 1 1.  Labeling errors in non-miscellaneous categories 2.  Items incorrectly labeled as miscellaneous
  76. 76. Impact on ClassificaBon Performance
  77. 77. Social Media Image Processing An ApplicaBon of Computer Vision
  78. 78. “A picture is worth a thousand words.”
  79. 79. Research Goals •  Social media image filtering – Real-,me image retrieval, processing, and storage – Duplicate or near-duplicate detec,on – Irrelevant image detec,on •  AcBonable informaBon extracBon – Infrastructure damage assessment – Injured people detec,on
  80. 80. AutomaBc Image Processing Pipeline Dat Tien Nguyen, Firoj Alam, Ferda Ofli, Muhammad Imran. Automa2c Image Filtering on Social Networks Using Deep Learning and Perceptual Hashing During Crises. Accepted for publica2on at the 14th Interna2onal Conference on Informa2on Systems for Crisis Response And Management (ISCRAM). 2017 Albi, France.
  81. 81. Disaster Datasets (Twiaer) Dataset details for all four disaster events with their year and number of images Number of labeled images for each dataset and each damage category
  82. 82. Relevancy Filtering Examples of irrelevant images showing cartoons, banners, adver,sements, celebri,es, etc. Performance of the relevancy filtering Task: Build a binary classifier Approach: Transfer learning (fine-tune a pre-trained convolu,onal neural network, e.g., VGG16*) * Simonyan, K. and Zisserman, A. (2014). “Very deep convolu,onal networks for large-scale image recogni,on”. In: arXiv preprint arXiv:1409.1556
  83. 83. Duplicate Filtering Examples of near-duplicate images Task: Compute similarity between a pair of images Approach: Perceptual Hash* + Hamming Distance (w/ threshold) * Lei, Y. et al. (2011). “Robust image hash in Radon transform domain for authen,ca,on”. In: Signal Processing: Image Communica,on 26.6, pp. 280–288.
  84. 84. Before/A_er Image Filtering Number of images that remain in our dataset aber each image filtering opera,on ~ 2 % ~ 2 % ~ 50 % ~ 58 % ~ 50 % ~ 30 %
  85. 85. Before/A_er Image Filtering Number of images that remain in our dataset aber each image filtering opera,on ~ 2 % ~ 2 % ~ 50 % ~ 58 % ~ 50 % ~ 30 % Assume tagging an image costs $1, we could have gocen the same job done by paying $17k less, almost saving 2/3s of the budget!!!
  86. 86. Infrastructure Damage Assessment •  Three-class classifica,on – Categories: severe, mild & licle-to-none •  Dis,nc,on between categories is ambiguous. •  Agreement among human annotators is low. –  in par,cular for mild category •  Fine-tuning a pre-trained CNN (e.g., VGG16)
  87. 87. AIDR SMS Processing AIDR Helps Answer Thousands of Health Queries
  88. 88. Public Health: AIDR + UNICEF Zambia Manual processing and rou,ng of SMS Counselors (experts of HIV, STIs) SMS service 1 2 3 4 5 6 Vulnerable people
  89. 89. Public Health: AIDR + UNICEF Zambia Manual processing and rou,ng of SMS Counselors (experts of HIV, STIs) SMS service 1 2 3 4 5 6 Vulnerable people
  90. 90. New ScienBst Featured This Work
  91. 91. Media Coverage
  92. 92. Domain AdaptaBon/Transfer Learning Ability of a system to apply knowledge and skills learned in previous domains to novel domains Ongoing Work Our Goal: To build a system that can understand natural language
  93. 93. Domain AdaptaBon Labeled source, but unlabeled target Feature extractor Machine learning algorithm Feature extractor Classifier model Input documents (blue domain) Feature vectors Labels Feature vectors Machine classified items Input documents (orange domain) Training PredicBon Source event data Target event data
  94. 94. Same Domain Learning Training data Machine learning model Tes,ng data infer predict Apples Apples Apples Oranges Different shapes, colors, skins, tastes, etc. Source domain Target domain Oranges Oranges BUT
  95. 95. Crisis-related Data ClassificaBon Training data Machine learning model Tes,ng data infer predict Italy earthquake Queensland floods Sandy hurricane Costa Rica earthquake Colorado floods Typhoon Haiyan Different events, languages, and needs etc. Source domain Target domain
  96. 96. Domain AdaptaBon
  97. 97. Model AdaptaBon EvaluaBon •  Model adapta,on using single source – Using both: in-domain and cross-domain •  Model adapta,on using mulBple sources – In-domain – Mul,ple source events without the target – Mul,ple source events with the target •  Model adapta,on in special cases – Same languages – Similar languages
  98. 98. Transfer Learning Differences in classificaBon tasks: •  Different classifica,on tasks •  Different types of disasters, stakeholders, informa,on needs Task: •  Learn from source to classify target •  Seman,c similarity between tasks •  Zero-shot learning (no training examples) •  One-shot learning (few training examples)
  99. 99. SummarizaBon and PrioriBzaBon of AcBonable InformaBon InformaBon needs & problem: •  Different stakeholders •  Different goals, requirements, and info. needs General situaBonal awareness vs. Target situaBonal awareness •  High-level general updates from an event •  Specific updates (infrastructure damages)
  100. 100. InformaBon SummarizaBon In Real-Time Class A Class B Class C Class D Summary Summary Summary Summary Classified documents stream
  101. 101. Resources, Datasets, And Tools
  102. 102. Towards Standard Baselines and Datasets CrisisNLP.qcri.org -  Access to 52 million tweets -  Around 50k labeled tweets into humanitarian categories -  Largest word2vec embeddings trained on 52m crisis-related tweets -  Out-of-vocabulary dic,onaries -  Tweets downloader
  103. 103. ACM CompuBng Survey Processing Social Media Messages in Mass Emergency: A Survey [Imran et al. 2015]
  104. 104. 27 Free Data Mining Books hap://www.datasciencecentral.com/profiles/blogs/27-free-data-mining-books
  105. 105. Special Issues Organizing Editors Chris,an Reuter (University of Siegen) Muhammad Imran (Qatar Compu,ng Research Ins,tute) Amanda Hughes (Utah State University) Starr Roxanne Hiltz (New Jersey Ins,tute of Technology) Linda Plotnick (Jacksonville State University) Special Issue on “ExploitaBon of Social Media for Emergency Relief and Preparedness” Deadline: July 1st 2017 Marie-Francine Moens, KU Leuven, Belgium Gareth Jones, Dublin City University, Ireland Muhammad Imran, Qatar Compu,ng Research Ins,tute Saptarshi Ghosh, IIT Kharagpur, India Kripabandhu Ghosh, IIT Kanpur, India Debasis Ganguly, IBM Research Labs, Dublin, Ireland Tanmoy Chakraborty, University of Maryland, College Park, USA
  106. 106. Conclusions •  InformaBon bestows power for disaster response –  People need informa,on as much as water, shelter, and food –  Disasters are unavoidable, but planning can lessen their effects •  Social media as Bme-criBcal informaBon source –  Early warnings, event detec,on, event monitoring –  Availability of informa,on opens new opportuni,es •  ArBficial Intelligence for Social Good –  Applied research at its best –  AI + humans-in-the-loop can enable rapid crisis response –  AI techniques useful for: •  Situa,onal awareness •  Ac,onable informa,on extrac,on •  Summariza,on
  107. 107. THANK YOU! CrisisNLP.qcri.org AIDR.qcri.org Email: mimran@hbku.edu.qa Homepage: hap://mimran.me Twiaer: @mimran15

×