SlideShare a Scribd company logo
Information extraction from
social media: A linguistically
motivated approach
Nelleke Oostdijk1
, Ali Hürriyetoˇglu1,2
, Marco Puts2
, Piet Daas2
,
Antal van den Bosch1
1
Centre for Language Studies, Radboud University, Nijmegen
2
Centraal Bureau voor de Statistiek
Traffic information extraction
Social media data are used to obtain information
• about the flow of traffic on Dutch main roads (heavy traffic, traffic
jams, accidents, adverse weather conditions, blocked roads, etc.)
• about how social media users are coping with or reacting to a
particular situation
• in addition to data from other sources (e.g. sensors, weather
forecasts, reports filed by the authorities)
Twitter
• Freely accessible, real-time, widely used, rich source of information;
tweets come with time stamps and many also with geo-location
Extracted traffic information
1. Activity
2. Advice
3. Development
4. Direction
5. Flow
6. Intensity
7. Lane
8. Location
9. Monitoring
10. Notification
11. Observed event
12. Road condition
13. Road ID
14. Road point
15. Road section
16. Road side
17. Speed
18. Status
19. Time expression
20. Traffic
21. Traffic violation
22. Weather
Use cases
Up-to-date traffic information can improve life for all of us!
Use cases
Informed drivers will
• Be less likely to be caught up in traffic jams (e.g. they can take an
alternative route)
• Experience a safer journey (e.g. because they have been forewarned
about icy roads)
• Be able to plan their journey better (e.g. decide to leave home
earlier)
• Be encouraged to share their observations
Authorities will be able to
• Quickly respond to traffic situations
• Use information about new events
• Get a better insight into the status of various traffic events
• Understand the risks for traffic on the roads better
Use cases
Extracted information can be used in combination with
• Data from traffic sensors
• Information about the existing or expected weather conditions
• Information about ongoing or planned road works
• Information about ongoing or planned events
• Information as regards traffic route planning
Information extraction method
We process the flexible language use found on social media by a
• Rule-based, hierarchical, formal information extraction methodology
• Partial match approach for traffic domain terms, place names, and
time expressions
• Pattern-based token representation that allows the partial coverage
of tokens, while tokens may start with a hash (‘#’) or end with a
period (‘.’).
This methodology improves the flexibility of the lexical and syntactic
structures that can be covered.
Information extraction method
Location Grammar
inLt = WS() + CaselessLiteral(‘in’) + WE()
sTk = WS() + Combine(Optional(‘#’) + Word(alphas)) + WE()
prp = (inLt|bijLt|thvLt|voorLt|opLt)
loc = prp + Optional(~infra + sTk + infra) + place
Flexible Token
optLt = Optional(Literal("#"))
road_tok = oneOf(["weg","route","straat","laan"], caseless=True)
r_cntxt = SkipTo(road_tok, include=True, failOn=White())
token_roadx = Combine(optLit + Word(alphas, exact=1) + r_cntxt)
token_roadx.setParseAction(validateString)
Learning place names
• Place names are detected in slots in linguistic patterns:
– “tussen <optional infrastructure indicator> <place name P1>
en <place name 2>” (EN: between <optional infrastructure
indicator> <place name P1> and <place name P2>)
Detecting new place names
– We filter out person names with exception lists for a cleaner list
of place names.
Development data
• 85,906 Dutch language tweets were collected over a period of 4 years
(i.e. from January 1, 2011 until March 31, 2015) using the hashtag
A2 (#A2).
• We removed 6,351 tweets from users that we know are not relevant,
for instance users dedicated to tweet about “flitsers” ‘detectors of
speed violations’.
• Moreover, 25 tweets that contain “#A2.0” were removed as well.
• Finally, we excluded all retweets (25,580).
• As a consequence the final tweet set consists of 57,940 tweets.
Remaining noise in the data
Although we observed the following additional meanings of the key term
A2, we did not exclude the tweets that has this meaning.
1. some different meaning in a
foreign language
2. paper size
3. football team
4. school classes
5. a quality label for real estate
6. a car type
7. reading level
Our grammars does not take into account any meaning other than the
road ID for the key term A2.
2011 2012 2013 2014 2015
time of the tweets
0
50
100
150
200
250
300
350
400
450
tweetcount
tweet count over time
A2 tweet distribution over time
Main threads of information
• Factual: an event that happened on a road is explained in detail
mainly by the traffic authorities.
• Meta: opinions about traffic situations which are not based on or
related to a single event.
• User observations: drivers or passengers commenting on an event
• User actions involving a road but not referring to the traffic.
Information extraction examples
• 8km file op de #A2 vanuit het zuiden richting Eindhoven , maar
de spitsstrook blijft dicht . Goed bezig jongens ! #file
@vanAnaarBeter
• Lekker in de file op de #A2 , tussen knp. Deil en Nieuwegein-Zuid
16 km stilstaand verkeer ( vertraging : meer dan 30 min ) aldus de
@vid
• Strooiploeg , waar ben je ? #A2 #limburg #sneeuw
Evaluation Data
• We collected 1,448 Dutch tweets between April 1 and 4 using the
road IDs A12, A28, A27, A50, A7, and A58.
• We excluded 40 tweets based on the information we observed in the
development data, e.g., users (40 tweets), retweets (295 tweets).
• We kept only one of the near-duplicate tweets, tweets that have
above .85 cosine similarity are accepted as near-duplicates.
• The remaining set contains 728 tweets.
Manual annotation
LOC
OET
NOT
RSC
TMX
DIR
STA
LN
TRA
ADV
RPN
RSD
ACT
MON
TRV
INT
FLW
SPD
DEV
WTH
RCD
OTH
Information Category
0
50
100
150
200
250
300
350
Numberofoccurrences
Information category distribution in the annotations that were performed by one
annotator by using the FLAT (https://github.com/proycon/flat) platform.
Results on individual tweets
• The total number of manually annotated and automatically detected
information units are 2,699 and 2,400 respectively.
• 285 of the manual annotations and 93 of the automatically detected
information units were not observed on the other side of the
comparison.
• The automatic method detected exactly the same information with
the annotations in 1,245 cases.
• In 79 cases the matched tokens were the same, but the information
category did not match.
Results on individual tweets
• Annotated and automatically detected units overlapped 542 and 73
cases with the same and different information category respectively.
• In the overlapping cases 452 of the time the automatic method
detected longer phrases, which mostly cover the preceding
prepositions and articles.
• In 160 cases the manual annotations were applied to longer phrases
compare to automatic method detects.
• The precision of the correct information category detection is 51%
and 74% for the exact and overlapping token matches, respectively.
The recall is 46% and 66% in the same scope.
Discussion
• Most of the errors were caused by the confusion between direction
and road section categories.
• The automatic method mostly failed to capture locations that are not
preceded by a preposition (such locations were ignored by design).
• Some tokens that are not in the scope of any relevant information
category were mistakenly identified as temporal expressions.
• Having multiple tweets about a single traffic event will increase the
chance of detecting these smaller events.
Next steps
• Create a filter to be able to use a more general tweet stream.
• Increase the lexical coverage: plural forms of (noun) tokens, verb
tenses (third person), 010 etc. for place names.
• Include names of rivers (for inclusion in rules that make use of words
referring to parts of the roads infrastructure, the bridge across the
river X, the tunnel under the river Y), e.g.
• Identify discourse units.
• Do more online learning.
• Add different makes of cars and the types: Audi, Mercedes, Opel, etc.
• Evaluate the system by comparing amount of information captured
with a standard traffic information database.
Acknowledgements
This research was funded by the Dutch national COMMIT programme
and is supported by CBS.
#A2 tweets were retrieved from http://twiqs.nl.
Pyparsing library played a key role in development of the information
extraction method.
Tweets are stored in MongoDB.
Thanks for listening. Any questions or comments?
Please find more information on:
• http://sinfex.science.ru.nl
• https://bitbucket.org/hurrial/sinfex
Contact
Nelleke Oostdijk, n.oostdijk@let.ru.nl
Ali Hürriyetoˇglu, a.hurriyetoglu@let.ru.nl (@hurrial)
Marco Puts, m.puts@cbs.nl (@MarcoPuts)
Piet Daas, pjh.daas@cbs.nl (@pietdaas)
Antal van den Bosch, a.vandenbosch@let.ru.nl (@antalvdb)

More Related Content

Similar to Information extraction from social media: A linguistically motivated approach

Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications
Victor de Boer
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
ShowravDuttaAnkur
 
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual ObservationsUnderstanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Artificial Intelligence Institute at UofSC
 
New data sources for statistics: Experiences at Statistics Netherlands.
New data sources for statistics: Experiences at Statistics Netherlands.New data sources for statistics: Experiences at Statistics Netherlands.
New data sources for statistics: Experiences at Statistics Netherlands.
Piet J.H. Daas
 
Toward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and ApplicationsToward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and Applications
Raja Chiky
 
Koisk Public Intractive Software
Koisk Public Intractive Software Koisk Public Intractive Software
Koisk Public Intractive Software
Sazid Khan
 
ICANN Updates by Yu Chang Kuek
ICANN Updates by Yu Chang KuekICANN Updates by Yu Chang Kuek
ICANN Updates by Yu Chang Kuek
MyNOG
 
Rakesh-Nune-Incident-Management-for-DDOT
Rakesh-Nune-Incident-Management-for-DDOTRakesh-Nune-Incident-Management-for-DDOT
Rakesh-Nune-Incident-Management-for-DDOT
Rakesh Nune
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Enrico Motta
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...
Alfonso Crisci
 
A recommender system for social learning platforms
A recommender system for social learning platformsA recommender system for social learning platforms
A recommender system for social learning platforms
Soudé Fazeli
 
Collecting Twitter Data
Collecting Twitter DataCollecting Twitter Data
Collecting Twitter Data
Cornelius Puschmann
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
Marko Grobelnik
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
Nizar Ghoula
 
Intelligent Transportation Systems - History & National Perspective
Intelligent Transportation Systems - History & National PerspectiveIntelligent Transportation Systems - History & National Perspective
Intelligent Transportation Systems - History & National Perspective
American Society of Civil Engineers, Orange County Branch
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
Biniam Asnake
 
Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?
Oscar Corcho
 
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
Emanuele Della Valle
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
Aerospike, Inc.
 
UNIT-1 PPT.pptx
UNIT-1 PPT.pptxUNIT-1 PPT.pptx
UNIT-1 PPT.pptx
RAJESH PYLA
 

Similar to Information extraction from social media: A linguistically motivated approach (20)

Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual ObservationsUnderstanding City Traffic Dynamics Utilizing Sensor and Textual Observations
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
 
New data sources for statistics: Experiences at Statistics Netherlands.
New data sources for statistics: Experiences at Statistics Netherlands.New data sources for statistics: Experiences at Statistics Netherlands.
New data sources for statistics: Experiences at Statistics Netherlands.
 
Toward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and ApplicationsToward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and Applications
 
Koisk Public Intractive Software
Koisk Public Intractive Software Koisk Public Intractive Software
Koisk Public Intractive Software
 
ICANN Updates by Yu Chang Kuek
ICANN Updates by Yu Chang KuekICANN Updates by Yu Chang Kuek
ICANN Updates by Yu Chang Kuek
 
Rakesh-Nune-Incident-Management-for-DDOT
Rakesh-Nune-Incident-Management-for-DDOTRakesh-Nune-Incident-Management-for-DDOT
Rakesh-Nune-Incident-Management-for-DDOT
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...
 
A recommender system for social learning platforms
A recommender system for social learning platformsA recommender system for social learning platforms
A recommender system for social learning platforms
 
Collecting Twitter Data
Collecting Twitter DataCollecting Twitter Data
Collecting Twitter Data
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
Intelligent Transportation Systems - History & National Perspective
Intelligent Transportation Systems - History & National PerspectiveIntelligent Transportation Systems - History & National Perspective
Intelligent Transportation Systems - History & National Perspective
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?Linked Statistical Data: does it actually pay off?
Linked Statistical Data: does it actually pay off?
 
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
It's a Streaming World! Reasoning upon Rapidly Changing Information (Milano, ...
 
Get Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California HighwaysGet Started with Data Science by Analyzing Traffic Data from California Highways
Get Started with Data Science by Analyzing Traffic Data from California Highways
 
UNIT-1 PPT.pptx
UNIT-1 PPT.pptxUNIT-1 PPT.pptx
UNIT-1 PPT.pptx
 

Recently uploaded

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Information extraction from social media: A linguistically motivated approach

  • 1. Information extraction from social media: A linguistically motivated approach Nelleke Oostdijk1 , Ali Hürriyetoˇglu1,2 , Marco Puts2 , Piet Daas2 , Antal van den Bosch1 1 Centre for Language Studies, Radboud University, Nijmegen 2 Centraal Bureau voor de Statistiek
  • 2. Traffic information extraction Social media data are used to obtain information • about the flow of traffic on Dutch main roads (heavy traffic, traffic jams, accidents, adverse weather conditions, blocked roads, etc.) • about how social media users are coping with or reacting to a particular situation • in addition to data from other sources (e.g. sensors, weather forecasts, reports filed by the authorities) Twitter • Freely accessible, real-time, widely used, rich source of information; tweets come with time stamps and many also with geo-location
  • 3.
  • 4. Extracted traffic information 1. Activity 2. Advice 3. Development 4. Direction 5. Flow 6. Intensity 7. Lane 8. Location 9. Monitoring 10. Notification 11. Observed event 12. Road condition 13. Road ID 14. Road point 15. Road section 16. Road side 17. Speed 18. Status 19. Time expression 20. Traffic 21. Traffic violation 22. Weather
  • 5. Use cases Up-to-date traffic information can improve life for all of us!
  • 6. Use cases Informed drivers will • Be less likely to be caught up in traffic jams (e.g. they can take an alternative route) • Experience a safer journey (e.g. because they have been forewarned about icy roads) • Be able to plan their journey better (e.g. decide to leave home earlier) • Be encouraged to share their observations Authorities will be able to • Quickly respond to traffic situations • Use information about new events • Get a better insight into the status of various traffic events • Understand the risks for traffic on the roads better
  • 7. Use cases Extracted information can be used in combination with • Data from traffic sensors • Information about the existing or expected weather conditions • Information about ongoing or planned road works • Information about ongoing or planned events • Information as regards traffic route planning
  • 8. Information extraction method We process the flexible language use found on social media by a • Rule-based, hierarchical, formal information extraction methodology • Partial match approach for traffic domain terms, place names, and time expressions • Pattern-based token representation that allows the partial coverage of tokens, while tokens may start with a hash (‘#’) or end with a period (‘.’). This methodology improves the flexibility of the lexical and syntactic structures that can be covered.
  • 9. Information extraction method Location Grammar inLt = WS() + CaselessLiteral(‘in’) + WE() sTk = WS() + Combine(Optional(‘#’) + Word(alphas)) + WE() prp = (inLt|bijLt|thvLt|voorLt|opLt) loc = prp + Optional(~infra + sTk + infra) + place Flexible Token optLt = Optional(Literal("#")) road_tok = oneOf(["weg","route","straat","laan"], caseless=True) r_cntxt = SkipTo(road_tok, include=True, failOn=White()) token_roadx = Combine(optLit + Word(alphas, exact=1) + r_cntxt) token_roadx.setParseAction(validateString)
  • 10. Learning place names • Place names are detected in slots in linguistic patterns: – “tussen <optional infrastructure indicator> <place name P1> en <place name 2>” (EN: between <optional infrastructure indicator> <place name P1> and <place name P2>) Detecting new place names – We filter out person names with exception lists for a cleaner list of place names.
  • 11. Development data • 85,906 Dutch language tweets were collected over a period of 4 years (i.e. from January 1, 2011 until March 31, 2015) using the hashtag A2 (#A2). • We removed 6,351 tweets from users that we know are not relevant, for instance users dedicated to tweet about “flitsers” ‘detectors of speed violations’. • Moreover, 25 tweets that contain “#A2.0” were removed as well. • Finally, we excluded all retweets (25,580). • As a consequence the final tweet set consists of 57,940 tweets.
  • 12. Remaining noise in the data Although we observed the following additional meanings of the key term A2, we did not exclude the tweets that has this meaning. 1. some different meaning in a foreign language 2. paper size 3. football team 4. school classes 5. a quality label for real estate 6. a car type 7. reading level Our grammars does not take into account any meaning other than the road ID for the key term A2.
  • 13. 2011 2012 2013 2014 2015 time of the tweets 0 50 100 150 200 250 300 350 400 450 tweetcount tweet count over time A2 tweet distribution over time
  • 14. Main threads of information • Factual: an event that happened on a road is explained in detail mainly by the traffic authorities. • Meta: opinions about traffic situations which are not based on or related to a single event. • User observations: drivers or passengers commenting on an event • User actions involving a road but not referring to the traffic.
  • 15. Information extraction examples • 8km file op de #A2 vanuit het zuiden richting Eindhoven , maar de spitsstrook blijft dicht . Goed bezig jongens ! #file @vanAnaarBeter • Lekker in de file op de #A2 , tussen knp. Deil en Nieuwegein-Zuid 16 km stilstaand verkeer ( vertraging : meer dan 30 min ) aldus de @vid • Strooiploeg , waar ben je ? #A2 #limburg #sneeuw
  • 16. Evaluation Data • We collected 1,448 Dutch tweets between April 1 and 4 using the road IDs A12, A28, A27, A50, A7, and A58. • We excluded 40 tweets based on the information we observed in the development data, e.g., users (40 tweets), retweets (295 tweets). • We kept only one of the near-duplicate tweets, tweets that have above .85 cosine similarity are accepted as near-duplicates. • The remaining set contains 728 tweets.
  • 17. Manual annotation LOC OET NOT RSC TMX DIR STA LN TRA ADV RPN RSD ACT MON TRV INT FLW SPD DEV WTH RCD OTH Information Category 0 50 100 150 200 250 300 350 Numberofoccurrences Information category distribution in the annotations that were performed by one annotator by using the FLAT (https://github.com/proycon/flat) platform.
  • 18. Results on individual tweets • The total number of manually annotated and automatically detected information units are 2,699 and 2,400 respectively. • 285 of the manual annotations and 93 of the automatically detected information units were not observed on the other side of the comparison. • The automatic method detected exactly the same information with the annotations in 1,245 cases. • In 79 cases the matched tokens were the same, but the information category did not match.
  • 19. Results on individual tweets • Annotated and automatically detected units overlapped 542 and 73 cases with the same and different information category respectively. • In the overlapping cases 452 of the time the automatic method detected longer phrases, which mostly cover the preceding prepositions and articles. • In 160 cases the manual annotations were applied to longer phrases compare to automatic method detects. • The precision of the correct information category detection is 51% and 74% for the exact and overlapping token matches, respectively. The recall is 46% and 66% in the same scope.
  • 20. Discussion • Most of the errors were caused by the confusion between direction and road section categories. • The automatic method mostly failed to capture locations that are not preceded by a preposition (such locations were ignored by design). • Some tokens that are not in the scope of any relevant information category were mistakenly identified as temporal expressions. • Having multiple tweets about a single traffic event will increase the chance of detecting these smaller events.
  • 21. Next steps • Create a filter to be able to use a more general tweet stream. • Increase the lexical coverage: plural forms of (noun) tokens, verb tenses (third person), 010 etc. for place names. • Include names of rivers (for inclusion in rules that make use of words referring to parts of the roads infrastructure, the bridge across the river X, the tunnel under the river Y), e.g. • Identify discourse units. • Do more online learning. • Add different makes of cars and the types: Audi, Mercedes, Opel, etc. • Evaluate the system by comparing amount of information captured with a standard traffic information database.
  • 22. Acknowledgements This research was funded by the Dutch national COMMIT programme and is supported by CBS. #A2 tweets were retrieved from http://twiqs.nl. Pyparsing library played a key role in development of the information extraction method. Tweets are stored in MongoDB.
  • 23. Thanks for listening. Any questions or comments? Please find more information on: • http://sinfex.science.ru.nl • https://bitbucket.org/hurrial/sinfex Contact Nelleke Oostdijk, n.oostdijk@let.ru.nl Ali Hürriyetoˇglu, a.hurriyetoglu@let.ru.nl (@hurrial) Marco Puts, m.puts@cbs.nl (@MarcoPuts) Piet Daas, pjh.daas@cbs.nl (@pietdaas) Antal van den Bosch, a.vandenbosch@let.ru.nl (@antalvdb)