Maritime safety events extraction from news articles

875 views

Published on

This is a presentation for my master's thesis. The system created for the thesis extracts maritime safety events from news articles using mainly Text Classification and Information Extraction

Published in: Business, Technology
  • Be the first to comment

Maritime safety events extraction from news articles

  1. 1. Vrije UniversiteitMSc Information Sciences Maritime Safety Events Extraction from News Articles Anastasios Martidis anastasios.martidis@student.vu.nl July 31, 2012 Supervisors: Willem R. van Hage, Dr Davide Ceolin, MSc 1
  2. 2. Outline Introduction  Training Sets Information  System Overview Spectrum  Test sets Problem Statement  Evaluation Significance of  Results Research  Conclusions Research Questions Hypotheses Materials and Methods 2
  3. 3. Introduction “We are drowning in information, and starved for knowledge. ” John Naisbitt 3
  4. 4. Information SpectrumStructured Data: Automatic Identification System (AIS) theoceandreamer.files.wordpress.com/ 2011/03/img_21861.jpg Free Text: News Articles http://www.tideway.nl/images/NorthWestEveningMail- PortSettoRockasTurbinesGetBoostfromaRollingstone-Walney2010-kleinbestan.jpg 4
  5. 5. Problem StatementNews Articles: Descriptive and informative, but… Vast in number, daily growing and updated Free text, difficult to process automatically Generic Natural Language Processing tools: Popular and useful, but… Present limitations in recognizing specific types of maritime safety events and ship names 5
  6. 6. Significance of the ResearchApplications Potential Stakeholders Risk assessments  Ship owners, operators Improvement of vessel and managers safety standards  Insurance Companies Port facility security  Coast Guard assessments  International Maritime Recognition of problematic Organization (IMO) areas (Piracy)  International Maritime Identification of shipping Security (IMS) companies, ships, ship  Private Security constructors with history Companies (PCSs) in maritime safety events Maritime education and training 6
  7. 7. Research Questions1. Can we automatically process a news article in order to determine if it concerns a maritime safety event?2. Can we automatically extract a description of a maritime safety event? The objective of the description is to automatically recognize the type of maritime safety event, ships involved, location, date and time.3. Can we recognize relations and significance of the extracted information from the text? -Can we recognize the dominant event? Dominant event is considered the event that is primarily described in the news article. -Can we identify relations between extracted locations and specific event types described in the text? 7
  8. 8. Hypotheses1. We can define sets of keywords that if are present in certain combinations in the text under processing, indicate that it concerns a maritime safety event.2. We can extract a description for the event described in the news article using rule based text classification and sets of keywords, datasets of ship names, regular expressions matching and Name Entity Recognition tasks.3. We can evaluate the extracted information from the text: -identifying the dominant event by measuring the frequency of keyword indicators for each event type -recognize relation between locations and event types by examining the position of locations and event type indicators in the text 8
  9. 9. Materials & Methods Rule Based Text Classification Information Extraction OpenCalais NLTK AIS dbpedia 9
  10. 10. Training Set 200 news articles (retrieved from CBS news) 100 related to maritime safety (53937 tokens) 100 of general domains (47053 tokens) Word Frequency Maritime Safety Related General Domains 10
  11. 11. Training Set Outcomes Manual discrimination of significant words Categorize into sets of keywords by their meaning Use of keywords for text classification Mapping of keywords into maritime safety event types Use of keywords as event type indicators 11
  12. 12. Text Classification Document D Lists of keywords: L1, most frequent keywords L2, safety related keywords L3, vessel type keywords L4, maritime related keywords L5, naval hierarchy keywords L6, part of ship keywords L7, water based locations keywords 12
  13. 13. Event Type Recognition Document D, Event Types (ET): Piracy Capsizing Sinking Drifting Oil spill Leakage Fire/Explosion Evacuation Grounding Collision 13
  14. 14. Ship Names Extraction Datasetof ship names retrieved from AIS messages and dbpedia Comparison of the dataset entries to the text Compromises  Location names  Part of names 14
  15. 15. Locations Extraction Use of OpenCalais for NER tasks Interested in locations only Four types of locations recognized by Calais: Continent Country City Provenance or State 15
  16. 16. Date and Time Extraction Chucked sentences Pattern matching using regular expressions  Numeric representation of date (e.g., 1322012, 22-07-12)  Months (e.g., January or Jan.)  Days (e.g., Monday or Mon.)  Day periods (e.g., morning, afternoon)  Time (e.g., 11:00am or 11.00 a.m.) Presented in specific order for each sentence 16
  17. 17. Dominant Event Recognition For each list of event type indicators keywords Sum of keywords occurrence in the text Event type with the highest sum is predicted as the dominant event 17
  18. 18. Location to Event Relations Chunked sentences For every sentence containing an extracted location, if a keyword indicator of an event type also occurs in the same sentence Then is predicted that the location is related to the event type 18
  19. 19. Test Set 200 news articles (BBC, Reuters) 100 maritime safety related 100 of general domains (50 of them selected as an attempt to mislead the system) Each news article manually labeled and automatically processed by the system Comparison of the results to the labeled news article 19
  20. 20. Labeled News Article 20
  21. 21. Results of the System 21
  22. 22. Evaluation 22
  23. 23. Results: Text ClassificationPrecision: 100 %Recall: 100 %F-measure: 100 % 23
  24. 24. Results: Event Type RecognitionPrecision: 88%Recall: 97 %F-measure: 92.2 % 24
  25. 25. Results: Ship Name ExtractionPrecision: 18.5%Recall: 45.3%F-measure: 26.3% 25
  26. 26. Results: Location ExtractionPrecision: 88.5%Recall: 74.7%F-measure: 81% 26
  27. 27. Results: Date and Time ExtractionPrecision: 95.3%Recall: 89.4%F-measure: 92.3% 27
  28. 28. Results: Dominant Event RecognitionPrecision: 92%Recall: 92%F-measure: 92% 28
  29. 29. Results: Location to Event RelationsPrecision: 81%Recall: 67.8%F-measure: 73.8% 29
  30. 30. Conclusions The system accomplished the extraction of maritime safety events from news articles Overall performance of the system was satisfying The system can be improved and refined Ship names extraction require a different approach 30
  31. 31. Vrije UniversiteitMSc Information Sciences Maritime Safety Events Extraction from News Articles Anastasios Martidis anastasios.martidis@student.vu.nl July 31, 2012 Supervisors: Willem R. van Hage, Dr Davide Ceolin, MSc 31

×