Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Does keyword noise change over space and time? A case study of flood- and rain-related social media messages

104 views

Published on

Social media is a valuable source of information for different domains, since users share their opinion and knowledge in (near) real-time. Moreover, users usually use different words to refer to a particular event (e.g., a rain event). These words may be later employed to filter social media messages regarding new occurrences of the event and, thus, to reduce the number of unrelated messages. These words, however, may have different meanings and, thus, may not reduce the number of messages. In this work, we conduct a case study to measure which rain- or flood-related keywords are less relevant to reduce the number of unrelated messages. The results show that the keywords change over space, due to local language/culture, and time, specially in different time scales.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Does keyword noise change over space and time? A case study of flood- and rain-related social media messages

  1. 1. Does keyword noise change over space and time? A case study of flood- and rain-related social media messages Sidgley Camargo de Andrade1 , L´ıvia Castro Degrossi2 Camilo Restrepo- Estrada3 , Alexandre C. B. Delbem2 , Jo˜ao Porto de Albuquerque4 December 2018 (1) Federal University of Technology - Paran´a (UTFPR), Toledo, Brazil (2) University of S˜ao Paulo (USP), S˜ao Carlos, Brazil (3) University of Antioquia, Medell´ın, Colombia (4) University of Warwick, Coventry, UK Publication available at http://mtc-m16c.sid.inpe.br/col/sid.inpe.br/mtc-m16c/2018/12.27.18.33/ doc/p11.pdf GEOINFO 2018 – Brazilian Symposium on Geoinformatics, Campina Grande, Para´ıba, Brazil
  2. 2. Big Data (Social Media) & Information Retrieval • Volume • Variety • Veracity • Velocity • Noise and rumor • Heterogeneity • Uncertainty • High processing (on-line) Retrieving relevant and meaningful data is not a straightforward task. * Paraphrasing Dr. Jo˜ao Porto de Albuquerque’s ideia/analogy (Warwick/UK) who used the image of the roman god Janus. 1
  3. 3. How to filter ueseful information? Within the context of social media analytics: • Social media users utilize a variety of terms to refer to an event that they observe. • Keyword-based filtering approach has been widely employed to reduce the “noise”, i.e., messages that contain event-related keywords, but are not related to an event indeed or are duplicated. • The noise usually occurs when the keywords have different definitions and/or meanings (e.g. “Santos” can refer to the coastal city or the soccer team). • Variations, misspellings and typos are inherent in the social media messages (text content). 2
  4. 4. Research • Question: Does keyword noise change over space and time? • Methodology: To carry out a case study supported by an exploratory content analysis to measure the signal and noise rate of the rain- and flood-related keywords on a data sample from Twitter. • Purpose: To measure which rain- or flood-related keywords are less relevant to reduce the noise. alagamento (flood), alagado (flooded), alagada (flooded), alagando (it’s flooding), alagou (flooded), alagar (to flood), chove (rain), chova (rain), chovia (had been rained), chuva (rain), chuvarada (rain), chuvosa (rain), chuvoso (rainy), chuvona (heavy rain), chuvinha (drizzle), chu- visco (drizzle), chuvendo (it’s raining), dil´uvio (heavy rain), garoa (drizzle), inunda¸c˜ao (flood), inundada (flooded), inundado (flooded), inundar (to flood), inundam (flood), inundou (flooded), temporal (storm), temporais (storms) 3
  5. 5. Exploratory content analysis • Study area: S˜ao Paulo city, Brazil. • Number of tweets retrieved: 11,848,923 million, from 7 Nov. 2016 to 28 Feb. 2017. • Geotagged tweets: 891,367 thousand (7.52%). • Keyword filtered tweets: 5,408 thousand. • On-topic/Off-topic tweets: 3,964 and 1,444 thousand, respectivelly. (Krippendorff’s alpha coefficient of 0.72 – 5 raters) 4
  6. 6. Signal and Noise To aggregate the signal and noise in different time scale across the districts: • Signalst = nr. on-topic tweets nr. filtered tweets • Noisest = nr. off-topic tweets nr. filtered tweets where s and t correspond to district and time scale, respectively. 5
  7. 7. 6
  8. 8. Signal and noise of the keywords (S´e and Barra Funda districts) The signal and noise were measured as the fraction between on-topic and off-topic tweets and all the tweets posted within the district (relative frequency) and, later, rescaled to [-1, 1]. 7
  9. 9. 8
  10. 10. Signal and noise of the keywords (Itaquera and Cidade Dutra districts) The signal and noise were measured as the fraction between on-topic and off-topic tweets and all the tweets posted within the district (relative frequency) and, later, rescaled to [-1, 1]. 9
  11. 11. Discussions & Conclusions • Keywords should be selected with caution since they are sensible to time and space. • At the first sight, all predefined keywords had potential to filter rain- and flood-related messages; however, our analysis demonstrated that some keywords are noisy and may introduce false-positive messages. • Local issues should be taken into account, such as language/culture, and, specially, atypical events. 10
  12. 12. References IBGE (2010). Censo Demogr´afico 2010. Brazilian Institute of Geography and Statistics, Rio de Janeiro. Krippendorff, K. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communications Research, 30(3):411–433. Rzeszewski, M. (2018). Geosocial capta in geographical research – a critical analysis. Cartography and Geographic Information Science, 45(1):18–30. Vieweg, S., Hughes, A. L., Starbird, K., and Palen, L. (2010). Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pages 1079–1088, New York, NY, USA. ACM. 11
  13. 13. Thank you very much! Sidgley Camargo de Andrade Federal University of Technology – Paran´a – UTFPR-Toledo sidgleyandrade@utfpr.edu.br http://pessoal.utfpr.edu.br/sidgleyandrade/ http://www.agora.icmc.usp.br/ Acknowledgements/Funding 11

×