Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data in the Geosciences : Geoscience Aware Sentiment Analysis

764 views

Published on

Geoscience aware text sentiment algorithm improves on prediction over out-of-the-box tools like IBM Watson, Google, Microsoft and Amazon by over 30%.
Presented early research findings today at the ‘Big Data in the Geosciences’ conference at the Geological Society of London.
Google opened proceedings with a talk on Satellite Imagery and the Earth Engine, subsequent talks ranged from using Twitter for early warnings of Earthquakes, Virtual Reality and Digital Analogues through to applying deep learning to detect volcano deformation. Some fascinating insights.
My latest research addressed sentiment, the context, around mentions of petroleum system elements (such as source rock, migration, reservoir and trapping) in literature, company reports and presentations. The hypothesis is that stacked somewhat independent opinion/tone in text, the averages, the outliers, the contradictions –may potentially show geoscientists what they don’t know and challenge what they think they do know.
The research question was to assess whether a geoscience-aware algorithm could improve on existing API’s/algorithms in use for sentiment analysis and how useful resulting visualization might be.
Using a held-back set of 750 labelled petroleum (oil and gas) examples to test, the Geoscience Aware text sentiment analyZER (GAZER) algorithm achieved 90.4% accuracy for two classes (positive and negative) and 84.26% accuracy for 3 classes (positive, negative and neutral sentiment). This compared favourably with generic paragraph Vector and Naïve Bayes approaches. It also compare favourably to the out-of-the-box sentiment Cloud API’s from IBM Watson, Microsoft, Amazon and Google that averaged approximately 50%.
Agreement between retired geoscientists labelling the data was 91.6% indicating the algorithm appears to be approaching ‘human-like’ performance.
This supports findings in in other areas showing the need for customization for sentiment in domain areas and criticality of specific training data for the work task in hand. The findings also support existing literature that suggested generative probabilistic machine learning algorithms may perform better than discriminatory ones when trying to classify snippets of information such as sentences and bullets in PowerPoint presentations.
Early evidence suggested resulting visualizations such as streamgraphs of the sentiment data could be used to challenge individual biases and organizational dogma, presenting an area for further research.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Big Data in the Geosciences : Geoscience Aware Sentiment Analysis

  1. 1. Mining Geological Sentiment from Unstructured Text: Early findings from an Exploratory Study Paul H. Cleverley Ph.D. Robert Gordon University, Aberdeen. The Geological Society Janet Watson Meeting, London 27th Feb 2018 A Data Exploration: Big Impact of Data in Geoscience
  2. 2. Cognitive Bias COGNITIVE BIAS (Rose 2016) Premature selection of theory Personal hubris Lack of perspective Lack of imagination Laziness Excessive self-interest GroupThink is to be avoided ………… “A compelling narrative fosters an illusion of inevitability” Daniel Kahneman “People generally see what they look for, and hear what they listen for” Harper Lee BIG DATA SENTIMENT ANALYSIS FROM EXTERNAL LITERATURE / COMPANY ARCHIVES …. but being aware of independently stacked opinion in literature may challenge our own biases CHALLENGE SURPISE US?
  3. 3. Research Questions – Sentiment Analysis • Literature Review and Gap • Sentiment classifiers need customizations (Van Boeyen 2014) issues with data sparseness & domain language (Asghar et al 2017). • Generative Machine Learning ‘probabilistic’ e.g. Bayes, may deliver more accurate classifications than discriminative ‘geometric’ machine learning e.g. SVM for sentences (Wang & Manning 2012). • Existing geoscience text mining literature has tendencies to focus on concept/entity extraction and NLP ontology population e.g. biodiversity (Peters et al. 2015; Liu et al. 2016; Yulaeva et al. 2017). There are no known published studies investigating sentiment analysis in petroleum geoscience. • Research Questions • To what extent can a “geoscience aware” sentiment classifier out-perform existing Out-Of-The-Box (OOTB) algorithms for sentiment? • To what extent are resulting visualizations useful for geoscientists? Cleverley and Burnett (2014) Journal of Information Science
  4. 4. Methodology – 2,500+ sentences: 80% Training;20% Test Identify Petroleum System “Mentions” 2,500+ Sentences Retired Geologist’s Label Convert to ASCII text Small sample 150 Public Domain PDF Oil & Gas Reports - 5 Million+ words Agreed with each other 91.6% of the time Extractor and Sentiment Classifier Written in Python (64BIT Anaconda Spyder) – Run on Quad Core 8GB RAM Laptop MACHINE CLASSIFIER SENTIMENT …
  5. 5. Methodology – 2,500+ sentences: 80% Training;20% Test Identify Petroleum System “Mentions” 2,500+ Sentences Retired Geologist’s Label Polarity Labelled Geological Sentences Convert to ASCII text Small sample 150 Public Domain PDF Oil & Gas Reports - 5 Million+ words Agreed with each other 91.6% of the time Extractor and Sentiment Classifier Written in Python (64BIT Anaconda Spyder) – Run on Quad Core 8GB RAM Laptop MACHINE CLASSIFIER SENTIMENT …
  6. 6. Geo-sentiment AnalyZER (GAZER) Algorithm – Predict Sentiment MACHINE LEARNING FEATURE SELECTION TRAINING SET BAYESIAN PROBABILITY AND SKIP-GRAMS NATURAL LANGUAGE PROCESSING (NLP)KNOWLEDGE ENGINEERING 6,000+ BALANCED FEATURE-SET/ LEXICON NUMERICAL RULES Books, Modified Public Domain Sentiment Lexicons Bayesian Skipgrams NEGATION PIVOTS PART OF SPEECH (POS) TAGGING DISAMBIGUATION
  7. 7. Geo-sentiment AnalyZER (GAZER) Algorithm – Predict Sentiment MACHINE LEARNING FEATURE SELECTION TRAINING SET BAYESIAN PROBABILITY AND SKIP-GRAMS NATURAL LANGUAGE PROCESSING (NLP) NEGATION PIVOTS PART OF SPEECH (POS) TAGGING DISAMBIGUATION KNOWLEDGE ENGINEERING 6,000+ BALANCED FEATURE-SET/ LEXICON NUMERICAL RULES Books, Modified Public Domain Sentiment Lexicons Bayesian Skipgrams Example of Part of Speech (POS) Tagging
  8. 8. Results – Accuracy Comparison 2 Categories POS v NEG 35.83% 68.79% 78.63% 75.52% 90.40% 32.00% 42.00% 52.00% 62.00% 72.00% 82.00% 92.00% Naïve Bayes - Multinomial (SentiWordNet) Sentence Vector Cosine Similarity (Neural Network) Naïve Bayes - Multinomial Naïve Bayes - Multinomial GAZER 6000+ Geoscience Polarity labelled feature-set/lexicon2500+ labelled geological sentences SentiWordNet 3.0 Polarity labelled lexicon • Comparing too State-of-the-art in the literature for generic sentiment analysis: Sentence Vector (Le & Mikolov 2014) gave 92.6% with 25,000 Move Reviews as a training set Accuracy(%)
  9. 9. Results –Comparison v Cloud API’s: 3 Categories (POS-NEG-NEUT) 26.60% 45.10% 50.50% 51.30% 84.26% 25.00% 35.00% 45.00% 55.00% 65.00% 75.00% 85.00% Amazon Lexalytics Microsoft IBM Google GAZER ** Test data for IBM & Microsoft is with a random subset (250 sentences) of the overall test set (750) Accuracy(%)
  10. 10. Results – Geoscience Sentiment AnalyZER (GAZER) Algorithm PREDICTED POSITIVE PREDICTED NEUTRAL PREDICTED NEGATIVE POSITIVE 225 17 8 250 NEUTRAL 54 180 16 250 NEGATIVE 11 12 227 250 290 209 251 750 Overall Accuracy 3 Categories (GAZER) = 84.26% - Accuracy (POS): Recall=90.0%, Precision=77.58%, F1=83.33% - Accuracy (NEUT): Recall=72.0%, Precision=80.36%, F1=75.99% - Accuracy (NEG): Recall=90.8%, Precision=90.43%, F1=90.62% Expected-Random Accuracy (EXP) = 33.3% Confusion Matrix Cut-off < 0.54 (Neutral) Skips=3
  11. 11. Crowdsourcing explicit sentiment from public domain text Streamgraphs (Sankey Curves): “Characterizing” an area Geologically automatically from text >> Frequencyofpositivementions
  12. 12. Crowdsourcing explicit sentiment from public domain text Beanplots: Showing polarity probability dist. (POS v NEG) across Geological Basins for hydrocarbon source rock >> Time-Series plots showing sentiment changes for different types of Source Rock/Areas through time (publication date of report)NEGATIVEPOSITIVE -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1980 1985 1990 1995 2000 2005 2010 2015 Sentiment(Positive>0,Negative<0) Year of Publication
  13. 13. Summary and future work • Out-Of-The-Box commercial APIs likely to be sub-optimal for geoscience sentiment. Classifiers highly reliant on data they were trained on. Classifying snippets/bullets in PowerPoints may suit Bayesian rather than vector approaches. • Further work: larger training sets, deeper dependency parsing, more areas of geoscience, beyond-sentence-boundary, intensity and empirical/subjective split. • Testing of visuals with Geoscientists (to publish paper Q4 2018). Very early feedback: • “…a really nice way to capture literature without reading it. It would be nice to run this method on all the AAPG publications of a few separate basins and see if the graphs reflect our basic understanding of these basins. This could become a very powerful method in understanding and visualizing the current state of knowledge.“ Exploration Geologist (December 2017)
  14. 14. Published data sets 750 labelled sentences (POS, NEG, NEUT) for test benchmarking classifier performance & basic Python extraction utility scripts on Github: https://github.com/phcleverley/Geoscience-Sentiment-Research Cleverley, P.H. (2017). Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants. Geological Society of America (GSA) Annual Meeting, Seattle, WA, USA. 22-25 October 2017 [Geological Formations (middle, right) Data courtesy of Society of Economic Geology (SEG) via GeoscienceWorld] Email: p.h.cleverley@rgu.ac.uk Blog: www.paulhcleverley.com Thankyou for listening! Geological Formation Sentiment from 35 Million WordsSentiment from selected USGS Petroleum Reports Graph EigenCentrality in Huge Text Associative Networks

×