Be the first to like this
Geoscience aware text sentiment algorithm improves on prediction over out-of-the-box tools like IBM Watson, Google, Microsoft and Amazon by over 30%.
Presented early research findings today at the ‘Big Data in the Geosciences’ conference at the Geological Society of London.
Google opened proceedings with a talk on Satellite Imagery and the Earth Engine, subsequent talks ranged from using Twitter for early warnings of Earthquakes, Virtual Reality and Digital Analogues through to applying deep learning to detect volcano deformation. Some fascinating insights.
My latest research addressed sentiment, the context, around mentions of petroleum system elements (such as source rock, migration, reservoir and trapping) in literature, company reports and presentations. The hypothesis is that stacked somewhat independent opinion/tone in text, the averages, the outliers, the contradictions –may potentially show geoscientists what they don’t know and challenge what they think they do know.
The research question was to assess whether a geoscience-aware algorithm could improve on existing API’s/algorithms in use for sentiment analysis and how useful resulting visualization might be.
Using a held-back set of 750 labelled petroleum (oil and gas) examples to test, the Geoscience Aware text sentiment analyZER (GAZER) algorithm achieved 90.4% accuracy for two classes (positive and negative) and 84.26% accuracy for 3 classes (positive, negative and neutral sentiment). This compared favourably with generic paragraph Vector and Naïve Bayes approaches. It also compare favourably to the out-of-the-box sentiment Cloud API’s from IBM Watson, Microsoft, Amazon and Google that averaged approximately 50%.
Agreement between retired geoscientists labelling the data was 91.6% indicating the algorithm appears to be approaching ‘human-like’ performance.
This supports findings in in other areas showing the need for customization for sentiment in domain areas and criticality of specific training data for the work task in hand. The findings also support existing literature that suggested generative probabilistic machine learning algorithms may perform better than discriminatory ones when trying to classify snippets of information such as sentences and bullets in PowerPoint presentations.
Early evidence suggested resulting visualizations such as streamgraphs of the sentiment data could be used to challenge individual biases and organizational dogma, presenting an area for further research.