Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants

279 views

Published on

Exploiting the patterns in text to support geoscientific workflows using similar/dissimilar word associations and sentiment analysis to challenge existing predispositions. Sentiment analysis is an ensemble of labelled examples, skipgrams and a geoscience lexicon fed into a Bayesian Classifier using Python.

Published in: Data & Analytics
  • Be the first to comment

Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants

  1. 1. Paul H. Cleverley Robert Gordon University, Aberdeen, Scotland, UK. GSA Annual Conference 24th October 2017; Seattle, USA. Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants
  2. 2. Background - Typical uses Spatializing entities/concepts and associations e.g. ‘mentions’ of Pre-Cambrian ‘Extracting integer and float data from unstructured text e.g. ppm is an association with a chemical element For example GeoDeepDive supported papers (Peters et al. 2015; Liu et al. 2016; Yulaeva et al. 2017) Stromatolite relationship to dolomite; link between cobalt and supercontinent assembly; extracting hydrogeological data) Cleverley (2017) Cleverley (2017) But what else can we do? Examples using Python...
  3. 3. Learning by comparison: Discriminatory Search Term Word Associations 100,000+ Society of Petroleum Engineers (SPE), American Geosciences Institute (AGI), Geological Society of London (GSL) Primary Search Query= submarine fan Comparing secondary search terms: - Miocene - Eocene Cleverley, P.H., Burnett, S. (2014)
  4. 4. Stimulating Serendipity (Discriminatory Word Associations) “….word associations highlighted new and unexpected terms... This surprising result led us to consider a new geological element which could impact our business opportunity” Geologist Oil & Gas Company 2015 n=53 To what extent do current search interfaces in your organization facilitate serendipitous discovery? 42% - To a moderate/large extent To what extent could word co-occurrence techniques facilitate serendipitous discovery? 75% - To a moderate/large extent A Wilcoxon Signed Rank Test showed a statistically significant difference (p<0.05). CURRENT Cleverley, P.H., Burnett, S. (2014) 100,000+ Society of Petroleum Engineers (SPE), American Geosciences Institute (AGI), Geological Society of London (GSL). Some colour coding from NASA SWEET Ontology and others.
  5. 5. Question: Which is the most similar formation to the Kimmeridge Formation? Word Vectors – very simple theory Cleverley (2016) Digital Energy
  6. 6. Word Vectors – very simple theory Cleverley (2016) Digital Energy
  7. 7. Find Similar Find Similar Similarity of entities “I input the Zebbag Formation that I studied in Tunisia and it returned a lateral equivalent (in Libya) that I had not come across before.“ Geologist, Multi-National Oil and Gas Company (June 2016) What are the analogues for xxx? Cleverley (2016) Digital Energy Adding more sophistication… - Curation (lemma’s, synonyms) - NLP e.g. ‘post Triassic’, ‘not porous’ - Mikolov et al. (2013); Řehůřek (2014) Word2Vec: Using Neural networks to generate richer and more complex representations of meaning in text (text embedding’s). - Using Geoscience Ontologies to enrich meaning and add logic for reasoning.
  8. 8. More “related” to volcanics than limestone More “related” to limestone than volcanics Testing Hypotheses (word vector v word vector) 6,000+ Articles over 100 years of the Society of Economic Geologists (SEG) - (courtesy GeoScienceWorld) Cleverley (2017(
  9. 9. R2=0.2576 A weak correlation. Arid environments can lead to high Ph (evaporation / desorption) which can lead to Arsenic in Groundwater. So the more arid the environment (less rainwater), more likely Arsenic may mobilize Word Vector (Arsenic) NOAAAnnualRainfall(mm) Testing Hypotheses (word vector v existing data) Word Vector (US States) 6,000+ Articles over 100 years of the Society of Economic Geologists (SEG) - (courtesy GeoScienceWorld) National Oceanic and Atmospheric Administration (NOAA) Environmental Data Cleverley (2017(
  10. 10. Are all the conditions likely to be in place for …? Labelled training data + skip-grams + geoscience ‘friendly’ lexicon Using literature too help challenge individual cognitive biases and organizational dogma Reports from United States Geological Survey (USGS) Petroleum Assessments Cleverley (2017)
  11. 11. Summary – Areas for further research • Opportunities may exist to increase the propensity of ‘general purpose’ enterprise search user interfaces to facilitate serendipity. • Combining text analytics & machine learning to address a specific work task to provide actionable insights. p.h.cleverley@rgu.ac.uk www.paulhcleverley.com

×