Machine Learning Using Natural Language Processing to Access Geoscience Knowledge
Large quantities of geological, geophysical and geospatial data exist in both the public domain and in industry, particularly within hydrocarbon, mining and geotechnical companies. Enhanced availability and accessibility of geoscience data (enabled by improved processing power, storage and bandwidth, an increase in primary digital acquisition, and digitization of legacy analogue data), has helped to raise expectations that ‘big data’ will inevitably equate to ‘new insights’ that can readily be monetarized. Deriving additional value from large data repositories generally involves significant challenges, though many of these have already been identified and analysed by Artificial Intelligence and Knowledge Management initiatives spanning many decades.
A key realization is that different types of data can require radically different analytics, and strategies that work well in one industry or with one type of data are not necessarily easily transferable to other domains. This is illustrated by contrasting the types of analysis and insights gained from large geospatial datasets (typically highly structured, consistently formatted, with comprehensive metadata), compared with large archives of text documents such as scientific journals or reports on a corporate file server.
The potential value of ‘mining’ text repositories is demonstrated with new prototype software based on Natural Language Processing (NLP) to analyse and characterize geoscience documents. The underlying challenge in text mining is that most additional value within text documents is held within the semantic structure of the prose. Whereas keyword search strategies rely on the binary presence/absence of specific search terms or phrases, written text conveys meaning and relevance through the wider context of surrounding sentences and paragraphs. NLP is used to analyse the information content of text, and relate this to pre-defined ontologies that represent the semantic framework of knowledge within a specific domain.