Regensburg 2019 – wissenschaftliches Programm
MM 33.1: Topical Talk
Donnerstag, 4. April 2019, 10:15–10:45, H43
Supervised and unsupervised learning from the large body of materials literature — •Gerbrand Ceder — University of California, Berkeley, CA, USA
The overwhelming majority of scientific knowledge is stored as unstructured text in millions of publications. I will show some results showing how knowledge can be extracted from such a large corpus of text using a combination of Natural Language Processes (NLP) combined with Machine Learning (ML). NLP is necessary to turn the unstructured information in the scientific literature into structured data on which ML can operate. As an example, I will demonstrate how a vector representation of words can capture inorganic materials science concepts from 3.3 million scientific abstracts without human labelling or supervision. Remarkably, such basic text-based methodology can be used to make predictions of new materials, the properties of which we verify with Density Functional Theory. An alternative and more complex example will be discussed whereby all materials synthesis information is extracted from several million papers. Constructing synthesis recipes from papers requires extremely high precision and recall of relevant chemicals and operational procedures. I will show how this can be achieved by combining various supervised and unsupervised machine learning methods to create the largest data set of solid-state synthesis reactions.