Faithfully representing chemical environments is essential for describing materials and molecules with machine learning approaches. In this paper we investigate (i) the sensitivity to perturbations and (ii) the effective dimensionality of a variety of atomic environment representations and over a range of material datasets.
Representations investigated include atom centered symmetry functions, Chebyshev Polynomial Symmetry Functions (CHSF), smooth overlap of atomic positions, many-body tensor representation, and atomic cluster expansion. In area (i), we show that none of the atomic environment representations are linearly stable under tangential perturbations and that for CHSF, there are instabilities for particular choices of perturbation, which we show can be removed with a slight redefinition of the representation. In area (ii), we find that most representations can be compressed significantly without loss of precision and, further, that selecting optimal subsets of a representation method improves the accuracy of regression models built for a given dataset.
The datasets used have been made available via Zenodo, as has the analysis code, which is written in Python and Julia.
Popular descriptors for machine learning potentials such as the Behler-Parinello atom centred symmetry functions (ACSF) or the Smooth Overlap of Interatomic Potentials (SOAP) are widely used but so far not much attention has been paid to optimising how many descriptor components need to be included to give good results.
The key results of the paper are shown in Figure 9, which shows the accuracy when using linear (ridge regression) and non-linear (kernel ridge regression) techniques with a range of descriptors, all as a function of the number of descriptor components included. I would suggest focusing on reproducing parts of this figure using a representative subset of descriptors, e.g. ACSF, SOAP and ACE. Scripts and analysis code are provided in the data made available with the paper.