Dresden 2026 – wissenschaftliches Programm
Bereiche | Tage | Auswahl | Suche | Aktualisierungen | Downloads | Hilfe
MM: Fachverband Metall- und Materialphysik
MM 22: Data-driven Materials Science: Big Data and Workflows III
MM 22.1: Vortrag
Mittwoch, 11. März 2026, 10:15–10:30, SCH/A216
Hashing It Out: Overcoming the Duplicate Structure Filtering Bottleneck for Large Data Sets — •Julian Holland, Juan Manuel Lombardi, Chiara Panosetti, and Karsten Reuter — Fritz Haber Institute, Berlin, Germany
With the increasingly data-rich landscape of computational chemistry research, new bottlenecks to material property elucidation have emerged stemming from data processing. Duplicate detection is often an essential data processing step for active learning, global optimization, and general PES exploration algorithms to ensure efficiency and functionality. Such duplication checks typically scale unfavorably with the number of structures, potentially taking longer to perform than the data generation. Hashing-based methods, which have decoupled scaling with dataset size, circumvent this but are conventionally too rigid to reliably find duplicates. In this talk, we present a democratic hashing duplicate detection algorithm that is flexible enough to detect duplicate structures with arbitrarily similar, but distinct, global descriptors nearly instantly. The uniqueness of the structure can be determined by an ensemble of hash functions associated with a set of randomly perturbed global descriptors. We compare the performance of our duplicate detection algorithm against conventional distance-matrix-based methods and introduce a standardized suite of duplicate detection benchmarks. Our algorithm is not only faster but often significantly more robust at detecting known duplicates.
Keywords: Data Science; Global Descriptors; Thermodynamics; Disorder; Big Data
