Mainz 2014 – wissenschaftliches Programm

Bereiche | Tage | Auswahl | Suche | Aktualisierungen | Downloads | Hilfe

T: Fachverband Teilchenphysik

T 12: Computing

T 12.3: Vortrag

Montag, 24. März 2014, 11:30–11:45, P15

Hadoop for Parallel Root Data Analysis — •Sebastian Lehrack and Guenter Duckeck — LMU Muenchen

The Apache Hadoop software is a Java based framework for distributed processing of large data sets across clusters of computers using the Hadoop file system (HDFS) for data storage and backup and MapReduce as a processing platform. Hadoop is primarily designed for processing large textual data sets which can be processed in arbitrary chunks, and must be adapted to the use case of processing binary data files which can not be split automatically. However, Hadoop offers attractive features in terms of fault tolerance, task supervision and controlling, multi-user functionality and job management. For this reason, we have evaluated Apache Hadoop as an alternative approach to PROOF for ROOT data analysis. Two alternatives in distributing analysis data are discussed: Either the data is stored in HDFS and processed with MapReduce, or the data is accessed via a standard Grid storage system (dCache Tier-2) and MapReduce was used only as execution back-end. The focus in the measurements are on the one hand to safely store analysis data on HDFS with reasonable data rates and on the other hand to process data fast and reliably with MapReduce. For evaluation of MapReduce, realistic ROOT analyses have been used and event rates were compared to PROOF. We also investigated the data locality on our workstation cluster.