Utilizing HPC Systems to Serve Compute-Intensive Tasks from a Data Lake Managing Sensitive Data
Dissertation
Datum der mündl. Prüfung:2024-09-04
Erschienen:2024-10-10
Betreuer:Prof. Dr. Ramin Yahyapour
Gutachter:Prof. Dr. Ramin Yahyapour
Gutachter:Prof. Dr. Dagmar Krefting
Dateien
Name:PHD_Thesis_Hendrik_Nolte_final.pdf
Size:9.09Mb
Format:PDF
Zusammenfassung
Englisch
Data lakes have been proposed by James Dixon in 2010 to prevent information loss happening during the extract-transform-load process usually employed to ingest data into a data warehouse. Instead, he proposed a schema-on-read semantics, where the schema is only inferred on the data when reading. % it while retaining its original format. This simple, yet powerful idea has been drastically extended to support multiple data sources and use different ecosystems, like Hadoop or public cloud offerings. Within this thesis, a data lake is designed that integrates sensitive health data and offloads compute-intensive tasks to a High-Performance Computing (HPC) system. The general applicability is demonstrated through a use case where Magnetic Resonance Imaging (MRI) scans are stored, cataloged, and processed. For this, an analysis of the challenges of data-intensive projects on HPC systems is done. Here, the requirement of a secure communication layer between a remote data management system, i.e., the data lake, and the HPC system is identified, alongside the general problem of data placement and lifecycle management. Following this analysis the communication layer is implemented and provides a Representational State Transfer (REST) interface. This system can be completely set up in user space allowing easy scalability and portability. In addition, it provides a fine-granular, token-based permission system. Among other things, this allows to automatically start uncritical tasks, like the execution of an already configured job, while a highly critical code-ingestion is only possible via an out-of-band confirmation of a second factor. Its generic design enables usage outside of the scope of the envisioned data lake by basically any client system. In order to implement the actual data lake, the abstraction of digital objects is used. It allows to organize an otherwise flat hierarchical namespace and abstracts the underlying infrastructure. This data lake can offload compute-intensive tasks to an HPC system while providing transparent provenance auditing to ensure maintainability, comprehensibility, and support researchers in working reproducibly. It further provides a modular design, which allows swapping out, or adapting the required backend services, like storage, metadata indexing, and compute. This modular design is leveraged to guarantee full data sovereignty to the users by providing the capability to store, index, and process encrypted data. For this, a new secure HPC service is developed which provides a secured and isolated partition on a shared HPC system. This secure partition is completely generic and fully implemented in software, therefore it can be deployed on any typical HPC system and does not require any additional procurement. Different encryption techniques are systematically benchmarked with synthetic and real use cases where up to 97\% of the baseline performance is achieved. In addition, it can also be used as a standalone service, which is demonstrated by including it in a clinical workflow. Lastly, scalable and secure metadata indexing is provided by on-demand Elasticsearch clusters which are spawned within the context of a typical HPC job. Two different encryption techniques are discussed and benchmarked. One is based on the previous secure HPC service, while the other is platform-independent by encrypting all key-value pairs of a document individually on the client side and performing all metadata operations, like search or range queries, on encrypted data on the server side. Therefore, the encryption keys never have to leave the client system at any time. To assess the two different encryption techniques, a throughput-oriented benchmark is developed and used on a synthetic dataset and within the MRI use case.
Keywords: Data Lake; Data Management; Confidential Computing; High-Performance Computing