分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》
摘要: Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.
分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》
摘要: Data harmonization and documentation of the data processing are essential prerequisites for enabling Canonical Analysis Workflows. The recently revised Terabyte-scale air quality database system, which the Tropospheric Ozone Assessment Report (TOAR) created, contains one of the worlds largest collections of near-surface air quality measurements and considers FAIR data principles as an integral part. A special feature of our data service is the on-demand processing and product generation of several air quality metrics directly from the underlying database. In this paper, we show that the necessary data harmonization for establishing such online analysis services goes much deeper than the obvious issues of common data formats, variable names, and measurement units, and we explore how the generation of FAIR Digital Objects (FDO) in combination with automatically generated documentation may support Canonical Analysis Workflows for air quality and related data.