The NIRD Research Data Archive Preservation Plan

Introduction

The mission of the NIRD Research Data Archive (NIRD RDA) is to ensure research data produced by Norwegian researchers remains widely discoverable, accessible and reusable at least 10 years after the data has been deposited. Data deposited on the NIRD RDA comes from different disciplines, are in a variety of formats and can be of any size. Being size, format and community agnostic poses several challenges to the preservation. This document describes the preservation plan that NIRD follows to maintain the data accessible during the archival period.

Scope and exclusion

This document describes the preservation plan for the archive to ensure that:

  • Authentic, reliable instances of the datasets are accessible.
  • Integrity, security and quality of the datasets is maintained.
  • Adequate management strategies for the dataset during the archival period, are in place.

This plan does not consider the wider NIRD storage infrastructure that is not part of the implementation of the archive. The plan uses as a guideline the standard OAIS reference model, the FAIR principles and the guidance from the Research Council of Norway.

Characterisation of the archive datasets

The NIRD RDA provides archival and publication services for datasets from all disciplines resulting from Norwegian research activities, provided the data is not sensitive, or when there is no requirement to store the data in another archive (for example domain specific). There is no restriction on the size of a dataset or the number of files it may contain. The archive currently holds datasets ranging from more than 100TB in size and with more than 300,000 files to a few MB with a handful of files.

The sciences such as Biology and Geoscience primarily use the netCDF format to store their data, which ranges from observational, analysis and simulation data. The format is adopted by many disciplines as a de-facto standard. Each discipline uses a variety of formats including PNG, JPEG for images, ASCII files for markdown and text. In some cases, researchers use the ZIP or TAR format to package and compress their data before archiving. The archive encourages researchers to choose open formats for their data as described in the list of open file formats.

Community watch

The NIRD RDA actively follows current or new trends appearing in the wide variety of disciplines served by the archive, by adopting the three approaches outlined below. All three rely on active communication between the archive and the communities to ensure the reuse of data is maximized.

The archive requires each dataset to be linked to a Depositor who acts as the contact person for the archived dataset. Users of the dataset who have queries about the data can contact the Depositor to resolve their query. The Depositor may find a need to update metadata for a dataset, or to replace or update an existing dataset. In the case of the archive’s activities (for example identifying datasets that have expired for deletion or migrating datasets to different class of storage) the Depositor can be contacted to understand the impact. The Depositor can also contact the archive in the case the data needs to be migrated to a new format.

Sigma2 also offers researchers Advanced User Support services where Depositors, or stakeholders of archived datasets can work with the archive DevOps to improve the reuse of archived datasets. To date, such support has resulted in the extraction of metadata contained in netCDF files that was then populated in a domain-specific catalog, integration of the archive with domain-specific services and other portals.

Sigma2 regularly meets with communities and conducts yearly surveys to gather feedback on its services including the archive. In addition, Sigma2 has regular workshops with the heavy users (several datasets deposited per years, large volume) to understand issues and future directions.

The NIRD RDA’s principles

  • The NIRD RDA strives to support Open Science, FAIR and the national guideline for sharing and reusing research data by ensuring each phase of the OAIS-based archive addresses the FAIR principles.
  • The NIRD RDA strives to support data from any community, of any size and with any open formats.
  • The NIRD RDA offers long term preservation of the data.
  • The NIRD RDA strives to support data driven science and data reuse.

The preservation strategies

The archive adopts the following strategies to ensure datasets are authentic, secure and accessible throughout their lifetime:

  • The Depositor of a dataset agrees to the Terms and Conditions that allows the archive to manage and distribute the datasets:
  • The Depositor is meant to be responsible for the integrity of the dataset at the deposition, the eligibility and the compliance of the data sets with the terms and conditions and GDPR requirements.
  • The datasets are validated (metadata are checked and data is checked to ensure authenticity) before publication which results in an DOI being issued.
  • The integrity of datasets is checked and maintained throughout the dataset’s lifetime.
  • All metadata and data comply with GDPR guidelines and IPR and copyright regulations of the dataset owner’s institution.
  • Deposited datasets are never deleted before the end of the retention time specified in the Depostor contract (at least 10 years). Only in the case of extraordinary circumstances may the data be deleted before the end of the retention time. After the retention time datasets might be deleted only in case of compelling technical reasons.contract (at least 10 years). The deleted dataset DOI will resolve to a tombstone record containing all the public metadata for the dataset. 

Roles and Responsibilities

Archive DevOps: the team responsible for the operation and maintenance of the archive service.

NIRD RDA Administrators: verify the eligibility of the Depositor when requests for depositions arrive and support the users in the process of depositing and publishing the data.

NRIS First line: First line support for all the users of the National e-Infrastructure (including the NIRD RDA), owned by Sigma2 and operated by the NRIS federation.

NRIS Administrators (Infra team and SP team): operate the National e-Infrastructure (including the NIRD system, providing hardware and software capability for the NIRD RDA.

Depositor: deposits datasets in the archive. They are responsible for providing the metadata and data necessary to archive the dataset and are responsible for providing information and assistance to ensure the dataset remains usable over its lifetime.

NIRD RDA Technical coordinator/manager: responsible for the alignment of the service with the evolving users’ need and coordinator of the related DevOps activities.

NIRD RDA Product owner: responsible for the communication, strategies, benefit realization plan and alignment with the Key Performance Indicators.

Sigma2’s Board: The Board of Sigma2 has the ultimate responsibility for the archive.

Sikt: The Norwegian Agency for Shared Services in Education and Research is a public administrative body under the Ministry of Education and Research. Sigma2 is owned by Sikt. Sikt has the ultimate responsibility for the data in case of discontinuation of the Sigma2 AS. 

Sustainability plans and funding

Sigma2 ensures operation of the service for the validity period of Sigma2’s mandate. Sigma2, established in 2015, is mandated by the Research Council of Norway (RCN) and the four oldest universities with a horizon of 10 years. Every 5 years there is a mid-term evaluation which triggers the renewal of the next period. Therefore, the horizon in which SIgma2 operates is always between 5-10 years.

Even if the mandate is based on a collaboration between the RCN and the four oldest universities, the funding to store and operate the NIRD RDA is solely from RCN, as these data are considered of public value.

Preservation Plan Implementation

The archive follows the OAIS reference architecture which divides up the preservation process into functions: ingest, archive storage, data management and access. The preservation plan covers all functions. The functions are implemented as described below.

Ingest functon

Covers the deposit of the dataset and related metadata. Potential Depositors are required to request registration with the archive. The archive manager assesses the request and approves it if the user is a member of a Norwegian institute. The archive implements an extension of the Dublin Core metadata standard as described here and Depositors are required to agree to the terms and conditions of deposition and use. This also covers the archive taking responsibility for the management and distribution of the dataset as described in the depositor agreement.

The archive encourages researchers to deposit datasets in an open format as described in the user-guide.

Depositors are also required to select a license for the dataset. The archive currently suggests CC BY 4.0 for all datasets. Depositors can request a different license if the default license does not meet the needs. Each request is evaluated by the archive on a case-by-case basis where the archive considers the needs of the Depositor as well as the need to ensure the data are as open as possible.

The metadata supplied by the Depositor and the data form the Submission Information Package (SIP) in the OAIS model. The metadata is supplied via a form which is submitted to the archive and stored in a database. Each dataset gets a unique identifier that is internal to the archive which is assigned to the metadata and the data are physically stored under a folder with the root being the unique identifier. The ingest process is logged so the archive can troubleshoot any issues that may arise with the ingestion process.

The Depositor is notified once the data has been ingested and the status of the metadata is visible to the user via the metadata status flag.

Users can upload new datasets, or create a new version of an existing, published dataset. When the user selects to create a new version of a dataset the archive ensures a link between the existing dataset and the new version is maintained by using the Dublin Core ‘hasVersion’ and ‘isVersionOf’ terms. The archive creates a copy of the metadata for the new version and the Depositor is free to update the metadata as needed.

New datasets and new versions of existing datasets then go through a publication process where the archive checks the metadata has been completed and the dataset is complete (in the sense that the Depositor checks and verifies all the data and metadata intended for the archive dataset has been supplied). Once the data and metadata have been verified the dataset is published by requesting from DataCite a Digital Object Identifier (DOI). Once the DOI has been issued it is attached as metadata to the dataset and the dataset is made read-only. The publicly accessible DOI resolves to a landing page maintained and operated by the archive that contains metadata information for the dataset and a link to the dataset. If a dataset has been deleted a metadata record called a tombstone record is kept which contains metadata for the dataset and a reason for deletion.

The Archive storage function

The data is copied from the ingest area to the archive area file-by-file with the hash for each file being computed. The data are stored under a folder with the top-level being the dataset identifier. Metadata for each file are stored in the database table of contents. The dataset metadata supplied by the Depositor and the table of contents generated by the archive which includes the fixity, filepaths, sizes, last modified times, formats form the Archival Information Package (AIP) that is used to successfully manage the dataset in the archive. The Depositors are notified of the success or failure of the archiving. The archive makes a back-up copy of the data set and replicates the dataset to storage on another site managed by Sigma2.

Once datasets have been published, the datasets are made read-only. The archive possesses the ability to delete datasets if they contravene copyright or if there is a valid reason for deletion. In the latter case, depending on the reason, the data are made inaccessible and marked as eligible for deletion but not actually deleted.

The Data management function

The data management function covers the metadata management. The Research Data Archive uses a postgres database with backed-up reliable SSD storage for the database. All metadata for the SIP, AIP and the Dissemination Information Package (DIP) are stored in the database. The database schema is arranged by dataset where each dataset has dataset metadata and file metadata. All the information stored is necessary for the Ingest, Archive storage, Administration, Preservation planning and Access functions.

The Access function

Covers the means that users must find, view and access the archived dataset. Each published dataset has a DOI that resolves to a landing page that is hosted by the archive system. The page contains all the publicly available metadata as well as the license for using the dataset, a link to the table of contents and a link to the dataset (this is the DIP). Users can anonymously download a complete dataset or choose the subset of files they wish to download.

The archive provides a web-based search function based on the widely used Apache Solr platform. Users can search for datasets of interest based on the exposed metadata (for example, terms in the title, description subject, creator can be searched). Although the archive provides metadata that is generic enough to support a wide variety of disciplines, researchers have used the DOI in their more detailed metadata registries to enable more fine-grained search of the data.

The archive also provides an API that also makes use of the Solr platform to provide basic search functionality. In addition, an OAI-PMH interface exists for harvesting by other metadata catalogues. The basic search returns metadata in JSON format with a link to the dataset which can be accessed via minio S3 service.

The Administration function

The function covers the management and operation of the archive. NIRD RDA Administrators field queries from users through a ticket system and the archive is supported by an infrastructure team who ensure the archive service and storage maintain a high level of availability, security and reliability.

The Preservation planning function

Ensure that the data remains usable over its lifetime. This function splits into two forms: ensuring the service provides access to the datasets and ensuring the datasets remain understandable. The first function is performed by the archive where datasets are regularly checked to ensure integrity. The archive team are included in storage planning and the infrastructure and archive plan and work together to migrate the data to new infrastructure in as seamless a manner as possible. To date, the archive has undergone three successful migrations since its inception in 2014.

The second function relies on close collaboration with the domain experts who are aware of changes in their domain that may impact their archived data. The archive requires a Depositor to be associated with each dataset. The role of the Depositor is to provide a contact point in case users of the dataset have questions that are not covered by the metadata. The Depositor also serves as a contact point for the archive to the domain. Regular meetings are held with the regular users of the dataset to understand any changes that need to be made to ensure the data remain usable.

Sigma2 also operates an ‘advanced user support’ program which provides a mechanism for owners or users of the archive to request support to fulfill any need that ensures the data remain usable.

Datasets reaching the end of their retention period (10 years) are reappraised in collaboration with domain experts and the Depositor. If the dataset is considered to still be of value, the retention period will be extended (the period may vary and will be defined on a case-by-case evaluation). If the appraisal is not possible due to unavailability of the Depositor, the datasets are kept unless there is a compelling technical reason to delete them. 

If a dataset is a candidate for deletion after the retention time, the impending deletion will first be announced on the Research Data Archive front page and the Dataset’s landing page. This announcement will be visible for a period of one year. During this grace period, anyone who has an interest in maintaining access to the dataset can renew the retention period by contacting the Archive Manage archive.manager@nris.no. If a dataset is deleted the DOI will resolve to a tombstone record containing all the public metadata for the dataset.

Scenarios and contingency plan

Data is not accessible.

Should data not be searchable or accessible, the FitSM incident management procedure is initiated, including communication with the end user and initiation of the recovery plan. As being part of the NIRD system, the incident follows the contingency plan of the NIRD ecosystem.

Data is corrupted.

Data are check-summed at the ingestion and checksum values are regularly monitored. Primary copy of the data is regularly replicated onto two off-line storages which are physically and logically separated. Corrupted data can be restored from the first or second replica.

Metadata are obsolete or no longer valid.

It is the Depositor's responsibility to ensure that the metadata are not obsolete and fit-for-the-purpose of being discovered by the targeted communities. If a data owner recognizes that the metadata is obsolete, the approach is to create a new version of the dataset, instead of modifying the old metadata. The old dataset, featuring obsolete metadata is kept, with a new metadata record pointing at the newer version. Likewise, the new version contains a metadata record referencing the obsolete one.

Format is obsolete or no longer valid.

If the Depositor identifies a need to migrate to a new format, they can notify the NIRD RDA Administrators who will work with the Depositor to create the migration workflow that includes ensuring significant properties of the dataset are maintained. The migration would make use of the existing NRIS computing and storage infrastructure. Adequate resources and competences from both data owner and service provider will be allocated to the migration task.

The underlying storage infrastructure is going out of production.

If the underlying NIRD infrastructure is going out of production, the migration of the archived data onto the new infrastructure is part of the procurement project for the renewal of the infrastructure. Migration is done during the acceptance testing period of the new infrastructure and integrity checks are done before dismissing the old infrastructure and putting the new one in production. Final deletion of the data from the old infrastructure is done after one year from decommissioning.

Governance or funding scheme is suddenly changed 

Archive responsibility is given to the ultimate accountable body for Sigma2, which is Sigma2 board. Should the SIgma2 board be dismissed, the responsibility over the archived data is to be taken by Sikt.   

References