Ingest data to Archive

The archive supports deposit of datasets from the project area.
 

How to ingest data to Archive from projecta area

  • You will first need to have used the web interface to complete the basic metadata form for deposit of a new dataset.
  • By choosing the option to "upload dataset from project area" and clicking the "Submit Dataset" button you will be able to use the following commands to archive your dataset. 
  • An email will be sent to you containing the dataset identifier (the UUID).
     

 

You should then do the following:

1.   Log onto a project-area accessible machine (ssh login.norstore.uio.no)
 

2. Create a manifest file containing the paths to the files that make up the dataset. The structure of the paths should be valid arguments for the UNIX “find ! -type d” command which is used by the ArchiveDataset script. For example if we define our dataset to consist of all gzipped tar files in the NS1234K project then our manifest file should contain the line:/projects/NS1234K/ -name *.tar.gz the manifest file can contain more than one line if the dataset spans more than one project or different types of files etc.

 

3. By default the files that make up the dataset will contain the full path excluding the leading '/' (e.g. project/NS1234K/subdir1/file1.dat). You can indicate that the root part of the path be removed by adding a “//” where the root path ends.

E.g. to remove “/projects/NS2134K” from “/projects/NS1234K/subdir1/file1.dat” you would add the following to your manifest file: “/projects/NS1234K///subdir1/file1.dat”. This can be used in combination with the regular expressions and globbing that are recognised by the find command. To remove “/projects/NS1234K” from the pattern which will archive all “.tar.gz” files in the directory “/projects/NS1234K/subdir1” specify the following: “/projects/NS1234K///subdir1 -name *.tar.gz”.
 

4. Run the command:  ArchiveDataset UUID manifest-file
This will result in a special file being created that is used by the archiver cron-job that copies the dataset from the project area to the archive. Depending on the size of the dataset the copy can take quite a bit of time.
 

5. Once the copy has completed you will receive an email with the results of the copy: how much data was copied and if the copy was successful or not. At this point the dataset has been safely uploaded to the archive and you can log back onto the web interface to complete the archiving process.
 

6. To get a list of your project-area datasets that have been submitted to the queue for archival use the command: ListArchiveDataset [UUID].

7. The UUID is optional.
 

8. You can cancel a request with the: CancelArchiveDataset UUID.
Only datasets that are pending or are in the process of being archived can be cancelled. It is not possible to cancel a dataset that has been archived.

 

9. You can cancel a request with the: CancelArchiveDataset UUID.
Only datasets that are pending or are in the process of being archived can be cancelled. It is not possible to cancel a dataset that has been archived.

 

NOTE that once a dataset has been archived using the ArchiveDataset script it is considered closed and it is not possible to add more files to the dataset. You will need to create a new dataset if you wish to update the dataset.