Skip to content

Replication

Summary

This document outlines a proposal for automatically replicating files that have been registered in datatrail to the preferred storage elements according to the replication policy of that file.

Motivation

Goals

The main purpose of Datatrail is to manage CHIME/FRB's raw data. One of the primary features of Datatrail is to automatically make a copy of the data to its preferred storage elements (set by the replication policy).

Non-Goals

This feature does not deal with data processing and making processed data available at CANFAR for quick follow-up analysis. If immediate offsite access is required, users can use the cli to manually trigger transfers. This service only transfers data that have been registered into datatrail.

Proposal

User Stories

User Story 1

After the callbacks happen, datatrail will register the events after two days. At that point, the replication policy for all the files belonging to an event will be set. We don't want users to conduct manual transfers since its a lot of data. The automated replication system should be able to read the replication policy of the files for an event and automatically copy the data to remote locations (if at all).

System Components

This system will contain 4 components:

  1. Replication Staging Daemon
  2. Buckets (for queuing work for replicators)
  3. Replicators
  4. State Updater Daemon (to update the state of the database after replication of a file replica)

Replication Staging Daemon

Replication Staging Daemon will run at CHIME main server site to create work for the replicators. It will directly perform read queries from the database to identify which files are yet to be replicated. It is the starting point of the automated replication system. It will:

  1. Check how many files are pending in the Replicator's buckets so that the queue is not overflowing. We will limit it such that it will not create more work if there are already N files in the bucket that are yet to be replicated.
  2. Perform the database query to identify the next M files to replicate.
  3. Create a work object per file for the replicators and deposit it in their bucket.

Replicators

Replicators will be responsible for performing the actual copy of the file. You can have X replicators running at each site to perform X parallel file transfers since each replicator transfers one file at a time. Depending on the site's internet, you can scale the number of replicators (X) up or down. Each replicator will:

  1. Withdraw work from its bucket.
  2. Check connectivity to CANFAR.
  3. Perform the copy.
  4. Create a work object for the State Updater Daemon containing the status of the current file transfer and information about the replicated file at the new storage element.

State Updater

The state updater daemon will be responsible for updating the state of the replicated file and adding the replicated file replica to the database. There is no need for multiple state updaters. A single state updater should suffice. The same state updater daemon will also handle the logic for updating the deletion state when a file replica is deleted.

It will:

  1. Withdraw work from its bucket.
  2. Update the state of the original file (i.e. status of the file transfer) in the database.
  3. Add the new file replica (ie. the copied file) into the database.
  4. Conclude work with the appropriate status.

Design Details

Database Query to identify the files to replicate.


payload of the work object for replicators

{
    "file_id": int # database id of the file to copy
    "file_path": str # complete path to the file's location on the local storage element.
    "destination": str # name of the destination storage element to copy the file to.
}

payload of the work object for the state updater daemon

{
}