Skip to content

Registration

Summary

This document outlines a proposal for automatically registering raw data (intensity and baseband) of new events into Datatrail for long term management.

Motivation

Goals

The main purpose of Datatrail is to manage CHIME/FRB's raw data. In order to manage this data, Datatrail needs to know about it and be able to track it. Therefore, as new callbacks happen and new data is collected by the CHIME/FRB real-time pipeline, the data for these events need to be registered into Datatrail's database.

Non-Goals

This feature does not deal with data processing and making processed data available at CANFAR for quick follow-up analysis.

Proposal

Context

Ingredients for registering data into datatrail

  1. File
    1. file name
    2. size in bytes
    3. md5sum
    4. file path
  2. Dataset
    1. name (event number in this case)
    2. belongs_to (a larger dataset or a dataset of datasets eg. classified FRBs or RFI)

Intensity data

  1. Callbacks issued in realtime
    1. Data available on the archiver directly in a few mins. This provides us file name, size in bytes, file path.
  2. md5sums are available the next day because of the cronjob that computes them nightly.
  3. Tsar classification of the event can take > 1 day. This is required to put the event in the larger dataset (belongs_to) for appropriate replication and deletion policy. This ingredient has the longest lead time.

Baseband data

  1. Callbacks issued in realtime.
    1. Data is available on the archiver ~ 1 hour after callback due to transfer to the staging node and then conversion to hdf5. This provides us file name, size in bytes, file path.
  2. md5sums are available the next day because of the cronjob that computes them nightly.
  3. Tsar classification of the event can take > 1 day. This is required to put the event in the correct dataset for appropriate replication and deletion policy. This ingredient has the longest lead time.

Do we need the raw data transferred immediately to CANFAR (MINOC or ARC)?

No.

  1. For intensity data, it does not matter at all. We can register the data after 1-2 days (even a week) when we are almost guaranteed to have tsar classifications. And datatrail will then automatically replicate the data once it sees the updated policy.
  2. For raw baseband data, it may not matter at all as well. Unless we need the full raw data for immediate cross-correlation (which is unlikely: confirm with outrigger team. I think they only need single-beam baseband not raw). We can register the data after 1-2 days (even a week) when we are almost guaranteed to have tsar classifications. And datatrail will then automatically replicate the data once it sees the updated policy.

Operational mode

Since we do not need raw data immediately at CANFAR, we can wait for tsar classifications and other ingredients to be available before registering it into Datatrail.

The proposal is to run this system (Event Registrar) in a mode where we register data that was captured 2 (can be configured) days ago (this should provide ample time for all events from that day to have tsar classifications). This way, we don't have to manage initial data registration separately from tsar classifications and unnecessarily increase the complexity.

Source of truth

The system that tracks callbacks is the L4 database. However, we have seen cases multiple times when the request is registered in the database but the data is not actually written our to the archivers (due to missing NFS mounts on L1 or some other system issue). This system does not care why a callback didn't happen. That;s the job of the realtime pipeline. This system only needs to register what was successfully captured.

Therefore, we can treat Archivers as the source of truth. Any data that resides in the raw data paths of the events on the archivers at the time of registration will be used. We will not make separate queries to other databases for validation. This should be safe against users accidently editing stuff because the raw data paths are mounted as read-only on all systems excelt the L1 nodes that write to it.

Design Details

Flow of Action

This system will run persistently on the CHIME/FRB analysis cluster and perform the following sent of actions periodically (Once a day: again, configurable):

  1. Query the last day that was successfully completed (see Design Details for more info) from the tracking database.
  2. Query unregistered events from past dates from the tracking database.
  3. Compile a list of dates from the last successful day to the date two days ago. These are the days that we are going to register.
  4. Parse through the unsuccessful event list from step 2 and try to re-register the events. Its possible that the tsar classifications are now available or the glitch was resolved. Delete the "now" successful events from the tracking database.
  5. For each date,
    1. walk through the directory for that date on the archiver and compile the list of event numbers that need to be registered.
    2. Collect ingredients as you traverse the directory structure: MD5sums and file size, name and path.
    3. For the event number
      1. Check if this event is already registered in Datatrail. Skip the next steps if it is already registered.
      2. Query the L4 database to identify if it belongs to a dataset that is not sent to tsars for classification (e.g. zooniverse or Pulsar.B0329+54) and use the dataset name that is stored in the L4 database for this purpose.
      3. If the event should have been sent to tsars for classifications, query frb-master for tsar classifications.
      4. Use the tsar classification to attach it to an appropriate larger dataset (e.g. classified frbs or classified rfi)
      5. Once all ingredients have been gathered, register the data into datatrail. Then update the tracking database with the new value for last_date_completed.
      6. If any of the above steps fail, log the event number to a file and add an entry directly to the tracking database.

Tracking Database

We can track event registration in mongoDB which is readily accessible since the registration system is not really something that satisfies the schema of Datatrail.

Database name: datatrail_event_registration
We can use the schema defined below to track events that were unsuccessful in being registered into Datatrail.

Collection name: unregistered_events
{
    "event_number": int
    "data_type": str #intensity or baseband
    "date_captured": str #YYYY-mm-dd
    "action_picker_dataset": str # e.g. realtime-pipeline
    "reason": str
}

To track the last successful date, we can store:

Collection name: last_completed_date
{
    "last_completed_date": "YYYY-mm-dd"
}

The above payload is updated after all events from that day have been attempted to be registered.

Implementation Details

  1. Query the last successful day directly from MongoDB.
  2. Query the list unregistered events directly from MongoDB.
  3. Delete the "now" successful events directly from MongoDB.
  4. Check if this event is already registered in Datatrail by making a query to the Datatrail server.
  5. Query the L4 database by making a http request to the L4 server at https://frb.chimenet.ca/chimefrb.
  6. Query frb-master directly to the mongoDB's verification collection.
  7. Register into datatrail via a commit to the Datatrail server.