Registration¶
Summary¶
This document outlines a proposal for automatically registering raw data (intensity and baseband) of new events into Datatrail for long term management.
Motivation¶
Goals¶
The main purpose of Datatrail
is to manage CHIME/FRB's raw data. In order to manage this data, Datatrail
needs to know about it and be able to track it. Therefore, as new callbacks happen and new data is collected by the CHIME/FRB real-time pipeline
, the data for these events need to be registered into Datatrail's database
.
Non-Goals¶
This feature does not deal with data processing and making processed data available at CANFAR for quick
follow-up analysis.
Proposal¶
Context¶
Ingredients for registering data into datatrail¶
- File
- file name
- size in bytes
- md5sum
- file path
- Dataset
name
(event number in this case)belongs_to
(a larger dataset or a dataset of datasets eg. classified FRBs or RFI)
Intensity data¶
- Callbacks issued in realtime
- Data available on the archiver directly in a few mins. This provides us
file name, size in bytes, file path
.
- Data available on the archiver directly in a few mins. This provides us
- md5sums are available the next day because of the cronjob that computes them nightly.
- Tsar classification of the event can take > 1 day. This is required to put the event in the larger dataset (
belongs_to
) for appropriate replication and deletion policy. This ingredient has the longest lead time.
Baseband data¶
- Callbacks issued in realtime.
- Data is available on the archiver ~ 1 hour after callback due to transfer to the staging node and then conversion to hdf5. This provides us
file name, size in bytes, file path
.
- Data is available on the archiver ~ 1 hour after callback due to transfer to the staging node and then conversion to hdf5. This provides us
- md5sums are available the next day because of the cronjob that computes them nightly.
- Tsar classification of the event can take > 1 day. This is required to put the event in the correct dataset for appropriate replication and deletion policy. This ingredient has the longest lead time.
Do we need the raw data transferred immediately to CANFAR
(MINOC
or ARC
)?¶
No.
- For intensity data, it does not matter at all. We can register the data after 1-2 days (even a week) when we are almost guaranteed to have tsar classifications. And datatrail will then automatically replicate the data once it sees the updated policy.
- For raw baseband data, it may not matter at all as well. Unless we need the full raw data for immediate cross-correlation (which is unlikely: confirm with
outrigger
team. I think they only need single-beam baseband not raw). We can register the data after 1-2 days (even a week) when we are almost guaranteed to have tsar classifications. And datatrail will then automatically replicate the data once it sees the updated policy.
Operational mode¶
Since we do not need raw data immediately at CANFAR
, we can wait for tsar classifications
and other ingredients
to be available before registering it into Datatrail
.
The proposal is to run this system (Event Registrar
) in a mode where we register data that was captured 2
(can be configured) days ago (this should provide ample time for all events from that day to have tsar classifications). This way, we don't have to manage initial data registration separately from tsar classifications and unnecessarily increase the complexity.
Source of truth¶
The system that tracks callbacks is the L4 database. However, we have seen cases multiple times when the request is registered in the database but the data is not actually written our to the archivers (due to missing NFS mounts on L1 or some other system issue). This system does not care why a callback didn't happen. That;s the job of the realtime pipeline. This system only needs to register what was successfully captured.
Therefore, we can treat Archivers as the source of truth. Any data that resides in the raw data paths of the events on the archivers at the time of registration will be used. We will not make separate queries to other databases for validation. This should be safe against users accidently editing stuff because the raw data paths are mounted as read-only
on all systems excelt the L1 nodes that write to it.
Design Details¶
Flow of Action¶
This system will run persistently on the CHIME/FRB analysis cluster and perform the following sent of actions periodically (Once a day: again, configurable):
- Query the last day that was
successfully completed
(see Design Details for more info) from the tracking database. - Query unregistered events from past dates from the tracking database.
- Compile a list of dates from the
last successful day
to the datetwo
days ago. These are the days that we are going to register. - Parse through the unsuccessful event list from step 2 and try to re-register the events. Its possible that the tsar classifications are now available or the glitch was resolved. Delete the "now" successful events from the tracking database.
- For each date,
- walk through the directory for that date on the archiver and compile the list of event numbers that need to be registered.
- Collect ingredients as you traverse the directory structure: MD5sums and file size, name and path.
- For the event number
- Check if this event is already registered in
Datatrail
. Skip the next steps if it is already registered. - Query the L4 database to identify if it belongs to a dataset that is not sent to tsars for classification (e.g.
zooniverse
orPulsar.B0329+54
) and use the dataset name that is stored in the L4 database for this purpose. - If the event should have been sent to tsars for classifications, query
frb-master
fortsar classifications
. - Use the
tsar classification
to attach it to an appropriate larger dataset (e.g.classified frbs
orclassified rfi
) - Once all ingredients have been gathered, register the data into datatrail. Then update the tracking database with the new value for
last_date_completed
. - If any of the above steps fail, log the event number to a file and add an entry directly to the tracking database.
- Check if this event is already registered in
Tracking Database¶
We can track event registration in mongoDB
which is readily accessible since the registration system is not really something that satisfies the schema
of Datatrail
.
Database name: datatrail_event_registration
Datatrail
.
Collection name: unregistered_events
{
"event_number": int
"data_type": str #intensity or baseband
"date_captured": str #YYYY-mm-dd
"action_picker_dataset": str # e.g. realtime-pipeline
"reason": str
}
To track the last successful date
, we can store:
Collection name: last_completed_date
{
"last_completed_date": "YYYY-mm-dd"
}
The above payload is updated after all events from that day have been attempted to be registered.
Implementation Details¶
- Query the last successful day directly from MongoDB.
- Query the list unregistered events directly from MongoDB.
- Delete the "now" successful events directly from MongoDB.
- Check if this event is already registered in Datatrail by making a
query
to theDatatrail server
. - Query the L4 database by making a http request to the L4 server at
https://frb.chimenet.ca/chimefrb
. - Query
frb-master
directly to the mongoDB'sverification
collection. - Register into datatrail via a
commit
to theDatatrail server
.