uploads-indexer

Capture Uploads Fixity

This actor captures the fixity details for a file given a qualified URI and the internally-defined means to resolve and inspect it. In the current implementation, it accepts only agave-canonical URIs.

It interacts with the SD2E Data Catalog, a MongoDB database, using via methods defined in the datacatalog Python package.

Example Fixity Record

This is an example of a MongoDB record for a specific file.

{
    "_id" : ObjectId("5b8ac5c78f0954000118c3de"),
    "filename" : "transcriptic/201808/yeast_gates/r1bsmgdayg2yq_r1bsu7tb7bsuk/6388_0.00015_2.fcs",
    "uuid" : BinData(3, "V4m97g6vUnO6JKkwvKM2hA=="),
    "properties" : {
        "lab" : "Transcriptic",
        "checksum" : "29370e2ac6ec246637a813e312ff7e6be4fb9285",
        "modified_date" : ISODate("2018-09-10T01:50:20.527+0000"),
        "revision" : 1,
        "size" : 8905548,
        "original_filename" : "transcriptic/201808/yeast_gates/r1bsmgdayg2yq_r1bsu7tb7bsuk/6388_0.00015_2.fcs",
        "file_modified" : ISODate("2018-08-28T01:15:49.000+0000"),
        "file_created" : ISODate("2018-08-28T01:15:49.000+0000"),
        "created_date" : ISODate("2018-09-01T17:00:55.109+0000"),
        "file_type" : "text/plaintext"
    }
}

Explanation: The filename is relative to /uploads/ on the data-sd2e-community resource. Each indexed file has a unique identifier uuid which is a hash of filename.

The creation and update times for the index are recorded as (created_date and modified_date), while the apparent* creation and update dates for the physical file are stored as (file_created and file_updated). The size (in bytes) and checksum are computed from the physical file, and lab is inferred from the filename. The value for file_type is determined using the Python file_types package and is always a MIME type. Finally, revision indicates how many times the index has been refreshed.