Event Listeners in Clowder

In Clowder v1, extractors were initially intended to extract metadata automatically from incoming files based on MIME type, and the extracted metadata could then be indexed for search and other purposes. As Clowder grew, different projects began to expand the extractor framework to do much more than extract metadata.

In Clowder v2, we are improving support for these types of activities. While existing extractors will largely function the same way in Clowder - connect to a RabbitMQ message bus and process event messages sent by Clowder to the corresponding queue - the way in which extractors can be assigned and triggered has been expanded. Additionally, extractors are now just one type of behavior in a larger framework called Event Listeners that can benefit from this feature.

Types of Event Listeners

An event listener is a process running apart from Clowder, typically in a container or VM, that is listening to a RabbitMQ feed for messages from Clowder. When a message is received, Clowder indicates which resources the listener will need to download and any parameters the user who sent the message provided. The listener will also receive a temporary API key from the user granting them equivalent permissions.

What the event listener does with the resources and any results will depend: - examine files or datasets and generate metadata JSON that is attached to the resource; - create new files or datasets; - populate databases and web services; - trigger downstream workflows dependent on the status of other extractors in orchestration; - submit jobs to external HPC resources.

The Event Listeners dashboard provides a list of listeners that Clowder has received a heartbeat from: - Name - Version - Description - Supported parameters - First heartbeat - Most recent heartbeat - Number of feeds associated - Activate/deactivate - unique identifiers seen

When a heartbeat is received from a new listener, Clowder will attempt to match it to any existing feeds according to the extractor_info JSON. Legacy extractors

Defining Listener Feeds

A feed is a named set of criteria that can be assigned listeners. When new files arrive that meet the criteria, any assigned listeners are triggered.

The Feeds dashboard helps users manage this: - Any search query can be saved as a feed from the Search page - Possible criteria include: - Filename (regular expressions supported) - MIME type - File size or count - Parent collection / dataset / folder - User or user group - Metadata fields - Feeds without criteria will be notified of everything.

Feeds can be specified for Files or Datasets
Dataset feeds allow multiple file criteria to be listed
When a dataset contains files that match every criteria, notify listeners
Process every file/dataset that comes in automatically, or only manually? (toggle)
approval process when criteria met? get a report on login, submit by hand
what parameters to include with jobs from this feed
cron job for the criteria checks on newly created resources?
order criteria so we can evaluate rapidly
On request, users can calculate feed statistics:
Number of files meeting criteria
Per feed, number of files that have been successfully processed
Per feed, number of files that have ben unsuccessfully processed
Per feed, number of files that have not been submitted

Submission Status for Event Listeners

Once a resource is sent to a listener (via a feed, or directly via Submit to Listener button), a submission id is generated and an entry is created in the database in a submission collection specific to each listener.

Each Submission includes: - id - resource type and id - listener id - submission origin (API, feed, GUI, etc.) - any parameters to be sent to the listener - submission status (PENDING APPROVAL, SUBMITTED, SUCCESS, FAILURE) - submitted, started, finished - most recent message? - events list

Each Event in the events list includes: - timestamp - flag updating submission status to SUCCESS or FAILURE (optional) - progress (integer 0-100 for progress bar visualization - optional) - optional additional message

Any Submissions generated by a specific feed can be viewed on the Feeds dasboard. All submissions for a specific resource can be viewed on its page, regardless of origin.

Event Listeners and Feeds in API

The API can also be used to programatically manage these features. Note that there is no endpoint for registering listeners (that is done via RabbitMQ heartbeats).

GET /api/listeners
- List known listeners.
- Parameters: id, name, ~status~, skip, limit
DELETE /api/listeners/:listener_id
- Unsubscribe from all feeds, mark as inactive (not shown except to admins). The entry and event history are kept for record-keeping purposes.
POST /api/feeds
- JSON body: {"query": "title:*.xlsx", "name":"Excel spreadsheets"}
- Return feed_id
GET /api/feeds
- List feeds.
- Parameters: id, name, listener, skip, limit
DELETE /api/feeds/:feed_id
POST /api/feeds/:feed_id/subscribe/:listener_id
- Associate a listener with a feed.
- Parameters: type (file or dataset), auto (default True), backfill (default False)
POST /api/feeds/:feed_id/execute/:listener_id
- Sends feed contents to listener. Default is to send only contents that have not already been processed, but flag can force all feed contents (search results) to be submitted.
- Parameters: force_all, skip, limit
- Return count of resources sent to listener.
GET /api/feeds/:feed_id/counts/:listener_id
- Calculate counts for feed contents that have been successfully and unsuccessfully processed by a given listener.
- JSON: {"total": 123, "success": 119, "failure": 2, "submitted": 1, "not_submitted": 1, "pending_approval": 0}