Cron Datasets

Cron dataset pipelines defines batches by using cron expressions (scheduled batching)

Parameter name and description Mandatory Parameter value
1. Name Arbitrary String
2. Source Configured Source connector
3. Reference source

Second source to connect when reference monitors are used

Configured Source connector
4. Reference sliding window

Number of batches in reference source used when computing metric. E.g. input of 5 and applying a mean difference monitor will take the mean of the latest batch in the target source and subtract it with the mean of the 5 latest batches in the reference source

Integer
5. Notification rule

Note: Without a notification rule, alerts will be visible in the platform UI, but not sent as a notification to a notification channel, e.g. Slack

Configured Notification rules
6. Cron expression

Schedule batching based on Cron expression input

5 index Cron expression
7. Data time feature

Empty: Order of the records/datapoints will be determined by the time of ingestion of the data into Validio

Filled in: Mechanism to batch records, and the time used to show metrics and alerts NOTE: To correctly batch and display historical data, i.e. backfilling, a data time feature is needed

Feature with a timestamp format in Source
8. Evaluation delay

Grace period for the cron trigger to await late data (see details below)

Positive integer
9. Unit

Unit of the evaluation delay parameter

  • Second
  • Minute
  • Hour
  • Day
  • Week

👍

New to cron jobs?

Cron expression allows you to define batches by scheduling jobs on calendar time basis e.g.

"Every Monday at 12:00 am"

"Day 1 of every month at 12:00 am"

There are multiple references online on how to create Cron expressions, for example here

Evaluation delay

Sometimes data jobs that are expected to be consistent can for various reasons be delayed. Data that should have been in a certain batch, might have arrived to the source after the cron pipeline triggered a batch calculation. The evaluation delay allows a grace period to await for any delayed data to ensure late data is calculated in the right batch.

Example: Consider we have cron dataset pipeline that is scheduled to trigger at 00:00 am everyday. In our case, the data having a certain timestamp typically lands in our source the same day as the date. We are interested in doing batch analysis on data with the same date in their timestamps.

In rare occasions, data might arrive late, to ensure the these datapoints gets batched correctly, we can introduce a evaluation delay:

Example: If data with 1st of January timestamps keeps on arriving after 2nd of January, the cron dataset pipeline will wait up to the specified delay (at most) before triggering a batch calculation Example: If data with 1st of January timestamps keeps on arriving after 2nd of January, the cron dataset pipeline will wait up to the specified delay (at most) before triggering a batch calculation

Example: If data with 1st of January timestamps keeps on arriving on 2nd of January, the cron dataset pipeline will wait up to the specified delay (at most) before triggering a batch calculation

Evaluation delay batch triggers

There are two scenarios a batch will trigger when an evaluation delay is specified:

  • Forced trigger: Evaluation delay (grace period) reached (timed out) - regardless if there are additional 'late' data, a batch calculation will be triggered
  • Premature trigger: If data from the next batch arrives within the evaluation delay window, the cron pipeline will consider the previous batch fully ingested.
    • E.g. If we set evaluation delay to two hours in our above example (evaluation delay timeout 2nd of January 02:00), and 2nd of January data arrives 2nd of January 01:30, a batch calculation will be triggered