Object Storage Sources

Validio supports Google Cloud Storage (GCS) and Amazon S3 for object storage and reads in data from CSV, Parquet, and JSON files.

General considerations

When setting up an Object storage Source, you need to specify what data to include. This is defined by:

  1. File location - Point Validio to a primary folder for data retrieval. Data from subfolders will also be included.
  2. File pattern - Optionally, you can utilize regex expressions to filter files based on their filenames.

🚧

Schema consistency

Detecting new files in an Object storage

Validio detects new files based on their filename. A file will only be read if its filenames is lexicographically greater than the last file in the previous poll.

🚧

Backfilling is limited to 250 files

When backfilling data from an Object storage, Validio will only read data from the lexicographically greatest 250 files. Subsequent polls will then read all later files, regardless of their quantity.

validio_file_created_at

For Object storage sources, Validio adds an additional field, called validio_file_created_at, to the schema. This field contains the timestamp for when the file was created, or updated, in the Object storage. The timestamp is in RFC3339 UTC "Zulu" format.

πŸ‘

validio_file_created_at is, for example, useful as Cursor field or when creating Windows.

Cost and performance considerations

Costs associated with reading data from Object storage:

  • If the traffic crosses cloud regions, there are potential network costs between Validio and the Object storage.
  • The costs for listing objects are negligible.