Object Storage Sources
Validio supports Google Cloud Storage (GCS) and Amazon S3 for object storage and reads in data from CSV, Parquet, and JSON files.
General considerations
When setting up an Object storage Source, you need to specify what data to include. This is defined by:
- File location - Point Validio to a primary folder for data retrieval. Data from subfolders will also be included.
- File pattern - Optionally, you can utilize regex expressions to filter files based on their filenames.
Schema consistency
- Note that all files included in the Source need to share the same schema. You will get notified if Validio discovers any schema inconsistencies.
- Missing fields are considered empty.
Detecting new files in an Object storage
Validio detects new files based on their filename. A file will only be read if its filenames is lexicographically greater than the last file in the previous poll.
Backfilling is limited to 250 files
When backfilling data from an Object storage, Validio will only read data from the lexicographically greatest 250 files. Subsequent polls will then read all later files, regardless of their quantity.
validio_file_created_at
For Object storage sources, Validio adds an additional field, called validio_file_created_at
, to the schema. This field contains the timestamp for when the file was created, or updated, in the Object storage. The timestamp is in RFC3339 UTC "Zulu" format.
validio_file_created_at
is, for example, useful as Cursor field or when creating Windows.
Cost and performance considerations
Costs associated with reading data from Object storage:
- If the traffic crosses cloud regions, there are potential network costs between Validio and the Object storage.
- The costs for listing objects are negligible.
Updated 8 months ago