Data Stream Sources

Validio supports integrating with the Data Streams that modern data teams work with today.

Consider the following when you read and validate records from a Data stream:

  • In each stream, all messages must be consistent with the declared schema:
    1. Fields added after creating the connector are ignored.
    2. Missing fields are interpreted as empty fields, which have consequences on the analytics involving those fields.

Cost and Performance

The costs associated with reading data from Data streams:

  • Each Source in Validio corresponds to one consumer of your Data stream.
  • If the traffic crosses cloud regions, there are potential network costs between Validio and the Data stream.

Schema Inference and Timestamps

Based on your data, Validio helps infer a schema for the source. Schema inference requires that the data stream has events when you connect the source. For example, a data stream can be empty because all events were deleted according to the retention period, and no new events have been published yet. If the data stream is empty, Validio will not have any data to infer the schema from. Schema inference works as follows:

  • Validio infers fields with the Timestamp data type in the schema. These fields can provide information for when an event or message is created or published in the data stream.
  • For Pub/Sub and Pub/Sub Lite, Validio also infers the validio_publish_time field in the schema. The validio_publish_time field contains the timestamp that Pub/Sub generates when a message is published to the stream. The timestamp is in RFC3339 UTC "Zulu" format.

Message Format and Schema

You must specify the message format and schema when you configure the data stream source. The message format can be JSON, AVRO, or PROTOBUF.

  • For JSON messages, you do not need to provide a schema because Validio will automatically infer the schema.
  • For AVRO messages, you need to upload the Apache Avro schema. For more information, refer to Schema Declaration in Apache Avro documentation.
  • For PROTOBUF messages, you need to upload the Protobuf schema. For more information, refer to the Language Guide in Protocol Buffer documentation.

Protobuf Message Schema

The Protobuf Schema needs to describe the format of the message in the topic and without imports to custom files. Validio does not support references (Protobuf schemas that reference other Protobuf schemas). Any references to other nested messages should be included inline so that the uploaded schema is self-contained.

The following is an example of a valid Protobuf schema, which also demonstrates how to include a nested message:

syntax = "proto3";
import "google/protobuf/timestamp.proto";

message MyMessage {
	message MyInnerNestedMessage {
		int32 val1 = 1;
		optional string val2 = 2;
	}

	message MyNestedMessage {
		repeated int32 numvec = 1;
		MyInnerNestedMessage inner_nested = 2;
	}

	int32 id = 1;
	string values = 2;
	MyNestedMessage nested = 3;
	google.protobuf.Timestamp created = 4;
}