AWS S3

Destination Connector

Overview

AWS S3

The Tarsal AWS S3 destination connector writes data to S3 buckets. Each stream is written to its own directory under the bucket.

Output Schema

Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as the equivalent of a "table" in the database world.

  • Under Full Refresh Sync mode, old output files will be purged before new files are created.
  • Under Incremental - Append Sync mode, new output files will be added that only contain the new data.

CSV

Tarsal's S3 output connector has 2 columns: usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.

ColumnConditionDescription
dataAlways existsThe log/event data
_tarsal_metadataAlways existsObject that contains metadata assigned by Tarsal. Currently, to each processed record, a timestamp representing when the event was pulled from the data source and UUID are appended.
root level fieldsWhen root level normalization (flattening) is selected, the root level fields are expanded.

For example, given the following json object from a source:

{
  "user_id": 123,
  "name": {
    "first": "John",
    "last": "Doe"
  }
}

With no normalization, the output CSV is:

_tarsal_metadata,data
{"_emitted_at":1622135805000,_ab_id:26d7...a206},{"user_id":123,name:{"first":"John","last":"Doe"}}

With root level normalization, the output CSV is:

_tarsal_metadata,user_id,name
{"_emitted_at":1622135805000,_ab_id:26d7...a206},123,{name:{"first":"John","last":"Doe"}

JSON Lines (JSONL)

Json Lines is a text format with one JSON per line. Each line has a structure as follows:

{
  "_tarsal_metadata": "<json-metadata>"
  "data": "<json-data-from-source>"
}

For example, given the following two json objects from a source:

[
  {
    "user_id": 123,
    "name": {
      "first": "John",
      "last": "Doe"
    }
  },
  {
    "user_id": 456,
    "name": {
      "first": "Jane",
      "last": "Roe"
    }
  }
]

Will have an output like:

{ "_tarsal_metadata": {"_tarsal_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_tarsal_emitted_at": "1631948170000"}, "data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_tarsal_metadata": {"_tarsal_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_tarsal_emitted_at": "1631948170000"}, "data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }

The full path of the output data is:

<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>

For example:

testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv
↑              ↑                ↑      ↑     ↑          ↑             ↑ ↑
|              |                |      |     |          |             | format extension
|              |                |      |     |          |             partition id
|              |                |      |     |          upload time in millis
|              |                |      |     upload date in YYYY-MM-DD
|              |                |      stream name
|              |                source namespace (if it exists)
|              bucket path
bucket name

Note: the stream name may contain a prefix, if it is configured on the connection. The rationales behind this naming pattern are:

  1. Each stream has its own directory.
  2. The data output files can be sorted by upload time.
  3. The upload time is composed of a date time so that it is both readable and unique.

Currently, each data sync will only create one file per stream. Each partition is identifiable by the partition ID, which is always 0.

Prerequisites

Authentication

The following authentication options are supported by this connector:

Authentication MethodSupportedDocumentation
Access Key ID and Access SecretyesManaging access keys for IAM users
IAM role authenticationyesIAM roles

Configuration

The following fields are used to configure the source connector.

🚧

Data is purged after each sync

Please note that data in the configured bucket and path will be deleted after each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration.

FieldRequiredDescriptionExample
S3 EndpointnoURL to S3, If using AWS S3 just leave blank.
S3 Bucket NameyesName of the bucket to sync data into.
S3 Bucket PathyesSubdirectory under the above bucket to sync the data into.
S3 RegionyesAWS Regionus-east-1
FormatyesFormat specific configuration. See below for details.

Authentication

Access Key ID and Access Secret

The following fields are specific for the Access Key ID and Access Secret authentication method.

FieldRequiredDescription
Access key IDyesThe access key id for authentication to Amazon Web Services
Access SecretyesThe access secret for authenticating to Amazon Web Services
IAM Role ARNnoLeave this field blank if using Access Key ID and Secret.

IAM role authentication

The following fields are specific for the IAM role authentication method.

FieldRequiredDescriptionExample
Access key IDyesLeave this field blank if using AWS IAM role authentication
Access SecretyesLeave this field blank if using AWS IAM role authentication
IAM Role ARNnoIAM Role associated with S3 bucket if using assume rolearn:aws:iam::123456789:role/tarsal-role