AWS S3
Destination Connector
Overview
The Tarsal AWS S3 destination connector writes data to S3 buckets. Each stream is written to its own directory under the bucket.
Output Schema
Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as the equivalent of a "table" in the database world.
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
CSV
Tarsal's S3 output connector has 2 columns: usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
Column | Condition | Description |
---|---|---|
data | Always exists | The log/event data |
_tarsal_metadata | Always exists | Object that contains metadata assigned by Tarsal. Currently, to each processed record, a timestamp representing when the event was pulled from the data source and UUID are appended. |
root level fields | When root level normalization (flattening) is selected, the root level fields are expanded. |
For example, given the following json object from a source:
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
}
With no normalization, the output CSV is:
_tarsal_metadata,data
{"_emitted_at":1622135805000,_ab_id:26d7...a206},{"user_id":123,name:{"first":"John","last":"Doe"}}
With root level normalization, the output CSV is:
_tarsal_metadata,user_id,name
{"_emitted_at":1622135805000,_ab_id:26d7...a206},123,{name:{"first":"John","last":"Doe"}
JSON Lines (JSONL)
Json Lines is a text format with one JSON per line. Each line has a structure as follows:
{
"_tarsal_metadata": "<json-metadata>"
"data": "<json-data-from-source>"
}
For example, given the following two json objects from a source:
[
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
},
{
"user_id": 456,
"name": {
"first": "Jane",
"last": "Roe"
}
}
]
Will have an output like:
{ "_tarsal_metadata": {"_tarsal_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_tarsal_emitted_at": "1631948170000"}, "data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_tarsal_metadata": {"_tarsal_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_tarsal_emitted_at": "1631948170000"}, "data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
The full path of the output data is:
<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>
For example:
testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | | format extension
| | | | | | partition id
| | | | | upload time in millis
| | | | upload date in YYYY-MM-DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name
Note: the stream name may contain a prefix, if it is configured on the connection. The rationales behind this naming pattern are:
- Each stream has its own directory.
- The data output files can be sorted by upload time.
- The upload time is composed of a date time so that it is both readable and unique.
Currently, each data sync will only create one file per stream. Each partition is identifiable by the partition ID, which is always 0.
Prerequisites
Authentication
The following authentication options are supported by this connector:
Authentication Method | Supported | Documentation |
---|---|---|
Access Key ID and Access Secret | yes | Managing access keys for IAM users |
IAM role authentication | yes | IAM roles |
Configuration
The following fields are used to configure the source connector.
Data is purged after each sync
Please note that data in the configured bucket and path will be deleted after each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration.
Field | Required | Description | Example |
---|---|---|---|
S3 Endpoint | no | URL to S3, If using AWS S3 just leave blank. | |
S3 Bucket Name | yes | Name of the bucket to sync data into. | |
S3 Bucket Path | yes | Subdirectory under the above bucket to sync the data into. | |
S3 Region | yes | AWS Region | us-east-1 |
Format | yes | Format specific configuration. See below for details. |
Authentication
Access Key ID and Access Secret
The following fields are specific for the Access Key ID and Access Secret authentication method.
Field | Required | Description |
---|---|---|
Access key ID | yes | The access key id for authentication to Amazon Web Services |
Access Secret | yes | The access secret for authenticating to Amazon Web Services |
IAM Role ARN | no | Leave this field blank if using Access Key ID and Secret. |
IAM role authentication
The following fields are specific for the IAM role authentication method.
Field | Required | Description | Example |
---|---|---|---|
Access key ID | yes | Leave this field blank if using AWS IAM role authentication | |
Access Secret | yes | Leave this field blank if using AWS IAM role authentication | |
IAM Role ARN | no | IAM Role associated with S3 bucket if using assume role | arn:aws:iam::123456789:role/tarsal-role |
Updated 2 months ago