S3

Source Connector

Overview

AWS S3

This page contains the setup guide and reference information for the Amazon S3 source connector.

Streams

Prerequisites

Authentication

The following authentication options are supported by this connector:

Authentication MethodSupportedDocumentation
Access Key ID and Access SecretyesManaging access keys for IAM users
IAM role authenticationyesIAM roles

Configuration

The following fields are used to configure the source connector using an Access Token.

FieldRequiredDescriptionExample
Output Stream NameyesThe name of the stream you would like this source to output. Can contain letters, numbers, or underscores.
Pattern of files to replicateyesA regular expression which tells the connector which files to replicate. Use pattern ** to pick up all filesmyFolder/myOtherTableFiles/*.jsonl
BucketyesName of the S3 bucket where the file(s) exist.myFolder/thisTable/
Path PrefixnoSee below
EndpointnoSee below

Path Prefix

An optional string that limits the files returned by AWS when listing files to only those starting with the specified prefix. This is different than the Path Pattern, as the prefix is applied directly to the API call made to S3, rather than being filtered within Tarsal. This is not a regular expression and does not accept pattern-style symbols like wildcards (*). We recommend using this filter to improve performance if the connector if your bucket has many folders and files that are unrelated to the data you want to replicate, and all the relevant files will always reside under the specified prefix.

  • Together with the Path Pattern, there are multiple ways to specify the files to sync. For example, all the following configurations are equivalent:
    • Prefix = <empty>, Pattern = path1/path2/myFolder/**/*
    • Prefix = path1/, Pattern = path2/myFolder/**/_.jsonl
    • Prefix = path1/path2/, Pattern = myFolder/**/_.jsonl
    • Prefix = path1/path2/myFolder/, Pattern = *_/_.jsonl
  • The ability to individually configure the prefix and pattern has been included to accommodate situations where you do not want to replicate the majority of the files in the bucket. If you are unsure of the best approach, you can safely leave the Path Prefix field empty and just set the Path Pattern to meet your requirements.

Endpoint

An optional parameter that enables the use of non-Amazon S3 compatible services. If you are using the default Amazon service, leave this field blank.

Authentication

Access Key ID and Access Secret

The following fields are specific for the Access Key ID and Access Secret authentication method.

FieldRequiredDescription
Access key IDyesThe access key id for authentication to Amazon Web Services
Access SecretyesThe access secret for authenticating to Amazon Web Services
IAM Role ARNnoLeave this field blank if using Access Key ID and Secret.

IAM role authentication

The following fields are specific for the IAM role authentication method.

FieldRequiredDescriptionExample
Access key IDyesLeave this field blank if using AWS IAM role authentication
Access SecretyesLeave this field blank if using AWS IAM role authentication
IAM Role ARNnoIAM Role associated with S3 bucket if using assume rolearn:aws:iam::123456789:role/tarsal-role

File Format Settings

JSONL is the only format currently supported. As such, there are no extra settings for other format types.

File Compressions

CompressionSupported?
Gzipyes
Zipyes
Bzip2yes
Lzmayes
Xzyes
Taryes

Path Patterns

This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:

  • Referencing many files with just one pattern, e.g. ** would indicate every file in the bucket.
  • Referencing future files that don't exist yet (and therefore don't have a specific path).

You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.
Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).

Some example patterns:

PatternDescripton
**match everything
**/*.jsonlmatch all files with specific extension
myFolder/**/*.jsonlmatch all jsonl files anywhere under myFolder
*/**match everything at least one folder deep
*/*/*/**match everything at least three folders deep
**/file.*|**/filematch every file called "file" with any extension (or no extension)
x/*/y/*match all files that sit in folder x -> any folder -> folder y
**/prefix*.jsonlmatch all jsonl files with specific prefix

Here is a specific example, matching the following bucket layout:

myBucket
    -> log_files
    -> some_table_files
        -> part1.jsonl
        -> part2.jsonl
    -> images
    -> more_table_files
        -> part3.jsonl
    -> extras
        -> misc
            -> another_part1.jsonl.gz

We want to pick up part1.jsonl, part2.jsonl and part3.jsonl (excluding another_part1.jsonl.gz for now). We could do this a few different ways:

  • We could pick up every csv file called "partX" with the single pattern **/part_.jsonl.
  • To be a bit more robust, we could use the dual pattern some_table_files/_.jsonl|more_table_files/_.json to pick up relevant files only from those exact folders.
  • We could achieve the above in a single pattern by using the pattern *table_files/*.jsonl. This could however cause problems in the future if new unexpected folders started being created.
  • We can also recursively wildcard, so adding the pattern extras/**/_.jsonl.gz would pick up any jsonl files nested in folders below "extras", such as "extras/misc/another_part1.jsonl.gz".

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.