AWS S3
Source Connector
Introduction
This page contains the setup guide and reference information for the Amazon S3 source connector. All streams sent to this destination will end up in a single bucket, with a folder is created for each stream within the bucket.
This is a "Low Volume" Connector and should only be used for S3 buckets with fewer than 10,000 objects
For a high volume S3 Connector, please use the AWS S3 with SQS Connector.
Prerequisites
- Access to the S3 bucket containing the files to replicate.
AWS Setup
Auth with Assume Role
If you are syncing from a private bucket, you will need to create an IAM role for Tarsal to assume to access the bucket and its objects, and ensure that the IAM role has read
and list
permissions for the bucket. If you are unfamiliar with configuring AWS permissions, you can follow these steps to obtain the necessary permissions and credentials:
- Log in to your Amazon AWS account and open the IAM console.
- In the IAM dashboard, select Policies, then click Create Policy.
- Select the JSON tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):
Note: If the contents of the bucket are encrypted using a custom KMS key, you will need to give the role permissions to decrypt data using a statement similar to below.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::{your-bucket-name}/*", "arn:aws:s3:::{your-bucket-name}" ] } ] }
{ "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Action": "kms:Decrypt", "Resource": "arn:aws:kms:{region}:{account}:key/{alias}" } }
- Give your policy a descriptive name, then click Create policy.
- In the IAM dashboard, click Roles. Create a new one by clicking Add role.
- Select the AWS account trusted entity type and allow the Tarsal IAM role to assume the role you're creating. Please ask support for the account ID and external ID to use.The trusted policy would look something like this once generated:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<tarsal account id>" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "<unique external id>" } } } ] }
- Then you will be asked to find and check the box for the policy you created. Click Next, then Create Role.
Auth with Access Credentials
If you are syncing from a private bucket, you will need to provide your AWS Access Key ID
and AWS Secret Access Key
to authenticate the connection, and ensure that the IAM user associated with the credentials has read
and list
permissions for the bucket. If you are unfamiliar with configuring AWS permissions, you can follow these steps to obtain the necessary permissions and credentials:
- Log in to your Amazon AWS account and open the IAM console.
- In the IAM dashboard, select Policies, then click Create Policy.
- Select the JSON tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::{your-bucket-name}/*", "arn:aws:s3:::{your-bucket-name}" ] } ] }
- Give your policy a descriptive name, then click Create policy.
- In the IAM dashboard, click Users. Select an existing IAM user or create a new one by clicking Add users.
- If you are using an existing IAM user, click the Add permissions dropdown menu and select Add permissions. If you are creating a new user, you will be taken to the Permissions screen after selecting a name.
- Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.
-
Your Secret Access Key will only be visible once upon creation. Be sure to copy and store it securely for future use.
For more information on managing your access keys, please refer to the official AWS documentation.
Tarsal Setup
- In the left navigation bar, click Sources. In the top-right corner, click Add Source.
- Find and select AWS S3 from the list of available sources.
- Enter the Output Stream Name. This will be the name of the table in the destination (can contain letters, numbers and underscores).
- Enter the Pattern of files to replicate. This is a regular expression that allows Tarsal to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use
**
as the pattern. For more precise pattern matching options, refer to the Path Patterns section below. - Enter the name of the Bucket containing your files to replicate.
- If you are syncing from a private bucket, you must fill the AWS Access Key ID and AWS Secret Access Key fields with the appropriate credentials to authenticate the connection or the IAM Assume Role ARN for Tarsal to assume, optionally specifying External ID attached to the policy.
File Format Settings
JSONL
JSONL is the only format currently supported. As such, there are no extra settings for other format types.
File Compressions
Compression | Supported? |
---|---|
Bzip2 | yes |
Gzip | yes |
Lzma | yes |
Tar | yes |
Xz | yes |
Zip | yes |
Please let us know any specific compressions you'd like to see support for next!
Path Patterns
This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:
- Referencing many files with just one pattern, e.g. ** would indicate every file in the bucket.
- Referencing future files that don't exist yet (and therefore don't have a specific path).
You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.
Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).
Some example patterns:
Pattern | Result |
---|---|
** | match everything |
*_/_.jsonl | match all files with specific extension |
myFolder/\*_/_.jsonl | match all jsonl files anywhere under myFolder |
*/** | match everything at least one folder deep |
*/*/*/** | match everything at least three folders deep |
**/file.*|**/file | match every file called "file" with any extension (or no extension) |
x/*/y/* | match all files that sit in folder x -> any folder -> folder y |
**/prefix*.jsonl | match all jsonl files with specific prefix |
Let's look at a specific example, matching the following bucket layout:
myBucket
-> log_files
-> some_table_files
-> part1.jsonl
-> part2.jsonl
-> images
-> more_table_files
-> part3.jsonl
-> extras
-> misc
-> another_part1.jsonl.gz
We want to pick up part1.jsonl, part2.jsonl and part3.jsonl (excluding another_part1.jsonl.gz for now). We could do this a few different ways:
- We could pick up every csv file called "partX" with the single pattern
**/part*.jsonl
. - To be a bit more robust, we could use the dual pattern
some_table_files/*.jsonl|more_table_files/*.json
to pick up relevant files only from those exact folders. - We could achieve the above in a single pattern by using the pattern
*table_files/*.jsonl
. This could however cause problems in the future if new unexpected folders started being created. - We can also recursively wildcard, so adding the pattern
extras/**/*.jsonl.gz
would pick up any jsonl files nested in folders below "extras", such as "extras/misc/another_part1.jsonl.gz".
As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
S3 Provider Settings
-
AWS Access Key ID: One half of the credentials for accessing a private bucket.
-
AWS Secret Access Key: The other half of the credentials for accessing a private bucket.
-
IAM Role ARN: If using assume role to give access to the bucket, the role in your account that has read permissions.
-
Path Prefix: An optional string that limits the files returned by AWS when listing files to only those starting with the specified prefix. This is different than the Path Pattern, as the prefix is applied directly to the API call made to S3, rather than being filtered within Tarsal. This is not a regular expression and does not accept pattern-style symbols like wildcards (
*
). We recommend using this filter to improve performance if the connector if your bucket has many folders and files that are unrelated to the data you want to replicate, and all the relevant files will always reside under the specified prefix.-
Together with the Path Pattern, there are multiple ways to specify the files to sync. For example, all the following configurations are equivalent:
Prefix Pattern <empty>
path1/path2/myFolder/**/*
path1/
path2/myFolder/**/*.jsonl
path1/path2/
myFolder/**/*.jsonl
path1/path2/myFolder/
**/*.jsonl
-
The ability to individually configure the prefix and pattern has been included to accommodate situations where you do not want to replicate the majority of the files in the bucket. If you are unsure of the best approach, you can safely leave the Path Prefix field empty and just set the Path Pattern to meet your requirements.
-
-
Endpoint: An optional parameter that enables the use of non-Amazon S3 compatible services. If you are using the default Amazon service, leave this field blank.
Updated about 2 months ago