1. Nomenclature
Bucket Root - The root name of the bucket without the s3://
prefix and without any trailing slashes (/
) or asterisk (/*
) or prefixes (/<prefix>
)
S3 Prefix - An S3 prefix that represents a location within your S3 bucket where tables are stored for that prefix
S3 Bucket Policy - The S3 Bucket Policy that we generate for you to attach to your S3 bucket that gives Permutive the permissions to list and read from the bucket
Schema - The concept of a group of tables. This maps to S3 data lake by using a static S3 Prefix as a part of the path in S3 to point to the location of multiple tables
Table - A single table within Permutive. This maps to S3 data lake by using a single prefix under the Schema Prefix. Imports can be created off of tables for matching and modelling
Data file - The data files that contain the data of the table; they exist underneath the prefix of the Schema and the Table and may exist under further prefixes such as Hive partitions or other directories
File Format - Which format the data files are in; the only option is CSV initially
Hive Partition - An S3 prefix under the table in Hive partition format. For example; date=2025-01-01
or region=EU
Data Partitioning - Option that specifies whether all Tables under a Schema are partitioned or not partitioned
2. Setting up your bucket
On the Permutive platform a Schema represents multiple Tables and a Table represents a set of data that can be Imported for use on the platform. As there is no equivalent concept to a Schema or a Table within S3, we require a particular setup of your S3 bucket that will allow you to have multiple Tables represented within a Schema.
Once organised correctly, this will allow you to connec to an S3 Prefix within your bucket and then use multiple tables from that bucket location. This will also make it easy for you to add new tables in the future to the same bucket location and we will be able to pick these new tables up.
2.1 S3 Directory Structure
2.1.1 Schema Directory Structure
In order to support the concept of a Schema within S3 that supports multiple tables, we require you to setup your schema as follows;s3://<bucket_name>/<prefix>/<table_1>
s3://<bucket_name>/<prefix>/<table_2>
s3://<bucket_name>/<prefix>/<table_n>
This will allow multiple tables to exist under a bucket_name
and prefix
. When you supply the prefix
to us, we will presume that every further directory under the prefix
represents a table.
You can have multiple Prefixes that represent Schemas, with multiple tables under each Schema. When a Connection is created within Permutive, you supply the Prefix and that connection will only represent that Prefix/Schema with it's Tables. In order to import another Prefix/Schema, you would need to create a new Connection within Permutive.
2.1.2 Table Directory Structure
2.1.2.1 Partitioned mode
We support Hive partitioning;s3://<bucket_name>/<prefix>/<table_n>/<partition_name>=<value>/<data_filename_n>.csv
In this case, partition_name
becomes a column with in your dataset and the value
becomes a row for all rows for files under this partition.
We also support multiple partitions;s3://<bucket_name>/<prefix>/<table_n>/<partition_name_1>=<value>/<partition_name_2>=<value>/<partition_name_3>=<value>/<data_filename_n>.csv
We recommend in always partitioning your data as it will reduce costs by allowing us to filter on the partition up-front when querying your data.
2.1.2.2 Non-partitioned mode
s3://<bucket_name>/<prefix>/<table_n>/<data_filename_n>.csv
Once we know the full prefix of the table, we will scan for all files under the table prefix regardless of how deep these files are within S3.
For example; we will include all the following files in the table, regardless of what prefixes exist underneath the table prefix;s3://<bucket_name>/<prefix>/<table_n>/<data_filename_1>.csv
s3://<bucket_name>/<prefix>/<table_n>/<inner_table_prefix>/<data_filename_2>.csv
s3://<bucket_name>/<prefix>/<table_n>/<inner_table_prefix_1>/<inner_table_prefix_n>/<data_filename_3>.csv
Above, we know the table prefix is s3://<bucket_name>/<prefix>/<table_n>
so will ignore the prefixes of /
, /<inner_table_prefix>/
and /<inner_table_prefix_1>/<inner_table_prefix_n>/
and only see the flattened structure of <data_filename_1>.csv
, <data_filename_2>.csv
and <data_filename_3>.csv
We will also ignore Hive partitioned data that is encoded in the prefixes under the table, unless you have indicated to us that all tables under the schema contain Hive partitions.
2.2 S3 Bucket Permissions
In order for us to read data directly from your S3 bucket, we need to you to attach an S3 Bucket Policy to the bucket that allows our AWS account the correct permissions for reading.
When creating a new Connection, we generate a policy within the Dashboard for you to use on your S3 bucket.
If you have already added the policy to the bucket but want to use a new location within the bucket, you do not need to re-add the policy.
This policy grants the following permissions to Permutive;
s3:ListBucket
s3:GetObject
Example Policy generated from the dashboard
The following is an example policy generated from the Dashboard; the correct details will be filled out once you start creating the connection.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<PermutiveAWSAccountId>:root"
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::<YourBucketName>",
"Condition": {
"StringEquals": {
"aws:PrincipalArn": "arn:aws:iam::<PermutiveAWSAccountId>:role/<PermutiveCustomerSpecificRole>"
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<PermutiveAWSAccountId>:root"
},
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::<YourBucketName>",
"arn:aws:s3:::<YourBucketName>/*"
],
"Condition": {
"StringEquals": {
"aws:PrincipalArn": "arn:aws:iam::<PermutiveAWSAccountId>:role/<PermutiveCustomerSpecificRole>"
}
}
}
]
}
3. Setting up a new connection
3.1 Step 1 - Select S3 from the catalog of connection types
Go to the Catalog section of Connectivity and select "Amazon S3"
3.2 Step 2 - Enter the details of your S3 bucket
Name - The name of your new Connection on the Permutive platform
AWS Bucket Region - The region that your data resides in; we will only display the supported regions
AWS Bucket Name - The S3 bucket name without any prefixes or suffixes; for example, the full S3 path of s3://<bucket_name>/*
would use just <bucket_name>
AWS Bucket Schema Prefix - The prefix within the bucket that gives a path to a location that represents multiple tables. This should not have a leading slash /
; example <prefix_part_1>/<prefix_part_n>/
Data Partitioning - Indicate to us whether all or no tables are partitioned within the Schema prefix.
- If all tables are hive partitioned is indicated, we will ignore any table prefix that is not partitioned
- If no tables are hive partitioned is indicated, we will ignore all subdirectories below each Table we discover within the Schema, including Hive partitions
3.3 Step 3 - Give Permutive access to your bucket
When you enter your bucket name that you would like to import from on the Create Connection screen, you will be given a S3 Bucket Policy to add to the S3 bucket we will read from.
Copy the policy from the import screen and update your bucket policy by going to the “Permissions” page of the S3 bucket within your AWS account.
Edit the bucket policy and add the policy generated from your Permutive dashboard to the bucket.
Your bucket should now have the S3 Bucket Policy attached to it;
Now that the S3 Bucket Policy is in place, you should now be able to create the connection to your data within S3. Once the connection has been created, you will see it on your list of of connections on your Permutive account.
3.4 Step 4 - Create Import
Go to the Create Import screen under Imports;
Here you will can select the source type of Amazon S3
, the connection you just created My S3 Data Connection
. The Schema Prefix will be the only selectable Schema in this type of connection. All discovered and usable tables will appear in the tables list. Continue as normal to create a new import from the data within your S3 bucket.
Recommendations and Limitations
- For partitioned tables we only support Hive Partitioning
- We recommend partitioning all data within S3; this will give the opportunity to reduce costs by only querying the data that is being used provided that the partition is used within a query
- This is the most important with User Activity type data where you can continually add data to the table
- With User Profile data, we expect the data to not be continually be appended to and grow in size too much
- It’s preferable to keep all data that represent Tables within a Schema as either partitioned or not partitioned; when creating a new Connection within Permutive you will specify the Data Partitioning used for all Tables underneath the S3 Prefix that represents a Schema
- If there are partitioned Tables within the Schema and the Data Partitioning has been set to No tables are partitioned in the source prefix, we will add these tables as non-partitioned tables and ignore the partitions
- If there are non-partitioned Tables within the Schema and the Data Partitioning has been set to “All tables are partitioned in the source prefix”, we will ignore any non-partitioned tables within the Schema
- We do not support updating the columns of data coming from S3 so we do not expect the columns to change on an S3 Connection or Import
- For CSV format we support files in CSV and GZIP format; we recommend having your tables gzipped to save costs
- Use either
.csv
file format for CSV files or.gz
for Gzipped CSV format
- Use either
Comments
0 comments
Please sign in to leave a comment.