1. Nomenclature
- Bucket Root – The root name of your GCS bucket, without the
gs://
prefix or any trailing slashes (/
), prefixes (/<prefix>
), or wildcards (/*
). - GCS Prefix – A prefix (or path) within your GCS bucket where tables are located.
- GCS IAM permissions – The IAM permissions you apply to the GCS bucket to grant Permutive read access to the data.
- Schema – A logical grouping of tables. In GCS, this is represented by a static prefix (folder) that contains multiple table subdirectories.
- Table – A single dataset (table) within Permutive. In GCS, this is a sub-prefix (subdirectory) under the Schema prefix.
- Data File – Files that store the actual data for the table. These reside within the table prefix and may be organised under further sub-prefixes (e.g., Hive-style partitions).
- File Format – Supported formats for data files; currently CSV and Parquet.
- Hive Partition – A directory structure under a table using Hive-style format, e.g.,
date=2025-01-01 or publishRegion=EU
. - Data Partitioning – A setting that defines whether all tables under a schema are partitioned or not.
2. Setting Up Your GCS Bucket
A Schema on the Permutive platform contains multiple Tables, and each Table is a set of data that can be imported into Permutive.
Since GCS has no built-in schema/table concepts, your bucket must be structured in a way that allows Permutive to infer tables from prefixes (folders) under a schema prefix.
This structure enables you to import multiple tables from a single GCS prefix and easily add more in the future.
2.1 GCS Directory Structure
2.1.1 Schema Directory Structure
To emulate a schema within GCS, structure your bucket like this:
gs://<bucket_name>/<prefix>/<table_1>
gs://<bucket_name>/<prefix>/<table_2>
- ...
gs://<bucket_name>/<prefix>/<table_n>
Each subdirectory under the schema prefix is considered a table.
To represent multiple schemas, use distinct prefixes:
gs://<bucket_name>/schema_one_prefix/
gs://<bucket_name>/schema_two_prefix/
Each connection in Permutive maps to one schema prefix. To import data from a new schema prefix, create a new connection.
2.1.2 Table Directory Structure
2.1.2.1 Partitioned Mode
We support Hive-style partitioning:
gs://<bucket_name>/<prefix>/<table_n>/<partition_name>=<value>/<file>.csv
Multiple partitions are also supported:
gs://<bucket_name>/<prefix>/<table_n>/date=2025-01-01/region=EU/<file>.csv
2.1.2.2 Non-partitioned Mode
If your data isn’t partitioned:
gs://<bucket_name>/<prefix>/<table_n>/<file>.csv
We scan all files under the table prefix regardless of depth:
gs://<bucket_name>/<prefix>/<table_n>/<file1>.csv
gs://<bucket_name>/<prefix>/<table_n>/inner1/<file2>.csv
gs://<bucket_name>/<prefix>/<table_n>/inner1/inner2/<file3>.csv
Unless you've indicated that all tables are partitioned, we treat the entire directory tree as a flat table and ignore Hive-style partitions.
2.2 GCS Bucket Permissions
To allow Permutive to read from your GCS bucket, please assign the correct IAM roles to Permutive’s service account connection@permutive.com
. This should include:
roles/storage.objectViewer
roles/storage.bucketViewer
If you're using fine-grained permissions or folder-level access, apply permissions to individual prefixes as needed.
If access is already granted for a parent prefix or bucket, no further IAM changes are needed.
3. Setting Up a New Connection
3.1 Step 1 – Select GCS from the catalog
Go to Connectivity → Catalog and select "Google Cloud Storage".
3.2 Step 2 – Enter GCS bucket details
-
Name – Friendly name for the connection in Permutive
-
GCP Project ID - the GCP project that the bucket belongs to
-
GCS Bucket Region – The region that the bucket belongs to. Available GCS regions will be displayed based on your workspace location. Please make sure your bucket is in one of these regions.
-
GCS Bucket Name – The full name of the GCS bucket (no
gs://
prefix) -
Schema Prefix – The prefix within the bucket representing a schema (no leading slash)
-
Data Format – Depending on what format your data files are stored in, choose either:
CSV
Parquet
-
Data Partitioning – Choose either:
All tables are partitioned
No tables are partitioned
This choice affects how we interpret subdirectories.
3.3 Step 4 – Create Import
Under Imports → Create Import, select:
- Source: Google Cloud Storage
- Connection: the GCS connection just created
- Schema - choose one of the schemas having the same prefix as defined during the connection setup
- Table - choose one of the discovered tables within the selected schema
You will see the schema prefix and a list of detected tables. Proceed to create imports as usual.
4. Limitations
- Only Hive-style partitioning is supported for partitioned tables.
- Profile data can be non-partitioned if it’s not frequently updated.
- Mixed partitioning is not supported in a single schema connection.
- We do not support schema evolution (i.e., column changes) on GCS imports.
5. Recommendations
Data Partitioning
- We recommend partitioning all data — especially event/user activity tables — for potential performance and cost-efficiency.
- This is the most important with User Activity type data where you can continually add data to the table
- With User Profile data, we expect the data to not be continually be appended to and grow in size too much
- Ensure consistency: all tables under a schema should either be partitioned or not, based on the Data Partitioning setting.
- If there are partitioned Tables within the Schema and the Data Partitioning has been set to No tables are partitioned in the source prefix, we will add these tables as non-partitioned tables and ignore the partitions
- If there are non-partitioned Tables within the Schema and the Data Partitioning has been set to “All tables are partitioned in the source prefix”, we will ignore any non-partitioned tables within the Schema
Data Format
It is recommended all tables under a schema to either be in CSV or Parquet format.
Parquet Format
We highly recommend using the Parquet file format due to its columnar storage benefits, which significantly improve query performance and reduce storage size.
For files in Parquet format, we specifically recommend using the ZSTD
compression codec to maximise storage efficiency and speed up data processing.
CSV Format
We support:
.csv
(uncompressed CSV).gz
(gzipped CSV)
For CSV files, especially large datasets, gzipping is highly recommended to significantly reduce storage costs.
Comments
0 comments
Please sign in to leave a comment.