Skip to main content
There are 2 ways to provide Snowflake data access to Aampe:
  • Unload data into Google Cloud Storage from Snowflake
  • Provide direct access for Aampe user to your snowflake dataset (and tables) with event data
For both methods, customers generally create separate tables/views for Aampe that are limited to only the data necessary for Aampe to function

Unloading Data into Google Cloud Storage from Snowflake

1. Create Snowflake Integration

Integrations are named, first-class Snowflake objects that avoid the need for passing explicit cloud provider credentials such as secret keys or access tokens; instead, integration objects reference a Cloud Storage service account. Aampe will provide you with the bucket link mentioned below.
CREATE STORAGE INTEGRATION gcs_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = GCS
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('gcs-bucket-link');

2. Obtain GCP Service Account

A GCP Service account is created by Snowflake as part of Step 1:
DESC STORAGE INTEGRATION gcs_int;
In the output of the above command, the STORAGE_GCP_SERVICE_ACCOUNT property contains the GCP service account name. Example output:
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+
| property                    | property_type | property_value                                                              | property_default |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------|
| ENABLED                     | Boolean       | true                                                                        | false            |
| STORAGE_ALLOWED_LOCATIONS   | List          | bucket-link-that-we-will-send-you                                          | []               |
| STORAGE_GCP_SERVICE_ACCOUNT | String        | service-account@project.iam.gserviceaccount.com                            |                  |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+
We will take this service account and provide it permissions to add/modify data to the GCS bucket.

3. Create File Formats

A parquet file format can be created in Snowflake as follows to unload snowflake data in the parquet format. Any other format like CSV or JSON also can be created. Here is an example script to create parquet format:
USE DB_NAME;
USE SCHEMA_NAME;
CREATE OR REPLACE FILE FORMAT parquet_unload_file_format
TYPE = PARQUET
SNAPPY_COMPRESSION = TRUE
COMMENT = 'FILE FORMAT FOR UNLOADING AS PARQUET FILES';
More options and file types can be found here.

4. Create External Stage

In Snowflake, an EXTERNAL STAGE object references data files stored in a location outside of Snowflake.
USE DB_NAME;
USE SCHEMA SCHEMA_NAME;
CREATE OR REPLACE STAGE parquet_unload_gcs_stage
URL = 'gcs-bucket-link'
storage_integration = gcs_int
FILE_FORMAT = parquet_unload_file_format
COMMENT = 'GCS Stage for the Snowflake external Parquet export';

5. Unload Data

From Snowflake, use the COPY command in Snowflake to unload data from a Snowflake table into a GCS bucket. To retain the column names in the output file, use the HEADER = TRUE copy option.
COPY INTO @<EXTERNAL_STAGE_NAME>/<file_name_prefix>
FROM DB_NAME.SCHEMA_NAME.TABLE_NAME
HEADER = TRUE;
For example:
COPY INTO @parquet_unload_gcs_stage/region
FROM SF01.TEST01.REGION
HEADER=TRUE;
For the minimal fields we need, you can refer to Data Models. There are various parameters for COPY command which can limit the partitions and provide optimizations: https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html

6. Automate the Unloading

What we recommend is to automate the COPY command that unloads each day’s events (single partition) into the external stage by using Snowflake TASK for it to run at a schedule: https://docs.snowflake.com/en/sql-reference/sql/create-task.html An example of same would be:
CREATE TASK copy_to_gcs
SCHEDULE = 'USING CRON 0 10 * * * Europe/Rome'
COMMENT = 'Test Schedule' AS
COPY INTO @parquet_unload_gcs_stage/region
from SF01.TEST01.REGION
HEADER=TRUE;
If you use Airflow on your side for orchestration you can also use that to schedule this query at a regular interval. References: