Skip to main content

Redshift Integration

There are two recommended approaches for integrating Amazon Redshift with Aampe:
  • Unload data into Amazon S3 from Redshift
  • Provide direct access for Aampe user to your Redshift cluster (and tables) with event data

Unloading Data into Amazon S3 from Redshift

1. Create IAM Role for Redshift

Create an IAM role that allows Redshift to access S3. This follows AWS security best practices by using roles instead of hardcoded credentials.
-- First, create the IAM role in AWS Console or CLI with the following trust policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
Attach the following IAM policy to the role (replace your-bucket-name with the actual bucket):
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

2. Attach IAM Role to Redshift Cluster

Associate the IAM role with your Redshift cluster:
-- Get the role ARN and attach it to your cluster
-- This can be done via AWS Console or CLI
aws redshift modify-cluster --cluster-identifier your-cluster-name --iam-roles arn:aws:iam::account-id:role/RedshiftS3Role

3. Unload Data to S3

Use the UNLOAD command to export data from Redshift to S3. Redshift’s UNLOAD command is optimized for parallel processing and handles large datasets efficiently.
-- Example: Unload event data to S3 in Parquet format
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/year=2024/month=09/day=18/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS PARQUET
PARTITION BY (event_date)
ALLOWOVERWRITE
PARALLEL ON;
Key UNLOAD parameters for optimization:
  • FORMAT AS PARQUET: Efficient columnar format, recommended for analytics
  • PARTITION BY: Organizes data for better query performance
  • PARALLEL ON: Enables parallel processing (default, but explicit is better)
  • ALLOWOVERWRITE: Overwrites existing files
  • MAXFILESIZE: Controls individual file size (default 6.2GB)
  • MANIFEST: Generates manifest file for data verification
Alternative formats:
-- CSV format with header
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/csv/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS CSV
HEADER
PARALLEL ON;

-- JSON format
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/json/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS JSON
PARALLEL ON;

4. Automate the Unloading Process

Option A: Using Redshift Scheduler (Recommended) Create a scheduled query to automate daily unloads:
-- Create a scheduled query (available in Redshift Console)
-- or use the Redshift Data API with EventBridge
CREATE EVENT RULE daily_unload_events
SCHEDULE 'cron(0 2 * * ? *)' -- Run at 2 AM daily
DESCRIPTION 'Daily unload of events to S3'
STATE ENABLED;
Option B: Using AWS Lambda + EventBridge Create a Lambda function that executes the UNLOAD command:
import boto3
import json

def lambda_handler(event, context):
    redshift_data = boto3.client('redshift-data')
    
    # Execute UNLOAD command
    response = redshift_data.execute_statement(
        ClusterIdentifier='your-cluster-name',
        Database='your-database',
        DbUser='your-db-user',
        Sql=f"""
        UNLOAD ('SELECT * FROM events WHERE event_date = current_date - 1')
        TO 's3://your-bucket-name/events/year={{{{ year }}}}/month={{{{ month }}}}/day={{{{ day }}}}/'
        IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
        FORMAT AS PARQUET
        PARTITION BY (event_date)
        ALLOWOVERWRITE
        PARALLEL ON;
        """
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps(f'Query submitted: {response["Id"]}')
    }

5. Monitoring and Optimization

Monitor UNLOAD performance:
-- Check UNLOAD query performance
SELECT query, elapsed, aborted, substring(querytxt,1,60) as sql_text
FROM stl_query 
WHERE querytxt LIKE '%UNLOAD%' 
ORDER BY starttime DESC 
LIMIT 10;
Best practices for optimization:
  • Use appropriate distribution keys on source tables
  • Consider VACUUM and ANALYZE operations before UNLOAD
  • Monitor S3 costs and implement lifecycle policies
  • Use compression (GZIP, BZIP2) for cost optimization if network transfer is a concern

Provide Direct Access for Aampe User to Your Redshift Cluster

This approach allows Aampe to directly query your Redshift cluster. Follow the principle of least privilege for security.

Setup Steps:

1. Create a Custom Database Role (if needed)

-- Create a read-only role for Aampe
CREATE ROLE aampe_read_role;

2. Create Database User

-- Create user with password authentication
CREATE USER aampe_user PASSWORD 'secure_password_here';

-- Alternative: Create user without password for IAM authentication
CREATE USER aampe_user;

3. Grant Role to User

-- Grant the read role to the user
GRANT ROLE aampe_read_role TO aampe_user;

4. Grant Table-Level Permissions

-- Grant access to specific schemas and tables
-- Replace 'your_schema' and table names with your actual schema/tables

-- Grant schema usage
GRANT USAGE ON SCHEMA your_schema TO aampe_read_role;

-- Grant table select permissions
GRANT SELECT ON your_schema.events TO aampe_read_role;
GRANT SELECT ON your_schema.users TO aampe_read_role;
GRANT SELECT ON your_schema.sessions TO aampe_read_role;

-- Or grant access to all tables in schema (current and future)
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO aampe_read_role;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema GRANT SELECT ON TABLES TO aampe_read_role;

5. Configure Network Security

VPC Security Groups: Ensure your Redshift cluster’s security group allows connections from Aampe’s IP addresses on port 5439. Subnet Groups: If using private subnets, ensure proper routing and NAT gateway configuration.

Connection Details Required

Provide Aampe with the following connection information:
  • Host/Endpoint: your-cluster-name.abc123.us-east-1.redshift.amazonaws.com
  • Port: 5439 (default Redshift port)
  • Database Name: Your database name
  • Schema Name: Schema containing event tables
  • Username: aampe_user
  • Password: The password you created
  • Table Names: List of tables containing event data

Data Models

Refer to the Data Models documentation for the expected schema and data format requirements.
I