Documentation Index
Fetch the complete documentation index at: https://kb.aampe.com/llms.txt
Use this file to discover all available pages before exploring further.
Redshift Integration
There are two recommended approaches for integrating Amazon Redshift with Aampe:
- Unload data into Amazon S3 from Redshift
- Provide direct access for Aampe user to your Redshift cluster (and tables) with event data
Unloading Data into Amazon S3 from Redshift
1. Create IAM Role for Redshift
Create an IAM role that allows Redshift to access S3. This follows AWS security best practices by using roles instead of hardcoded credentials.
-- First, create the IAM role in AWS Console or CLI with the following trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Attach the following IAM policy to the role (replace your-bucket-name with the actual bucket):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
2. Attach IAM Role to Redshift Cluster
Associate the IAM role with your Redshift cluster:
-- Get the role ARN and attach it to your cluster
-- This can be done via AWS Console or CLI
aws redshift modify-cluster --cluster-identifier your-cluster-name --iam-roles arn:aws:iam::account-id:role/RedshiftS3Role
3. Unload Data to S3
Use the UNLOAD command to export data from Redshift to S3. Redshift’s UNLOAD command is optimized for parallel processing and handles large datasets efficiently.
-- Example: Unload event data to S3 in Parquet format
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/year=2024/month=09/day=18/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS PARQUET
PARTITION BY (event_date)
ALLOWOVERWRITE
PARALLEL ON;
Key UNLOAD parameters for optimization:
FORMAT AS PARQUET: Efficient columnar format, recommended for analytics
PARTITION BY: Organizes data for better query performance
PARALLEL ON: Enables parallel processing (default, but explicit is better)
ALLOWOVERWRITE: Overwrites existing files
MAXFILESIZE: Controls individual file size (default 6.2GB)
MANIFEST: Generates manifest file for data verification
Alternative formats:
-- CSV format with header
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/csv/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS CSV
HEADER
PARALLEL ON;
-- JSON format
UNLOAD ('SELECT * FROM events WHERE event_date >= current_date - 1')
TO 's3://your-bucket-name/events/json/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS JSON
PARALLEL ON;
4. Automate the Unloading Process
Option A: Using Redshift Scheduler (Recommended)
Create a scheduled query to automate daily unloads:
-- Create a scheduled query (available in Redshift Console)
-- or use the Redshift Data API with EventBridge
CREATE EVENT RULE daily_unload_events
SCHEDULE 'cron(0 2 * * ? *)' -- Run at 2 AM daily
DESCRIPTION 'Daily unload of events to S3'
STATE ENABLED;
Option B: Using AWS Lambda + EventBridge
Create a Lambda function that executes the UNLOAD command:
import boto3
import json
def lambda_handler(event, context):
redshift_data = boto3.client('redshift-data')
# Execute UNLOAD command
response = redshift_data.execute_statement(
ClusterIdentifier='your-cluster-name',
Database='your-database',
DbUser='your-db-user',
Sql=f"""
UNLOAD ('SELECT * FROM events WHERE event_date = current_date - 1')
TO 's3://your-bucket-name/events/year={{{{ year }}}}/month={{{{ month }}}}/day={{{{ day }}}}/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3Role'
FORMAT AS PARQUET
PARTITION BY (event_date)
ALLOWOVERWRITE
PARALLEL ON;
"""
)
return {
'statusCode': 200,
'body': json.dumps(f'Query submitted: {response["Id"]}')
}
5. Monitoring and Optimization
Monitor UNLOAD performance:
-- Check UNLOAD query performance
SELECT query, elapsed, aborted, substring(querytxt,1,60) as sql_text
FROM stl_query
WHERE querytxt LIKE '%UNLOAD%'
ORDER BY starttime DESC
LIMIT 10;
Best practices for optimization:
- Use appropriate distribution keys on source tables
- Consider VACUUM and ANALYZE operations before UNLOAD
- Monitor S3 costs and implement lifecycle policies
- Use compression (GZIP, BZIP2) for cost optimization if network transfer is a concern
Provide Direct Access for Aampe User to Your Redshift Cluster
This approach allows Aampe to directly query your Redshift cluster. Follow the principle of least privilege for security.
Setup Steps:
1. Create a Custom Database Role (if needed)
-- Create a read-only role for Aampe
CREATE ROLE aampe_read_role;
2. Create Database User
-- Create user with password authentication
CREATE USER aampe_user PASSWORD 'secure_password_here';
-- Alternative: Create user without password for IAM authentication
CREATE USER aampe_user;
3. Grant Role to User
-- Grant the read role to the user
GRANT ROLE aampe_read_role TO aampe_user;
4. Grant Table-Level Permissions
-- Grant access to specific schemas and tables
-- Replace 'your_schema' and table names with your actual schema/tables
-- Grant schema usage
GRANT USAGE ON SCHEMA your_schema TO aampe_read_role;
-- Grant table select permissions
GRANT SELECT ON your_schema.events TO aampe_read_role;
GRANT SELECT ON your_schema.users TO aampe_read_role;
GRANT SELECT ON your_schema.sessions TO aampe_read_role;
-- Or grant access to all tables in schema (current and future)
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema TO aampe_read_role;
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema GRANT SELECT ON TABLES TO aampe_read_role;
VPC Security Groups: Ensure your Redshift cluster’s security group allows connections from Aampe’s IP addresses on port 5439.
Subnet Groups: If using private subnets, ensure proper routing and NAT gateway configuration.
Connection Details Required
Provide Aampe with the following connection information:
- Host/Endpoint:
your-cluster-name.abc123.us-east-1.redshift.amazonaws.com
- Port:
5439 (default Redshift port)
- Database Name: Your database name
- Schema Name: Schema containing event tables
- Username:
aampe_user
- Password: The password you created
- Table Names: List of tables containing event data
Data Models
Refer to the Data Models documentation for the expected schema and data format requirements.