Day 29: Athena & Redshift

Jeeva-AWSLabsJourney

3 min readNov 28, 2023

𝐀𝐭𝐡𝐞𝐧𝐚:

👉 Serverless query service to Analyse data stored in Amazon S3

👉 Performance Improvement-columnar data, compress data, Partition datasets, Use larger files

👉 Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)

Athena:

Serverless Query Service:

Allows you to analyze data stored in Amazon S3 without the need for infrastructure management.
Queries are executed on an ad-hoc basis, and you pay only for the queries you run.

2.Performance Improvement:

Strategies include using columnar data, compressing data, partitioning datasets, and using larger files to optimize query performance.
Columnar Data: Storing data in columnar format can improve query performance, as only the necessary columns are read.
Data Compression: Compressing data reduces storage costs and improves query performance by reducing the amount of data to be scanned.
Partition Datasets: Partitioning involves dividing large datasets into smaller, more manageable parts, improving query efficiency.
Use Larger Files: Working with larger files can enhance performance by reducing the number of files to be processed.

3.Data Source Connectors:

Utilizes Data Source Connectors running on AWS Lambda to execute Federated Queries across different data sources like CloudWatch Logs, DynamoDB, RDS, etc.

𝐑𝐞𝐝𝐬𝐡𝐢𝐟𝐭:

👉 PostgreSQL based Db with online analytical processing (analytics and data warehousing)

👉 BI tools such as Amazon Quick sight or Tableau integrate with it

👉 Query>cluster> Leader node> Compute Nodes

👉 Snapshots are point-in-time backups of a cluster, stored internally in S3

👉 3 ways of loading data Kinesis Data Firehose/ S3 using COPY/ EC2 Instance JDBC driver

Redshift:

PostgreSQL-based DB:

Redshift is based on PostgreSQL, but it’s optimized for online analytical processing (OLAP) workloads.

2.Integration with BI Tools:

Integrates seamlessly with Business Intelligence (BI) tools such as Amazon Quick sight or Tableau for data visualization and analytics.

3.Query Execution:

Queries are processed in a distributed manner on a Redshift cluster, consisting of a Leader Node and Compute Nodes.

4.Snapshots:

Supports point-in-time backups through snapshots, which are stored internally in Amazon S3.
Snapshots in Redshift are point-in-time backups of a cluster.
These snapshots are stored internally in Amazon S3, providing a reliable backup and restore mechanism.

5.Data Loading:

Provides multiple ways to load data, including Kinesis Data Firehose, loading from S3 using the COPY command, and using an EC2 instance with a JDBC driver.
Kinesis Data Firehose: Data can be loaded into Redshift from Amazon Kinesis Data Firehose, which is a fully managed service for loading streaming data.
S3 using COPY Command: Redshift supports loading data directly from Amazon S3 using the COPY command, providing a straightforward and efficient method.
EC2 Instance JDBC Driver: Data can also be loaded into Redshift using an EC2 instance with a JDBC driver, offering flexibility in data loading methods.

Scenario: Data Warehousing for E-commerce Analytics

Consider an e-commerce business with a large dataset containing customer transactions, product information, and user behavior. The company wants to perform complex analytics and generate reports for business insights.

Use Case Steps:

Data Warehouse Setup:

Deploy an Amazon Redshift cluster to serve as the centralized data warehouse.

2.Data Loading:

Load transaction data from various sources, such as online stores, into Redshift.
Utilize the flexibility of Redshift’s data loading options, including Kinesis Data Firehose, COPY from S3, and EC2 Instance with JDBC driver.

3.Integration with BI Tools:

Integrate Amazon Redshift seamlessly with BI tools like Amazon Quicksight or Tableau.

4.OLAP Analytics:

Leverage Redshift’s online analytical processing (OLAP) capabilities to run complex queries and analytics on the large dataset.
Use the cluster architecture with Leader Node and Compute Nodes for parallel processing.

5.Point-in-Time Backups:

Ensure data durability and recovery by taking point-in-time backups (snapshots) of the Redshift cluster, stored internally in S3.

6.Scalability:

Scale the Redshift cluster up or down based on the changing analytics needs of the business.

Day 29: Athena & Redshift

Athena:

Redshift:

Written by Jeeva-AWSLabsJourney

No responses yet