Day 29: Athena & Redshift
πππ‘ππ§π:
π Serverless query service to Analyse data stored in Amazon S3
π Performance Improvement-columnar data, compress data, Partition datasets, Use larger files
π Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, β¦)
Athena:
- Serverless Query Service:
- Allows you to analyze data stored in Amazon S3 without the need for infrastructure management.
- Queries are executed on an ad-hoc basis, and you pay only for the queries you run.
2.Performance Improvement:
- Strategies include using columnar data, compressing data, partitioning datasets, and using larger files to optimize query performance.
- Columnar Data: Storing data in columnar format can improve query performance, as only the necessary columns are read.
- Data Compression: Compressing data reduces storage costs and improves query performance by reducing the amount of data to be scanned.
- Partition Datasets: Partitioning involves dividing large datasets into smaller, more manageable parts, improving query efficiency.
- Use Larger Files: Working with larger files can enhance performance by reducing the number of files to be processed.
3.Data Source Connectors:
- Utilizes Data Source Connectors running on AWS Lambda to execute Federated Queries across different data sources like CloudWatch Logs, DynamoDB, RDS, etc.
ππππ¬π‘π’ππ:
π PostgreSQL based Db with online analytical processing (analytics and data warehousing)
π BI tools such as Amazon Quick sight or Tableau integrate with it
π Query>cluster> Leader node> Compute Nodes
π Snapshots are point-in-time backups of a cluster, stored internally in S3
π 3 ways of loading data Kinesis Data Firehose/ S3 using COPY/ EC2 Instance JDBC driver
Redshift:
- PostgreSQL-based DB:
- Redshift is based on PostgreSQL, but itβs optimized for online analytical processing (OLAP) workloads.
2.Integration with BI Tools:
- Integrates seamlessly with Business Intelligence (BI) tools such as Amazon Quick sight or Tableau for data visualization and analytics.
3.Query Execution:
- Queries are processed in a distributed manner on a Redshift cluster, consisting of a Leader Node and Compute Nodes.
4.Snapshots:
- Supports point-in-time backups through snapshots, which are stored internally in Amazon S3.
- Snapshots in Redshift are point-in-time backups of a cluster.
- These snapshots are stored internally in Amazon S3, providing a reliable backup and restore mechanism.
5.Data Loading:
- Provides multiple ways to load data, including Kinesis Data Firehose, loading from S3 using the COPY command, and using an EC2 instance with a JDBC driver.
- Kinesis Data Firehose: Data can be loaded into Redshift from Amazon Kinesis Data Firehose, which is a fully managed service for loading streaming data.
- S3 using COPY Command: Redshift supports loading data directly from Amazon S3 using the COPY command, providing a straightforward and efficient method.
- EC2 Instance JDBC Driver: Data can also be loaded into Redshift using an EC2 instance with a JDBC driver, offering flexibility in data loading methods.
Scenario: Data Warehousing for E-commerce Analytics
Consider an e-commerce business with a large dataset containing customer transactions, product information, and user behavior. The company wants to perform complex analytics and generate reports for business insights.
Use Case Steps:
- Data Warehouse Setup:
- Deploy an Amazon Redshift cluster to serve as the centralized data warehouse.
2.Data Loading:
- Load transaction data from various sources, such as online stores, into Redshift.
- Utilize the flexibility of Redshiftβs data loading options, including Kinesis Data Firehose, COPY from S3, and EC2 Instance with JDBC driver.
3.Integration with BI Tools:
- Integrate Amazon Redshift seamlessly with BI tools like Amazon Quicksight or Tableau.
4.OLAP Analytics:
- Leverage Redshiftβs online analytical processing (OLAP) capabilities to run complex queries and analytics on the large dataset.
- Use the cluster architecture with Leader Node and Compute Nodes for parallel processing.
5.Point-in-Time Backups:
- Ensure data durability and recovery by taking point-in-time backups (snapshots) of the Redshift cluster, stored internally in S3.
6.Scalability:
- Scale the Redshift cluster up or down based on the changing analytics needs of the business.