Day 29: Athena & Redshift

Jeeva-AWSLabsJourney
3 min readNov 28, 2023

--

π€π­π‘πžπ§πš:

πŸ‘‰ Serverless query service to Analyse data stored in Amazon S3

πŸ‘‰ Performance Improvement-columnar data, compress data, Partition datasets, Use larger files

πŸ‘‰ Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)

Athena:

  1. Serverless Query Service:
  • Allows you to analyze data stored in Amazon S3 without the need for infrastructure management.
  • Queries are executed on an ad-hoc basis, and you pay only for the queries you run.

2.Performance Improvement:

  • Strategies include using columnar data, compressing data, partitioning datasets, and using larger files to optimize query performance.
  • Columnar Data: Storing data in columnar format can improve query performance, as only the necessary columns are read.
  • Data Compression: Compressing data reduces storage costs and improves query performance by reducing the amount of data to be scanned.
  • Partition Datasets: Partitioning involves dividing large datasets into smaller, more manageable parts, improving query efficiency.
  • Use Larger Files: Working with larger files can enhance performance by reducing the number of files to be processed.

3.Data Source Connectors:

  • Utilizes Data Source Connectors running on AWS Lambda to execute Federated Queries across different data sources like CloudWatch Logs, DynamoDB, RDS, etc.

π‘πžππ¬π‘π’πŸπ­:

πŸ‘‰ PostgreSQL based Db with online analytical processing (analytics and data warehousing)

πŸ‘‰ BI tools such as Amazon Quick sight or Tableau integrate with it

πŸ‘‰ Query>cluster> Leader node> Compute Nodes

πŸ‘‰ Snapshots are point-in-time backups of a cluster, stored internally in S3

πŸ‘‰ 3 ways of loading data Kinesis Data Firehose/ S3 using COPY/ EC2 Instance JDBC driver

Redshift:

  1. PostgreSQL-based DB:
  • Redshift is based on PostgreSQL, but it’s optimized for online analytical processing (OLAP) workloads.

2.Integration with BI Tools:

  • Integrates seamlessly with Business Intelligence (BI) tools such as Amazon Quick sight or Tableau for data visualization and analytics.

3.Query Execution:

  • Queries are processed in a distributed manner on a Redshift cluster, consisting of a Leader Node and Compute Nodes.

4.Snapshots:

  • Supports point-in-time backups through snapshots, which are stored internally in Amazon S3.
  • Snapshots in Redshift are point-in-time backups of a cluster.
  • These snapshots are stored internally in Amazon S3, providing a reliable backup and restore mechanism.

5.Data Loading:

  • Provides multiple ways to load data, including Kinesis Data Firehose, loading from S3 using the COPY command, and using an EC2 instance with a JDBC driver.
  • Kinesis Data Firehose: Data can be loaded into Redshift from Amazon Kinesis Data Firehose, which is a fully managed service for loading streaming data.
  • S3 using COPY Command: Redshift supports loading data directly from Amazon S3 using the COPY command, providing a straightforward and efficient method.
  • EC2 Instance JDBC Driver: Data can also be loaded into Redshift using an EC2 instance with a JDBC driver, offering flexibility in data loading methods.

Scenario: Data Warehousing for E-commerce Analytics

Consider an e-commerce business with a large dataset containing customer transactions, product information, and user behavior. The company wants to perform complex analytics and generate reports for business insights.

Use Case Steps:

  1. Data Warehouse Setup:
  • Deploy an Amazon Redshift cluster to serve as the centralized data warehouse.

2.Data Loading:

  • Load transaction data from various sources, such as online stores, into Redshift.
  • Utilize the flexibility of Redshift’s data loading options, including Kinesis Data Firehose, COPY from S3, and EC2 Instance with JDBC driver.

3.Integration with BI Tools:

  • Integrate Amazon Redshift seamlessly with BI tools like Amazon Quicksight or Tableau.

4.OLAP Analytics:

  • Leverage Redshift’s online analytical processing (OLAP) capabilities to run complex queries and analytics on the large dataset.
  • Use the cluster architecture with Leader Node and Compute Nodes for parallel processing.

5.Point-in-Time Backups:

  • Ensure data durability and recovery by taking point-in-time backups (snapshots) of the Redshift cluster, stored internally in S3.

6.Scalability:

  • Scale the Redshift cluster up or down based on the changing analytics needs of the business.

--

--

Jeeva-AWSLabsJourney
Jeeva-AWSLabsJourney

Written by Jeeva-AWSLabsJourney

Exploring AWS, cloud, Linux & DevOps. Your guide to navigating the digital realm. Join me on the journey of discovery

No responses yet