AWS Certified Data Analytics – Specialty

AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1

AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1 — Study Guide

AWS Certified Data Analytics – Specialty Fundamentals: Quiz 1 Study Guide

Understanding the AWS data analytics ecosystem is essential for anyone building modern data pipelines. Whether you're ingesting streaming events, transforming raw files, or visualizing business metrics, AWS offers purpose-built services for every layer of the data journey. This guide maps out those services, their roles, and how they connect — giving you exactly what you need to ace Quiz 1.

The AWS Data Analytics Landscape

Think of a data analytics architecture like a factory assembly line:

Raw materials arrive (data ingestion)

Materials are processed and shaped (ETL / transformation)

Finished goods are stored (data lake / data warehouse)

Quality is inspected (data quality / governance)

Reports are delivered to management (visualization)

AWS has a dedicated service for nearly every stage.

Data Ingestion: Getting Data In

Real-Time Ingestion with Amazon Kinesis

Amazon Kinesis is the go-to AWS service for ingesting and processing real-time streaming data — think clickstreams, IoT sensor readings, or application logs flowing in continuously.

Service	Best For
Kinesis Data Streams	Custom real-time processing; low latency; consumers like Lambda or custom apps
Kinesis Data Firehose	Fully managed delivery to S3, Redshift, or OpenSearch; no coding required

Analogy: Kinesis Streams is like a live conveyor belt you control; Firehose is an automated delivery truck that drops packages at your destination automatically.

Batch Ingestion

For non-real-time workloads, data is often loaded in bulk via AWS Glue jobs, direct S3 uploads, or database migration tools.

Storage: Data Lakes and Data Warehouses

Amazon S3 — The Data Lake Foundation

Amazon S3 is the backbone of nearly every AWS data lake. It stores structured, semi-structured, and unstructured data at virtually unlimited scale with low cost.

A data lake on S3 centralizes raw and processed data in one place

Data can be organized using partitioning (e.g., by year/month/day) to dramatically improve query performance and reduce costs

AWS Lake Formation adds governance, security, and fine-grained access control on top of S3

Amazon Redshift — The Data Warehouse

Amazon Redshift is AWS's fully managed data warehouse, optimized for analytical queries over large volumes of structured data.

Uses columnar storage — data is stored by column, not row, making aggregations (SUM, AVG) extremely fast

Supports distribution keys to control how data is spread across nodes, reducing data movement during joins

Best for structured, schema-defined data used in regular reporting

Data Lake vs. Data Warehouse:

| | Data Lake (S3) | Data Warehouse (Redshift) |

|---|---|---|

| Data type | Any (structured, unstructured) | Structured |

| Schema | Schema-on-read | Schema-on-write |

| Cost | Very low | Higher |

| Query speed | Slower (without optimization) | Very fast |

ETL and Transformation

AWS Glue — The ETL Engine

AWS Glue is a fully managed ETL (Extract, Transform, Load) service. Its primary purposes are:

Discovering and cataloging data via the AWS Glue Data Catalog (stores schema and metadata)

Running ETL jobs written in Python or Scala (using Apache Spark under the hood)

Crawling data sources to automatically infer schemas

# Example: Simple Glue ETL job snippet
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="sales_db",
    table_name="raw_orders"
)
transformed = datasource.filter(lambda x: x["status"] == "completed")
glueContext.write_dynamic_frame.from_options(transformed, ...)

AWS Glue DataBrew

DataBrew is a visual, no-code data preparation tool built on top of Glue. It's designed for data quality checks and cleaning without writing code — great for analysts who aren't engineers.

Querying Data: Athena

Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 — no infrastructure to manage.

-- Query partitioned Parquet data in S3
SELECT product_id, SUM(revenue) AS total_revenue
FROM sales_data
WHERE year = '2024' AND month = '03'
GROUP BY product_id;

Athena works best when your data is:

Stored in columnar formats like Parquet or ORC

Partitioned by common filter fields (date, region, etc.)

Columnar Formats: Parquet and ORC

Format	Description	Best With
Parquet	Open-source columnar format; great compression	Athena, Glue, Spark
ORC	Optimized Row Columnar; strong with Hive/EMR	EMR, Hive workloads

Both formats provide performance optimization and cost optimization by scanning only the columns needed.

Distributed Computing: Amazon EMR

Amazon EMR (Elastic MapReduce) runs Apache Hadoop and Apache Spark clusters on AWS for large-scale distributed computing and batch processing.

Hadoop: The original distributed processing framework (MapReduce)

Spark: Faster, in-memory processing engine; supports both batch processing and real-time processing

EMR is ideal when you need full control over your cluster configuration or are running complex ML pipelines

Metadata, Governance, and Discovery

AWS Glue Data Catalog: Central repository of metadata and schema information; acts as a data catalog for data discovery across S3, Redshift, and more

AWS Lake Formation: Manages data governance, security, and data lineage — tracking where data came from and how it was transformed

Data lineage helps organizations understand the full journey of a data asset, critical for compliance and debugging

Visualization: Amazon QuickSight

Amazon QuickSight is AWS's cloud-native business intelligence tool for creating dashboards and visualizations. It connects to Redshift, Athena, S3, and other sources, enabling non-technical users to explore data through interactive charts and reports.

Key Takeaways

Kinesis (Streams for custom real-time; Firehose for managed delivery) handles real-time data ingestion, while S3 serves as the foundation for scalable data lakes.

AWS Glue is the primary ETL service — it transforms data AND catalogs schema/metadata for the entire ecosystem.

Athena lets you query S3 data with SQL serverlessly; use Parquet/ORC columnar formats and partitioning to maximize performance and minimize cost.

Redshift is purpose-built for structured data warehousing with columnar storage and distribution keys for fast analytics.

EMR (Hadoop/Spark) handles heavy distributed batch and real-time processing, while Lake Formation governs security, access, and data lineage across your data lake.