AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1
AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1 — Study Guide
AWS Certified Data Analytics – Specialty Fundamentals: Quiz 1 Study Guide
Understanding the AWS data analytics ecosystem is essential for anyone building modern data pipelines. Whether you're ingesting streaming events, transforming raw files, or visualizing business metrics, AWS offers purpose-built services for every layer of the data journey. This guide maps out those services, their roles, and how they connect — giving you exactly what you need to ace Quiz 1.
The AWS Data Analytics Landscape
Think of a data analytics architecture like a factory assembly line:
AWS has a dedicated service for nearly every stage.
Data Ingestion: Getting Data In
Real-Time Ingestion with Amazon Kinesis
Amazon Kinesis is the go-to AWS service for ingesting and processing real-time streaming data — think clickstreams, IoT sensor readings, or application logs flowing in continuously.
| Service | Best For |
|---|---|
| Kinesis Data Streams | Custom real-time processing; low latency; consumers like Lambda or custom apps |
| Kinesis Data Firehose | Fully managed delivery to S3, Redshift, or OpenSearch; no coding required |
Analogy: Kinesis Streams is like a live conveyor belt you control; Firehose is an automated delivery truck that drops packages at your destination automatically.
Batch Ingestion
For non-real-time workloads, data is often loaded in bulk via AWS Glue jobs, direct S3 uploads, or database migration tools.
Storage: Data Lakes and Data Warehouses
Amazon S3 — The Data Lake Foundation
Amazon S3 is the backbone of nearly every AWS data lake. It stores structured, semi-structured, and unstructured data at virtually unlimited scale with low cost.
Amazon Redshift — The Data Warehouse
Amazon Redshift is AWS's fully managed data warehouse, optimized for analytical queries over large volumes of structured data.
Data Lake vs. Data Warehouse:
| | Data Lake (S3) | Data Warehouse (Redshift) |
|---|---|---|
| Data type | Any (structured, unstructured) | Structured |
| Schema | Schema-on-read | Schema-on-write |
| Cost | Very low | Higher |
| Query speed | Slower (without optimization) | Very fast |
ETL and Transformation
AWS Glue — The ETL Engine
AWS Glue is a fully managed ETL (Extract, Transform, Load) service. Its primary purposes are:
# Example: Simple Glue ETL job snippet
datasource = glueContext.create_dynamic_frame.from_catalog(
database="sales_db",
table_name="raw_orders"
)
transformed = datasource.filter(lambda x: x["status"] == "completed")
glueContext.write_dynamic_frame.from_options(transformed, ...)AWS Glue DataBrew
DataBrew is a visual, no-code data preparation tool built on top of Glue. It's designed for data quality checks and cleaning without writing code — great for analysts who aren't engineers.
Querying Data: Athena
Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 — no infrastructure to manage.
-- Query partitioned Parquet data in S3
SELECT product_id, SUM(revenue) AS total_revenue
FROM sales_data
WHERE year = '2024' AND month = '03'
GROUP BY product_id;Athena works best when your data is:
Columnar Formats: Parquet and ORC
| Format | Description | Best With |
|---|---|---|
| Parquet | Open-source columnar format; great compression | Athena, Glue, Spark |
| ORC | Optimized Row Columnar; strong with Hive/EMR | EMR, Hive workloads |
Distributed Computing: Amazon EMR
Amazon EMR (Elastic MapReduce) runs Apache Hadoop and Apache Spark clusters on AWS for large-scale distributed computing and batch processing.
Metadata, Governance, and Discovery
Visualization: Amazon QuickSight
Amazon QuickSight is AWS's cloud-native business intelligence tool for creating dashboards and visualizations. It connects to Redshift, Athena, S3, and other sources, enabling non-technical users to explore data through interactive charts and reports.