AWS Certified Data Analytics – Specialty

AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1

AWS Certified Data Analytics – Specialty Fundamentals — Quiz 1 — Study Guide

AWS Certified Data Analytics – Specialty Fundamentals: Quiz 1 Study Guide

Understanding the AWS data analytics ecosystem is essential for anyone building modern data pipelines. Whether you're ingesting streaming events, transforming raw files, or visualizing business metrics, AWS offers purpose-built services for every layer of the data journey. This guide maps out those services, their roles, and how they connect — giving you exactly what you need to ace Quiz 1.


The AWS Data Analytics Landscape

Think of a data analytics architecture like a factory assembly line:

  • Raw materials arrive (data ingestion)
  • Materials are processed and shaped (ETL / transformation)
  • Finished goods are stored (data lake / data warehouse)
  • Quality is inspected (data quality / governance)
  • Reports are delivered to management (visualization)
  • AWS has a dedicated service for nearly every stage.


    Data Ingestion: Getting Data In

    Real-Time Ingestion with Amazon Kinesis

    Amazon Kinesis is the go-to AWS service for ingesting and processing real-time streaming data — think clickstreams, IoT sensor readings, or application logs flowing in continuously.

    ServiceBest For
    Kinesis Data StreamsCustom real-time processing; low latency; consumers like Lambda or custom apps
    Kinesis Data FirehoseFully managed delivery to S3, Redshift, or OpenSearch; no coding required
    Analogy: Kinesis Streams is like a live conveyor belt you control; Firehose is an automated delivery truck that drops packages at your destination automatically.

    Batch Ingestion

    For non-real-time workloads, data is often loaded in bulk via AWS Glue jobs, direct S3 uploads, or database migration tools.


    Storage: Data Lakes and Data Warehouses

    Amazon S3 — The Data Lake Foundation

    Amazon S3 is the backbone of nearly every AWS data lake. It stores structured, semi-structured, and unstructured data at virtually unlimited scale with low cost.

  • A data lake on S3 centralizes raw and processed data in one place
  • Data can be organized using partitioning (e.g., by year/month/day) to dramatically improve query performance and reduce costs
  • AWS Lake Formation adds governance, security, and fine-grained access control on top of S3
  • Amazon Redshift — The Data Warehouse

    Amazon Redshift is AWS's fully managed data warehouse, optimized for analytical queries over large volumes of structured data.

  • Uses columnar storage — data is stored by column, not row, making aggregations (SUM, AVG) extremely fast
  • Supports distribution keys to control how data is spread across nodes, reducing data movement during joins
  • Best for structured, schema-defined data used in regular reporting
  • Data Lake vs. Data Warehouse:
    | | Data Lake (S3) | Data Warehouse (Redshift) |
    |---|---|---|
    | Data type | Any (structured, unstructured) | Structured |
    | Schema | Schema-on-read | Schema-on-write |
    | Cost | Very low | Higher |
    | Query speed | Slower (without optimization) | Very fast |


    ETL and Transformation

    AWS Glue — The ETL Engine

    AWS Glue is a fully managed ETL (Extract, Transform, Load) service. Its primary purposes are:

  • Discovering and cataloging data via the AWS Glue Data Catalog (stores schema and metadata)
  • Running ETL jobs written in Python or Scala (using Apache Spark under the hood)
  • Crawling data sources to automatically infer schemas
  • # Example: Simple Glue ETL job snippet
    datasource = glueContext.create_dynamic_frame.from_catalog(
        database="sales_db",
        table_name="raw_orders"
    )
    transformed = datasource.filter(lambda x: x["status"] == "completed")
    glueContext.write_dynamic_frame.from_options(transformed, ...)

    AWS Glue DataBrew

    DataBrew is a visual, no-code data preparation tool built on top of Glue. It's designed for data quality checks and cleaning without writing code — great for analysts who aren't engineers.


    Querying Data: Athena

    Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 — no infrastructure to manage.

    -- Query partitioned Parquet data in S3
    SELECT product_id, SUM(revenue) AS total_revenue
    FROM sales_data
    WHERE year = '2024' AND month = '03'
    GROUP BY product_id;

    Athena works best when your data is:

  • Stored in columnar formats like Parquet or ORC
  • Partitioned by common filter fields (date, region, etc.)
  • Columnar Formats: Parquet and ORC

    FormatDescriptionBest With
    ParquetOpen-source columnar format; great compressionAthena, Glue, Spark
    ORCOptimized Row Columnar; strong with Hive/EMREMR, Hive workloads
    Both formats provide performance optimization and cost optimization by scanning only the columns needed.


    Distributed Computing: Amazon EMR

    Amazon EMR (Elastic MapReduce) runs Apache Hadoop and Apache Spark clusters on AWS for large-scale distributed computing and batch processing.

  • Hadoop: The original distributed processing framework (MapReduce)
  • Spark: Faster, in-memory processing engine; supports both batch processing and real-time processing
  • EMR is ideal when you need full control over your cluster configuration or are running complex ML pipelines

  • Metadata, Governance, and Discovery

  • AWS Glue Data Catalog: Central repository of metadata and schema information; acts as a data catalog for data discovery across S3, Redshift, and more
  • AWS Lake Formation: Manages data governance, security, and data lineage — tracking where data came from and how it was transformed
  • Data lineage helps organizations understand the full journey of a data asset, critical for compliance and debugging

  • Visualization: Amazon QuickSight

    Amazon QuickSight is AWS's cloud-native business intelligence tool for creating dashboards and visualizations. It connects to Redshift, Athena, S3, and other sources, enabling non-technical users to explore data through interactive charts and reports.


    Key Takeaways

  • Kinesis (Streams for custom real-time; Firehose for managed delivery) handles real-time data ingestion, while S3 serves as the foundation for scalable data lakes.
  • AWS Glue is the primary ETL service — it transforms data AND catalogs schema/metadata for the entire ecosystem.
  • Athena lets you query S3 data with SQL serverlessly; use Parquet/ORC columnar formats and partitioning to maximize performance and minimize cost.
  • Redshift is purpose-built for structured data warehousing with columnar storage and distribution keys for fast analytics.
  • EMR (Hadoop/Spark) handles heavy distributed batch and real-time processing, while Lake Formation governs security, access, and data lineage across your data lake.