AWS Certified Data Analytics – Specialty

AWS Certified Data Analytics – Specialty Intermediate — Quiz 2

AWS Certified Data Analytics – Specialty Intermediate — Quiz 2 — Study Guide

AWS Data Analytics Security & Governance — Intermediate Study Guide

Data is only as valuable as it is trustworthy and protected. In the AWS ecosystem, a robust data analytics platform isn't just about processing speed — it's about knowing *who* accessed *what*, *when*, and *why*. This lesson covers the security, governance, and compliance services you'll need to master for the AWS Certified Data Analytics Specialty exam, and more importantly, to build real-world data platforms that organizations can trust.


Data Governance Fundamentals

Data governance is the framework of policies, processes, and standards that ensure data is accurate, secure, and used appropriately. Think of it like a city's zoning laws — it defines who can build what, where, and under what conditions.

Key pillars of data governance in AWS:

  • Access control — who can read, write, or delete data
  • Data lineage — tracking where data came from and how it transformed
  • Auditing — recording every action taken on data
  • Compliance — meeting regulations like GDPR (General Data Protection Regulation)
  • Data Lineage

    Data lineage is the "paper trail" of your data — from ingestion through transformation to consumption. AWS Glue captures lineage metadata automatically when running ETL jobs, letting you trace a dashboard metric back to its raw source file.


    Core Security Services

    IAM — Identity and Access Management

    IAM is the central nervous system of AWS security. It controls *who* (users, roles, services) can do *what* (actions) on *which* resources.

    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-data-lake-bucket/*",
      "Condition": {
        "StringEquals": {"s3:prefix": "finance/"}
      }
    }

    Use IAM roles (not users) for services like Glue, Redshift, and Lambda to follow the principle of least privilege.

    AWS Organizations

    When managing multiple AWS accounts (common in enterprise data lakes), AWS Organizations lets you apply Service Control Policies (SCPs) across all accounts. SCPs act as guardrails — even an account admin can't exceed the permissions the SCP allows.


    Auditing & Monitoring

    CloudTrail

    AWS CloudTrail records every API call made in your AWS account — who called it, from where, and when. It's your primary tool for answering: *"Who deleted that S3 bucket?"*

    FeatureDetail
    ScopeAll AWS API calls (console, CLI, SDK)
    StorageLogs delivered to S3 or CloudWatch Logs
    Use caseSecurity audits, compliance, incident response
    Exam tip: If a question asks how to audit API calls to AWS resources, the answer is CloudTrail.

    AWS Config

    While CloudTrail records *actions*, AWS Config records *resource states*. It continuously monitors whether your resources comply with defined rules (e.g., "Are all S3 buckets encrypted?"). Think of CloudTrail as a security camera and Config as a building inspector.

    Security Hub

    AWS Security Hub aggregates findings from multiple services (GuardDuty, Macie, Config) into a single dashboard. It's your centralized security posture manager — useful when you need a bird's-eye view of compliance across accounts.


    Encryption & Key Management

    KMS — Key Management Service

    AWS KMS lets you create and control the encryption keys used to protect your data. Benefits include:
  • Centralized key management with audit trails
  • Integration with S3, Redshift, Glue, and more
  • Support for customer-managed keys (CMKs) for fine-grained control
  • Encrypting Data in S3

    The simplest way to encrypt data at rest in S3 is to enable default bucket encryption:

    aws s3api put-bucket-encryption \
      --bucket my-data-lake \
      --server-side-encryption-configuration '{
        "Rules": [{
          "ApplyServerSideEncryptionByDefault": {
            "SSEAlgorithm": "aws:kms"
          }
        }]
      }'

    This ensures every object uploaded is automatically encrypted, even if the uploader forgets to specify it.

    Encryption TypeKey Managed ByUse Case
    SSE-S3AWSSimplest, no extra cost
    SSE-KMSAWS KMS (your key)Audit trail, fine-grained control
    SSE-CYou (customer)Full key control

    Data Lake & Lake Formation

    Data Lake on AWS

    A data lake is a centralized repository storing structured, semi-structured, and unstructured data at scale — typically in S3. The challenge isn't storing data; it's governing it.

    AWS Lake Formation

    Lake Formation simplifies building secure data lakes by providing:
  • Centralized access control — define permissions once, enforce everywhere
  • Column and row-level security — restrict access to specific columns (e.g., hide SSN)
  • Data catalog integration — built on the AWS Glue Data Catalog
  • Blueprint-based ingestion — automated data ingestion workflows
  • Without Lake Formation, you'd need to manage S3 bucket policies, IAM policies, and Glue permissions separately. Lake Formation unifies these into one governance layer.


    PII, Data Masking & Macie

    PII (Personally Identifiable Information)

    PII is any data that can identify an individual — names, email addresses, social security numbers, etc. Regulations like GDPR require you to protect, minimize, and properly handle PII.

    Data Masking

    Data masking replaces sensitive values with realistic but fake data. For example:
  • Real: John Smith, SSN: 123-45-6789
  • Masked: J* S, SSN: *-**-6789
  • AWS Glue can apply masking transformations in ETL pipelines before data reaches analysts.

    Amazon Macie

    Amazon Macie uses machine learning to automatically discover and protect PII in S3. It scans buckets, identifies sensitive data, and generates findings (e.g., "This bucket contains 1,200 credit card numbers").

    Analogy: Macie is like a smart mail sorter that flags envelopes containing sensitive documents before they're sent to the wrong department.


    Networking: VPC for Data Security

    A VPC (Virtual Private Cloud) isolates your AWS resources in a private network. For data analytics:

  • Use VPC endpoints for S3 and Glue so traffic never leaves the AWS network
  • Place Redshift clusters in private subnets
  • Use security groups and NACLs as firewall layers

  • Redshift Security

    Amazon Redshift (the cloud data warehouse) supports:
  • Encryption at rest via KMS
  • SSL/TLS for data in transit
  • VPC isolation
  • Column-level access control (via Lake Formation)

  • Key Takeaways

  • CloudTrail is your go-to service for auditing API calls; AWS Config monitors resource compliance over time.
  • IAM manages access centrally — always use roles with least-privilege policies for AWS services.
  • Lake Formation simplifies data lake governance by centralizing access control, including column/row-level security, on top of S3 and the Glue Data Catalog.
  • KMS provides managed encryption key control with full audit trails, and enabling default S3 bucket encryption is the simplest way to encrypt data at rest.
  • Macie automatically detects PII in S3, supporting GDPR compliance and data masking workflows.