Google Cloud Certified Professional Data Engineer

Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1

Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1 — Study Guide

Google Cloud Professional Data Engineer Fundamentals — Study Guide

As a data engineer on Google Cloud, your job is to move, store, transform, and serve data reliably and cost-effectively. This guide covers the core GCP services and concepts you need to master — from choosing the right database to optimizing query costs — so you can make smart architectural decisions on the exam and in the real world.


Google Cloud Storage (GCS) & Storage Classes

GCS is the foundation of most data architectures on GCP. Think of it as an infinitely scalable file system in the cloud — ideal for data lakes, backups, and staging areas.

Storage Classes Comparison

Storage ClassBest ForMinimum DurationRelative Cost
StandardFrequently accessed dataNoneHighest
NearlineAccessed ~once per month30 daysLower
ColdlineAccessed ~once per quarter90 daysEven lower
ArchiveAccessed less than once per year365 daysLowest
Exam tip: If data is accessed less than once per year, Archive is your answer for cost optimization.

Lifecycle Policies

Lifecycle policies automatically transition or delete objects based on rules you define — a key tool for data retention and cost management.

{
  "rule": [
    {
      "action": { "type": "SetStorageClass", "storageClass": "COLDLINE" },
      "condition": { "age": 90 }
    },
    {
      "action": { "type": "Delete" },
      "condition": { "age": 365 }
    }
  ]
}

This policy moves objects to Coldline after 90 days, then deletes them after a year — automating your data management without manual intervention.


BigQuery — Serverless Data Warehousing

BigQuery is GCP's fully managed, serverless data warehouse. It's designed for analytical queries over massive datasets using SQL.

Partitioning for Cost & Performance

Partitioning splits a table into segments (usually by date or integer range), so queries only scan the relevant partition instead of the entire table.

-- Create a partitioned table by ingestion date
CREATE TABLE my_dataset.events
PARTITION BY DATE(_PARTITIONTIME)
AS SELECT * FROM raw_events;

-- Query only scans the relevant partition SELECT user_id, event_type FROM my_dataset.events WHERE DATE(_PARTITIONTIME) = '2024-01-15';

Why it matters: BigQuery charges by bytes scanned. Partitioning dramatically reduces scan size → lower cost, faster queries.

Denormalization & Query Performance

Unlike traditional relational databases, BigQuery favors denormalized schemas (flat, wide tables with repeated records). Joins across huge tables are expensive; embedding nested data avoids that cost.

Geospatial Data

BigQuery supports a native GEOGRAPHY data type for geospatial analysis:

SELECT ST_DISTANCE(
  ST_GEOGPOINT(-122.4194, 37.7749),  -- San Francisco
  ST_GEOGPOINT(-118.2437, 34.0522)   -- Los Angeles
) AS distance_meters;


Choosing the Right Database

ServiceTypeBest ForKey Feature
Cloud SQLRelational (SQL)Transactional apps, moderate scaleManaged MySQL/PostgreSQL
SpannerRelational (SQL)Global apps needing strong consistencyHorizontal scale + ACID
BigtableNoSQL (wide-column)Time-series, IoT, high throughputMillions of reads/writes per second
FirestoreNoSQL (document)Flexible schema, mobile/web appsSchema flexibility

Bigtable — Time-Series & High Throughput

Bigtable is optimized for time-series data (metrics, IoT sensor readings) with extremely high ingestion rates and low-latency reads. It scales to petabytes with consistent single-digit millisecond latency.

Use Bigtable when you need:

  • Millions of rows written per second
  • Key-based lookups (not complex SQL queries)
  • Time-ordered data like application logs or financial ticks
  • Spanner — Global Consistency

    Spanner is unique: it's a distributed relational database that offers both horizontal scalability *and* strong ACID consistency globally. Use it when you need SQL semantics at planetary scale — think global banking or inventory systems.

    Cloud SQL vs. Spanner

  • Cloud SQL: Single-region, up to a few TB, familiar MySQL/PostgreSQL
  • Spanner: Multi-region, unlimited scale, higher cost — use when Cloud SQL hits its limits
  • NoSQL & Schema Flexibility

    NoSQL databases like Firestore and Bigtable don't enforce a fixed schema, making them ideal when your data structure evolves over time or varies between records.


    Data Processing Services

    Dataflow — Streaming & Batch ETL

    Dataflow is GCP's fully managed service for ETL pipelines, supporting both batch and streaming data processing using Apache Beam. It's the go-to for real-time data transformations.

    Common pattern:

    Pub/Sub (ingest) → Dataflow (transform) → BigQuery (store/analyze)

    Use a staging table in BigQuery to land raw streaming data before transforming it into your final schema.

    Dataproc — Managed Hadoop & Spark

    Dataproc runs managed Apache Hadoop and Spark clusters. Use it when you're migrating existing Hadoop workloads to GCP or need the full Spark ecosystem for large-scale batch processing.

    Pub/Sub — Resilient Message Ingestion

    Pub/Sub is a fully managed messaging service that decouples producers from consumers. It provides resilience by durably storing messages until subscribers process them — critical for real-time streaming pipelines that can't afford data loss.


    Orchestration & Metadata

    Cloud Composer — Workflow Orchestration

    Cloud Composer is managed Apache Airflow on GCP. Use it to schedule and orchestrate complex data pipelines with dependencies — for example, running a Dataflow job only after a GCS file arrives.

    # Example Airflow DAG task
    from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator

    run_etl = DataflowCreatePythonJobOperator( task_id='run_etl_pipeline', py_file='gs://my-bucket/etl_job.py', dag=dag )

    Data Catalog — Metadata & Discoverability

    Data Catalog is a fully managed metadata service that makes your data assets discoverable across GCP. It automatically catalogs BigQuery tables, GCS files, and Pub/Sub topics, and lets you add business metadata and tags — essential for large organizations managing hundreds of datasets.


    Key Takeaways

  • Match storage class to access frequency: Use Archive for data accessed less than once a year; use lifecycle policies to automate transitions and control cost.
  • Partition BigQuery tables: Partitioning by date is the most impactful way to reduce query costs and improve performance in BigQuery.
  • Pick the right database for the job: Bigtable for time-series/high throughput, Spanner for global consistency at scale, Cloud SQL for standard relational workloads, and Firestore/NoSQL when schema flexibility is needed.
  • Dataflow handles real-time ETL; Dataproc handles Hadoop/Spark: Use Dataflow for streaming pipelines (especially with Pub/Sub), and Dataproc when migrating existing Spark jobs.
  • Orchestrate with Cloud Composer, discover with Data Catalog: Complex multi-step pipelines need Airflow-based scheduling; Data Catalog ensures your data assets are findable and well-documented.