Google Cloud Certified Professional Data Engineer

Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1

Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1 — Study Guide

Google Cloud Professional Data Engineer Fundamentals — Study Guide

As a data engineer on Google Cloud, your job is to move, store, transform, and serve data reliably and cost-effectively. This guide covers the core GCP services and concepts you need to master — from choosing the right database to optimizing query costs — so you can make smart architectural decisions on the exam and in the real world.

Google Cloud Storage (GCS) & Storage Classes

GCS is the foundation of most data architectures on GCP. Think of it as an infinitely scalable file system in the cloud — ideal for data lakes, backups, and staging areas.

Storage Classes Comparison

Storage Class	Best For	Minimum Duration	Relative Cost
Standard	Frequently accessed data	None	Highest
Nearline	Accessed ~once per month	30 days	Lower
Coldline	Accessed ~once per quarter	90 days	Even lower
Archive	Accessed less than once per year	365 days	Lowest

Exam tip: If data is accessed less than once per year, Archive is your answer for cost optimization.

Lifecycle Policies

Lifecycle policies automatically transition or delete objects based on rules you define — a key tool for data retention and cost management.

{
  "rule": [
    {
      "action": { "type": "SetStorageClass", "storageClass": "COLDLINE" },
      "condition": { "age": 90 }
    },
    {
      "action": { "type": "Delete" },
      "condition": { "age": 365 }
    }
  ]
}

This policy moves objects to Coldline after 90 days, then deletes them after a year — automating your data management without manual intervention.

BigQuery — Serverless Data Warehousing

BigQuery is GCP's fully managed, serverless data warehouse. It's designed for analytical queries over massive datasets using SQL.

Partitioning for Cost & Performance

Partitioning splits a table into segments (usually by date or integer range), so queries only scan the relevant partition instead of the entire table.

-- Create a partitioned table by ingestion date
CREATE TABLE my_dataset.events
PARTITION BY DATE(_PARTITIONTIME)
AS SELECT * FROM raw_events;-- Query only scans the relevant partition
SELECT user_id, event_type
FROM my_dataset.events
WHERE DATE(_PARTITIONTIME) = '2024-01-15';

Why it matters: BigQuery charges by bytes scanned. Partitioning dramatically reduces scan size → lower cost, faster queries.

Denormalization & Query Performance

Unlike traditional relational databases, BigQuery favors denormalized schemas (flat, wide tables with repeated records). Joins across huge tables are expensive; embedding nested data avoids that cost.

Geospatial Data

BigQuery supports a native GEOGRAPHY data type for geospatial analysis:

SELECT ST_DISTANCE(
  ST_GEOGPOINT(-122.4194, 37.7749),  -- San Francisco
  ST_GEOGPOINT(-118.2437, 34.0522)   -- Los Angeles
) AS distance_meters;

Choosing the Right Database

Service	Type	Best For	Key Feature
Cloud SQL	Relational (SQL)	Transactional apps, moderate scale	Managed MySQL/PostgreSQL
Spanner	Relational (SQL)	Global apps needing strong consistency	Horizontal scale + ACID
Bigtable	NoSQL (wide-column)	Time-series, IoT, high throughput	Millions of reads/writes per second
Firestore	NoSQL (document)	Flexible schema, mobile/web apps	Schema flexibility

Bigtable — Time-Series & High Throughput

Bigtable is optimized for time-series data (metrics, IoT sensor readings) with extremely high ingestion rates and low-latency reads. It scales to petabytes with consistent single-digit millisecond latency.

Use Bigtable when you need:

Millions of rows written per second

Key-based lookups (not complex SQL queries)

Time-ordered data like application logs or financial ticks

Spanner — Global Consistency

Spanner is unique: it's a distributed relational database that offers both horizontal scalability *and* strong ACID consistency globally. Use it when you need SQL semantics at planetary scale — think global banking or inventory systems.

Cloud SQL vs. Spanner

Cloud SQL: Single-region, up to a few TB, familiar MySQL/PostgreSQL

Spanner: Multi-region, unlimited scale, higher cost — use when Cloud SQL hits its limits

NoSQL & Schema Flexibility

NoSQL databases like Firestore and Bigtable don't enforce a fixed schema, making them ideal when your data structure evolves over time or varies between records.

Data Processing Services

Dataflow — Streaming & Batch ETL

Dataflow is GCP's fully managed service for ETL pipelines, supporting both batch and streaming data processing using Apache Beam. It's the go-to for real-time data transformations.

Common pattern:

Pub/Sub (ingest) → Dataflow (transform) → BigQuery (store/analyze)

Use a staging table in BigQuery to land raw streaming data before transforming it into your final schema.

Dataproc — Managed Hadoop & Spark

Dataproc runs managed Apache Hadoop and Spark clusters. Use it when you're migrating existing Hadoop workloads to GCP or need the full Spark ecosystem for large-scale batch processing.

Pub/Sub — Resilient Message Ingestion

Pub/Sub is a fully managed messaging service that decouples producers from consumers. It provides resilience by durably storing messages until subscribers process them — critical for real-time streaming pipelines that can't afford data loss.

Orchestration & Metadata

Cloud Composer — Workflow Orchestration

Cloud Composer is managed Apache Airflow on GCP. Use it to schedule and orchestrate complex data pipelines with dependencies — for example, running a Dataflow job only after a GCS file arrives.

# Example Airflow DAG task
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperatorrun_etl = DataflowCreatePythonJobOperator(
    task_id='run_etl_pipeline',
    py_file='gs://my-bucket/etl_job.py',
    dag=dag
)

Data Catalog — Metadata & Discoverability

Data Catalog is a fully managed metadata service that makes your data assets discoverable across GCP. It automatically catalogs BigQuery tables, GCS files, and Pub/Sub topics, and lets you add business metadata and tags — essential for large organizations managing hundreds of datasets.

Key Takeaways

Match storage class to access frequency: Use Archive for data accessed less than once a year; use lifecycle policies to automate transitions and control cost.

Partition BigQuery tables: Partitioning by date is the most impactful way to reduce query costs and improve performance in BigQuery.

Pick the right database for the job: Bigtable for time-series/high throughput, Spanner for global consistency at scale, Cloud SQL for standard relational workloads, and Firestore/NoSQL when schema flexibility is needed.

Dataflow handles real-time ETL; Dataproc handles Hadoop/Spark: Use Dataflow for streaming pipelines (especially with Pub/Sub), and Dataproc when migrating existing Spark jobs.

Orchestrate with Cloud Composer, discover with Data Catalog: Complex multi-step pipelines need Airflow-based scheduling; Data Catalog ensures your data assets are findable and well-documented.