Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1
Google Cloud Certified Professional Data Engineer Fundamentals — Quiz 1 — Study Guide
Google Cloud Professional Data Engineer Fundamentals — Study Guide
As a data engineer on Google Cloud, your job is to move, store, transform, and serve data reliably and cost-effectively. This guide covers the core GCP services and concepts you need to master — from choosing the right database to optimizing query costs — so you can make smart architectural decisions on the exam and in the real world.
Google Cloud Storage (GCS) & Storage Classes
GCS is the foundation of most data architectures on GCP. Think of it as an infinitely scalable file system in the cloud — ideal for data lakes, backups, and staging areas.
Storage Classes Comparison
| Storage Class | Best For | Minimum Duration | Relative Cost |
|---|---|---|---|
| Standard | Frequently accessed data | None | Highest |
| Nearline | Accessed ~once per month | 30 days | Lower |
| Coldline | Accessed ~once per quarter | 90 days | Even lower |
| Archive | Accessed less than once per year | 365 days | Lowest |
Exam tip: If data is accessed less than once per year, Archive is your answer for cost optimization.
Lifecycle Policies
Lifecycle policies automatically transition or delete objects based on rules you define — a key tool for data retention and cost management.
{
"rule": [
{
"action": { "type": "SetStorageClass", "storageClass": "COLDLINE" },
"condition": { "age": 90 }
},
{
"action": { "type": "Delete" },
"condition": { "age": 365 }
}
]
}This policy moves objects to Coldline after 90 days, then deletes them after a year — automating your data management without manual intervention.
BigQuery — Serverless Data Warehousing
BigQuery is GCP's fully managed, serverless data warehouse. It's designed for analytical queries over massive datasets using SQL.
Partitioning for Cost & Performance
Partitioning splits a table into segments (usually by date or integer range), so queries only scan the relevant partition instead of the entire table.
-- Create a partitioned table by ingestion date
CREATE TABLE my_dataset.events
PARTITION BY DATE(_PARTITIONTIME)
AS SELECT * FROM raw_events;-- Query only scans the relevant partition
SELECT user_id, event_type
FROM my_dataset.events
WHERE DATE(_PARTITIONTIME) = '2024-01-15';
Why it matters: BigQuery charges by bytes scanned. Partitioning dramatically reduces scan size → lower cost, faster queries.
Denormalization & Query Performance
Unlike traditional relational databases, BigQuery favors denormalized schemas (flat, wide tables with repeated records). Joins across huge tables are expensive; embedding nested data avoids that cost.
Geospatial Data
BigQuery supports a native GEOGRAPHY data type for geospatial analysis:
SELECT ST_DISTANCE(
ST_GEOGPOINT(-122.4194, 37.7749), -- San Francisco
ST_GEOGPOINT(-118.2437, 34.0522) -- Los Angeles
) AS distance_meters;Choosing the Right Database
| Service | Type | Best For | Key Feature |
|---|---|---|---|
| Cloud SQL | Relational (SQL) | Transactional apps, moderate scale | Managed MySQL/PostgreSQL |
| Spanner | Relational (SQL) | Global apps needing strong consistency | Horizontal scale + ACID |
| Bigtable | NoSQL (wide-column) | Time-series, IoT, high throughput | Millions of reads/writes per second |
| Firestore | NoSQL (document) | Flexible schema, mobile/web apps | Schema flexibility |
Bigtable — Time-Series & High Throughput
Bigtable is optimized for time-series data (metrics, IoT sensor readings) with extremely high ingestion rates and low-latency reads. It scales to petabytes with consistent single-digit millisecond latency.
Use Bigtable when you need:
Spanner — Global Consistency
Spanner is unique: it's a distributed relational database that offers both horizontal scalability *and* strong ACID consistency globally. Use it when you need SQL semantics at planetary scale — think global banking or inventory systems.
Cloud SQL vs. Spanner
NoSQL & Schema Flexibility
NoSQL databases like Firestore and Bigtable don't enforce a fixed schema, making them ideal when your data structure evolves over time or varies between records.
Data Processing Services
Dataflow — Streaming & Batch ETL
Dataflow is GCP's fully managed service for ETL pipelines, supporting both batch and streaming data processing using Apache Beam. It's the go-to for real-time data transformations.
Common pattern:
Pub/Sub (ingest) → Dataflow (transform) → BigQuery (store/analyze)Use a staging table in BigQuery to land raw streaming data before transforming it into your final schema.
Dataproc — Managed Hadoop & Spark
Dataproc runs managed Apache Hadoop and Spark clusters. Use it when you're migrating existing Hadoop workloads to GCP or need the full Spark ecosystem for large-scale batch processing.
Pub/Sub — Resilient Message Ingestion
Pub/Sub is a fully managed messaging service that decouples producers from consumers. It provides resilience by durably storing messages until subscribers process them — critical for real-time streaming pipelines that can't afford data loss.
Orchestration & Metadata
Cloud Composer — Workflow Orchestration
Cloud Composer is managed Apache Airflow on GCP. Use it to schedule and orchestrate complex data pipelines with dependencies — for example, running a Dataflow job only after a GCS file arrives.
# Example Airflow DAG task
from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperatorrun_etl = DataflowCreatePythonJobOperator(
task_id='run_etl_pipeline',
py_file='gs://my-bucket/etl_job.py',
dag=dag
)
Data Catalog — Metadata & Discoverability
Data Catalog is a fully managed metadata service that makes your data assets discoverable across GCP. It automatically catalogs BigQuery tables, GCS files, and Pub/Sub topics, and lets you add business metadata and tags — essential for large organizations managing hundreds of datasets.