Databricks Data Engineer Associate

Workspace & Clusters — Quiz 1

Workspace & Clusters — Quiz 1 — Study Guide

Workspace & Clusters — Study Guide

Understanding how Databricks organizes your work environment is foundational to everything else you'll do on the platform. Whether you're a data engineer running pipelines, a data scientist training models, or an admin managing a team, knowing how the Workspace, clusters, and permissions fit together will save you hours of frustration and help you build secure, cost-effective solutions.

The Databricks Workspace

The Workspace is your central hub in Databricks — think of it like a shared Google Drive, but purpose-built for data and AI work. It's where you organize and access all your assets: notebooks, files, experiments, and more.

What Lives in the Workspace?

Notebooks — Interactive documents combining code (Python, SQL, Scala, R), visualizations, and markdown text

Folders — Organize your work, just like a file system

Repos — Git-integrated folders for version-controlled code

Experiments — MLflow tracking for machine learning runs

Dashboards — Visual reports built from query results

Navigating the Workspace

The left-hand sidebar is your primary navigation tool. Key sections include:

Icon	Section	Purpose
🏠	Home	Your personal workspace folder
📓	Workspace	Full file browser for all assets
📊	Data	Data Explorer for browsing tables
⚙️	Compute	Manage clusters and SQL warehouses
🔁	Workflows	Schedule jobs and DLT pipelines
🛡️	Admin Console	User and security management (admins only)

Data Explorer & Data Discovery

The Data Explorer lets you browse catalogs, databases (schemas), and tables without writing any code. It's especially useful for:

Previewing table schemas and sample data

Checking table ownership and permissions

Understanding data lineage (with Unity Catalog)

Think of it as a "table of contents" for all your data assets. Data discovery becomes much easier when your team tags tables with descriptions and owners — a best practice that pays dividends at scale.

Clusters: Your Compute Engine

A cluster is a set of virtual machines that run your code. Without a cluster, your notebooks have no engine to execute on.

Types of Clusters

All-Purpose Clusters — Started manually, used for interactive development and notebooks

Job Clusters — Created automatically for a specific job run, then terminated (more cost-efficient)

SQL Warehouses — Optimized compute specifically for SQL analytics

Cluster Permissions

Permissions on clusters control what different users can do:

Permission	What It Allows
No Permissions	Cannot see or use the cluster
Can Attach To	Can attach a notebook and run code on the cluster
Can Restart	Can attach and restart the cluster
Can Manage	Full control: edit, delete, change permissions

Quiz tip: A user with "Can Attach To" can run notebooks on the cluster but cannot restart or reconfigure it.

Cluster Policies

Cluster policies are admin-defined templates that restrict what users can configure when creating a cluster. They help enforce:

Maximum VM sizes (cost control)

Required tags for cost attribution

Approved Databricks Runtime versions

This is a key tool for resource management and cost control — preventing users from accidentally spinning up expensive clusters.

User Management & Permissions

Roles in Databricks

Workspace Admin — Broadest permissions; manages users, clusters, and settings

Regular User — Can create notebooks and use clusters they have access to

Account Admin — Manages the overall Databricks account (billing, workspaces)

Groups

Instead of assigning permissions to individuals, use groups to manage access at scale. A group might be data-engineers or analysts, and you assign cluster or table permissions to the group once.

SCIM & Identity Management

SCIM (System for Cross-domain Identity Management) is a protocol that lets your company's identity provider (like Okta or Azure AD) automatically sync users and groups into Databricks. When someone joins or leaves your company, their Databricks access is updated automatically — no manual work required.

Admin Console & Security

The Admin Console (found under Settings) is where workspace admins:

Add/remove users and groups

Configure single sign-on (SSO)

Review audit logs — records of who did what and when

Manage secrets and integrations

Secrets Management

Never hardcode passwords or API keys in notebooks! Use Databricks Secrets instead:

# Retrieve a secret safely — the value is never shown in plain text
password = dbutils.secrets.get(scope="my-scope", key="db-password")

Secrets are stored encrypted and masked in notebook outputs, keeping credentials out of your code and logs.

Audit Logs & Compliance

Audit logs capture events like cluster creation, login attempts, and data access. These are critical for compliance with regulations like GDPR or HIPAA. Logs are typically exported to cloud storage or a SIEM tool for analysis.

Automation: Jobs, Pipelines & APIs

Delta Live Tables (DLT) & Pipelines

Delta Live Tables (DLT) is a framework for building reliable data pipelines declaratively. Instead of managing dependencies yourself, you define tables and Databricks handles the orchestration:

import dlt@dlt.table
def cleaned_orders():
    return spark.read.table("raw_orders").filter("status != 'cancelled'")

DLT pipelines automatically manage table updates, data quality checks, and retries.

The Databricks REST API

Almost everything in Databricks can be automated via the REST API — creating clusters, running jobs, managing permissions. This enables CI/CD pipelines and infrastructure-as-code workflows:

# Example: List all clusters via API
curl -X GET https://<workspace-url>/api/2.0/clusters/list \
  -H "Authorization: Bearer <your-token>"

Collaboration Best Practices

Use Repos for team code — enables branching, pull requests, and code reviews

Use Groups for permissions — easier to manage than individual assignments

Tag clusters with team/project names for cost attribution

Use cluster policies to prevent runaway costs

Document tables in the Data Explorer with descriptions and owners

Rotate secrets regularly and audit access logs

Key Takeaways

The Workspace is your central hub for notebooks, data, and compute — navigate it via the left sidebar, with the Admin Console reserved for admins.

Cluster permissions follow a hierarchy: "Can Attach To" → "Can Restart" → "Can Manage"; users with only "Can Attach To" can run code but not reconfigure the cluster.

Cluster policies are the primary tool for cost control and resource governance, restricting what users can configure.

SCIM automates user and group provisioning from your identity provider, while Secrets Management keeps credentials safe and out of your code.

Audit logs are your compliance paper trail — they record all significant actions in the workspace for security review and regulatory requirements.