Workspace & Clusters — Quiz 1
Workspace & Clusters — Quiz 1 — Study Guide
Workspace & Clusters — Study Guide
Understanding how Databricks organizes your work environment is foundational to everything else you'll do on the platform. Whether you're a data engineer running pipelines, a data scientist training models, or an admin managing a team, knowing how the Workspace, clusters, and permissions fit together will save you hours of frustration and help you build secure, cost-effective solutions.
The Databricks Workspace
The Workspace is your central hub in Databricks — think of it like a shared Google Drive, but purpose-built for data and AI work. It's where you organize and access all your assets: notebooks, files, experiments, and more.
What Lives in the Workspace?
Navigating the Workspace
The left-hand sidebar is your primary navigation tool. Key sections include:
| Icon | Section | Purpose |
|---|---|---|
| 🏠 | Home | Your personal workspace folder |
| 📓 | Workspace | Full file browser for all assets |
| 📊 | Data | Data Explorer for browsing tables |
| ⚙️ | Compute | Manage clusters and SQL warehouses |
| 🔁 | Workflows | Schedule jobs and DLT pipelines |
| 🛡️ | Admin Console | User and security management (admins only) |
Data Explorer & Data Discovery
The Data Explorer lets you browse catalogs, databases (schemas), and tables without writing any code. It's especially useful for:
Think of it as a "table of contents" for all your data assets. Data discovery becomes much easier when your team tags tables with descriptions and owners — a best practice that pays dividends at scale.
Clusters: Your Compute Engine
A cluster is a set of virtual machines that run your code. Without a cluster, your notebooks have no engine to execute on.
Types of Clusters
Cluster Permissions
Permissions on clusters control what different users can do:
| Permission | What It Allows |
|---|---|
| No Permissions | Cannot see or use the cluster |
| Can Attach To | Can attach a notebook and run code on the cluster |
| Can Restart | Can attach *and* restart the cluster |
| Can Manage | Full control: edit, delete, change permissions |
Quiz tip: A user with "Can Attach To" can run notebooks on the cluster but cannot restart or reconfigure it.
Cluster Policies
Cluster policies are admin-defined templates that restrict what users can configure when creating a cluster. They help enforce:
This is a key tool for resource management and cost control — preventing users from accidentally spinning up expensive clusters.
User Management & Permissions
Roles in Databricks
Groups
Instead of assigning permissions to individuals, use groups to manage access at scale. A group might be data-engineers or analysts, and you assign cluster or table permissions to the group once.
SCIM & Identity Management
SCIM (System for Cross-domain Identity Management) is a protocol that lets your company's identity provider (like Okta or Azure AD) automatically sync users and groups into Databricks. When someone joins or leaves your company, their Databricks access is updated automatically — no manual work required.
Admin Console & Security
The Admin Console (found under Settings) is where workspace admins:
Secrets Management
Never hardcode passwords or API keys in notebooks! Use Databricks Secrets instead:
# Retrieve a secret safely — the value is never shown in plain text
password = dbutils.secrets.get(scope="my-scope", key="db-password")Secrets are stored encrypted and masked in notebook outputs, keeping credentials out of your code and logs.
Audit Logs & Compliance
Audit logs capture events like cluster creation, login attempts, and data access. These are critical for compliance with regulations like GDPR or HIPAA. Logs are typically exported to cloud storage or a SIEM tool for analysis.
Automation: Jobs, Pipelines & APIs
Delta Live Tables (DLT) & Pipelines
Delta Live Tables (DLT) is a framework for building reliable data pipelines declaratively. Instead of managing dependencies yourself, you define tables and Databricks handles the orchestration:
import dlt@dlt.table
def cleaned_orders():
return spark.read.table("raw_orders").filter("status != 'cancelled'")
DLT pipelines automatically manage table updates, data quality checks, and retries.
The Databricks REST API
Almost everything in Databricks can be automated via the REST API — creating clusters, running jobs, managing permissions. This enables CI/CD pipelines and infrastructure-as-code workflows:
# Example: List all clusters via API
curl -X GET https://<workspace-url>/api/2.0/clusters/list \
-H "Authorization: Bearer <your-token>"