Databricks Data Engineer Associate

Workspace & Clusters — Quiz 1

Workspace & Clusters — Quiz 1 — Study Guide

Workspace & Clusters — Study Guide

Understanding how Databricks organizes your work environment is foundational to everything else you'll do on the platform. Whether you're a data engineer running pipelines, a data scientist training models, or an admin managing a team, knowing how the Workspace, clusters, and permissions fit together will save you hours of frustration and help you build secure, cost-effective solutions.


The Databricks Workspace

The Workspace is your central hub in Databricks — think of it like a shared Google Drive, but purpose-built for data and AI work. It's where you organize and access all your assets: notebooks, files, experiments, and more.

What Lives in the Workspace?

  • Notebooks — Interactive documents combining code (Python, SQL, Scala, R), visualizations, and markdown text
  • Folders — Organize your work, just like a file system
  • Repos — Git-integrated folders for version-controlled code
  • Experiments — MLflow tracking for machine learning runs
  • Dashboards — Visual reports built from query results
  • Navigating the Workspace

    The left-hand sidebar is your primary navigation tool. Key sections include:

    IconSectionPurpose
    🏠HomeYour personal workspace folder
    📓WorkspaceFull file browser for all assets
    📊DataData Explorer for browsing tables
    ⚙️ComputeManage clusters and SQL warehouses
    🔁WorkflowsSchedule jobs and DLT pipelines
    🛡️Admin ConsoleUser and security management (admins only)

    Data Explorer & Data Discovery

    The Data Explorer lets you browse catalogs, databases (schemas), and tables without writing any code. It's especially useful for:

  • Previewing table schemas and sample data
  • Checking table ownership and permissions
  • Understanding data lineage (with Unity Catalog)
  • Think of it as a "table of contents" for all your data assets. Data discovery becomes much easier when your team tags tables with descriptions and owners — a best practice that pays dividends at scale.


    Clusters: Your Compute Engine

    A cluster is a set of virtual machines that run your code. Without a cluster, your notebooks have no engine to execute on.

    Types of Clusters

  • All-Purpose Clusters — Started manually, used for interactive development and notebooks
  • Job Clusters — Created automatically for a specific job run, then terminated (more cost-efficient)
  • SQL Warehouses — Optimized compute specifically for SQL analytics
  • Cluster Permissions

    Permissions on clusters control what different users can do:

    PermissionWhat It Allows
    No PermissionsCannot see or use the cluster
    Can Attach ToCan attach a notebook and run code on the cluster
    Can RestartCan attach *and* restart the cluster
    Can ManageFull control: edit, delete, change permissions
    Quiz tip: A user with "Can Attach To" can run notebooks on the cluster but cannot restart or reconfigure it.

    Cluster Policies

    Cluster policies are admin-defined templates that restrict what users can configure when creating a cluster. They help enforce:

  • Maximum VM sizes (cost control)
  • Required tags for cost attribution
  • Approved Databricks Runtime versions
  • This is a key tool for resource management and cost control — preventing users from accidentally spinning up expensive clusters.


    User Management & Permissions

    Roles in Databricks

  • Workspace Admin — Broadest permissions; manages users, clusters, and settings
  • Regular User — Can create notebooks and use clusters they have access to
  • Account Admin — Manages the overall Databricks account (billing, workspaces)
  • Groups

    Instead of assigning permissions to individuals, use groups to manage access at scale. A group might be data-engineers or analysts, and you assign cluster or table permissions to the group once.

    SCIM & Identity Management

    SCIM (System for Cross-domain Identity Management) is a protocol that lets your company's identity provider (like Okta or Azure AD) automatically sync users and groups into Databricks. When someone joins or leaves your company, their Databricks access is updated automatically — no manual work required.


    Admin Console & Security

    The Admin Console (found under Settings) is where workspace admins:

  • Add/remove users and groups
  • Configure single sign-on (SSO)
  • Review audit logs — records of who did what and when
  • Manage secrets and integrations
  • Secrets Management

    Never hardcode passwords or API keys in notebooks! Use Databricks Secrets instead:

    # Retrieve a secret safely — the value is never shown in plain text
    password = dbutils.secrets.get(scope="my-scope", key="db-password")

    Secrets are stored encrypted and masked in notebook outputs, keeping credentials out of your code and logs.

    Audit Logs & Compliance

    Audit logs capture events like cluster creation, login attempts, and data access. These are critical for compliance with regulations like GDPR or HIPAA. Logs are typically exported to cloud storage or a SIEM tool for analysis.


    Automation: Jobs, Pipelines & APIs

    Delta Live Tables (DLT) & Pipelines

    Delta Live Tables (DLT) is a framework for building reliable data pipelines declaratively. Instead of managing dependencies yourself, you define tables and Databricks handles the orchestration:

    import dlt

    @dlt.table def cleaned_orders(): return spark.read.table("raw_orders").filter("status != 'cancelled'")

    DLT pipelines automatically manage table updates, data quality checks, and retries.

    The Databricks REST API

    Almost everything in Databricks can be automated via the REST API — creating clusters, running jobs, managing permissions. This enables CI/CD pipelines and infrastructure-as-code workflows:

    # Example: List all clusters via API
    curl -X GET https://<workspace-url>/api/2.0/clusters/list \
      -H "Authorization: Bearer <your-token>"


    Collaboration Best Practices

  • Use Repos for team code — enables branching, pull requests, and code reviews
  • Use Groups for permissions — easier to manage than individual assignments
  • Tag clusters with team/project names for cost attribution
  • Use cluster policies to prevent runaway costs
  • Document tables in the Data Explorer with descriptions and owners
  • Rotate secrets regularly and audit access logs

  • Key Takeaways

  • The Workspace is your central hub for notebooks, data, and compute — navigate it via the left sidebar, with the Admin Console reserved for admins.
  • Cluster permissions follow a hierarchy: "Can Attach To" → "Can Restart" → "Can Manage"; users with only "Can Attach To" can run code but not reconfigure the cluster.
  • Cluster policies are the primary tool for cost control and resource governance, restricting what users can configure.
  • SCIM automates user and group provisioning from your identity provider, while Secrets Management keeps credentials safe and out of your code.
  • Audit logs are your compliance paper trail — they record all significant actions in the workspace for security review and regulatory requirements.