Welcome to Chapter 13 of our comprehensive guide! In the previous chapters, we’ve meticulously built robust data pipelines, ingesting real-time supply chain events, performing complex analytics, and establishing a sophisticated data lakehouse architecture. We’ve focused on data transformation, reliability, and performance. Now, it’s time to address a critical aspect for any production-ready system: security and data governance.

This chapter will guide you through implementing Databricks Unity Catalog to secure your data lakehouse. Unity Catalog provides a centralized governance solution for data and AI on the Databricks Lakehouse Platform, offering fine-grained access control, auditing, and data lineage across all your data assets. By the end of this chapter, you will have a securely governed lakehouse, ensuring that only authorized users and applications can access specific data, and that all data access is auditable and compliant with organizational policies.

To follow along, you should have a Databricks workspace with Unity Catalog enabled and an understanding of the data pipelines and tables created in previous chapters. Specifically, we’ll be securing the Delta tables residing in our Bronze, Silver, and Gold layers, as well as any derived views. Our expected outcome is a fully permissioned and auditable data environment, ready for secure consumption by various teams within the organization, from data analysts to machine learning engineers.

Planning & Design

Securing a data lakehouse with Unity Catalog involves understanding its hierarchical structure and how permissions propagate.

Unity Catalog Architecture:

  1. Metastore: The top-level container for all Unity Catalog metadata. Each Databricks account can have multiple metastores, but typically one per region.
  2. Catalogs: The first layer of organization within a metastore. You can use catalogs to group data assets (schemas, tables, views) by logical units, such as departments, projects, or environments (e.g., dev, prod).
  3. Schemas (Databases): The second layer, containing tables, views, and volumes. Schemas are commonly used to organize data within a catalog, such as by data source or data layer (e.g., bronze, silver, gold).
  4. Tables, Views, Volumes: The actual data assets. Tables store data, views are logical representations of data, and volumes manage non-tabular data files.

Access Control Model:

Unity Catalog uses a standard ANSI SQL GRANT/REVOKE permission model. Permissions can be granted to users, groups, or service principals at the metastore, catalog, schema, table, view, or volume level. The principle of least privilege is paramount: grant only the necessary permissions.

For our supply chain project, we’ll design the following structure and access patterns:

  • Catalog: supply_chain_analytics
  • Schemas: bronze, silver, gold, tariff_analysis, logistics_monitoring, procurement_intelligence
  • Permissions:
    • Data Engineers: Full access (CREATE, MODIFY, SELECT) to bronze, silver, gold schemas.
    • Tariff Analysts: SELECT access to tariff_analysis schema, potentially specific tables in gold.
    • Logistics Managers: SELECT access to logistics_monitoring schema, specific views for dashboards.
    • Procurement Team: SELECT access to procurement_intelligence schema, potentially with Row-Level Security (RLS) or Column-Level Security (CLS).
    • Auditors: SELECT access to all schemas, and audit logs.
    • Service Principals: Specific, minimal permissions for automated jobs (DLT pipelines, Structured Streaming jobs).

File Structure for Configuration (Conceptual):

While Unity Catalog is primarily managed via SQL commands and the Databricks UI, we can encapsulate common setup and permission scripts for version control and automation.

.
├── databricks_config/
│   ├── unity_catalog_setup.sql          # Initial catalog, schema creation
│   ├── unity_catalog_permissions.sql    # User/group permission grants
│   ├── dlt_pipeline_config.json         # DLT pipeline configurations for UC
│   └── structured_streaming_config.py   # Python config for SS jobs
├── notebooks/
│   ├── 01_bronze_ingestion_dlt.py
│   ├── ...
│   └── 12_anomaly_detection.py
└── README.md

Step-by-Step Implementation

We’ll assume Unity Catalog is already enabled for your Databricks workspace and a Metastore is configured. If not, please refer to the Databricks documentation for initial setup. Our focus will be on creating and securing our project’s data assets within Unity Catalog.

a) Setup/Configuration

First, we need to create our dedicated catalog and schemas.

1. Create a New Catalog for Our Project

We’ll create a catalog named supply_chain_analytics. This acts as the top-level container for all data related to this project.

Open a new Databricks Notebook or SQL editor and run the following:

-- databricks_config/unity_catalog_setup.sql
-- Create a new catalog for our supply chain project
CREATE CATALOG IF NOT EXISTS supply_chain_analytics;

-- Use the new catalog for subsequent operations
USE CATALOG supply_chain_analytics;

-- Grant initial permissions to the 'account_admins' group (or your designated admin group)
-- This ensures administrators can manage the catalog.
GRANT CREATE SCHEMA, USE SCHEMA ON CATALOG supply_chain_analytics TO `account_admins`;
GRANT CREATE VOLUME, CREATE FUNCTION, CREATE TABLE, CREATE EXTERNAL LOCATION ON CATALOG supply_chain_analytics TO `account_admins`;

-- Log the creation
SELECT 'Catalog supply_chain_analytics created and initial permissions granted.' AS log_message;

Explanation:

  • CREATE CATALOG IF NOT EXISTS supply_chain_analytics;: Creates our dedicated catalog. The IF NOT EXISTS clause makes the script idempotent.
  • USE CATALOG supply_chain_analytics;: Sets the default catalog for subsequent SQL commands within the session.
  • GRANT ... TO account_admins``: Grants essential administrative privileges to the account_admins group (or a similar group you manage) over the newly created catalog. This is crucial for managing schemas, tables, and other assets within it.

2. Create Schemas within the Catalog

Next, we’ll create the necessary schemas (databases) within our supply_chain_analytics catalog, representing our data layers and specific analytical domains.

-- databricks_config/unity_catalog_setup.sql (append to existing file or create new)
-- Ensure we are in the correct catalog
USE CATALOG supply_chain_analytics;

-- Create schemas for different data layers and domains
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;
CREATE SCHEMA IF NOT EXISTS tariff_analysis;
CREATE SCHEMA IF NOT EXISTS logistics_monitoring;
CREATE SCHEMA IF NOT EXISTS procurement_intelligence;

-- Grant usage permissions to 'account_admins' for all new schemas
GRANT USE SCHEMA ON SCHEMA bronze TO `account_admins`;
GRANT USE SCHEMA ON SCHEMA silver TO `account_admins`;
GRANT USE SCHEMA ON SCHEMA gold TO `account_admins`;
GRANT USE SCHEMA ON SCHEMA tariff_analysis TO `account_admins`;
GRANT USE SCHEMA ON SCHEMA logistics_monitoring TO `account_admins`;
GRANT USE SCHEMA ON SCHEMA procurement_intelligence TO `account_admins`;

-- Log the creation
SELECT 'Schemas created within supply_chain_analytics catalog.' AS log_message;

Explanation:

  • We create schemas like bronze, silver, gold for our Medallion architecture layers, and tariff_analysis, logistics_monitoring, procurement_intelligence for specific analytical outputs.
  • GRANT USE SCHEMA: This permission allows users to list and interact with objects within the schema, but not necessarily to read or write data from tables unless explicitly granted.

b) Core Implementation

Now that our catalog and schemas are set up, we’ll configure our pipelines to write to these Unity Catalog managed tables and apply granular permissions.

1. Update DLT Pipelines to Write to Unity Catalog

Our Delta Live Tables (DLT) pipelines currently write to traditional Delta Lake paths. We need to update them to write to Unity Catalog managed tables. This involves specifying the catalog and schema in the DLT pipeline configuration.

Modify your DLT pipeline definition (e.g., dlt_pipeline_supply_chain.py from Chapter 3/4).

# notebooks/03_bronze_ingestion_dlt.py (or relevant DLT notebook)

import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Define the target Unity Catalog and Schema
# This can also be passed as a DLT pipeline configuration parameter
UC_CATALOG = "supply_chain_analytics"
UC_SCHEMA = "bronze" # For bronze layer

# Set the target schema for DLT tables
dlt.create_target_table(
    name="raw_events_bronze",
    table_properties={"quality": "bronze"},
    path=f"dbfs:/user/hive/warehouse/{UC_CATALOG}.db/{UC_SCHEMA}.db/raw_events_bronze" # Optional, DLT manages paths for UC tables
)
def create_raw_events_bronze_table():
    # Example: Reading from Kafka (replace with your actual Kafka source)
    # For DLT, Kafka configuration is typically done in the pipeline settings
    # or passed as a Spark config.
    # Assuming 'raw_kafka_stream' is a DLT stream created from Kafka
    return spark.readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "your_kafka_broker:9092") \
                .option("subscribe", "supply_chain_events") \
                .option("startingOffsets", "earliest") \
                .load() \
                .selectExpr("CAST(value AS STRING) as raw_data", "current_timestamp() as ingestion_timestamp")

@dlt.table(
    name=f"{UC_CATALOG}.{UC_SCHEMA}.raw_events_bronze", # Fully qualified name for UC table
    comment="Raw supply chain events ingested from Kafka",
    table_properties={"quality": "bronze"},
    # For DLT, checkpointLocation is typically configured at the pipeline level
    # or automatically managed for UC tables.
    # checkpointLocation=f"/dlt_checkpoints/{UC_CATALOG}/{UC_SCHEMA}/raw_events_bronze"
)
def raw_events_bronze():
    return create_raw_events_bronze_table()

# Example for a Silver table
# UC_SCHEMA_SILVER = "silver"
# @dlt.table(
#     name=f"{UC_CATALOG}.{UC_SCHEMA_SILVER}.processed_events_silver",
#     comment="Cleaned and parsed supply chain events",
#     table_properties={"quality": "silver"}
# )
# def processed_events_silver():
#     return dlt.read_stream(f"{UC_CATALOG}.{UC_SCHEMA}.raw_events_bronze") \
#               .withColumn("parsed_data", from_json(col("raw_data"), "YOUR_SCHEMA_HERE")) \
#               .select("parsed_data.*", "ingestion_timestamp")

# ... similar updates for all DLT pipelines (bronze, silver, gold)

Explanation:

  • We define UC_CATALOG and UC_SCHEMA variables.
  • The key change is in the @dlt.table decorator: name=f"{UC_CATALOG}.{UC_SCHEMA}.raw_events_bronze". This fully qualified name tells DLT to create the table as a Unity Catalog managed table within the specified catalog and schema.
  • When configuring the DLT pipeline in the Databricks UI, ensure the Target schema field is left empty if you specify the full path in the Python code, or set the Target schema to supply_chain_analytics.bronze (or supply_chain_analytics.silver, etc.) directly in the UI. Databricks recommends setting the target schema in the DLT pipeline settings for easier management.

DLT Pipeline UI Configuration (Illustrative):

When creating/editing your DLT pipeline, in the “Pipeline settings” under “Target”, specify:

  • Catalog: supply_chain_analytics
  • Target schema: bronze (for bronze layer pipeline) or silver (for silver layer pipeline), etc.

This configuration ensures all tables created by that DLT pipeline land in the specified Unity Catalog schema.

2. Update Spark Structured Streaming Jobs to Write to Unity Catalog

For our Spark Structured Streaming jobs (e.g., logistics cost monitoring), the process is similar: specify the fully qualified table name when writing to Delta Lake.

Modify your Structured Streaming job (e.g., structured_streaming_logistics.py from Chapter 8).

# notebooks/08_streaming_logistics_monitoring.py (or relevant SS notebook)

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Initialize Spark Session (assuming running on Databricks cluster)
spark = SparkSession.builder.appName("LogisticsCostMonitoring") \
    .getOrCreate()

# Disable Adaptive Query Execution for streaming (optional, depending on Spark version)
spark.conf.set("spark.sql.adaptive.enabled", "false")

# Define Unity Catalog details
UC_CATALOG = "supply_chain_analytics"
UC_SCHEMA = "logistics_monitoring" # Target schema for this SS job

# Kafka connection properties (replace with your actual details)
kafka_bootstrap_servers = "your_kafka_broker:9092"
kafka_topic = "logistics_events"

# Define the schema for incoming Kafka messages (adjust as per your data)
logistics_event_schema = StructType([
    StructField("event_id", StringType(), True),
    StructField("shipment_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("cost_type", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("currency", StringType(), True),
    StructField("tariff_code", StringType(), True),
    StructField("fuel_price", DoubleType(), True)
])

try:
    # Read streaming data from Kafka
    kafka_stream_df = spark.readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
        .option("subscribe", kafka_topic) \
        .option("startingOffsets", "latest") \
        .load() \
        .selectExpr("CAST(value AS STRING) as json_payload", "timestamp as kafka_ingestion_timestamp")

    # Parse the JSON payload and apply schema
    parsed_logistics_df = kafka_stream_df \
        .withColumn("parsed_data", from_json(col("json_payload"), logistics_event_schema)) \
        .select("parsed_data.*", "kafka_ingestion_timestamp") \
        .withColumn("processing_timestamp", current_timestamp())

    # Define the Unity Catalog table name
    uc_table_name = f"{UC_CATALOG}.{UC_SCHEMA}.logistics_costs_stream"
    checkpoint_location = f"/mnt/supply_chain_checkpoints/{UC_CATALOG}/{UC_SCHEMA}/logistics_costs_stream"

    logger.info(f"Writing streaming data to Unity Catalog table: {uc_table_name}")
    logger.info(f"Using checkpoint location: {checkpoint_location}")

    # Write the processed data to a Unity Catalog managed Delta table
    query = parsed_logistics_df.writeStream \
        .format("delta") \
        .outputMode("append") \
        .option("checkpointLocation", checkpoint_location) \
        .trigger(processingTime="30 seconds") \
        .toTable(uc_table_name) # Key change: use .toTable() with fully qualified name

    logger.info("Structured Streaming query started.")

    # Wait for the termination of the query
    query.awaitTermination()

except Exception as e:
    logger.error(f"Error in Structured Streaming job: {e}", exc_info=True)
    raise

finally:
    spark.stop()
    logger.info("Spark session stopped.")

Explanation:

  • uc_table_name = f"{UC_CATALOG}.{UC_SCHEMA}.logistics_costs_stream": Constructs the fully qualified name for the Unity Catalog table.
  • .toTable(uc_table_name): This is the crucial part. Instead of writing to a path (e.g., .start(path)), we use .toTable() with the fully qualified Unity Catalog table name. This registers the table with Unity Catalog.
  • Checkpoint Location: While the table is managed by Unity Catalog, the checkpoint location for Structured Streaming is still a path in cloud storage. It’s good practice to organize these checkpoints logically, perhaps under a dedicated mount point or external location that is also governed by Unity Catalog volumes if needed. For simplicity here, we use a DBFS path.

3. Define and Apply Granular Permissions

Now, let’s define and apply permissions for different user groups. We’ll assume you have groups set up in Databricks (e.g., data_engineers, tariff_analysts, logistics_managers, procurement_team).

-- databricks_config/unity_catalog_permissions.sql
USE CATALOG supply_chain_analytics;

-- --- Data Engineers Group Permissions ---
-- Data Engineers need full control over Bronze, Silver, Gold layers for pipeline development
GRANT CREATE TABLE, CREATE VIEW, CREATE FUNCTION, SELECT, MODIFY ON SCHEMA bronze TO `data_engineers`;
GRANT CREATE TABLE, CREATE VIEW, CREATE FUNCTION, SELECT, MODIFY ON SCHEMA silver TO `data_engineers`;
GRANT CREATE TABLE, CREATE VIEW, CREATE FUNCTION, SELECT, MODIFY ON SCHEMA gold TO `data_engineers`;

-- They also need SELECT on analytical schemas to verify data
GRANT SELECT ON SCHEMA tariff_analysis TO `data_engineers`;
GRANT SELECT ON SCHEMA logistics_monitoring TO `data_engineers`;
GRANT SELECT ON SCHEMA procurement_intelligence TO `data_engineers`;

-- --- Tariff Analysts Group Permissions ---
-- Tariff Analysts need to query data for analysis
GRANT SELECT ON SCHEMA tariff_analysis TO `tariff_analysts`;
GRANT SELECT ON TABLE gold.tariff_summary TO `tariff_analysts`; -- Example: specific table in gold

-- --- Logistics Managers Group Permissions ---
-- Logistics Managers need to view monitoring dashboards
GRANT SELECT ON SCHEMA logistics_monitoring TO `logistics_managers`;
-- If specific views are created for dashboards, grant access to those views
GRANT SELECT ON VIEW logistics_monitoring.dashboard_view TO `logistics_managers`;

-- --- Procurement Team Group Permissions ---
-- Procurement Team needs access to price intelligence, potentially with RLS/CLS
GRANT SELECT ON SCHEMA procurement_intelligence TO `procurement_team`;
-- Example: Grant SELECT on a specific table, then apply RLS/CLS via a view

-- --- Service Principal for DLT Pipelines ---
-- Create a service principal (e.g., 'sp_dlt_pipeline') and grant it permissions
-- (This assumes 'sp_dlt_pipeline' is already created as a service principal in Databricks)
-- GRANT USE CATALOG ON CATALOG supply_chain_analytics TO `sp_dlt_pipeline`;
-- GRANT USE SCHEMA ON SCHEMA bronze TO `sp_dlt_pipeline`;
-- GRANT CREATE TABLE, SELECT, MODIFY ON SCHEMA bronze TO `sp_dlt_pipeline`;
-- GRANT USE SCHEMA ON SCHEMA silver TO `sp_dlt_pipeline`;
-- GRANT CREATE TABLE, SELECT, MODIFY ON SCHEMA silver TO `sp_dlt_pipeline`;
-- GRANT USE SCHEMA ON SCHEMA gold TO `sp_dlt_pipeline`;
-- GRANT CREATE TABLE, SELECT, MODIFY ON SCHEMA gold TO `sp_dlt_pipeline`;

-- Log the permission grants
SELECT 'Granular permissions applied to user groups and service principals.' AS log_message;

Explanation:

  • We use GRANT <permission> ON <object_type> <object_name> TO <principal>;.
  • SELECT: Allows reading data.
  • MODIFY: Allows INSERT, UPDATE, DELETE, TRUNCATE operations.
  • CREATE TABLE, CREATE VIEW, CREATE FUNCTION: Allows creating new objects.
  • Service Principals: For automated jobs like DLT pipelines or scheduled Structured Streaming, it’s best practice to run them under a Databricks Service Principal. This allows for programmatic access without tying to a specific user, and permissions can be managed independently. You would grant the necessary USE CATALOG, USE SCHEMA, CREATE TABLE, SELECT, MODIFY permissions to the service principal.

4. Implement Row-Level and Column-Level Security (RLS/CLS)

For sensitive data, like procurement pricing, you might want to restrict what rows or columns a user can see based on their role or attributes. Unity Catalog supports native RLS/CLS, or you can implement it using SQL views. Here, we’ll demonstrate a common approach using SQL views for flexibility.

Let’s assume the procurement_intelligence.vendor_prices table contains sensitive pricing data. We want to show different prices based on the user’s region or mask certain columns for general users.

-- notebooks/rls_cls_example.sql
USE CATALOG supply_chain_analytics;

-- Create a dummy table for demonstration (if not already existing)
CREATE TABLE IF NOT EXISTS procurement_intelligence.vendor_prices (
    vendor_id STRING,
    product_sku STRING,
    region STRING,
    price DOUBLE,
    negotiated_discount DOUBLE,
    last_updated TIMESTAMP
);

-- Insert some sample data
INSERT INTO procurement_intelligence.vendor_prices VALUES
('V001', 'SKU100', 'EMEA', 100.50, 0.05, current_timestamp()),
('V001', 'SKU101', 'NA', 200.75, 0.10, current_timestamp()),
('V002', 'SKU100', 'EMEA', 102.00, 0.03, current_timestamp()),
('V003', 'SKU102', 'APAC', 50.25, 0.07, current_timestamp());

-- --- Row-Level Security View ---
-- Create a view that filters rows based on the current user's region (example logic)
-- In a real scenario, you'd integrate with a lookup table of user-to-region mapping.
CREATE OR REPLACE VIEW procurement_intelligence.my_region_vendor_prices AS
SELECT
    vendor_id,
    product_sku,
    region,
    price,
    negotiated_discount,
    last_updated
FROM
    procurement_intelligence.vendor_prices
WHERE
    region = (SELECT user_region FROM user_regions_lookup WHERE user_name = current_user()); -- Placeholder for user region logic

-- Grant SELECT on this view, not on the base table, to 'procurement_team'
GRANT SELECT ON VIEW procurement_intelligence.my_region_vendor_prices TO `procurement_team`;
REVOKE SELECT ON TABLE procurement_intelligence.vendor_prices FROM `procurement_team`; -- Ensure they cannot access the base table directly

-- --- Column-Level Security View ---
-- Create a view that masks the 'negotiated_discount' column for general users
CREATE OR REPLACE VIEW procurement_intelligence.public_vendor_prices AS
SELECT
    vendor_id,
    product_sku,
    region,
    price,
    '***MASKED***' AS negotiated_discount, -- Masking the sensitive column
    last_updated
FROM
    procurement_intelligence.vendor_prices;

-- Grant SELECT on this public view to a broader group, if applicable
GRANT SELECT ON VIEW procurement_intelligence.public_vendor_prices TO `some_general_user_group`;

SELECT 'RLS and CLS views created and permissions updated.' AS log_message;

Explanation:

  • RLS View (my_region_vendor_prices): This view filters the vendor_prices table to show only rows where the region matches a region associated with the current_user(). The user_regions_lookup table is a conceptual placeholder; in practice, this would involve a robust user management system or attributes passed during authentication.
  • CLS View (public_vendor_prices): This view replaces the sensitive negotiated_discount column with a masked string ('***MASKED***').
  • Permission Strategy: For RLS/CLS, grant SELECT permissions only to the specific views that enforce security, and REVOKE (or never grant) SELECT on the underlying base table for those users.

c) Testing This Component

After setting up Unity Catalog and applying permissions, it’s crucial to test that access controls are working as expected.

  1. Verify Catalog and Schema Creation:

    • As an account_admin or a user with CREATE CATALOG privileges, run:
      SHOW CATALOGS;
      USE CATALOG supply_chain_analytics;
      SHOW SCHEMAS;
      
    • Confirm supply_chain_analytics catalog and all schemas (bronze, silver, etc.) are listed.
  2. Test DLT Pipeline Execution:

    • Run your DLT pipelines. They should now successfully create/update tables with fully qualified names (e.g., supply_chain_analytics.bronze.raw_events_bronze).
    • Verify the table properties and location in Unity Catalog by querying DESCRIBE EXTENDED supply_chain_analytics.bronze.raw_events_bronze;.
  3. Test Structured Streaming Job Execution:

    • Run your Structured Streaming jobs. They should write to the specified Unity Catalog tables (e.g., supply_chain_analytics.logistics_monitoring.logistics_costs_stream).
    • Query the table to ensure data is flowing: SELECT * FROM supply_chain_analytics.logistics_monitoring.logistics_costs_stream LIMIT 10;.
  4. Test Granular Permissions:

    • Login as a data_engineer user:
      • Try to SELECT from supply_chain_analytics.bronze.raw_events_bronze (should succeed).
      • Try to INSERT into supply_chain_analytics.bronze.raw_events_bronze (should succeed).
      • Try to SELECT from supply_chain_analytics.tariff_analysis.some_table (should succeed).
      • Try to SELECT from a non-existent table or schema outside their granted access (should fail with PERMISSION_DENIED).
    • Login as a tariff_analyst user:
      • Try to SELECT from supply_chain_analytics.tariff_analysis.some_table (should succeed).
      • Try to MODIFY supply_chain_analytics.tariff_analysis.some_table (should fail).
      • Try to SELECT from supply_chain_analytics.bronze.raw_events_bronze (should fail).
    • Login as a procurement_team user:
      • Try to SELECT from supply_chain_analytics.procurement_intelligence.my_region_vendor_prices (should succeed, showing only their region’s data).
      • Try to SELECT from supply_chain_analytics.procurement_intelligence.vendor_prices (the base table, should fail).

Debugging Tips:

  • PERMISSION_DENIED errors: This is the most common error. Double-check your GRANT statements. Ensure the user/group/service principal has all necessary permissions: USE CATALOG, USE SCHEMA, and then SELECT/MODIFY on the specific table/view.
  • Unity Catalog Event Log: Use the Unity Catalog event log (accessible via Databricks UI or API) to audit who accessed what and whether access was granted or denied. This is invaluable for debugging permission issues.
  • current_user() and is_member(): Use these SQL functions in Databricks to verify the identity of the current user and their group memberships, which is helpful when debugging RLS/CLS views.

Production Considerations

Securing your Lakehouse in production with Unity Catalog requires careful planning and ongoing management.

  • Principle of Least Privilege: Always grant the minimum necessary permissions. Avoid granting ALL PRIVILEGES unless absolutely required for administrative roles.
  • Automation of Permissions: For large organizations, manually managing GRANT statements can be cumbersome. Integrate permission management into your CI/CD pipeline using Databricks Asset Bundles, Terraform, or similar infrastructure-as-code tools to define and apply permissions.
  • Service Principals: Use Databricks Service Principals for all automated workloads (DLT pipelines, Structured Streaming jobs, Databricks Jobs). This decouples job execution from individual user accounts, improving security and auditability.
  • Audit Logging: Unity Catalog automatically logs all data access and permission changes. Regularly review these audit logs for suspicious activity and compliance. Integrate these logs with your SIEM (Security Information and Event Management) system.
  • External Locations and Storage Credentials: Securely manage the underlying cloud storage (S3 buckets, ADLS containers) permissions. Unity Catalog’s External Locations and Storage Credentials abstract this, but the IAM roles/principals backing those credentials must also adhere to least privilege.
  • Data Masking and Tokenization: For highly sensitive data, consider more robust data masking or tokenization solutions in addition to CLS views. Unity Catalog’s native masking functions (currently in preview/GA) can be very powerful here.
  • Integration with Identity Providers: Ensure Unity Catalog is integrated with your corporate identity provider (Azure AD, Okta, etc.) for centralized user and group management.
  • Monitoring: Monitor Unity Catalog usage and performance. While Unity Catalog itself is highly optimized, complex RLS/CLS views can sometimes impact query performance.

Code Review Checkpoint

At this stage, you should have accomplished the following:

  • A dedicated supply_chain_analytics catalog created in Unity Catalog.
  • Schemas (bronze, silver, gold, tariff_analysis, logistics_monitoring, procurement_intelligence) created within the catalog.
  • DLT pipelines updated to write to Unity Catalog managed tables using fully qualified names (e.g., supply_chain_analytics.bronze.raw_events_bronze).
  • Spark Structured Streaming jobs updated to write to Unity Catalog managed tables using .toTable() with fully qualified names.
  • Granular permissions applied to various user groups (data_engineers, tariff_analysts, logistics_managers, procurement_team) and potentially service principals, following the principle of least privilege.
  • Examples of Row-Level Security (RLS) and Column-Level Security (CLS) implemented using SQL views for the procurement_intelligence schema.

Files Created/Modified:

  • databricks_config/unity_catalog_setup.sql
  • databricks_config/unity_catalog_permissions.sql
  • notebooks/03_bronze_ingestion_dlt.py (and other DLT pipeline notebooks)
  • notebooks/08_streaming_logistics_monitoring.py (and other Structured Streaming notebooks)
  • notebooks/rls_cls_example.sql (for RLS/CLS demonstration)

These changes integrate seamlessly with our existing data pipelines, ensuring that all new and updated data assets are now governed by Unity Catalog.

Common Issues & Solutions

  1. Issue: PERMISSION_DENIED when running DLT/Structured Streaming jobs.

    • Cause: The user or service principal running the DLT pipeline/job does not have sufficient permissions to CREATE TABLE, MODIFY, or SELECT in the target Unity Catalog schema.
    • Solution:
      • Verify the identity running the job. For DLT, it’s typically the user who created/updated the pipeline or a configured service principal. For Databricks Jobs, it’s the job owner or a service principal.
      • Ensure the identity has USE CATALOG on the catalog, USE SCHEMA on the target schema, and CREATE TABLE, SELECT, MODIFY on the target schema for the specific operations being performed.
      • If using an external location for checkpoints, ensure the identity also has permissions to write to that external location.
  2. Issue: Tables are not appearing in Unity Catalog, still showing up as legacy Hive tables or on DBFS paths.

    • Cause: The DLT pipeline or Structured Streaming job is not correctly configured to write to Unity Catalog. This often happens if the name in @dlt.table or the argument to .toTable() is not a fully qualified Unity Catalog name (catalog.schema.table_name), or if the DLT pipeline’s “Target schema” setting is incorrect.
    • Solution:
      • DLT: Double-check the DLT pipeline settings in the UI to ensure “Catalog” and “Target schema” are correctly set. If you specify the full name in code, ensure it matches.
      • Structured Streaming: Confirm toTable("catalog.schema.table_name") is used instead of .start("path/to/delta/table").
      • Ensure the cluster running the job is Unity Catalog-enabled.
  3. Issue: Row-level or column-level security views are not working as expected (e.g., users see all data or masked data is visible).

    • Cause:
      • Users still have SELECT access to the underlying base table, bypassing the view.
      • The RLS/CLS logic in the view is incorrect.
      • The current_user() function or other user-context functions are not resolving correctly.
    • Solution:
      • Revoke Base Table Access: Crucially, REVOKE SELECT ON TABLE <base_table> FROM <user_or_group>; for any user/group that should only access data through the RLS/CLS view.
      • Test View Logic: As an admin, query the view directly and verify its filtering/masking behavior.
      • Verify User Context: Use SELECT current_user(); and SELECT is_member('group_name'); to confirm the user’s identity and group memberships are correctly recognized by Databricks.

Testing & Verification

To thoroughly test and verify the Unity Catalog setup and permissions:

  1. As Databricks Account Admin:

    • Navigate to the Databricks UI -> Catalog Explorer.
    • Confirm the supply_chain_analytics catalog exists.
    • Expand the catalog and verify all schemas (bronze, silver, gold, etc.) are present.
    • Within each schema, check that the tables created by DLT and Structured Streaming jobs (e.g., raw_events_bronze, logistics_costs_stream) are listed as Unity Catalog managed tables.
    • Review the permissions granted to groups/users on catalogs, schemas, tables, and views.
  2. As a data_engineer (or similar role):

    • Log in to Databricks as a user belonging to the data_engineers group.
    • Open a new notebook and try to:
      • USE CATALOG supply_chain_analytics; (should succeed)
      • SELECT * FROM bronze.raw_events_bronze LIMIT 10; (should succeed)
      • INSERT INTO bronze.test_table VALUES ('test'); (should succeed if CREATE TABLE is granted)
      • SELECT * FROM procurement_intelligence.vendor_prices LIMIT 10; (should fail with PERMISSION_DENIED)
  3. As a procurement_team member (or similar role with RLS/CLS):

    • Log in to Databricks as a user belonging to the procurement_team group.
    • Open a new notebook and try to:
      • USE CATALOG supply_chain_analytics; (should succeed)
      • SELECT * FROM procurement_intelligence.my_region_vendor_prices LIMIT 10; (should succeed, showing only data for their region as per RLS logic)
      • SELECT * FROM procurement_intelligence.vendor_prices LIMIT 10; (should fail with PERMISSION_DENIED)
      • SELECT * FROM procurement_intelligence.public_vendor_prices LIMIT 10; (should succeed, showing masked columns if applicable)

This systematic testing ensures that your data lakehouse is not only functional but also securely governed according to your organizational policies.

Summary & Next Steps

In this chapter, we have successfully implemented robust data governance and security for our real-time supply chain data lakehouse using Databricks Unity Catalog. We learned how to:

  • Structure our data assets using catalogs and schemas.
  • Configure DLT pipelines and Spark Structured Streaming jobs to write to Unity Catalog managed tables.
  • Apply fine-grained access control to different user groups and service principals using GRANT statements.
  • Implement Row-Level Security (RLS) and Column-Level Security (CLS) using SQL views to protect sensitive data.
  • Understand critical production considerations like the principle of least privilege, automation, audit logging, and integration with identity providers.

By integrating Unity Catalog, our project now boasts a centralized, compliant, and auditable data platform, ready to serve various stakeholders securely. This foundation is crucial for scaling our data operations and ensuring data integrity and confidentiality.

In the next chapter, we will shift our focus to Chapter 14: CI/CD for Data Pipelines with Databricks Asset Bundles. We’ll explore how to automate the deployment of our DLT pipelines, Structured Streaming jobs, and Unity Catalog configurations using a modern CI/CD framework, ensuring consistent and reliable deployments across environments.