Pre-Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Databricks-Certified-Professional-Data-Engineer Question 4

Which solution would improve the performance?

A)

Databricks-Certified-Professional-Data-Engineer Question 4

B)

Databricks-Certified-Professional-Data-Engineer Question 4

C)

Databricks-Certified-Professional-Data-Engineer Question 4

D)

Databricks-Certified-Professional-Data-Engineer Question 4

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 5

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Databricks-Certified-Professional-Data-Engineer Question 5

Which statement describes what will happen when the above code is executed?

Options:

A.

The connection to the external table will fail; the string " redacted " will be printed.

B.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

C.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

D.

The connection to the external table will succeed; the string value of password will be printed in plain text.

E.

The connection to the external table will succeed; the string " redacted " will be printed.

Buy Now
Questions 6

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Options:

A.

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

B.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

C.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

D.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

Buy Now
Questions 7

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

Buy Now
Questions 8

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Buy Now
Questions 9

An organization processes customer data from web and mobile applications. Data includes names, emails, phone numbers, and location history. Data arrives both as batch files (from SFTP daily) and streaming JSON events (from Kafka in real-time).

To comply with data privacy policies, the following requirements must be met:

    Personally Identifiable Information (PII) such as email, phone number, and IP address must be masked or anonymized before storage.

    Both batch and streaming pipelines must apply consistent PII handling.

    Masking logic must be auditable and reproducible.

    The masked data must remain usable for downstream analytics.

How should the data engineer design a compliant data pipeline on Databricks that supports both batch and streaming modes, applies data masking to PII, and maintains traceability for audits?

Options:

A.

Allow PII to be stored unmasked in Bronze for lineage tracking, then apply masking logic in Gold tables used for reporting.

B.

Load batch data with notebooks and ingest streaming data with SQL Warehouses; use Unity Catalog column masks on Silver tables to redact fields after storage.

C.

Ingest both batch and streaming data using Lakeflow Declarative Pipelines, and apply masking via Unity Catalog column masks at read time to avoid modifying the data during ingestion.

D.

Use Lakeflow Declarative Pipelines for batch and streaming ingestion, define a PII masking function , and apply it during Bronze ingestion before writing to Delta Lake .

Buy Now
Questions 10

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job ' s initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Buy Now
Questions 11

To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the System catalog in Unity Catalog, to gain detailed visibility into workspace activity.

Which SQL query should the team run from the System catalog to achieve this?

Options:

A.

SELECT sku_name,

identity_metadata.created_by AS user_email,

COUNT(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

B.

SELECT identity_metadata.run_as AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email

ORDER BY total_dbus DESC

LIMIT 10

C.

SELECT sku_name,

identity_metadata.created_by AS user_email,

SUM(usage_quantity * usage_unit) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

D.

SELECT sku_name,

usage_metadata.run_name AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

Buy Now
Questions 12

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

A.

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

B.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Buy Now
Questions 13

A platform engineer is creating catalogs and schemas for the development team to use.

The engineer has created an initial catalog, catalog_A, and initial schema, schema_A. The engineer has also granted USE CATALOG, USE

SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.

Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.

What explains the engineer ' s lack of access to the underlying tables?

Options:

A.

The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners.

B.

Users granted with USE CATALOG can modify the owner ' s permissions to downstream tables.

C.

The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point.

D.

Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their

schema.

Buy Now
Questions 14

Which distribution does Databricks support for installing custom Python code packages?

Options:

A.

sbt

B.

CRAN

C.

CRAM

D.

nom

E.

Wheels

F.

jars

Buy Now
Questions 15

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

A.

In the Executor ' s log file, by gripping for " predicate push-down "

B.

In the Stage ' s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

C.

In the Storage Detail screen, by noting which RDDs are not stored on disk

D.

In the Delta Lake transaction log. by noting the column statistics

E.

In the Query Detail screen, by interpreting the Physical Plan

Buy Now
Questions 16

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Buy Now
Questions 17

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

A.

Task queueing resulting from improper thread pool assignment.

B.

Spill resulting from attached volume storage being too small.

C.

Network latency due to some cluster nodes being in different regions from the source data

D.

Skew caused by more data being assigned to a subset of spark-partitions.

E.

Credential validation errors while pulling data from an external system.

Buy Now
Questions 18

A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.

Which technique will fix the skew in this aggregation?

Options:

A.

Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix.

B.

Increase the Spark driver memory and retry.

C.

Use reduceByKey instead of groupBy to avoid shuffles.

D.

Filter out the skewed users before the aggregation.

Buy Now
Questions 19

A junior data engineer seeks to leverage Delta Lake ' s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true . They plan to execute the following code as a daily job:

Databricks-Certified-Professional-Data-Engineer Question 19

Which statement describes the execution and results of running the above query multiple times?

Options:

A.

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Buy Now
Questions 20

A data engineering team is setting up deployment automation. To deploy workspace assets remotely using the Databricks CLI command, they must configure it with proper authentication.

Which authentication approach will provide the highest level of security ?

Options:

A.

Use a service principal with OAuth token federation.

B.

Use a service principal ID and its OAuth client secret.

C.

Use a service principal and its Personal Access Token.

D.

Use a shared user account and its OAuth client secret.

Buy Now
Questions 21

A table is registered with the following code:

Databricks-Certified-Professional-Data-Engineer Question 21

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders ?

Options:

A.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Buy Now
Questions 22

The following table consists of items found in user carts within an e-commerce website.

Databricks-Certified-Professional-Data-Engineer Question 22

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

Databricks-Certified-Professional-Data-Engineer Question 22

How would the following update be handled?

Options:

A.

The update is moved to separate ' ' restored ' ' column because it is missing a column expected in the target schema.

B.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

C.

The update throws an error because changes to existing columns in the target schema are not supported.

D.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Buy Now
Questions 23

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

Databricks-Certified-Professional-Data-Engineer Question 23

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Databricks-Certified-Professional-Data-Engineer Question 23

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Options:

A.

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

B.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

C.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

D.

Define a view against the products_per_order table and define the dashboard against this view.

Buy Now
Questions 24

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

Options:

A.

Workspace audit logs

B.

Driver ' s log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor ' s log file

Buy Now
Questions 25

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:

A.

Require the analytics team to use a tool that supports Delta table.

B.

Enable uniform on the transactions table to ' iceberg ' so that the table can be read as an Iceberg table.

C.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

D.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.

Buy Now
Questions 26

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Databricks-Certified-Professional-Data-Engineer Question 26

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

Options:

A.

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

B.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

C.

No: the change data feed only tracks inserts and updates not deleted records.

D.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Buy Now
Questions 27

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

Options:

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Buy Now
Questions 28

A data engineer is using Auto Loader to read incoming JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records but notice that over time, some records are being quarantined even though they are well-formed JSON .

The code snippet is:

df = (spark.readStream

.format( " cloudFiles " )

.option( " cloudFiles.format " , " json " )

.option( " badRecordsPath " , " /tmp/somewhere/badRecordsPath " )

.schema( " a int, b int " )

.load( " /Volumes/catalog/schema/raw_data/ " ))

What is the cause of the missing data?

Options:

A.

At some point, the upstream data provider switched everything to multi-line JSON.

B.

The badRecordsPath location is accumulating many small files.

C.

The source data is valid JSON but does not conform to the defined schema in some way.

D.

The engineer forgot to set the option " cloudFiles.quarantineMode " = " rescue " .

Buy Now
Questions 29

A workspace admin has created a new catalog called finance_data and wants to delegate permission management to a finance team lead without giving them full admin rights.

Which privilege should be granted to the finance team lead?

Options:

A.

ALL PRIVILEGES on the finance_data catalog.

B.

Make the finance team lead a metastore admin.

C.

GRANT OPTION privilege on the finance_data catalog.

D.

MANAGE privilege on the finance_data catalog.

Buy Now
Questions 30

Which of the following is true of Delta Lake and the Lakehouse?

Options:

A.

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

C.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

E.

Z-order can only be applied to numeric values stored in Delta Lake tables

Buy Now
Questions 31

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

Options:

A.

databricks execute my_project_job -e prod

B.

databricks job run my_project_job --env prod

C.

databricks run my_project_job -t prod

D.

databricks bundle run my_project_job -t prod

Buy Now
Questions 32

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

Options:

A.

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

B.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

C.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

D.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

E.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Buy Now
Questions 33

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id .

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Options:

A.

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

B.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

C.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

D.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

E.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Buy Now
Questions 34

A Delta Lake table representing metadata about content from user has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

A.

Date

B.

Post_id

C.

User_id

D.

Post_time

Buy Now
Questions 35

A data engineer is designing a Lakeflow Spark Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id is not null and amount is greater than zero. Invalid records should be dropped. Which Lakeflow Spark Declarative Pipelines configuration implements this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect_or_drop( " valid_amount " , " amount > 0 " )

B.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect( " valid_amount " , " amount > 0 " )

C.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 36

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

What code will satisfy the requirements?

Options:

A.

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

B.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

C.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

Buy Now
Questions 37

A facilities-monitoring team is building a near-real-time PowerBI dashboard off the Delta table device_readings:

Columns:

    device_id (STRING, unique sensor ID)

    event_ts (TIMESTAMP, ingestion timestamp UTC)

    temperature_c (DOUBLE, temperature in °C)

Requirement:

    For each sensor, generate one row per non-overlapping 5-minute interval , offset by 2 minutes (e.g., 00:02–00:07, 00:07–00:12, …).

    Each row must include interval start, interval end, and average temperature in that slice.

    Downstream BI tools (e.g., Power BI) must use the interval timestamps to plot time-series bars.

Options:

Options:

A.

WITH buckets AS (

SELECT device_id,

window(event_ts, ' 5 minutes ' , ' 2 minutes ' , ' 5 minutes ' ) AS win,

temperature_c

FROM device_readings

)

SELECT device_id,

win.start AS bucket_start,

win.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM buckets

GROUP BY device_id, win

ORDER BY device_id, bucket_start;

B.

SELECT device_id,

event_ts,

AVG(temperature_c) OVER (

PARTITION BY device_id

ORDER BY event_ts

RANGE BETWEEN INTERVAL 5 MINUTES PRECEDING AND CURRENT ROW

) AS avg_temp_5m

FROM device_readings

WINDOW w AS (window(event_ts, ' 5 minutes ' , ' 2 minutes ' ));

C.

SELECT device_id,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 2 MINUTES AS bucket_start,

date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES) + INTERVAL 7 MINUTES AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, date_trunc( ' minute ' , event_ts - INTERVAL 2 MINUTES)

ORDER BY device_id, bucket_start;

D.

SELECT device_id,

window.start AS bucket_start,

window.end AS bucket_end,

AVG(temperature_c) AS avg_temp_5m

FROM device_readings

GROUP BY device_id, window(event_ts, ' 5 minutes ' , ' 5 minutes ' , ' 2 minutes ' )

ORDER BY device_id, bucket_start;

Buy Now
Questions 38

A query is taking too long to run. After investigating the Spark UI, the data engineer discovered a significant amount of disk spill . The compute instance being used has a core-to-memory ratio of 1:2.

What are the two steps the data engineer should take to minimize spillage? (Choose 2 answers)

Options:

A.

Choose a compute instance with a higher core-to-memory ratio.

B.

Choose a compute instance with more disk space.

C.

Increase spark.sql.files.maxPartitionBytes.

D.

Reduce spark.sql.files.maxPartitionBytes.

E.

Choose a compute instance with more network bandwidth.

Buy Now
Questions 39

Which statement describes the default execution mode for Databricks Auto Loader?

Options:

A.

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

B.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

C.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

D.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Buy Now
Questions 40

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Databricks-Certified-Professional-Data-Engineer Question 40

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

B.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

C.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

D.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

E.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Buy Now
Questions 41

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Databricks-Certified-Professional-Data-Engineer Question 41

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Buy Now
Questions 42

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

Databricks-Certified-Professional-Data-Engineer Question 42

An analyze who is not a member of the auditing group executing the following query:

Databricks-Certified-Professional-Data-Engineer Question 42

Which result will be returned by this query?

Options:

A.

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

B.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

C.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

D.

All records from all columns will be displayed with the values in user_ltv.

Buy Now
Questions 43

A data organization has adopted Delta Sharing to securely distribute curated datasets from a Unity Catalog-enabled workspace . The data engineering team shares large Delta tables internally via Databricks-to-Databricks and externally via Open Sharing for aggregated reports. While testing, they encounter challenges related to access control, data update visibility, and shareable object types.

What is a limitation of the Delta Sharing protocol or implementation when used with Databricks-to-Databricks or Open Sharing?

Options:

A.

With Open Sharing, recipients cannot access Volumes, Models, or notebooks — only static Delta tables are supported.

B.

Delta Sharing does not support Unity Catalog–enabled tables; only legacy Hive Metastore tables are shareable.

C.

With Databricks-to-Databricks sharing, Unity Catalog recipients must re-ingest data manually using COPY INTO or REST APIs.

D.

Delta Sharing (both Databricks-to-Databricks and Open Sharing) allows recipients to modify the source data if they have select privileges.

Buy Now
Questions 44

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users .

Databricks-Certified-Professional-Data-Engineer Question 44

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

A.

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

B.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

C.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

D.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

E.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Buy Now
Questions 45

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema " customer_id LONG, predictions DOUBLE, date DATE " .

Databricks-Certified-Professional-Data-Engineer Question 45

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Options:

A.

preds.write.mode( " append " ).saveAsTable( " churn_preds " )

B.

preds.write.format( " delta " ).save( " /preds/churn_preds " )

C)

45

D)

45

E)

45C.

Option A

D.

Option B

E.

Option C

F.

Option D

G.

Option E

Buy Now
Questions 46

A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and amount are greater than zero. Invalid records should be dropped.

Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

.expect_or_drop( " valid_amount " , " amount > 0 " )

)

B.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

C.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect( " valid_customer " , " customer_id IS NOT NULL " )

.expect( " valid_amount " , " amount > 0 " )

)

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 47

A data engineer deploys a multi-task Databricks job that orchestrates three notebooks. One task intermittently fails with Exit Code 1 but succeeds on retry. The engineer needs to collect detailed logs for the failing attempts, including stdout/stderr and cluster lifecycle context, and share them with the platform team.

What steps the data engineer needs to follow using built-in tools?

Options:

A.

Use the notebook interactive debugger to re-run the entire multi-task job, and capture step-through traces for the failing task.

B.

Download worker logs directly from the Spark UI and ignore driver logs, as worker logs contain stdout/stderr for all tasks and cluster events.

C.

Export the notebook run results to HTML; this bundle includes complete stdout, stderr, and cluster event history across all tasks.

D.

From the job run details page, export the job ' s logs or configure log delivery; then retrieve the compute driver logs and event logs from the compute details page to correlate stdout/stderr with cluster events.

Buy Now
Questions 48

A company has a task management system that tracks the most recent status of tasks. The system takes task events as input and processes events in near real-time using Lakeflow Declarative Pipelines. A new task event is ingested into the system when a task is created or the task status is changed. Lakeflow Declarative Pipelines provides a streaming table (tasks_status) for BI users to query.

The table represents the latest status of all tasks and includes 5 columns:

    task_id (unique for each task)

    task_name

    task_owner

    task_status

    task_event_time

The table enables three properties: deletion vectors, row tracking, and change data feed (CDF).

A data engineer is asked to create a new Lakeflow Declarative Pipeline to enrich the tasks_status table in near real-time by adding one additional column representing task_owner’s department, which can be looked up from a static dimension table (employee).

How should this enrichment be implemented?

Options:

A.

Create a new Lakeflow Declarative Pipeline: use the readStream() function to read tasks_status table; enrich with the employee table; store the result in a new streaming table.

B.

Create a new Lakeflow Declarative Pipeline: use readStream() function with option readChangeFeed to read tasks_status table CDF; enrich with the employee table; create a new streaming table as the result table and use apply_changes() function to process the changes from the enriched CDF.

C.

Create a new Lakeflow Declarative Pipeline: use the read() function to read tasks_status table; enrich with employee table; store the result in a materialized view.

D.

Create a new Lakeflow Declarative Pipeline: use the readStream() function with the option skipChangeCommits to read the tasks_status table; enrich with the employee table; store the result in a new streaming table.

Buy Now
Questions 49

A data team is automating a daily multi-task ETL pipeline in Databricks. The pipeline includes a notebook for ingesting raw data, a Python wheel task for data transformation, and a SQL query to update aggregates. They want to trigger the pipeline programmatically and see previous runs in the GUI. They need to ensure tasks are retried on failure and stakeholders are notified by email if any task fails.

Which two approaches will meet these requirements? (Choose 2 answers)

Options:

A.

Use the REST API endpoint /jobs/runs/submit to trigger each task individually as separate job runs and implement retries using custom logic in the orchestrator.

B.

Create a multi-task job using the UI, Databricks Asset Bundles (DABs), or the Jobs REST API (/jobs/create) with notebook, Python wheel, and SQL tasks. Configure task-level retries and email notifications in the job definition.

C.

Trigger the job programmatically using the Databricks Jobs REST API (/jobs/run-now), the CLI (databricks jobs run-now), or one of the Databricks SDKs.

D.

Create a single orchestrator notebook that calls each step with dbutils.notebook.run(), defining a job for that notebook and configuring retries and notifications at the notebook level.

E.

Use Databricks Asset Bundles (DABs) to deploy the workflow, then trigger individual tasks directly by referencing each task’s notebook or script path in the workspace.

Buy Now
Questions 50

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Buy Now
Questions 51

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

A.

The total count of rows is calculated by scanning all data files

B.

The total count of rows will be returned from cached results unless REFRESH is run

C.

The total count of records is calculated from the Delta transaction logs

D.

The total count of records is calculated from the parquet file metadata

E.

The total count of records is calculated from the Hive metastore

Buy Now
Questions 52

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df . The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Databricks-Certified-Professional-Data-Engineer Question 52

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

B.

window( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

window( " event_time " , " 10 minutes " ).alias( " time " )

E.

lag( " event_time " , " 10 minutes " ).alias( " time " )

Buy Now
Questions 53

A data engineer wants to automate job monitoring and recovery in Databricks using the Jobs API. They need to list all jobs, identify a failed job, and rerun it.

Which sequence of API actions should the data engineer perform?

Options:

A.

Use the jobs/list endpoint to list jobs, check job run statuses with jobs/runs/list, and rerun a failed job using jobs/run-now.

B.

Use the jobs/get endpoint to retrieve job details, then use jobs/update to rerun failed jobs.

C.

Use the jobs/list endpoint to list jobs, then use the jobs/create endpoint to create a new job, and run the new job using jobs/run-now.

D.

Use the jobs/cancel endpoint to remove failed jobs, then recreate them with jobs/create and run the new ones.

Buy Now
Questions 54

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Databricks-Certified-Professional-Data-Engineer Question 54

Which code block will output DataFrame with the schema ' ' customer_id LONG, predictions DOUBLE ' ' ?

Options:

A.

Model, predict (df, columns)

B.

Df, map (lambda k:midel (x [columns]) ,select ( ' ' customer_id predictions ' ' )

C.

Df. Select ( ' ' customer_id ' ' .

Model ( ' ' columns) alias ( ' ' predictions ' ' )

D.

Df.apply(model, columns). Select ( ' ' customer_id, prediction ' '

Buy Now
Questions 55

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

A.

spark.sql.files.maxPartitionBytes

B.

spark.sql.autoBroadcastJoinThreshold

C.

spark.sql.files.openCostInBytes

D.

spark.sql.adaptive.coalescePartitions.minPartitionNum

E.

spark.sql.adaptive.advisoryPartitionSizeInBytes

Buy Now
Questions 56

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

A.

userLookup.join(streamingDF, [ " userid " ], how= " inner " )

B.

streamingDF.join(userLookup, [ " user_id " ], how= " outer " )

C.

streamingDF.join(userLookup, [ " user_id”], how= " left " )

D.

streamingDF.join(userLookup, [ " userid " ], how= " inner " )

E.

userLookup.join(streamingDF, [ " user_id " ], how= " right " )

Buy Now
Questions 57

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Options:

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

D.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

E.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Buy Now
Questions 58

A healthcare analytics team is implementing a dimensional model in Delta Lake for patient care analysis. They have a date dimension table and are evaluating design options to ensure it supports a wide range of time-based analyses.

Which design approach for the date dimension will support efficient time-based querying and aggregation?

Options:

A.

Store the date as a string in the format YYYY-MM-DD for readability.

B.

Create separate dimension tables for different calendar systems (fiscal, academic, etc.).

C.

Store only the date value and calculate all time attributes dynamically in queries.

D.

Pre-calculate attributes like fiscal_period, quarter, month_name, day_of_week, and holiday.

Buy Now
Questions 59

While reviewing a query ' s execution in the Databricks Query Profiler, a data engineer observes that the Top Operators panel shows a Sort operator with high Time Spent and Memory Peak metrics. The Spark UI also reports frequent data spilling .

How should the data engineer address this issue?

Options:

A.

Switch to a broadcast join to reduce memory usage.

B.

Repartition the DataFrame to a single partition before sorting.

C.

Convert the sort operation to a filter operation.

D.

Increase the number of shuffle partitions to better distribute data.

Buy Now
Questions 60

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark( " event_time " , " 10 minutes " )

.groupBy(

________,

" device_id "

)

.agg(

avg( " temp " ).alias( " avg_temp " ),

avg( " humidity " ).alias( " avg_humidity " )

)

.writeStream

.format( " delta " )

.saveAsTable( " sensor_avg " )

Which line of code correctly fills in the blank within the code block to complete this task?

Options:

A.

window( " event_time " , " 5 minutes " ).alias( " time " )

B.

to_interval( " event_time " , " 5 minutes " ).alias( " time " )

C.

" event_time "

D.

lag( " event_time " , " 5 minutes " ).alias( " time " )

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: Apr 19, 2026
Questions: 195

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99