Big Cyber Monday Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

Options:

A.

CAN ATTACH TO

B.

CAN MANAGE

C.

CAN VIEW

D.

CAN RESTART

Buy Now
Questions 5

The following code has been migrated to a Databricks notebook from a legacy workload:

Databricks-Certified-Professional-Data-Engineer Question 5

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

A.

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

B.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

C.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

D.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

E.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Buy Now
Questions 6

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Databricks-Certified-Professional-Data-Engineer Question 6

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Buy Now
Questions 7

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?

Options:

A.

Parse the Delta Lake transaction log to identify all newly written data files.

B.

Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

C.

Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

D.

Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

Buy Now
Questions 8

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

Options:

A.

Workspace audit logs

B.

Driver's log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor's log file

Buy Now
Questions 9

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

A.

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

B.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Buy Now
Questions 10

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.

Buy Now
Questions 11

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Total Disk Space remains constant

D.

Network I/O never spikes

E.

Overall cluster CPU utilization is around 25%

Buy Now
Questions 12

Which of the following is true of Delta Lake and the Lakehouse?

Options:

A.

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

C.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

E.

Z-order can only be applied to numeric values stored in Delta Lake tables

Buy Now
Questions 13

A transactions table has been liquid clustered on the columns product_id, user_id, and event_date.

Which operation lacks support for cluster on write?

Options:

A.

spark.writestream.format('delta').mode('append')

B.

CTAS and RTAS statements

C.

INSERT INTO operations

D.

spark.write.format('delta').mode('append')

Buy Now
Questions 14

A Delta Lake table representing metadata about content posts from users has the following schema:

    user_id LONG

    post_text STRING

    post_id STRING

    longitude FLOAT

    latitude FLOAT

    post_time TIMESTAMP

    date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

A.

date

B.

user_id

C.

post_id

D.

post_time

Buy Now
Questions 15

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Options:

A.

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

B.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

C.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

D.

A user can only transfer job ownership to a group if they are also a member of that group.

E.

Only workspace administrators can grant "Owner" privileges to a group.

Buy Now
Questions 16

A platform engineer is creating catalogs and schemas for the development team to use.

The engineer has created an initial catalog, catalog_A, and initial schema, schema_A. The engineer has also granted USE CATALOG, USE

SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables.

Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A.

What explains the engineer's lack of access to the underlying tables?

Options:

A.

The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners.

B.

Users granted with USE CATALOG can modify the owner's permissions to downstream tables.

C.

The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point.

D.

Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their

schema.

Buy Now
Questions 17

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

A.

The job_id and number of times the job has been run are concatenated and returned.

B.

The globally unique ID of the newly triggered run.

C.

The job_id is returned in this field.

D.

The number of times the job definition has been run in this workspace.

Buy Now
Questions 18

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Databricks-Certified-Professional-Data-Engineer Question 18

Which command allows manual confirmation that these three requirements have been met?

Options:

A.

DESCRIBE EXTENDED dev.pii test

B.

DESCRIBE DETAIL dev.pii test

C.

SHOW TBLPROPERTIES dev.pii test

D.

DESCRIBE HISTORY dev.pii test

E.

SHOW TABLES dev

Buy Now
Questions 19

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

Databricks-Certified-Professional-Data-Engineer Question 19

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

Options:

A.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Buy Now
Questions 20

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now
Questions 21

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

A.

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

B.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

C.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

E.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Buy Now
Questions 22

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

A.

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

B.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

C.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

D.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

E.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Buy Now
Questions 23

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

Options:

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Buy Now
Questions 24

A nightly job ingests data into a Delta Lake table using the following code:

Databricks-Certified-Professional-Data-Engineer Question 24

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

A.

return spark.readStream.table("bronze")

B.

return spark.readStream.load("bronze")

C.

24

D.

return spark.read.option("readChangeFeed", "true").table ("bronze")

E.

24

Buy Now
Questions 25

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

Buy Now
Questions 26

Given the following error traceback:

AnalysisException: cannot resolve 'heartrateheartrateheartrate' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3*"heartrate"))

Which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Buy Now
Questions 27

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Buy Now
Questions 28

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Databricks-Certified-Professional-Data-Engineer Question 28

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

A.

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

B.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

C.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

D.

Both commands will fail. No new variables, tables, or views will be created.

E.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Buy Now
Questions 29

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Buy Now
Questions 30

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Databricks-Certified-Professional-Data-Engineer Question 30

Which solution would improve the performance?

A)

Databricks-Certified-Professional-Data-Engineer Question 30

B)

Databricks-Certified-Professional-Data-Engineer Question 30

C)

Databricks-Certified-Professional-Data-Engineer Question 30

D)

Databricks-Certified-Professional-Data-Engineer Question 30

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 31

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.

Databricks-Certified-Professional-Data-Engineer Question 31

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

A.

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

B.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

C.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

D.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

E.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Buy Now
Questions 32

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

Databricks-Certified-Professional-Data-Engineer Question 32

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Databricks-Certified-Professional-Data-Engineer Question 32

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Options:

A.

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

B.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

C.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

D.

Define a view against the products_per_order table and define the dashboard against this view.

Buy Now
Questions 33

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

A.

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

B.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

C.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

D.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

E.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Buy Now
Questions 34

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

Options:

A.

DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

B.

By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.

C.

The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.

D.

Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.

E.

The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Buy Now
Questions 35

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

Options:

A.

Set the skipChangeCommits flag to true on bpm_stats

B.

Set the SkipChangeCommits flag to true raw_lot

C.

Set the pipelines, reset, allowed property to false on bpm_stats

D.

Set the pipelines, reset, allowed property to false on raw_iot

Buy Now
Questions 36

Which distribution does Databricks support for installing custom Python code packages?

Options:

A.

sbt

B.

CRAN

C.

CRAM

D.

nom

E.

Wheels

F.

jars

Buy Now
Questions 37

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Databricks-Certified-Professional-Data-Engineer Question 37

Which statement describes the execution and results of running the above query multiple times?

Options:

A.

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Buy Now
Questions 38

Which statement describes Delta Lake Auto Compaction?

Options:

A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Buy Now
Questions 39

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Databricks-Certified-Professional-Data-Engineer Question 39

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

A.

to_interval("event_time", "5 minutes").alias("time")

B.

window("event_time", "5 minutes").alias("time")

C.

"event_time"

D.

window("event_time", "10 minutes").alias("time")

E.

lag("event_time", "10 minutes").alias("time")

Buy Now
Questions 40

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Databricks-Certified-Professional-Data-Engineer Question 40

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Options:

A.

No; Delta Lake manages streaming checkpoints in the transaction log.

B.

Yes; both of the streams can share a single checkpoint directory.

C.

No; only one stream can write to a Delta Lake table.

D.

Yes; Delta Lake supports infinite concurrent writers.

E.

No; each of the streams needs to have its own checkpoint directory.

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: Dec 5, 2025
Questions: 195

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99