Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Questions 4

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Buy Now

Questions 5

Given the code fragment:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 5

import pyspark.pandas as ps

psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

psdf.to_spark()

psdf.to_pyspark()

psdf.to_pandas()

psdf.to_dataframe()

Buy Now

Questions 6

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 6

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options:

Convert the Pandas UDF to a PySpark UDF

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

Run the in_spanish_inner() function in a mapInPandas() function call

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Buy Now

Questions 7

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame

Buy Now

Questions 8

37 of 55.

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.

The DataFrame has a column named email with customer email addresses.

The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

customerDF = customerDF \

.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))

customerDF = customerDF.select("email").alias("username", "domain")

customerDF = customerDF.withColumn("domain", col("email").split("@")[1])

Buy Now

Questions 9

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

Configure the application to run in cluster mode instead of local mode.

Increase the number of local threads based on the number of CPU cores.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

Set the spark.executor.memory property to a large value.

Buy Now

Questions 10

Given this view definition:

df.createOrReplaceTempView("users_vw")

Which approach can be used to query the users_vw view after the session is terminated?

Options:

Query the users_vw using Spark

Persist the users_vw data as a table

Recreate the users_vw and query the data using Spark

Save the users_vw definition and query using Spark

Buy Now

Questions 11

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

users.write(path="/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table").option("path", "/some/path")

users.write.option("path", "/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table", path="/some/path")

Buy Now

Questions 12

33 of 55.

The data engineering team created a pipeline that extracts data from a transaction system.

The transaction system stores timestamps in UTC, and the data engineers must now transform the transaction_datetime field to the “America/New_York” timezone for reporting.

Which code should be used to convert the timestamp to the target timezone?

Options:

raw.withColumn("transaction_datetime", from_utc_timestamp(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", to_utc_timestamp(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", date_format(col("transaction_datetime"), "America/New_York"))

raw.withColumn("transaction_datetime", convert_timezone(col("transaction_datetime"), "America/New_York"))

Buy Now

Questions 13

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

groupBy

filter

select

coalesce

Buy Now

Questions 14

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

employees_df.filter(employees_df.tenure >= 5).show()

employees_df.where(employees_df.tenure >= 5)

filter(employees_df.tenure >= 5)

employees_df.filter(employees_df.tenure >= 5).collect()

Buy Now

Questions 15

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

Reads directly from /data/input.json.

Infers the schema automatically.

Merges differing schemas.

Which code snippet should the engineer use?

Options:

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Buy Now

Questions 16

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

Options:

Ability to process small datasets efficiently

In-memory computation and parallel processing capabilities

Support for SQL queries on structured data

Built-in machine learning libraries

Buy Now

Questions 17

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

name

count

timestamp

Delhi

2024-09-19T10:11

Delhi

2024-09-19T10:12

London

2024-09-19T10:15

Paris

2024-09-19T10:18

Paris

2024-09-19T10:20

Washington

2024-09-19T10:22

Which operation is supported with streaming_df?

Options:

streaming_df.count()

streaming_df.filter("count < 30")

streaming_df.select(countDistinct("name"))

streaming_df.show()

Buy Now

Questions 18

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Options:

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

Use spark.read.json() with the inferSchema option set to true

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Buy Now

Questions 19

36 of 55.

What is the main advantage of partitioning the data when persisting tables?

Options:

It compresses the data to save disk space.

It automatically cleans up unused partitions to optimize storage.

It ensures that data is loaded into memory all at once for faster query execution.

It optimizes by reading only the relevant subset of data from fewer partitions.

Buy Now

Questions 20

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Buy Now

Questions 21

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 21

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \

.[LINE 5] \

.load()

Options:

.option("subscribe", "feed")

.option("subscribe.topic", "feed")

.option("kafka.topic", "feed")

.option("topic", "feed")

Buy Now

Questions 22

2 of 55. Which command overwrites an existing JSON file when writing a DataFrame?

Options:

df.write.json("path/to/file")

df.write.mode("append").json("path/to/file")

df.write.option("overwrite").json("path/to/file")

df.write.mode("overwrite").json("path/to/file")

Buy Now

Questions 23

A data engineer is working on the DataFrame:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 23

(Referring to the table image: it has columns Id, Name, count, and timestamp.)

Which code fragment should the engineer use to extract the unique values in the Name column into an alphabetically ordered list?

Options:

df.select("Name").orderBy(df["Name"].asc())

df.select("Name").distinct().orderBy(df["Name"])

df.select("Name").distinct()

df.select("Name").distinct().orderBy(df["Name"].desc())

Buy Now

Questions 24

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

Options:

Add more executor instances to the cluster

Increase the driver memory on the client machine

Switch the deployment mode to cluster mode

Switch the deployment mode to local mode

Buy Now

Questions 25

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

Options:

DataFrame.groupBy().agg()

DataFrame.filter()

DataFrame.withColumn()

DataFrame.select()

Buy Now

Questions 26

Given the code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 26

df = spark.read.csv("large_dataset.csv")

filtered_df = df.filter(col("error_column").contains("error"))

mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"), lit(1).alias("count"))

reduced_df = mapped_df.groupBy("date").sum("count")

reduced_df.count()

reduced_df.show()

At which point will Spark actually begin processing the data?

Options:

When the filter transformation is applied

When the count action is applied

When the groupBy transformation is applied

When the show action is applied

Buy Now

Questions 27

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

Use an RDD action like reduce() to compute the maximum time

Use an accumulator to record the maximum time on the driver

Broadcast a variable to share the maximum time among workers

Configure the Spark UI to automatically collect maximum times

Buy Now

Questions 28

A data engineer needs to write a Streaming DataFrame as Parquet files.

Given the code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 28

Which code fragment should be inserted to meet the requirement?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 28

Which code fragment should be inserted to meet the requirement?

Options:

.format("parquet")

.option("location", "path/to/destination/dir")

CopyEdit

.option("format", "parquet")

.option("destination", "path/to/destination/dir")

.option("format", "parquet")

.option("location", "path/to/destination/dir")

.format("parquet")

.option("path", "path/to/destination/dir")

Buy Now

Questions 29

What is the benefit of using Pandas on Spark for data transformations?

Options:

It is available only with Python, thereby reducing the learning curve.

It computes results immediately using eager execution, making it simple to use.

It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost-efficient.

It executes queries faster using all the available cores in the cluster as well as provides Pandas’s rich set of features.

Buy Now

Questions 30

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

Increase the executor memory allocation in the Spark configuration.

Reduce the size of the data partitions to improve task scheduling.

Increase the number of executor instances to handle more concurrent tasks.

Buy Now

Questions 31

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

Shuffle join because no broadcast hints were provided.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Buy Now

Questions 32

A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is too large to fit entirely in memory.

What is the likely behavior when Spark runs out of memory to store the DataFrame?

Options:

Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in memory, the DataFrame is stored and retrieved from the disk entirely.

Spark splits the DataFrame evenly between memory and disk, ensuring balanced storage utilization.

Spark will store as much data as possible in memory and spill the rest to disk when memory is full, continuing processing with performance overhead.

Spark stores the frequently accessed rows in memory and less frequently accessed rows on disk, utilizing both resources to offer balanced performance.

Buy Now

Questions 33

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Buy Now

Questions 34

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

Locate the executor logs on the Spark master node, typically under the /tmp directory.

Use the command spark-submit with the —verbose flag to print the logs to the console.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

Fetch the logs by running a Spark job with the spark-sql CLI tool.

Buy Now

Questions 35

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:

Use a single SparkSession instance for the entire application.

Avoid using a SparkSession and rely on SparkContext only.

Create a new SparkSession instance before each transformation.

Stop and restart the SparkSession after each action.

Buy Now

Questions 36

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Buy Now

Questions 37

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

Options:

Uses trigger(continuous='5 seconds') – continuous processing mode.

Uses trigger() – default micro-batch trigger without interval.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.

Buy Now

Questions 38

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 38

Which block of Spark code can be used to achieve this requirement?

Options:

filtered_df = users_raw_df.na.drop(thresh=0)

filtered_df = users_raw_df.na.drop(how='all')

filtered_df = users_raw_df.na.drop(how='any')

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Buy Now

Questions 39

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

Options:

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

Buy Now

Questions 40

A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 40

Which operation is supported with streamingdf ?

Options:

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Buy Now

Answer:

Explanation:

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed — Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide – Unsupported Operations., B. groupby("Id").count()Supported — Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs → Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4) Not allowed — Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming – Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show() Not allowed — show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide – Output operations like show() are not supported., , Reference Extract from Official Guide:, “Operations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.”— Databricks Structured Streaming Programming Guide]

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python

Last Update: Oct 30, 2025

Questions: 136

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99

buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 pdf

Big Halloween Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

Dumpsbuddy logo

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: