Databricks-Certified-Professional-Data-Engineer by Databricks Actual Free Exam Q&As

Question 1

A table named user_ltv is being used to create a view that will be used by data analysis on various teams.
Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?

A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted. B. All records from all columns will be displayed with the values in user_ltv. C. All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv. D. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 2

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A. workspace B. libraries C. jobs D. configure E. fs

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 3

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?

A. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala. B. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark. C. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time. D. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster. E. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 4

A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that over time, data being transmitted from the bicycle sensors fails to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
" valid_lat " : " (lat IS NOT NULL) " ,
" valid_long " : " (long IS NOT NULL) "
}
quarantine_rules = " NOT({0}) " .format( " AND " .join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
How should the data engineer meet the requirements to capture good and bad data?

A. @dlt.table(name= " trips_data_quarantine " )
def trips_data_quarantine():
return (
spark.readStream.table( " raw_trips_data " )
filter(expr(quarantine_rules))
) B. @dlt.table
@dlt.expect_all_or_drop(rules)
def trips_data_quarantine():
return spark.readStream.table( " raw_trips_data " ) C. @dlt.table(partition_cols=[ " is_quarantined " ])
@dlt.expect_all(rules)
def trips_data_quarantine():
return (
spark.readStream.table( " raw_trips_data " )
withColumn( " is_quarantined " , expr(quarantine_rules))
) D. @dlt.view
@dlt.expect_or_drop( " lat_long_present " , " (lat IS NOT NULL AND long IS NOT NULL) " ) def trips_data_quarantine():
return spark.readStream.table( " ride_and_go.telemetry.trips " )

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 5

Which approach demonstrates a modular and testable way to use DataFrame.transform for ETL code in PySpark?

A. def transform_data(input_df):
# transformation logic here
return output_df
test_input = spark.createDataFrame([(1, " a " )], [ " id " , " value " ]) assertDataFrameEqual(transform_data(test_input), expected) B. def upper_value(df):
return df.withColumn( " value_upper " , upper(col( " value " )))
def filter_positive(df):
return df.filter(df[ " id " ] > 0)
pipeline_df = df.transform(upper_value).transform(filter_positive) C. class Pipeline:
def transform(self, df):
return df.withColumn( " value_upper " , upper(col( " value " )))
pipeline = Pipeline()
assertDataFrameEqual(pipeline.transform(test_input), expected) D. def upper_transform(df):
return df.withColumn( " value_upper " , upper(col( " value " )))
actual = test_input.transform(upper_transform)
assertDataFrameEqual(actual, expected)

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 6

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de- duplicated and validated, which statement describes what will occur when this code is executed?

A. The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job. B. The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job. C. A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key. D. An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table. E. An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 7

A data engineer is designing an append-only pipeline that needs to handle both batch and streaming data in Delta Lake. The team wants to ensure that the streaming component can efficiently track which data has already been processed.
Which configuration should be set to enable this?

A. overwriteSchema B. checkpointLocation C. partitionBy D. mergeSchema

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 8

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?

A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints. B. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly- parallel writes. C. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake ' s upsert functionality. D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 9

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A. Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer. B. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer. C. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance. D. Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state. E. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 10

A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.
Which technique will fix the skew in this aggregation?

A. Filter out the skewed users before the aggregation. B. Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix. C. Use reduceByKey instead of groupBy to avoid shuffles. D. Increase the Spark driver memory and retry.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 11

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They've been given a new requirement - the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.
This is the existing code for their silver table:
@dlt.table
@dlt.expect( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
What code will satisfy the requirements?

A. @dlt.table
@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
@dlt.table
@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " ) def quarantine_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " ) B. @dlt.table
@dlt.expect( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
@dlt.table
@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )
def quarantine_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " ) C. @dlt.table
@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
@dlt.table
@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )
def quarantine_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " ) D. @dlt.table
@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
@dlt.table
@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )
def quarantine_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 12

A data team ' s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

A. Specify a new checkpointlocation B. Increase the shuffle partitions to account for additional aggregates C. Run REFRESH TABLE delta, /item_agg ' D. Remove .option (mergeSchema ' , true ' ) from the streaming write

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 13

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A. Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster. B. Schedule a job to execute the pipeline once an hour on a new job cluster. C. Configure a job that executes every time new data lands in a given directory. D. Schedule a Structured Streaming job with a trigger interval of 60 minutes.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 14

To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the System catalog in Unity Catalog, to gain detailed visibility into workspace activity.
Which SQL query should the team run from the System catalog to achieve this?

A. SELECT sku_name,
identity_metadata.created_by AS user_email,
COUNT(usage_quantity) AS total_dbus
FROM system.billing.usage
GROUP BY user_email, sku_name
ORDER BY total_dbus DESC
LIMIT 10 B. SELECT sku_name,
identity_metadata.created_by AS user_email,
SUM(usage_quantity * usage_unit) AS total_dbus
FROM system.billing.usage
GROUP BY user_email, sku_name
ORDER BY total_dbus DESC
LIMIT 10 C. SELECT identity_metadata.run_as AS user_email,
SUM(usage_quantity) AS total_dbus
FROM system.billing.usage
GROUP BY user_email
ORDER BY total_dbus DESC
LIMIT 10 D. SELECT sku_name,
usage_metadata.run_name AS user_email,
SUM(usage_quantity) AS total_dbus
FROM system.billing.usage
GROUP BY user_email, sku_name
ORDER BY total_dbus DESC
LIMIT 10

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Question 15

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A. The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution. B. Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production. C. Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results. D. Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored. E. The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).

Databricks Certified Professional Data Engineer - Databricks-Certified-Professional-Data-Engineer Exam Practice Test