Databricks Certified Professional Data Engineer - Databricks-Certified-Professional-Data-Engineer Exam Practice Test
A table named user_ltv is being used to create a view that will be used by data analysis on various teams.
Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?
Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?
Correct Answer: D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that over time, data being transmitted from the bicycle sensors fails to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
" valid_lat " : " (lat IS NOT NULL) " ,
" valid_long " : " (long IS NOT NULL) "
}
quarantine_rules = " NOT({0}) " .format( " AND " .join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
How should the data engineer meet the requirements to capture good and bad data?
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
" valid_lat " : " (lat IS NOT NULL) " ,
" valid_long " : " (long IS NOT NULL) "
}
quarantine_rules = " NOT({0}) " .format( " AND " .join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
How should the data engineer meet the requirements to capture good and bad data?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
Which approach demonstrates a modular and testable way to use DataFrame.transform for ETL code in PySpark?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de- duplicated and validated, which statement describes what will occur when this code is executed?

Assuming that this code produces logically correct results and the data in the source table has been de- duplicated and validated, which statement describes what will occur when this code is executed?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A data engineer is designing an append-only pipeline that needs to handle both batch and streaming data in Delta Lake. The team wants to ensure that the streaming component can efficiently track which data has already been processed.
Which configuration should be set to enable this?
Which configuration should be set to enable this?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?
Which consideration will impact the decisions made by the engineer while migrating this workload?
Correct Answer: A
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?
Which describes how Delta Lake can help to avoid data loss of this nature in the future?
Correct Answer: D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.
Which technique will fix the skew in this aggregation?
Which technique will fix the skew in this aggregation?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They've been given a new requirement - the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.
This is the existing code for their silver table:
@dlt.table
@dlt.expect( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
What code will satisfy the requirements?
This is the existing code for their silver table:
@dlt.table
@dlt.expect( " valid_sensor_reading " , " reading < 120 " )
def silver_sensor_readings():
return spark.readStream.table( " bronze_sensor_readings " )
What code will satisfy the requirements?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A data team ' s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

Which step must also be completed to put the proposed query into production?
Correct Answer: A
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the System catalog in Unity Catalog, to gain detailed visibility into workspace activity.
Which SQL query should the team run from the System catalog to achieve this?
Which SQL query should the team run from the System catalog to achieve this?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).