[2022] Get Top-Rated Google Professional-Data-Engineer Exam Dumps Now
Passing Key To Getting Professional-Data-Engineer Certified Exam Engine PDF
NEW QUESTION 115
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?
- A. Update the current pipeline and use the drain flag.
- B. Update the current pipeline and provide the transform mapping JSON object.
- C. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
- D. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
Answer: B
Explanation:
If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the --transformNameMapping option.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#preventing_compatibility_breaks
NEW QUESTION 116
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)
- A. Restrict BigQuery API access to approved users.
- B. Segregate data across multiple tables or databases.
- C. Restrict access to tables by role.
- D. Use Google Stackdriver Audit Logging to determine policy violations.
- E. Disable writes to certain tables.
- F. Ensure that the data is encrypted at all times.
Answer: A,C,D
NEW QUESTION 117
Google Cloud Bigtable indexes a single value in each row. This value is called the
_______.
- A. master key
- B. primary key
- C. unique key
- D. row key
Answer: D
Explanation:
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION 118
You work for an economic consulting firm that helps companies identify economic trends as they happen.
As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?
- A. Load the data every 30 minutes into a new partitioned table in BigQuery.
- B. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
- C. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
- D. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
Answer: A
NEW QUESTION 119
Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?
- A. You need to integrate with Google BigQuery.
- B. You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.
- C. You expect to store at least 10 TB of data.
- D. You will not use the data to back a user-facing or latency-sensitive application.
Answer: A
Explanation:
Explanation
For example, if you plan to store extensive historical data for a large number of remote-sensing devices and then use the data to generate daily reports, the cost savings for HDD storage may justify the performance tradeoff. On the other hand, if you plan to use the data to display a real-time dashboard, it probably would not make sense to use HDD storage-reads would be much more frequent in this case, and reads are much slower with HDD storage.
Reference: https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
NEW QUESTION 120
Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market.
Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system that indicates the location of
their loads
Perform analytics on all their orders and shipment logs, which contain both structured and unstructured
data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
Databases
8 physical servers in 2 clusters
- SQL Server - user data, inventory, static data
3 physical servers
- Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
Application servers - customer front end, middleware for order/customs
60 virtual machines across 20 physical servers
- Tomcat - Java services
- Nginx - static content
- Batch servers
Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) - SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements
Build a reliable and reproducible environment with scaled panty of production.
Aggregate data in a centralized Data Lake for analysis
Use historical data to perform predictive analytics on future shipments
Accurately track every shipment worldwide using proprietary technology
Improve business agility and speed of innovation through rapid provisioning of new resources
Analyze and optimize architecture for performance in the cloud
Migrate fully to the cloud if all other requirements are met
Technical Requirements
Handle both streaming and batch data
Migrate existing Hadoop workloads
Ensure architecture is scalable and elastic to meet the changing demands of the company.
Use managed services whenever possible
Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment
SEO Statement
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability.
Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?
- A. Use the automatically generated timestamp from Cloud Pub/Sub to order the data.
- B. Use the NOW () function in BigQuery to record the event's time.
- C. Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.
- D. Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.
Answer: D
NEW QUESTION 121
Which of the following is NOT true about Dataflow pipelines?
- A. Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources
- B. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
- C. Dataflow pipelines can be programmed in Java
- D. Dataflow pipelines can consume data from other Google Cloud services
Answer: B
Explanation:
Explanation
Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs Reference: https://cloud.google.com/dataflow/
NEW QUESTION 122
You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this? (Choose two.)
- A. Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.
- B. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
- C. Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
- D. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
- E. Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
Answer: A,C
NEW QUESTION 123
You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings.
Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?
- A. Convert the streaming insert code to batch load for individual messages.
- B. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
- C. Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.
- D. Re-write the application to load accumulated data every 2 minutes.
Answer: D
NEW QUESTION 124
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?
- A. Hashing
- B. Field promotion
- C. Randomization
- D. Salting
Answer: B
Explanation:
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-
series#ensure_that_your_row_key_avoids_hotspotting
NEW QUESTION 125
Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.
- A. Ignite
- B. Blaze
- C. Spark
- D. Fire
Answer: C
Explanation:
Explanation
Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that lets you use open source data tools for batch processing, querying, streaming, and machine learning.
Reference: https://cloud.google.com/dataproc/docs/
NEW QUESTION 126
Your company is implementing a data warehouse using BigQuery and you have been tasked with designing the data model You move your on-premises sales data warehouse with a star data schema to BigQuery but notice performance issues when querying the data of the past 30 days Based on Google's recommended practices, what should you do to speed up the query without increasing storage costs?
- A. Partition the data by transaction date
- B. Denormalize the data
- C. Shard the data by customer ID
- D. Materialize the dimensional data in views
Answer: A
NEW QUESTION 127
You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?
- A. Scan every table in BigQuery, and mask the data it finds that has Pll
- B. Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
- C. Create a pseudonym by replacing Pll data with a cryptographic format-preserving token
- D. Redact all Pll data, and store a version of the unredacted data in a locked-down bucket
Answer: B
NEW QUESTION 128
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)
- A. Adjust the settings for each dataset to allow a related region-based security group view access.
- B. Ensure all the tables are included in global dataset.
- C. Ensure each table is included in a dataset for a region.
- D. Adjust the settings for each view to allow a related region-based security group view access.
- E. Adjust the settings for each table to allow a related region-based security group view access.
Answer: C,D
NEW QUESTION 129
Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use?
(Choose three.)
- A. Unsupervised learning to predict the location of a transaction.
- B. Clustering to divide the transactions into N categories based on feature similarity.
- C. Supervised learning to predict the location of a transaction.
- D. Unsupervised learning to determine which transactions are most likely to be fraudulent.
- E. Supervised learning to determine which transactions are most likely to be fraudulent.
- F. Reinforcement learning to predict the location of a transaction.
Answer: B,D,F
Explanation:
Explanation/Reference:
NEW QUESTION 130
You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and- shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery.
You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?
- A. Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
- B. Rewrite your models on TensorFlow, and start using Cloud ML Engine
- C. Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
- D. Use Cloud ML Engine for training existing Spark ML models
Answer: A
NEW QUESTION 131
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
- A. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
- B. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
- C. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
- D. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
Answer: D
Explanation:
Explanation
NEW QUESTION 132
What are two methods that can be used to denormalize tables in BigQuery?
- A. 1) Split table into multiple tables; 2) Use a partitioned table
- B. 1) Join tables into one table; 2) Use nested repeated fields
- C. 1) Use a partitioned table; 2) Join tables into one table
- D. 1) Use nested repeated fields; 2) Use a partitioned table
Answer: B
Explanation:
The conventional method of denormalizing data involves simply writing a fact, along with all its dimensions, into a flat table structure. For example, if you are dealing with sales transactions, you would write each individual fact to a record, along with the accompanying dimensions such as order and customer information.
The other method for denormalizing data takes advantage of BigQuery's native support for nested and repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a JSON structure would contain the order and customer information, and the inner part of the structure would contain the individual line items of the order, which would be represented as nested, repeated elements.
Reference: https://cloud.google.com/solutions/bigquery-data-
warehouse#denormalizing_data
NEW QUESTION 133
You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real- time analysis of aggregated trends. What should you do?
- A. Use bq loadto load a batch of sensor data every 60 seconds.
- B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
- C. Use the INSERT statement to insert a batch of data every 60 seconds.
- D. Use the MERGE statement to apply updates in batch every 60 seconds.
Answer: C
Explanation:
Explanation
NEW QUESTION 134
You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?
- A. Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
- B. Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.
- C. Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
- D. Create a cron schedule in Cloud Dataprep.
Answer: B
NEW QUESTION 135
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
* Ensure secure and efficient transport and storage of telemetry data
* Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
* Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
* Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?
- A. Rowkey: data_point
Column data: device_id,date - B. Rowkey: date
Column data: device_id,data_point - C. Rowkey: device_id
Column data: date, data_point - D. Rowkey: date#device_id
Column data: data_point - E. Rowkey: date#data_point
Column data: device_id
Answer: A
NEW QUESTION 136
......
Professional-Data-Engineer exam questions for practice in 2022 Updated 253 Questions: https://www.passtestking.com/Google/Professional-Data-Engineer-practice-exam-dumps.html
Professional-Data-Engineer Exam Dumps Pass with Updated Tests Dumps: https://drive.google.com/open?id=1ROQ9WEzV-KeRp3HYdj41CLbueF5b6S8O