
Get Professional-Data-Engineer Products Practice Material for Professional-Data-Engineer Exam Question Preparation
Most Reliable Google Professional-Data-Engineer Training Materials
Google Professional Data Engineer Practice Test Questions, Google Professional Data Engineer Exam Practice Test Questions
The Google Professional Data Engineer certification is designed to evaluate the candidates’ skills in designing data processing systems and ensuring solution quality. It is also created to measure their competence in building and operationalizing data processing systems and operationalizing ML models. The potential applicants must complete a single exam to get certified.
NEW QUESTION 48
As your organization expands its usage of GCP, many teams have started to create their own projects.
Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? (Choose two.)
- A. For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
- B. Use Cloud Deployment Manager to automate access provision.
- C. Introduce resource hierarchy to leverage access control policy inheritance.
- D. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
- E. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
Answer: B,D
Explanation:
Explanation
NEW QUESTION 49
You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis. Which job type and transforms should this pipeline use?
- A. Streaming job, PubSubIO, BigQueryIO, side-inputs
- B. Streaming job, PubSubIO, JdbcIO, side-outputs
- C. Batch job, PubSubIO, side-inputs
- D. Streaming job, PubSubIO, BigQueryIO, side-outputs
Answer: A
NEW QUESTION 50
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data
during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every
hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and
collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable.
The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data.
They want to improve this performance while minimizing cost. What should they do?
- A. Redesign the schema to use a single row key to identify values that need to be updated frequently in
the cluster. - B. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user
viewing the offers. - C. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
- D. Redefine the schema by evenly distributing reads and writes across the row space of the table.
Answer: D
NEW QUESTION 51
If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?
- A. Regressor
- B. Unsupervised learning
- C. Clustering estimator
- D. Classifier
Answer: A
Explanation:
Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores.
Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial fraud, or student letter grades.
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset. Examples include customer segmentation, grouping similar items in e-commerce, and social network analysis.
NEW QUESTION 52
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's dat
a. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)
- A. Use the appropriate identity and access management (IAM) roles for each client's users.
- B. Put each client's BigQuery dataset into a different table.
- C. Load data into a different dataset for each client.
- D. Load data into different partitions.
- E. Only allow a service account to access the datasets.
- F. Restrict a client's dataset to approved users.
Answer: A,C,F
NEW QUESTION 53
Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)
- A. A good use for the wide and deep model is a small-scale linear regression problem.
- B. The wide model is used for generalization, while the deep model is used for memorization.
- C. A good use for the wide and deep model is a recommender system.
- D. The wide model is used for memorization, while the deep model is used for generalization.
Answer: C,D
Explanation:
Can we teach computers to learn like humans do, by combining the power of memorization and generalization? It's not an easy question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It's useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
NEW QUESTION 54
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?
- A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
- B. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
- C. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
- D. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
Answer: A
NEW QUESTION 55
You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?
- A. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
- B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
- C. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
- D. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
Answer: D
NEW QUESTION 56
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
- A. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
- B. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
- C. Create an additional project to overcome the 2K on-demand per-project quota.
- D. Convert your batch BQ queries into interactive BQ queries.
Answer: A
Explanation:
Explanation
Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery
NEW QUESTION 57
You are updating the code for a subscriber to a Pub/Sub feed. You are concerned that upon deployment the subscriber may erroneously acknowledge messages, leading to message loss. Your subscriber is not set up to retain acknowledged messages. What should you do to ensure that you can recover from errors after deployment?
- A. Enable dead-lettering on the Pub/Sub topic to capture messages that aren't successfully acknowledged. If an error occurs after deployment, re-deliver any messages captured by the dead-letter queue.
- B. Use Cloud Build for your deployment. If an error occurs after deployment, use a Seek operation to locate a timestamp logged by Cloud Build at the start of the deployment.
- C. Set up the Pub/Sub emulator on your local machine. Validate the behavior of your new subscriber logic before deploying it to production.
- D. Create a Pub/Sub snapshot before deploying new subscriber code. Use a Seek operation to re-deliver messages that became available after the snapshot was created.
Answer: B
Explanation:
Explanation/Reference: https://cloud.google.com/pubsub/docs/replay-overview
NEW QUESTION 58
What Dataflow concept determines when a Window's contents should be output based on certain criteria being met?
- A. OutputCriteria
- B. Sessions
- C. Triggers
- D. Windows
Answer: C
Explanation:
Triggers control when the elements for a specific key and window are output. As elements arrive, they are put into one or more windows by a Window transform and its associated WindowFn, and then passed to the associated Trigger to determine if the Windows contents should be output.
Reference: https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/ transforms/windowing/Trigger
NEW QUESTION 59
You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?
- A. HiveQL using Hive
- B. Python using MapReduce
- C. PigLatin using Pig
- D. Java using MapReduce
Answer: C
Explanation:
Pig is scripting language which can be used for checkpointing and splitting pipelines.
NEW QUESTION 60
To give a user read permission for only the first three columns of a table, which access control method would you use?
- A. Primitive role
- B. Predefined role
- C. It's not possible to give access to only the first three columns of a table.
- D. Authorized view
Answer: D
Explanation:
Explanation
An authorized view allows you to share query results with particular users and groups without giving them read access to the underlying tables. Authorized views can only be created in a dataset that does not contain the tables queried by the view.
When you create an authorized view, you use the view's SQL query to restrict access to only the rows and columns you want the users to see.
Reference: https://cloud.google.com/bigquery/docs/views#authorized-views
NEW QUESTION 61
You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30-90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries. What should you do?
- A. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
- B. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.
- C. Modify your pipeline to maintain the last 30-90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
- D. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
Answer: C
NEW QUESTION 62
You work for a manufacturing plant that batches application log files together into a single log file once a
day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make
sure the log file in processed once per day as inexpensively as possible. What should you do?
- A. Change the processing job to use Google Cloud Dataproc instead.
- B. Manually start the Cloud Dataflow job each morning when you get into the office.
- C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
- D. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
Answer: C
NEW QUESTION 63
Government regulations in the banking industry mandate the protection of client's personally identifiable information (PII). Your company requires PII to be access controlled encrypted and compliant with major data protection standards In addition to using Cloud Data Loss Prevention (Cloud DIP) you want to follow Google-recommended practices and use service accounts to control access to PII. What should you do?
- A. Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources
- B. Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group
- C. Use one service account to access a Cloud SQL database and use separate service accounts for each human user
- D. Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users
Answer: B
NEW QUESTION 64
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?
- A. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
- B. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
- C. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- D. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
Answer: A
NEW QUESTION 65
As your organization expands its usage of GCP, many teams have started to create their own projects.
Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? (Choose two.)
- A. For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
- B. Use Cloud Deployment Manager to automate access provision.
- C. Introduce resource hierarchy to leverage access control policy inheritance.
- D. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
- E. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
Answer: B,D
NEW QUESTION 66
......
LATEST Professional-Data-Engineer Exam Practice Material: https://www.prepawayexam.com/Google/braindumps.Professional-Data-Engineer.ete.file.html
The Realest Study Materials Professional-Data-Engineer Dumps: https://drive.google.com/open?id=14GyBcOQPWcWkPyMUYqp8WDuQ33xyPLS2