SnowPro Advanced: Data Engineer (DEA-C02) (DEA-C02) Free Practice Test

Question 1

You are building a data pipeline using Snowflake Tasks to orchestrate a series of transformations. One of the tasks, 'task _ transform data', depends on the successful completion of another task, 'task extract_data'. However, occasionally fails due to transient network issues. You want to implement a retry mechanism for 'task_extract data' without impacting the overall pipeline execution time significantly. Which of the following approaches is the most appropriate and efficient way to achieve this within the Snowflake Task framework?

A. Configure the task with an error notification integration that sends alerts upon failure. Manually monitor these alerts and manually resume the task if it fails. Use 'ALTER TASK task extract data RESUME;'

B. Implement a TRY...CATCH block within the task definition to catch any exceptions. Inside the CATCH block, use SYSTEM$WAIT to pause for a few seconds, then re- execute the core logic of the task. Repeat this process a limited number of times before failing the task permanently.

C. Create a new root-level task that checks the status of 'task_extract_data'. If it failed, the root-level task will execute a copy of the 'task_extract data' task. After this, it updates the 'task_transform_data"s 'AFTER' condition to depend on the new task that retries extraction.

D. Modify the task definition to call a stored procedure. The stored procedure implements a loop with a retry counter. Inside the loop, execute the data extraction logic. If an error occurs, catch the exception, wait for a few seconds, and retry the extraction. After a specified number of retries, raise an exception to signal task failure.

E. Use the 'AFTER keyword in the 'CREATE TASK' statement for 'task_transform_data' to only execute if succeeds on its first attempt. If fails, the entire pipeline will stop, ensuring data consistency.

Correct Answer: D

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 2

You are tasked with optimizing a continuous data pipeline that loads data from an external stage into a Snowflake table using streams.
The pipeline is experiencing significant latency during peak hours. The stream is defined on a very large table with frequent updates and deletes. Which of the following strategies would be MOST effective in reducing the latency of the data pipeline, considering stream performance and cost implications?

A. Implement a more aggressive pruning strategy on the base table to reduce the amount of data that the stream needs to track.

B. Implement a materialized view on top of the stream to pre-aggregate the data.

C. Create multiple streams on the same base table, each filtering for specific types of changes (e.g., INSERT, UPDATE, DELETE).

D. Increase the size of the virtual warehouse used for loading data. This will provide more compute resources for processing the stream.

E. Reduce the RETENTION TIME of the stream. This will limit the amount of historical data tracked and improve performance.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 3

You are implementing a data pipeline in Snowpark that reads data from an external stage (e.g., AWS S3) and performs complex transformations, including joins with large Snowflake tables. You notice that the pipeline's performance is significantly slower than expected, despite having sufficient warehouse resources. Which of the following actions would MOST likely improve the performance of the Snowpark data pipeline?

A. Ensure that the external stage is properly configured with appropriate data formats (e.g., Parquet) and partitioning schemes that align with the join keys.

B. Reduce the number of partitions in the DataFrame representing the data from the external stage using 'df.repartition(l )'.

C. Increase the warehouse size to the largest available option (e.g., X-Large or larger).

D. Persist the DataFrame representing the data from the external stage using 'df.cache()' before performing the joins.

E. Optimize the SQL joins within the Snowpark DataFrame operations by using broadcast joins when appropriate and ensuring correct join key data types.

Correct Answer: A,D,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 4

You are designing a data loading process for a high-volume streaming data source. The data arrives as Avro files in an AWS S3 bucket. You need to load this data into a Snowflake table with minimal latency and operational overhead. Which of the following combinations of Snowflake features and configurations would be MOST suitable for this scenario? (Select TWO)

A. Configure an external table pointing to the S3 bucket and query the Avro files directly from Snowflake.

B. Use a Kafka connector to stream data directly from the Kafka topic to Snowflake.

C. Create a custom Spark application that reads Avro files from S3, transforms the data, and then writes it to Snowflake using the Snowflake Spark connector.

D. Implement Snowpipe with auto-ingest configured to listen for S3 event notifications whenever a new Avro file is added to the bucket.

E. Use the 'COPY INTO' command with a scheduled task that runs every 5 minutes to load new files from the S3 bucket.

Correct Answer: B,D

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 5

You have a complex data pipeline implemented using Snow park Python. The pipeline involves multiple Data Frame transformations, joins, aggregations, and window functions. To enhance the maintainability and readability of the code, you want to modularize the pipeline into reusable functions. You also need to handle potential errors and exceptions gracefully. Consider the following code snippet:

A.

B.

C.

D.

Correct Answer: A,B

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 6

You are designing a Snowflake data pipeline that continuously ingests clickstream dat a. You need to monitor the pipeline for latency and throughput, and trigger notifications if these metrics fall outside acceptable ranges. Which of the following combinations of Snowflake features and techniques would be MOST effective for achieving this goal?

A. Use Snowflake's 'QUERY_HISTORY view to track query execution times and implement a scheduled task that queries this view, calculates latency and throughput, and sends email notifications using Snowflake's built-in email integration if thresholds are exceeded.

B. Implement a combination of Snowflake Streams, Tasks, and external functions. Streams capture changes, Tasks process the changes, and external functions send notifications to a monitoring service when latency or throughput issues are detected.

C. Rely on Snowflake's default resource monitors to track warehouse usage. If warehouse usage exceeds a certain threshold, assume there are performance issues and send a notification.

D. Create a custom dashboard using a Bl tool that connects to Snowflake via JDBC/ODBC and visualizes data ingestion and processing metrics. Manually monitor the dashboard for anomalies.

E. Use Snowflake's Event Tables and Event Notifications to capture events related to data ingestion and processing. Configure alerts based on event patterns that indicate latency or throughput issues.

Correct Answer: B,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 7

You are using Snowpipe to ingest data from Azure Blob Storage into a Snowflake table. You have successfully set up the pipe and configured the event notifications. However, you notice that duplicate records are appearing in your target table. After reviewing the logs, you determine that the same file is being processed multiple times by Snowpipe. Which of the following strategies can you implement to prevent duplicate data ingestion, assuming you cannot modify the source data in Azure Blob Storage to include a unique ID or timestamp?

A. Modify the Azure Event Grid subscription configuration to filter events based on file size or creation time to avoid resending events for already processed files.

B. Implement idempotent logic within a Snowflake stored procedure that is triggered by a task after the data is loaded by Snowpipe. The stored procedure should identify and remove duplicate rows based on all other columns in the table.

C. Use a data masking policy with the 'MASK' function to obfuscate duplicate records based on their similarity, making them effectively invisible to downstream queries.

D. Configure the Snowpipe definition with the 'PURGE = TRUE parameter. This will ensure that each file is only processed once.

E. Create a Snowflake stream on the target table and use it to incrementally load data into a separate, deduplicated table using a merge statement with conditional logic to insert or update records based on a combination of columns.

Correct Answer: E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 8

You are performing a series of complex data transformations on a large table named 'TRANSACTIONS' in Snowflake. After running several DML statements, you realize that an earlier transformation step introduced incorrect data into the table. You want to rollback the table to a state before that specific transformation occurred. Which of the following methods could be used to achieve this rollback, assuming you know the exact timestamp or query ID of the state you want to revert to? Select all that apply.

A. Create a clone of the ' TRANSACTIONS' table using Time Travel, specifying the 'AT' or 'BEFORE clause with either the timestamp or query ID of the desired state. Then, replace the original table with the cloned table.

B. Use the UNDROP TABLE command if the table was dropped accidentally, then manually re-apply the correct transformations.

C. Create a new table with the correct data and load from the original table filtered by a range of transaction IDs excluding the incorrect range.

D. Restore the entire Snowflake account to a point in time before the incorrect transformation.

E. Use Time Travel to query the historical version of the 'TRANSACTIONS' table using the 'AT' or 'BEFORE clause with either the timestamp or query ID. Then, use 'INSERT OVERWRITES or ' REPLACE TABLES statement to replace the current content of the original table with the historical data.

Correct Answer: A,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 9

You have a Snowflake Stream named 'ORDERS STREAM' on an 'ORDERS' table, which is used to incrementally load data into a historical orders table named 'HISTORICAL ORDERS'. The data pipeline involves a series of tasks: 1) Consume changes from the 'ORDERS STREAM', 2) Apply transformations and data quality checks, and 3) Merge the changes into 'HISTORICAL ORDERS' using a MERGE statement. After a recent data load, you notice that the 'HISTORICAL ORDERS' table contains duplicate records for certain 'ORDER values. The MERGE statement uses 'ORDER ID' as the matching key. You have confirmed that the transformation logic is correct and idempotent. Examine the MERGE statement below. What could be causing the duplicates, given the context of Streams and incremental loading?

A. The MERGE statement is not correctly handling updates and deletes from the stream. The 'WHEN NOT MATCHED' and 'WHEN MATCHED' clauses are not mutually exclusive, leading to potential insertions of duplicate rows.

B. The stream is not configured to capture DELETE operations from the ORDERS table, causing records that should have been removed in HISTORICAL ORDERS to remain.

C. Multiple tasks are concurrently consuming from the same 'ORDERS STREAM' without proper coordination, causing records to be processed multiple times.

D. The stream's or 'BEFORE clause is being used incorrectly, potentially rewinding the stream to an earlier point in time.

E. The 'ORDERS STREAM' is retaining historical data beyond the data retention period, causing older records to be re-processed.

Correct Answer: C

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Welcome to TestSimulate

Snowflake SnowPro Advanced: Data Engineer (DEA-C02) (DEA-C02) Free Practice Test