SnowPro Advanced: Data Scientist Certification Exam (DSA-C03) Free Practice Test

Question 1

You are a data scientist working with a large dataset of customer transactions stored in Snowflake. You need to identify potential fraud using statistical summaries. Which of the following approaches would be MOST effective in identifying unusual spending patterns, considering the need for scalability and performance within Snowflake?

A. Calculate the average transaction amount and standard deviation for each customer using window functions in SQL. Flag transactions that fall outside of 3 standard deviations from the customer's mean.

B. Export the entire dataset to a Python environment, use Pandas to calculate the average transaction amount and standard deviation for each customer, and then identify outliers based on a fixed threshold.

C. Implement a custom UDF (User-Defined Function) in Java to calculate the interquartile range (IQR) for each customer's transaction amounts and flag transactions as outliers if they are below QI - 1.5 IQR or above Q3 + 1.5 IQR.

D. Sample a subset of the data, calculate descriptive statistics using Snowpark Python and the 'describe()' function, and extrapolate these statistics to the entire dataset.

E. Use Snowflake's native anomaly detection functions (if available, and configured for streaming) to detect anomalies based on transaction amount and frequency, grouped by customer ID.

Correct Answer: A,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 2

You've built a machine learning model in scikit-learn and want to deploy it to Snowflake for real-time inference. You have the following options for deploying the model. Select all that apply and are considered a best practice for cost and time optimization:

A. Use Snowflake's Snowpark Python API to directly load the model from a stage and execute inference using Snowpark DataFrames, which will implicitly handle the distributed processing of the data.

B. Create a Snowflake external function that calls a cloud-based (AWS SageMaker, Azure Machine Learning, GCP Vertex A1) endpoint for inference, passing the input data to the endpoint and receiving the prediction back.

C. Package the scikit-learn model using 'joblib' or 'pickle' , store it in a Snowflake stage, and create a Snowflake UDF (User-Defined Function) in Python to load the model from the stage and perform inference.

D. Implement a custom microservice that reads data from Snowflake, performs inference using the scikit-learn model, and writes the predictions back to Snowflake.

E. Migrate your entire Snowflake data warehouse to a different platform which better supports real-time ML inference.

Correct Answer: A,C

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 3

A healthcare provider has a Snowflake table 'MEDICAL RECORDS containing patient notes stored as unstructured text in a column called 'NOTE TEXT. They want to identify different patient groups based on the topics discussed in these notes. They aim to use a combination of unsupervised and supervised learning. Which of the following represents a robust workflow to achieve this goal?

A. Perform topic modeling on a sample of the 'NOTE TEXT data using a Snowflake Python UDF. Manually review the top documents for each identified topic, and assign labels describing the patient group represented by each topic. Train a supervised multi-label classification model (e.g., using scikit-learn's

B. Export all 'NOTE TEXT data to an extemal system, use an existing NLP pipeline for topic modeling and manual labeling, then create a Snowflake UDF that replicates this entire pipeline internally.

C. MultiOutputClassifier wrapped around a Logistic Regression model) within Snowflake (using Snowpark), using the original 'NOTE TEXT as input features (TF-IDF or word embeddings) and the manually assigned topic labels as target variables. Use the trained model to classify the remaining patient notes into relevant patient groups.

D. Perform topic modeling (e.g., LDA) directly on the 'NOTE_TEXT column using a Python UDF in Snowflake. Manually label a subset of the resulting topics. Then, train a supervised classifier (e.g., Naive Bayes) to predict the identified topics for new patient notes.

E. Use a Snowflake external function to call a pre-trained topic modeling model (e.g., BERTopic) hosted on Google Cloud A1 Platform. Assign topic probabilities to each patient note. Then, perform K-Means clustering on the topic probabilities to identify patient segments. No manual labeling is performed.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 4

You are tasked with identifying fraudulent transactions in a large financial dataset stored in Snowflake using unsupervised learning. The dataset contains features like transaction amount, merchant ID, location, time, and user ID. You decide to use a combination of clustering and anomaly detection techniques. Which of the following steps and techniques would be MOST effective in achieving this goal while leveraging Snowflake's capabilities and minimizing false positives?

A. Apply Principal Component Analysis (PCA) for dimensionality reduction, then use DBSCAN clustering to identify dense regions of normal transactions and flag any transaction that is not within a dense region as potentially fraudulent. After, review the anomalous data points.

B. Use a Snowflake Python UDF to perform feature selection, apply a combination of K-means clustering and anomaly detection techniques like Isolation Forest or Local Outlier Factor (LOF), and then score each transaction based on its likelihood of being fraudulent. Tune parameters and use a hold-out validation set to minimize false positives, using a Snowpark DataFrame to retrieve the data.

C. Implement an Isolation Forest algorithm directly in SQL using complex JOINs and window functions to identify anomalies based on transaction volume and velocity.

D. Use only the 'transaction amount' feature and perform histogram-based anomaly detection in Snowflake SQL by identifying values outside of the common ranges, disregarding other potentially relevant information.

E. Perform K-means clustering on the entire dataset using all available features, then flag any transaction that falls outside of any cluster as fraudulent. Ignore any feature selection or engineering to simplify the process.

Correct Answer: A,B

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 5

You are preparing a dataset in Snowflake for a K-means clustering algorithm. The dataset includes features like 'age', 'income' (in USD), and 'number of_transactions'. 'Income' has significantly larger values than 'age' and 'number of_transactions'. To ensure that all features contribute equally to the distance calculations in K-means, which of the following scaling approaches should you consider, and why? Select all that apply:

A. Do not scale the data, as K-means is robust to differences in feature scales.

B. Apply StandardScaler to all three features ('age', 'income', 'number_of_transactions') to center the data around zero and scale it to unit variance.

C. Apply PowerTransformer to transform income and StandardScaler to other features to handle skewness.

D. Apply RobustScaler to handle outliers and then StandardScaler or MinMaxScaler to further scale the features.

E. Apply MinMaxScaler to all three features to scale them to a range between O and 1 .

Correct Answer: B,D,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 6

You're building a fraud detection model and want to determine if the average transaction amount for fraudulent transactions is significantly higher than the average transaction amount for legitimate transactions. You have two tables in Snowflake:
'FRAUDULENT TRANSACTIONS and 'LEGITIMATE TRANSACTIONS, both with a 'TRANSACTION AMOUNT column. You believe that FRAUDULENT TRANSACTIONS contains fewer than 30 transactions. You don't know the population standard deviations. What are the proper steps to conduct the hypothesis test, and what is the correct hypothesis statement?

A. Perform a Z-test. Null Hypothesis: The average transaction amount for fraudulent transactions is equal to the average transaction amount for legitimate transactions. Alternative Hypothesis: The average transaction amount for fraudulent transactions is not equal to the average transaction amount for legitimate transactions.

B. Perform a chi-squared test. Null Hypothesis: There is no relationship between transaction amount and whether a transaction is fraudulent. Alternative Hypothesis: There is a relationship between transaction amount and whether a transaction is fraudulent.

C. Perform a t-test. Null Hypothesis: The average transaction amount for fraudulent transactions is less than or equal to the average transaction amount for legitimate transactions. Alternative Hypothesis: The average transaction amount for fraudulent transactions is greater than the average transaction amount for legitimate transactions.

D. Perform a Z-test. Null Hypothesis: The average transaction amount for fraudulent transactions is less than or equal to the average transaction amount for legitimate transactions. Alternative Hypothesis: The average transaction amount for fraudulent transactions is greater than the average transaction amount for legitimate transactions.

E. Perform a t-test. Null Hypothesis: The average transaction amount for fraudulent transactions is equal to the average transaction amount for legitimate transactions. Alternative Hypothesis: The average transaction amount for fraudulent transactions is not equal to the average transaction amount for legitimate transactions.

Correct Answer: C

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 7

You are building a data science pipeline in Snowflake to predict customer churn. The pipeline involves extracting data, transforming it using Dynamic Tables, training a model using Snowpark ML, and deploying the model for inference. The raw data arrives in a Snowflake stage daily as Parquet files. You want to optimize the pipeline for cost and performance. Which of the following strategies are MOST effective, considering resource utilization and potential data staleness?

A. Use a combination of Dynamic Tables for feature engineering and Snowpark ML for model training and deployment, ensuring proper dependency management and refresh intervals for each Dynamic Table based on data freshness requirements.

B. Load all data into traditional Snowflake tables and use scheduled tasks with stored procedures written in Python to perform the transformations and model training.

C. Use a single, large Dynamic Table to perform all transformations in one step, relying on Snowflake's optimization to handle dependencies and incremental updates.

D. Schedule all data transformations and model training as a single large Snowpark Python script executed by a Snowflake task, ignoring data freshness requirements.

E. Implement a series of smaller Dynamic Tables, each responsible for a specific transformation step, with well-defined refresh intervals tailored to the data's volatility and the downstream model's requirements.

Correct Answer: A,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 8

You are building a machine learning model using Snowflake data to predict customer churn. Your dataset includes a 'CUSTOMER TYPE column with the following possible values: 'New', 'Returning', and 'VIP'. You need to perform one-hot encoding on this column. Which of the following Snowflake SQL queries correctly implements one-hot encoding for the 'CUSTOMER TYPE column, creating separate binary columns for each customer type ('IS NEW', 'IS RETURNING', 'IS VIP')?

A. Option E

B. Option D

C. Option C

D. Option A

E. Option B

Correct Answer: C,D,E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 9

You have built an external function to train a PyTorch model using SageMaker. The model training process requires a significant amount of CPU and memory. The training data is passed from Snowflake to the external function in batches. The external function code in AWS Lambda is as follows:

The Snowflake external function is defined as follows:

During testing, you encounter '500 Internal Server Error' from the external function consistently. Upon inspection of the Lambda logs, you find messages indicating 'PayloadTooLargeError'. What is the most likely cause and how do you mitigate it within the context of Snowflake and AWS Lambda?

A. The Snowflake external function definition is incorrect. Change the 'RETURNS VARIANT clause to 'RETURNS VARCHAR as the Lambda function returns a JSON string.

B. The Lambda function is timing out before the model training can complete. Increase the Lambda function's timeout setting to allow sufficient time for the training process.

C. The IAM role associated with the Lambda function lacks the necessary permissions to invoke the SageMaker training job. Grant the Lambda function's IAM role the appropriate SageMaker permissions.

D. The size of the data being sent from Snowflake to the Lambda function exceeds the maximum payload size allowed by AWSAPI Gateway. Increase the maximum payload size limit in the API Gateway settings.

E. The size of the data being sent from Snowflake to the Lambda function exceeds the maximum payload size allowed by AWS API Gateway. Implement data partitioning in Snowflake and send smaller batches of data to the Lambda function, aggregating the results in a separate table.

Correct Answer: E

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 10

A data scientist is developing a model within a Snowpark Python environment to predict customer churn. They have established a Snowflake session and loaded data into a Snowpark DataFrame named 'customer data'. The feature engineering pipeline requires a custom Python function, 'calculate engagement_score', to be applied to each row. This function takes several columns as input and returns a single score representing customer engagement. The data scientist wants to apply this function in parallel across the entire DataFrame using Snowpark's UDF capabilities. The following code snippet is used to define and register the UDF:

When the UDF is called the above error is observed. What change needs to be applied to make the UDF work as expected?

A. Change the function call to use the Snowpark DataFrame's 'select' function with column objects: 'customer_data.select(engagement_score_udf(F.col('num_transactions'), F.col('avg_transaction_value'),

B. Add '@F.sproc' decorator before the function definition.

C. Wrap the Python function inside a stored procedure using @F.sproc' and call that stored procedure instead of the plain python function.

D. Remove argument from 'session.udf.register' call. Snowpark can infer the input types automatically.

E. Redefine the function to accept string arguments and cast them to the correct data types within the function.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 11

You are tasked with validating a regression model predicting customer lifetime value (CLTV). The model uses various customer attributes, including purchase history, demographics, and website activity, stored in a Snowflake table called 'CUSTOMER DATA. You want to assess the model's calibration specifically, whether the predicted CLTV values align with the actual observed CLTV values over time. Which of the following evaluation techniques would be MOST suitable for assessing the calibration of your CLTV regression model in Snowflake?

A. Calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) on a hold-out test set to quantify the overall prediction accuracy.

B. Conduct a Kolmogorov-Smirnov test to check the distribution of predicted and actual value.

C. Create a calibration curve (also known as a reliability diagram) by binning the predicted CLTV values, calculating the average predicted CLTV and the average actual CLTV within each bin, and plotting these averages against each other.

D. Evaluate the model's residuals by plotting them against the predicted values and checking for patterns or heteroscedasticity.

E. Calculate the R-squared score on a hold-out test set to assess the proportion of variance in the actual CLTV explained by the model.

Correct Answer: C

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Welcome to TestSimulate

Snowflake SnowPro Advanced: Data Scientist Certification (DSA-C03) Free Practice Test