Detecting Credit Card Fraud with Snowflake, Snowpark, and Amazon SageMaker Studio Lab

Part 2 of 5: Scalable Feature Engineering with Snowpark

Author: Josh Patterson
Date: zzzzzzz xxth, 2022

Other entries in this series:

For more use cases like these, check out our Financial Services vertical in our Use Case Repository.

Introduction

In our last article we loaded our synthetic credit card data into the Snowflake platform.

We want to start building some models in Amazon SageMaker Studio Lab, but first we need to do some scalable feature engineering. In many cases we will have a non-trivial amount of raw data in our data warehouse, so anything we can do in-storage / in-database (scalably) is going to be an advantage in terms of saving compute time later on in our pipeline.

Snowpark in Snowflake gives us a great way to do scalable feature generation. In this article we are going to use Snowpark to build most of our core dataset features for model training.

Scalable Feature Engineering with Snowpark

Let's start off by giving a specific definition for "feature engineering":

Feature engineering is process where we use domain expertise to extract features (characteristics, properties) from raw data. We use the data we already have to generate new features (columns) in our dataset.

There is no standard way to do feature engineering for non-numerical or categorical features. There are a few core methods that are known to work well in practice (e.g., "1-hot encoding for categorical features"), but most of the time we use a combination of these practictioner methods and domain expertise to generate good features for out modeling workflow.

As previously mentioned, many times the raw dataset is large but the data we want to model is relatively smaller after processing. From this perspective, we want to do as much raw processing of the data in the datawarehouse (e.g., here Snowflake) with tools that will scale linearly with the size of the data input (e.g., Snowpark). Snowpark is nice because it let's us build data processing pipelines in a dataframe-api (similar to Pandas on Python).

The work in this series is based on the dataset and general workflow from the online book:

Reproducible Machine Learning for Credit Card Fraud Detection - Practical Handbook, by Le Borgne, Yann-Aël and Siblini, Wissam and Lebichot, Bertrand and Bontempi, Gianluca, 2022, Université Libre de Bruxelles

Further, Mats Stellwall (Snowflake Engineer) wrote a blog post Feature Engineering with Snowflake, Using Snowpark and Scala based on the book above.

We use portions of Mats' scala code to provide the basis for our Snowflake feature generation pipeline (we provide our implementation here on github).

With all of this in mind, let's jump into building our feature engineering pipeline in Snowpark.

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.

Quick Look at the Raw Transaction Data

The raw transaction data has three types of features:

identifiers
timestamps
amounts

It also has a flag column to indicate if the record was fraudulent or not.

Machine learning works with vectors (or tensors) of floating point numbers requiring us to convert each feature into an integer or real number. In some cases we want to include ordering information about a feature as well.

Identifiers (e.g., transaction-ids, customer-ids, and terminal-ids) are not valuable to machine learning modeling due to high cardinality (e.g., large numbers of unique values), so we generally drop those.

Raw rates (strings or ms from a date) do not work well with machine learning so we need to do some feature creation work with dates to create meaningful context for the machine learning algorithm to model.

With our high-level notes about the data out of the way, let's now take a look at our specific feature engineering workflow overview.

Feature Engineering Workflow Overview

The online book gives a great detailed explanation (Chapter 3: Baseline feature transformation) around the how's and why's of the new features created. In this section we provide a brief summary explanation for these features.

Methods of Feature Encoding

The book states that there are 3 types of feature transformations that are known to be relevant for payment card fraud detection:

Binary encoding or 1-hot encoding
RFM (Recency, Frequency, Monetary Value)
Frequency Encoding or Risk encoding

Now let's briefly review the types of features we'll use these methods to create.

Creating Temporal Features (Date/Time) Features

With temporal features (data/time) we want to create binary features that indicate potentially actions during relevant periods of time. In this pipeline we create two features of this type:

transaction occurs during weekday or weekend
transaction occurs during day or night

This features tend to be useful based on fraud patterns in real-world datasets.

Creating Customer Spending Behavior Features

The next type of features we want to create are centered on how customer spending behaviors. To create these features we use RFM the (Recency, Frequency, Monetary value) method. We will create 6 total new features that keep track of the average spending amount and number of transactions for each customer over 3 time window sizes.

Creating Terminal Risk Features

The final group of features we'll create involve measuring the risk associated with each financial terminal. As more fraud is committed at a terminal, its risk profile will also rise. We will create 6 new features for this group.

Table of Features to Create

To summarize our the features to create, the table below shows the complete list:

Original feature name	Original feature type	Transformation	Number of new features	New feature(s) type
TX_DATE_TIME	Panda timestamp	0 if transaction during a weekday, 1 if transaction during a weekend. The new feature is called TX_DURING_WEEKEND.	1	Integer (0/1)
TX_DATE_TIME	Panda timestamp	0 if transaction between 6am and 0pm, 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT.	1	Integer (0/1)
CUSTOMER_ID	Categorical variable	Number of transactions by the customer in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_NB_TX_nDAY_WINDOW.	3	Integer
CUSTOMER_ID	Categorical variable	Average spending amount in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_AVG_AMOUNT_nDAY_WINDOW.	3	Real
TERMINAL_ID	Categorical variable	Number of transactions on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_NB_TX_nDAY_WINDOW.	3	Integer
TERMINAL_ID	Categorical variable	Average number of frauds on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_RISK_nDAY_WINDOW.	3	Real

Let's now take a look at how we can build these feature transforms with scala and Snowpark on Snowflake.

Running Snowpark Feature Engineering Pipelines

In this article we are going to port the feature creation of the following features to Snowpark:

Create weekend and night features
Customer Transaction Features

We'll complete the last set of features ("Terminal Features") in python in part-3 of this series. Let's now jump into how to port the weekend and night features to Snowpark.

Our implementation of the snowpark credit card transaction data feature engineering code can be found here on github.

For the sake of time and space we are going to cover the highlights of the feature engineering code in the sub-sections below. You are more than welcome to download the project from github and run it against your own Snowflake account.

Create Scala Project to Connect to Snowflake and Run Snowpark Code

Check our previous Snowpark article to understand how to launch snowpark code.

Let's connect to Snowflake with a Session object:


val session = Session.builder.configFile("conn.properties").create
val df_cust_txns = session.table("CUSTOMER_CC_TRANSACTIONS")

Our Snowflake account information is contained in the conn.properties file in the example above, letting our code authenticate with Snowflake.

Create Weekend and Night Features in Scala for Snowpark

In the code below we want to generate a feature that has a 0 if transaction between 6am and 0pm and 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT.

The code also generates a feature TX_DURING_WEEKEND that has a 0 if the transaction's date is during a weekday and 1 if transaction's date is during the weekend..


    val dfDateTimeFeat = df_cust_txns.withColumns(
        Seq("TX_DURING_WEEKEND", "TX_DURING_NIGHT"), 
        Seq(
          iff(
            dayofweek(col("TX_DATETIME")) === 6 || dayofweek(col("TX_DATETIME")) === 0, 
            lit(1), 
            lit(0)
          ), 
          iff(hour(col("TX_DATETIME")) <= 6 || hour(col("TX_DATETIME")) > 23, lit(1), lit(0)
        )
      ))

This feature creation uses the Seq function in the Snowpark code to generate the 0 or 1 value for the features based on conditions in the data.

Create Customer Transaction Features in Scala for Snowpark

We want to calculate the number of transactions for the following time windows:

previous 1 day
previous 7 days
previous 30 days

For both number of transactions and total amount spend across those time periods. Snowflake has windowing functions that allow for calculations over time periods (total, average, etc).

Rolling Window Calculations in Snowflake

The original code from the online book uses the Pandas rolling window function, which calculates some of the values differently than how Snowflake calculates its window functions. In part 3 of this series we'll see the Pandas rolling window function in action when we generate the terminal features in python.

So we need to create the above 6 features for each transaction. Our general strategy (for both feature groups) for our snowpark code is:

create a dataframe with a single row for each customer and day (between min / mx date in our data)
calculate transaction count and sum for each customer and day (if no transactions, then 0)
use built-in Snowflake window functions to build our { 1, 7, 30 } features, and then join it back with original transactions

In the code listing below, we can see the code for this part of the feature creation in Snowpark.

val dateInfo = df_cust_txns.select(
  min(col("TX_DATETIME")).alias("END_DATE"), 
  datediff("DAY", col("END_DATE"), max(col("TX_DATETIME")))
).collect()    

val dStartDate = new SimpleDateFormat("yyyy-MM-dd").format(dateInfo(0).getTimestamp(0))
val nDays = dateInfo(0).getInt(1)

val dfDays = session.range(nDays)
                   .withColumn("TX_DATE", 
                        to_date(dateadd("DAY", callBuiltin("SEQ4"), lit(dStartDate))))

val num_rows_txns_dfDays = dfDays.count()
val dfCustomers = session.table("CUSTOMERS").select("CUSTOMER_ID")
val num_rows_txns_dfCustomers = dfCustomers.count()
val dfCustDay = dfDays.join(dfCustomers)
val num_rows_txns_dfCustDay = dfCustDay.count()
val zeroifnull = builtin("ZEROIFNULL")

// df_cust_txns: the original table of raw transactions
// dfCustDay: the matrix of customers and possible dates

val dfCustTrxByDay = df_cust_txns.join(dfCustDay, df_cust_txns("CUSTOMER_ID") === dfCustDay("CUSTOMER_ID")
      && to_date(df_cust_txns("TX_DATETIME")) === dfCustDay("TX_DATE"), "rightouter")

     .select(dfCustDay("CUSTOMER_ID").alias("CUSTOMER_ID"), 
             dfCustDay("TX_DATE"), 
             zeroifnull(df_cust_txns("TX_AMOUNT")).as("TX_AMOUNT"), 
             iff(col("TX_AMOUNT") > 0, lit(1), lit(0)).as("NO_TRX"))

      .groupBy(col("CUSTOMER_ID"), col("TX_DATE"))

      .agg(
          sum(col("TX_AMOUNT")).alias("TOT_AMOUNT"), 
           sum(col("NO_TRX")).alias("NO_TRX")
       )    

val dfSubset = dfCustTrxByDay.filter(col("CUSTOMER_ID") === lit(0)).sort(col("TX_DATE"))
val custDate = Window.partitionBy(col("customer_id")).orderBy(col("TX_DATE"))
val win7dCust = custDate.rowsBetween(-7, -1)
val win30dCust = custDate.rowsBetween(-30, -1)

val dfCustFeatDay = dfCustTrxByDay
       .select(col("TX_DATE"),col("CUSTOMER_ID"),
               col("NO_TRX"), col("TOT_AMOUNT"),
               lag(col("NO_TRX"),1).over(custDate).as("CUST_TX_PREV_1"),
               sum(col("NO_TRX")).over(win7dCust).as("CUST_TX_PREV_7"),
               sum(col("NO_TRX")).over(win30dCust).as("CUST_TX_PREV_30"),
               lag(col("TOT_AMOUNT"),1).over(custDate).as("CUST_TOT_AMT_PREV_1"),
               sum(col("TOT_AMOUNT")).over(win7dCust).as("CUST_TOT_AMT_PREV_7"),
               sum(col("TOT_AMOUNT")).over(win30dCust).as("CUST_TOT_AMT_PREV_30"))

The original project generates 6 features for terminal usage. We're going to leave that function in python in this series to show the difference in how these windows are created.

Save Created Features as New Table in Snowpark

Finally, we want to save the new features to a table in Snowflake with the Snowpark command:

dfCustBehaviurFeat.write.mode(SaveMode.Overwrite).saveAsTable("CUSTOMER_CC_TRANSACTION_FEATURES")

At this point we have a table in Snowflake we can use to model over in Amazon SageMaker Studio Lab.

Next Steps: Connecting Amazon SageMaker Studio Lab to Snowflake

In this post in our series we learned how to build features in Snowpark for Snowflake.

In the next post we'll connect Amazon SageMaker Studio Lab to Snowflake and build out final terminal features.

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.