Author: Josh Patterson
Date: zzzzzzz xxth, 2022
Other entries in this series:
In our last article we loaded our synthetic credit card data into the Snowflake platform.
We want to start building some models in Amazon SageMaker Studio Lab, but first we need to do some scalable feature engineering. In many cases we will have a non-trivial amount of raw data in our data warehouse, so anything we can do in-storage / in-database (scalably) is going to be an advantage in terms of saving compute time later on in our pipeline.
Snowpark in Snowflake gives us a great way to do scalable feature generation. In this article we are going to use Snowpark to build most of our core dataset features for model training.
Let's start off by giving a specific definition for "feature engineering":
Feature engineering is process where we use domain expertise to extract features (characteristics, properties) from raw data. We use the data we already have to generate new features (columns) in our dataset.
There is no standard way to do feature engineering for non-numerical or categorical features. There are a few core methods that are known to work well in practice (e.g., "1-hot encoding for categorical features"), but most of the time we use a combination of these practictioner methods and domain expertise to generate good features for out modeling workflow.
As previously mentioned, many times the raw dataset is large but the data we want to model is relatively smaller after processing. From this perspective, we want to do as much raw processing of the data in the datawarehouse (e.g., here Snowflake) with tools that will scale linearly with the size of the data input (e.g., Snowpark). Snowpark is nice because it let's us build data processing pipelines in a dataframe-api (similar to Pandas on Python).
The work in this series is based on the dataset and general workflow from the online book:
Further, Mats Stellwall (Snowflake Engineer) wrote a blog post Feature Engineering with Snowflake, Using Snowpark and Scala based on the book above.
We use portions of Mats' scala code to provide the basis for our Snowflake feature generation pipeline (we provide our implementation here on github).
With all of this in mind, let's jump into building our feature engineering pipeline in Snowpark.
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.
The raw transaction data has three types of features:
Machine learning works with vectors (or tensors) of floating point numbers requiring us to convert each feature into an integer or real number. In some cases we want to include ordering information about a feature as well.
Identifiers (e.g., transaction-ids, customer-ids, and terminal-ids) are not valuable to machine learning modeling due to high cardinality (e.g., large numbers of unique values), so we generally drop those.
Raw rates (strings or ms from a date) do not work well with machine learning so we need to do some feature creation work with dates to create meaningful context for the machine learning algorithm to model.
With our high-level notes about the data out of the way, let's now take a look at our specific feature engineering workflow overview.
The online book gives a great detailed explanation (Chapter 3: Baseline feature transformation) around the how's and why's of the new features created. In this section we provide a brief summary explanation for these features.
The book states that there are 3 types of feature transformations that are known to be relevant for payment card fraud detection:
With temporal features (data/time) we want to create binary features that indicate potentially actions during relevant periods of time. In this pipeline we create two features of this type:
The next type of features we want to create are centered on how customer spending behaviors. To create these features we use RFM the (Recency, Frequency, Monetary value) method. We will create 6 total new features that keep track of the average spending amount and number of transactions for each customer over 3 time window sizes.
The final group of features we'll create involve measuring the risk associated with each financial terminal. As more fraud is committed at a terminal, its risk profile will also rise. We will create 6 new features for this group.
To summarize our the features to create, the table below shows the complete list:
Original feature name |
Original feature type |
Transformation |
Number of new features |
New feature(s) type |
---|---|---|---|---|
TX_DATE_TIME |
Panda timestamp |
0 if transaction during a weekday, 1 if transaction during a weekend. The new feature is called TX_DURING_WEEKEND. |
1 |
Integer (0/1) |
TX_DATE_TIME |
Panda timestamp |
0 if transaction between 6am and 0pm, 1 if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT. |
1 |
Integer (0/1) |
CUSTOMER_ID |
Categorical variable |
Number of transactions by the customer in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_NB_TX_nDAY_WINDOW. |
3 |
Integer |
CUSTOMER_ID |
Categorical variable |
Average spending amount in the last n day(s), for n in {1,7,30}. The new features are called CUSTOMER_ID_AVG_AMOUNT_nDAY_WINDOW. |
3 |
Real |
TERMINAL_ID |
Categorical variable |
Number of transactions on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_NB_TX_nDAY_WINDOW. |
3 |
Integer |
TERMINAL_ID |
Categorical variable |
Average number of frauds on the terminal in the last n+d day(s), for n in {1,7,30} and d=7. The parameter d is called delay and will be discussed later in this notebook. The new features are called TERMINAL_ID_RISK_nDAY_WINDOW. |
3 |
Real |
Let's now take a look at how we can build these feature transforms with scala and Snowpark on Snowflake.
In this article we are going to port the feature creation of the following features to Snowpark:
Our implementation of the snowpark credit card transaction data feature engineering code can be found here on github.
For the sake of time and space we are going to cover the highlights of the feature engineering code in the sub-sections below. You are more than welcome to download the project from github and run it against your own Snowflake account.
Check our previous Snowpark article to understand how to launch snowpark code.
Let's connect to Snowflake with a Session
object:
val session = Session.builder.configFile("conn.properties").create
val df_cust_txns = session.table("CUSTOMER_CC_TRANSACTIONS")
Our Snowflake account information is contained in the conn.properties
file in the example above, letting our code authenticate with Snowflake.
In the code below we want to generate a feature that has a 0
if transaction between 6am and 0pm and 1
if transaction between 0pm and 6am. The new feature is called TX_DURING_NIGHT
.
The code also generates a feature TX_DURING_WEEKEND
that has a 0
if the transaction's date is during a weekday and 1
if transaction's date is during the weekend..
val dfDateTimeFeat = df_cust_txns.withColumns(
Seq("TX_DURING_WEEKEND", "TX_DURING_NIGHT"),
Seq(
iff(
dayofweek(col("TX_DATETIME")) === 6 || dayofweek(col("TX_DATETIME")) === 0,
lit(1),
lit(0)
),
iff(hour(col("TX_DATETIME")) <= 6 || hour(col("TX_DATETIME")) > 23, lit(1), lit(0)
)
))
This feature creation uses the Seq
function in the Snowpark code to generate the 0 or 1 value for the features based on conditions in the data.
We want to calculate the number of transactions for the following time windows:
So we need to create the above 6 features for each transaction. Our general strategy (for both feature groups) for our snowpark code is:
val dateInfo = df_cust_txns.select(
min(col("TX_DATETIME")).alias("END_DATE"),
datediff("DAY", col("END_DATE"), max(col("TX_DATETIME")))
).collect()
val dStartDate = new SimpleDateFormat("yyyy-MM-dd").format(dateInfo(0).getTimestamp(0))
val nDays = dateInfo(0).getInt(1)
val dfDays = session.range(nDays)
.withColumn("TX_DATE",
to_date(dateadd("DAY", callBuiltin("SEQ4"), lit(dStartDate))))
val num_rows_txns_dfDays = dfDays.count()
val dfCustomers = session.table("CUSTOMERS").select("CUSTOMER_ID")
val num_rows_txns_dfCustomers = dfCustomers.count()
val dfCustDay = dfDays.join(dfCustomers)
val num_rows_txns_dfCustDay = dfCustDay.count()
val zeroifnull = builtin("ZEROIFNULL")
// df_cust_txns: the original table of raw transactions
// dfCustDay: the matrix of customers and possible dates
val dfCustTrxByDay = df_cust_txns.join(dfCustDay, df_cust_txns("CUSTOMER_ID") === dfCustDay("CUSTOMER_ID")
&& to_date(df_cust_txns("TX_DATETIME")) === dfCustDay("TX_DATE"), "rightouter")
.select(dfCustDay("CUSTOMER_ID").alias("CUSTOMER_ID"),
dfCustDay("TX_DATE"),
zeroifnull(df_cust_txns("TX_AMOUNT")).as("TX_AMOUNT"),
iff(col("TX_AMOUNT") > 0, lit(1), lit(0)).as("NO_TRX"))
.groupBy(col("CUSTOMER_ID"), col("TX_DATE"))
.agg(
sum(col("TX_AMOUNT")).alias("TOT_AMOUNT"),
sum(col("NO_TRX")).alias("NO_TRX")
)
val dfSubset = dfCustTrxByDay.filter(col("CUSTOMER_ID") === lit(0)).sort(col("TX_DATE"))
val custDate = Window.partitionBy(col("customer_id")).orderBy(col("TX_DATE"))
val win7dCust = custDate.rowsBetween(-7, -1)
val win30dCust = custDate.rowsBetween(-30, -1)
val dfCustFeatDay = dfCustTrxByDay
.select(col("TX_DATE"),col("CUSTOMER_ID"),
col("NO_TRX"), col("TOT_AMOUNT"),
lag(col("NO_TRX"),1).over(custDate).as("CUST_TX_PREV_1"),
sum(col("NO_TRX")).over(win7dCust).as("CUST_TX_PREV_7"),
sum(col("NO_TRX")).over(win30dCust).as("CUST_TX_PREV_30"),
lag(col("TOT_AMOUNT"),1).over(custDate).as("CUST_TOT_AMT_PREV_1"),
sum(col("TOT_AMOUNT")).over(win7dCust).as("CUST_TOT_AMT_PREV_7"),
sum(col("TOT_AMOUNT")).over(win30dCust).as("CUST_TOT_AMT_PREV_30"))
The original project generates 6 features for terminal usage. We're going to leave that function in python in this series to show the difference in how these windows are created.
Finally, we want to save the new features to a table in Snowflake with the Snowpark command:
dfCustBehaviurFeat.write.mode(SaveMode.Overwrite).saveAsTable("CUSTOMER_CC_TRANSACTION_FEATURES")
At this point we have a table in Snowflake we can use to model over in Amazon SageMaker Studio Lab.
In this post in our series we learned how to build features in Snowpark for Snowflake.
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.