Author: Josh Patterson
Date: zzzzzzz xxth, 2022
Other entries in this series:
In this blog post series we will demonstrate how to build a credit card fraud detection system on realistic data.
Series Key Take-Aways:
In this series we further explore feature engineering on Snowpark and then show you have to use Amazon SageMaker Studio Lab to model the feature data in Snowflake.
In this article (Part 1 of the series) we're going to document
The work in this series is based on the dataset and general workflow from the online book:
The project also includes a seperate github repository containing the raw transaction data in .pkl files, which we'll use in our example further below in this article.
In places we've transformed the data to other formats or used different models for modeling the data, but the overall workflow ideas are the same. This also gives us a good reference baseline on which to compare Snowpark's dataframe API against the pandas dataframe API.
We'll start out by taking a look at how to convert the raw dataset's file format (pkl) into something that we can ingest into a Snowflake table.
There is a bit of magic in the new 🤗nlp library besides giving dead-simple access to 120+ datasets🧙♂️
— Thomas Wolf (@Thom_Wolf) June 15, 2020
We've tested it with @qlhoest and loading a 17GB+ dataset like English Wikipedia only takes... 9MB in RAM🐣
And you can iterate over the data at 2-3 Gbit/s🚀
Try it yourself👇 pic.twitter.com/wx7x7fzhqf
When bulk loading a lot of data into Snowflake there are multiple things to think about, such as:
We note that Snowflake supports the following file formats for data loading:
We're not going to delve too deeply into the file format arguments (outside of the sidebar below), but for this blog series we're going to convert the pkl files to parquet files.
Our reasoning here is that while CSV files can load faster into Snowflake, CSVs are row-orientated (which means they’re slow to query by themselves) and difficult to store efficiently. CSVs also lack the ability to carry a schema embedded in the file format.
We want to explore using Parquet files as they carry the schema embedded in the file format, compress the best (Parquet is an efficient columnar data storage format that supports complex nested data structures in a flat columnar format), and are ideal for working with huge amounts of complex data (as Parquet offers a variety of data compression and encoding options). Parquet files are also easier to work with because they are supported by so many different projects.
An example of a tool using both projects is Pandas. A user can save a Pandas data frame to Parquet and read a Parquet file to in-memory Arrow. Pandas can work directly with the Arrow columns as a great example of interoperability.
A great example of how performant Parquet files can be is how HuggingFace is able to iterate through 17GB of data in less than a minute with a RAM footprint of 9MB by using Apache Arrow under the hood.
Using Apache Arrow for internal serialization allows us to map blobs of data on-drive without doing any deserialization. This gives us the advantage of using our dataset directly on disk where we can use memory-mapping and pay effectively zero cost. Additionally, random access is O(1) which is powerful from an algorithmic standpoint.
So while we may only be using Parquet (and Arrow under the hood) to move data into Snowflake in this example, understanding how Parquet files work is a useful tool in our toolbelt, especially if we had to do any pre-processing on the data before loading into Snowflake.
To get a copy of the raw credit card transaction data from the project mentioned above, clone the github repository with the command below:
git clone https://github.com/Fraud-Detection-Handbook/simulated-data-raw
That will give you a local copy of the raw credit card transaction data from the online book project (in the .pkl file format).
In the code listing below we show how to use the pyarrow python module to convert .pkl
files into a single Parquet file containing 1,754,155 rows
× 9 columns
.
import pandas as pd import os import pyarrow.parquet as pq import pyarrow as pa for pickle_file_name in os.listdir("./source/pkl/data/path/"): print(pickle_file_name) all_df = pd.concat([pd.read_pickle(strBasePath + pickle_file_name) for pickle_file_name in os.listdir(strBasePath)]).reset_index(drop=True) table = pa.Table.from_pandas(all_df) parquet_path = "./your/path/here/cc_txn_data_pkl_all_files_noindex.parquet" pq.write_table(table, parquet_path, version='1.0')
Once we have our parquet file(s), we can now load it into the a Snowflake database in the Snowflake data cloud.
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. This Snowflake columnar database engine is based on SQL and supports ANSI SQL.
Some things we want to consider when choosing where to manage our transaction data:
When we have to load large amounts of data we may want to look at efficient file formats that enable compression. Snowflake recommends data files roughly 100-250 MB
(or larger) in size compressed.
Transactional data (e.g., credit card transactions, here) creates large amounts of repetitive data. Scalable systems such as Snowflake help us in the feature creation phase of machine learning because many times the data we actually model is smaller than the incoming raw transaction data.
As we wrote previously in this article, Snowflake supports many types of files including Parquet, which we're focused on in this article.
Snowflake Cloud Data Warehouse is a cloud-native, fully relational ANSI SQL data warehouse service available in both AWS and Azure It's scalability and ability to quickly provision a data warehouse make it the right place to start our sensor data management operation for our pilot program.
We have our data in parquet but we need to do a few things before we can load the data into Snowflake:
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.
You'll need to use either the online Web UI for Snowflake or the command-line interface (SnowSQL CLI Client). We're going to do everything from the command line via the SnowSQL in this blog post:
https://docs.snowflake.com/en/user-guide/snowsql.html
Once you have the SnowSQL CLI Client installed, open a terminal window and log into your Snowflake account from the command line with the following command:
$ snowsql -a [account_name] -u [user_name]
This should show console output similar to the output below:
Now that we have connectivity, let's move on to creating a database and table for our transaction data
Now let's use the SnowSQL client to create a database and a table to manage our data.
To create a database (CREDIT_CARD_COMPANY_DB
) use the following command from SnowSQL CLI:
create or replace database CREDIT_CARD_COMPANY_DB;
If we change our current datasbase to CREDIT_CARD_COMPANY_DB
the output should look like:
Next we want to create a table in our Snowflake database.
create or replace table CUSTOMER_CC_TRANSACTIONS ( TRANSACTION_ID int, TX_DATETIME timestamp, CUSTOMER_ID int, TERMINAL_ID int, TX_AMOUNT FLOAT, TX_TIME_SECONDS int, TX_TIME_DAYS int, TX_FRAUD int, TX_FRAUD_SCENARIO int );
We can execute the the commands above from SnowSQL or from a file via SnowSQL:
snowsql -a [xxxxxxx.us-east-1] -u [user_name] -f ./create_pm_db_and_table.sql
Now let's move on to uploading the Parquet-based credit card transaction data into our Snowflake table.
We can get more detail on our table CUSTOMER_CC_TRANSACTIONS
with the describe table
command:
When we load the data into the Snowflake stage, we'll need to tell the system a bit about how the data is formatted. Since we're using parquet files we'll create a parquet file format called parquet_format
inside our database.
create or replace file format parquet_format type = parquet
We'll take the default option values for our file format.
We next need to put the data in an internal staging table on snowflake. First we need to create our temporary internal stage with the command below:
create or replace temporary stage parquet_cc_data_loading_stage file_format = parquet_format;
Each table has a Snowflake stage allocated to it by default for storing files (when using "Table Stage", check out the staging documentation).
We can see what all stages have been created for a database with the command show stages;
.
We will use the associated staging table (@parquet_cc_data_loading_stage
) to load the data into Snowflake. We can see this commadn below:
PUT file:///tmp/cc_txn_data_pkl_all_files_noindex.parquet @parquet_cc_data_loading_stage;
We note that by default Snowflake reads Parquet data into a single Variant
column (Variant
is a tagged universal type that can hold up to 16 MB of any data type supported by Snowflake). The data in this Variant
column can be queried with standard SQL (including joining against its data).
There is also the option to move the Parquet columns directly into separate Snowflake columns (extracting the individual columns into a structured schema) during the data load phase from stage to table. The structured approach gives us a little more performance on the load phase, and is easier to manage with the defined schema. We can see this load statement in action below.
copy into CUSTOMER_CC_TRANSACTIONS from (select $1:TRANSACTION_ID, $1:TX_DATETIME::varchar, $1:CUSTOMER_ID, $1:TERMINAL_ID, $1:TX_AMOUNT, $1:TX_TIME_SECONDS, $1:TX_TIME_DAYS, $1:TX_FRAUD, $1:TX_FRAUD_SCENARIO from @parquet_cc_data_loading_stage/cc_txn_data_pkl_all_files_noindex.parquet);
In the SQL-listing above, the main body of the COPY
statment includes extraction of the labeled fields contained in the Parquet data, mapping them directly to the corresponding column in CUSTOMER_CC_TRANSACTIONS
.
Loading data into fully structured (columnarized) schema is ~10-20%
faster than landing it into a VARIANT
(Reference: "How to Load Terabytes into Snowflake").
To do a quick visual check that the transaction data loaded correctly, run the following command:
select * from CUSTOMER_CC_TRANSACTIONS limit 4;
The output should look similar to the following:
In this post in our series we learned some basics around Parquet files and how to load them into Snowflake.
Now that we have established our cloud-based credit card transaction data management platform and loaded our data, let's move on to building some features with Snowflake and Snowpark.
Normally do lots of EDA (as in our previous blog series on Applied Predictive Maintenance), but here we're going to skip that as its been done for us by the authors of the original project, so we're going to focus on building out the project on Snowpark and Amazon SageMaker Studio Lab in the next 2 entries in this series.
Next: Part 2 of 5: Scalable Feature Engineering with Snowpark
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.