Applied Predictive Maintenance

Part 2 of 6: "Sensor Data Ingest, Storage, and Analysis with Snowflake"

Author: Josh Patterson
Date: January 26th, 2022

Other entries in this series:

Part 1: Making the Business Case for Predictive Maintenance
Part 2: Sensor Data Ingest, Storage, and Analysis with Snowflake
Part 3: Exploratory Data Analysis
Part 4: Machine Learning Workflows with Sci-Kit Learn to Build Predictive Maintenance Models
Part 5: Analyzing the Results
Part 6: Going to Production with Snowpark

Managing Predictive Maintenance Data with the Cloud and Snowflake

In our last article The ACME Tool Co. team collectively established their goals and business constraints for their predictive maintenance pilot program.

In part 2 of this series, ACME Tool Co. data science (ATDS) team starts their journey managing and analyzing the machine failure sensor data.

Key Take Aways:

Introduction to the Snowflake Cloud Analytics Platform
How to ingest data into Snowflake tables
Connecting Google Colab Jupyter Notebooks to Snowflake

The ATDS team knows what their requirements are, now let's introduce the sensor dataset the company provided the ATDS team for the predictive maintenance pilot project.

The Predictive Maintenance Dataset

The dataset we will use for this post is the AI4I 2020 Predictive Maintenance Dataset Data Set from the UCI Machine Learning Repository.

The associated publication released with the dataset goes on to explain their rationale for using synthetic data:

"Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of our knowledge."

"Explainable Artificial Intelligence for Predictive Maintenance Applications", Stephan Matzka, Third International Conference on Artificial Intelligence for Industries (AI4I 2020), 2020 (in press)

Likewise, for our ACME Tool Co. blog series, we will use this dataset for analyis and modeling as it is available yet realistic. It's worth noting that this data is an aggregate of sensor data over time per machine. If we were working with the raw logs of sensor data then before modeling (if we desired a tabular data structure) we'd run some type of aggregation query across the logs to build a similar aggregate form of the sensor data. In this way, this aggregated form of sensor data is an appropriate and realistic dataset to work with in this scenario.

Right now we know little about what the data contains. Before we can do any data exploration (before any modeling work), we need to get some data management and analysis tooling up and running. Let's start with getting our data into an analytic platform that can handle our aggregated data or could continuously ingest sensor data if we had that.

Snowflake as the Sensor Data Management Platform

In part 1 of this blog series the ACME Tool Co management team made it clear they had some constraints for the pilot:

Dont stand up new hardware
need to move quickly
wants a platform for analysis that is cost-effective but scalable

The ACME Tool Co. pilot project needs to move fast to prove to the line of business that they can hit our ROI metrics in our operational contract.

This also means the the team needs to find a data platform in the cloud (because they can't bring in new on-premise hardware right now) that allows them to quickly prototype the pilot components while being scalable and cost-effective if the pilot program becomes a longer-term production system.

These constraints and timeline make the Snowflake Cloud Data Warehouse a great fit for data management and analytics components in this project.

Traditional analytical applications require extensive effort around building out platform infrastructure, which means your team spends a lot of time on non-value generating activities (procuring hardware, software, configuring databases, ETL, etc). These are activities the ACME Tool Co team will not have time nor budget for in this pilot project.

Snowflake Cloud Data Warehouse is a cloud-native, fully relational ANSI SQL data warehouse service available in both AWS and Azure It's scalability and ability to quickly provision a data warehouse make it the right place to start our sensor data management operation for our pilot program.

Snowflake has features such as:

Native JSON Support
Aggregation Using Streams and Tasks
Time-series optimized data ingestion with snowpipe
ability to query data in object storage via external tables

Snowflake provides us with a rock solid analytical store in the cloud to collect raw machine data over time. Once we need to rebuild our model, we can query the raw data to pull the subset of data we need to build our model.

Historically you'd have to wait or request the sensor/machine data to be pre-processed; Snowflake simplifies this aspect of data processing as there is no need for data engineering tricks or complex Spark jobs.

Given Snowflake's ability to scalably and quickly ingest diverse data sets we don't have to expend a lot of energy and time on building out an ingestion framework. This let's our team focus on data processing to achieve line of business goals more quickly. Further, Snowflake only charges for what we use so it provides us with a efficient data ingest mechanism and low-cost data managagement platform for large amounts of machine-generated (e.g., "IoT") data.

With all of this in mind, let's get started working with Snowflake.

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.

Creating a Predictive Maintenance Database in Snowflake

To get started with managing our sensor data in Snowflake we need to do the following:

Sign-up for a Snowflake account (free, if you don't already have one)
Create a database in Snowflake for the project
Create a table in the database for our sensor data
Load the sensor data into the table

Let's get to signing up for Snowflake first.

If you already have a snowflake account head to the site and log in. If you don't have an account, head to this link and sign up for the free account.

We're going to do everything from the command line via the SnowSQL (CLI Client): https://docs.snowflake.com/en/user-guide/snowsql.html

Once you have the SnowSQL CLI Client installed, open a terminal window and log into your Snowflake account from the command line with the following command:

$ snowsql -a [account_name] -u [user_name]

This should show console output similar to the output below:

jpatanooga#COMPUTE_WH@(no database).(no schema)>

Now that we have connectivity, let's move on to creating a database and table for our sensor data

Creating a Database and Table for Our Sensor Data on Snowflake

Now let's use the SnowSQL client to create a database and a table to manage our data. This tutorial has an associated github repository with the SQL scripts and Jupyter notebooks referenced: https://github.com/pattersonconsulting/predictive_maintenance

For reference, if you have git installed, you can quickly clone the project repository with the following command:

git clone https://github.com/pattersonconsulting/predictive_maintenance.git

Tp create a database (predictive_maintenance) and table in the database (summary_sensor_data) use the following script (create_pm_db_and_table.sql):

create or replace database predictive_maintenance;

create or replace table summary_sensor_data (
  UDI int ,
  Product_ID string ,
  Type string ,
  Air_temperature float ,
  Process_temperature float ,
  Rotational_speed float ,
  Torque float ,
  Tool_wear float ,

  Machine_failure int ,

  TWF int ,
  HDF int ,
  PWF int ,
  OSF int ,
  RNF int 
  
  );

We can execute the script included in the github repository from the terminal command line with the command:

snowsql -a [xxxxxxx.us-east-1] -u [user_name] -f ./create_pm_db_and_table.sql

And we should see output

* SnowSQL * v1.2.18 Type SQL statements or !help +-------------------------------------------------------+ | status | |-------------------------------------------------------| | Database PREDICTIVE_MAINTENANCE successfully created. | +-------------------------------------------------------+ 1 Row(s) produced. Time Elapsed: 0.183s +-------------------------------------------------+ | status | |-------------------------------------------------| | Table SUMMARY_SENSOR_DATA successfully created. | +-------------------------------------------------+ 1 Row(s) produced. Time Elapsed: 0.267s

Now let's move the sensor data into our Snowflake table.

Continuous Sensor Data Ingestion with Snowflake

It's worth noting that we'd normally use Snowpipe, Apache Kafka, and/or AWS for sensor data ingestion, but for this demo, we're going to bulk load the data as a single CSV file. We'd also use snowflake to create the daily aggregates from the raw sensor data updates for machine learning training.

Typically we might see the data being streamed from a system such as Apache Kafka, and then Snowpipe (by Snowflake) allows for a streaming-based data ingest approach. This allows for micro-batch processing of the data as well which best supports the underlying distributed computations in the Snowflake parallel data warehouse.

If you'd like to know more about Snowpipe check out the documentation or check out the Snowflake IoT Reference Architecture.

Copy Predictive Maintenance Dataset into Snowflake

Download the UCI dataset from the following URL:

AI4I 2020 Predictive Maintenance Dataset Data Set (from the UCI Machine Learning Repository).

We now we load this raw CSV sensor data into Snowflake. For reference, check out the Snowflake documentation on "Loading data into Snowflake":

https://docs.snowflake.com/en/user-guide-data-load.html

If we change our current datasbase to PREDICTIVE_MAINTENANCE the output should look like:

jpatanooga#COMPUTE_WH@(no database).(no schema)>use PREDICTIVE_MAINTENANCE; +----------------------------------+ | status | |----------------------------------| | Statement executed successfully. | +----------------------------------+ 1 Row(s) produced. Time Elapsed: 0.386s jpatanooga#COMPUTE_WH@PREDICTIVE_MAINTENANCE.PUBLIC>

Now we can ask Snowflake to show tables; and it will output:

show TABLES; +-------------------------------+---------------------+------------------------+-------------+-------+---------+------------+------+-------+----------+----------------+----------------------+-----------------+-------------+ | created_on | name | database_name | schema_name | kind | comment | cluster_by | rows | bytes | owner | retention_time | automatic_clustering | change_tracking | is_external | |-------------------------------+---------------------+------------------------+-------------+-------+---------+------------+------+-------+----------+----------------+----------------------+-----------------+-------------| | 2021-09-21 12:15:32.144 -0700 | SUMMARY_SENSOR_DATA | PREDICTIVE_MAINTENANCE | PUBLIC | TABLE | | | 0 | 0 | SYSADMIN | 1 | OFF | OFF | N | +-------------------------------+---------------------+------------------------+-------------+-------+---------+------------+------+-------+----------+----------------+----------------------+-----------------+-------------+ 1 Row(s) produced. Time Elapsed: 0.149s

We can get more detail on our table RAW_DEVICE_DATA with the describe table command:

describe table SUMMARY_SENSOR_DATA; +---------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+ | name | type | kind | null? | default | primary key | unique key | check | expression | comment | policy name | |---------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------| | UDI | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | PRODUCT_ID | VARCHAR(16777216) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TYPE | VARCHAR(16777216) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | AIR_TEMPERATURE | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | PROCESS_TEMPERATURE | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | ROTATIONAL_SPEED | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TORQUE | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TOOL_WEAR | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | MACHINE_FAILURE | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TWF | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | HDF | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | PWF | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | OSF | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | RNF | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | +---------------------+-------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+ 14 Row(s) produced. Time Elapsed: 0.436s

Put the data in an internal staging table on snowflake Video: Explain how the implied stage table is addressed via syntax (%) View the staged data Copy the staged data into the target table

Each table has a Snowflake stage allocated to it by default for storing files (when using "Table Stage", check out the staging documentation). We will use the associated staging table (@predictive_maintenance.public.%SUMMARY_SENSOR_DATA) to load the data into Snowflake. To stage the sensor data in Snowflake for loading use the following command (referencing the sql script stage_sensor_data.sql included):

snowsql -a nna57244.us-east-1 -u jpatanooga -f ./stage_sensor_data.sql

Depending on where you downloaded the UCI sensor data csv file you have need to update the staging script sql. Once the command is run, you should see output similar to the output below.

ai4i2020.csv_c.gz(0.13MB): [##########] 100.00% Done (0.506s, 0.25MB/s). +--------------+-----------------+-------------+-------------+--------------------+--------------------+----------+---------+ | source | target | source_size | target_size | source_compression | target_compression | status | message | |--------------+-----------------+-------------+-------------+--------------------+--------------------+----------+---------| | ai4i2020.csv | ai4i2020.csv.gz | 522048 | 134490 | NONE | GZIP | UPLOADED | | +--------------+-----------------+-------------+-------------+--------------------+--------------------+----------+---------+ 1 Row(s) produced. Time Elapsed: 1.288s Goodbye!

To confirm we staged the file on snowflake use the following list command from SnowSQL:

list @predictive_maintenance.public.%SUMMARY_SENSOR_DATA;

You should see something similar to the output below:

+-----------------+--------+----------------------------------+-------------------------------+ | name | size | md5 | last_modified | |-----------------+--------+----------------------------------+-------------------------------| | ai4i2020.csv.gz | 134496 | 8871341f84e4f56a7f689f1ac8b32fba | Tue, 21 Sep 2021 19:48:22 GMT | +-----------------+--------+----------------------------------+-------------------------------+ 1 Row(s) produced. Time Elapsed: 0.410s

Now let's copy the staged data into the target table. We'll use the included script that has the following contents:

use database PREDICTIVE_MAINTENANCE;

COPY INTO summary_sensor_data
From @%summary_sensor_data
FILE_FORMAT = ( TYPE = CSV, SKIP_HEADER=1 )

Run this script with the command:

snowsql -a nna57244.us-east-1 -u jpatanooga -f ./copy_staged_data_to_table.sql

And the output should look like:

+----------------------------------+ | status | |----------------------------------| | Statement executed successfully. | +----------------------------------+ 1 Row(s) produced. Time Elapsed: 0.099s +-----------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+ | file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name | |-----------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------| | ai4i2020.csv.gz | LOADED | 10000 | 10000 | 1 | 0 | NULL | NULL | NULL | NULL | +-----------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+ 1 Row(s) produced. Time Elapsed: 0.787s

Now if we log into the SnowSQL console and run the query:

select * from SUMMARY_SENSOR_DATA limit 5;

We should see:

+-----+------------+------+-----------------+---------------------+------------------+--------+-----------+-----------------+-----+-----+-----+-----+-----+ | UDI | PRODUCT_ID | TYPE | AIR_TEMPERATURE | PROCESS_TEMPERATURE | ROTATIONAL_SPEED | TORQUE | TOOL_WEAR | MACHINE_FAILURE | TWF | HDF | PWF | OSF | RNF | |-----+------------+------+-----------------+---------------------+------------------+--------+-----------+-----------------+-----+-----+-----+-----+-----| | 1 | M14860 | M | 298.1 | 308.6 | 1551 | 42.8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 2 | L47181 | L | 298.2 | 308.7 | 1408 | 46.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | | 3 | L47182 | L | 298.1 | 308.5 | 1498 | 49.4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | | 4 | L47183 | L | 298.2 | 308.6 | 1433 | 39.5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | | 5 | L47184 | L | 298.2 | 308.7 | 1408 | 40 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | +-----+------------+------+-----------------+---------------------+------------------+--------+-----------+-----------------+-----+-----+-----+-----+-----+ 5 Row(s) produced. Time Elapsed: 0.226s

Data Load, Ingest, and Queries with Snowflake

In the "olden days" (as my kids say) it was a challenge to build out ingest pipleines for systems such as Hadoop and then manage how new data came in while running queries against the data being changed. A great aspect of the Snowflake platform is that we can write to a logical table all day long and we don't have to manage the updates or which temporary copy of the data is being processed -- it just all happens behind the scenes.

Now that we have established our cloud-base sensor data management platform and loaded our data, let's connect some data science tools to snowflake.

Google Colaboratory ("Google Colab") as EDA and Data Science Platform

Google Colaboratory is a cloud-based Jupyter notebook server that is free to use. Given that we have a small team of data scientists that want to use python in Jupyter notebooks and we need to run notebooks in the cloud, its a great option for the ACME Tool Co. data science team to "move fast".

The ACME Tool Co. data science team needs to explore the sensor data before any modeling occurs and has some existing exploratory data analysis methods they want to use to analyze the data before modeling. Fortunately there is a Snowflake python connector and it works from inside Google Colab, keeping all of our analysis and data management in the cloud.

Connecting to Snowflake from Google Colab

The nice thing about the Snowflake Connector for Python is that we can just pip install it in the Colab notebook and keep on moving via:

!pip install snowflake-connector-python

Once we do that, we can quickly test our connection to our Snowflake account with the following code:

import snowflake.connector

# Gets the version
ctx = snowflake.connector.connect(
    user='[your_account_here]',
    password='[your_pw_here],
    account='[xxxxxxxxxx.us-east-1]'
    )
cs = ctx.cursor()
try:
    cs.execute("SELECT current_version()")
    one_row = cs.fetchone()
    print(one_row[0])
finally:
    cs.close()
ctx.close()

At this point the ACME Tool Co. data science (ATDS) team has a platform they can ingest sensor data into and then pull into a juypter notebook for analytics and machine learning in the cloud.

Next Steps: Ready for Data Exploration

In this post in our series we learned how Snowflake provides us with a rock solid analytical store in the cloud to collect raw machine data over time. Once the ATDS team needs to rebuild their model, they can query the raw data to build any summary aggregates for analysis and modeling.

Before the ATDS team can start building machine learning models they need to better understand what is in the current dataset. This exploratory process is known as "Exploratory Data Analysis" (EDA) and is key to informing how to best start our machine learning activities. In the next post (part 3) we are going to dive straight into performing an EDA workflow on the sensor data stored in Snowflake.

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.