Author: Josh Patterson
Date: January 26th, 2022
Other entries in this series:
In our last article The ACME Tool Co. team collectively established their goals and business constraints for their predictive maintenance pilot program.
In part 2 of this series, ACME Tool Co. data science (ATDS) team starts their journey managing and analyzing the machine failure sensor data.
Key Take Aways:
The ATDS team knows what their requirements are, now let's introduce the sensor dataset the company provided the ATDS team for the predictive maintenance pilot project.
The dataset we will use for this post is the AI4I 2020 Predictive Maintenance Dataset Data Set from the UCI Machine Learning Repository.
The associated publication released with the dataset goes on to explain their rationale for using synthetic data:
"Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of our knowledge."
"Explainable Artificial Intelligence for Predictive Maintenance Applications", Stephan Matzka, Third International Conference on Artificial Intelligence for Industries (AI4I 2020), 2020 (in press)
Likewise, for our ACME Tool Co. blog series, we will use this dataset for analyis and modeling as it is available yet realistic. It's worth noting that this data is an aggregate of sensor data over time per machine. If we were working with the raw logs of sensor data then before modeling (if we desired a tabular data structure) we'd run some type of aggregation query across the logs to build a similar aggregate form of the sensor data. In this way, this aggregated form of sensor data is an appropriate and realistic dataset to work with in this scenario.
Right now we know little about what the data contains. Before we can do any data exploration (before any modeling work), we need to get some data management and analysis tooling up and running. Let's start with getting our data into an analytic platform that can handle our aggregated data or could continuously ingest sensor data if we had that.
In part 1 of this blog series the ACME Tool Co management team made it clear they had some constraints for the pilot:
The ACME Tool Co. pilot project needs to move fast to prove to the line of business that they can hit our ROI metrics in our operational contract.
This also means the the team needs to find a data platform in the cloud (because they can't bring in new on-premise hardware right now) that allows them to quickly prototype the pilot components while being scalable and cost-effective if the pilot program becomes a longer-term production system.
These constraints and timeline make the Snowflake Cloud Data Warehouse a great fit for data management and analytics components in this project.
Traditional analytical applications require extensive effort around building out platform infrastructure, which means your team spends a lot of time on non-value generating activities (procuring hardware, software, configuring databases, ETL, etc). These are activities the ACME Tool Co team will not have time nor budget for in this pilot project.
Snowflake Cloud Data Warehouse is a cloud-native, fully relational ANSI SQL data warehouse service available in both AWS and Azure It's scalability and ability to quickly provision a data warehouse make it the right place to start our sensor data management operation for our pilot program.
Snowflake has features such as:
Snowflake provides us with a rock solid analytical store in the cloud to collect raw machine data over time. Once we need to rebuild our model, we can query the raw data to pull the subset of data we need to build our model.
Historically you'd have to wait or request the sensor/machine data to be pre-processed; Snowflake simplifies this aspect of data processing as there is no need for data engineering tricks or complex Spark jobs.
Given Snowflake's ability to scalably and quickly ingest diverse data sets we don't have to expend a lot of energy and time on building out an ingestion framework. This let's our team focus on data processing to achieve line of business goals more quickly. Further, Snowflake only charges for what we use so it provides us with a efficient data ingest mechanism and low-cost data managagement platform for large amounts of machine-generated (e.g., "IoT") data.
With all of this in mind, let's get started working with Snowflake.
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.
To get started with managing our sensor data in Snowflake we need to do the following:
Let's get to signing up for Snowflake first.
If you already have a snowflake account head to the site and log in. If you don't have an account, head to this link and sign up for the free account.
We're going to do everything from the command line via the SnowSQL (CLI Client): https://docs.snowflake.com/en/user-guide/snowsql.html
Once you have the SnowSQL CLI Client installed, open a terminal window and log into your Snowflake account from the command line with the following command:
$ snowsql -a [account_name] -u [user_name]
This should show console output similar to the output below:
Now that we have connectivity, let's move on to creating a database and table for our sensor data
Now let's use the SnowSQL client to create a database and a table to manage our data. This tutorial has an associated github repository with the SQL scripts and Jupyter notebooks referenced: https://github.com/pattersonconsulting/predictive_maintenance
For reference, if you have git
installed, you can quickly clone the project repository with the following command:
git clone https://github.com/pattersonconsulting/predictive_maintenance.git
Tp create a database (predictive_maintenance
) and table in the database (summary_sensor_data
) use the following script (create_pm_db_and_table.sql
):
create or replace database predictive_maintenance; create or replace table summary_sensor_data ( UDI int , Product_ID string , Type string , Air_temperature float , Process_temperature float , Rotational_speed float , Torque float , Tool_wear float , Machine_failure int , TWF int , HDF int , PWF int , OSF int , RNF int );
We can execute the script included in the github repository from the terminal command line with the command:
snowsql -a [xxxxxxx.us-east-1] -u [user_name] -f ./create_pm_db_and_table.sql
And we should see output
Now let's move the sensor data into our Snowflake table.
Download the UCI dataset from the following URL:
AI4I 2020 Predictive Maintenance Dataset Data Set (from the UCI Machine Learning Repository).
We now we load this raw CSV sensor data into Snowflake. For reference, check out the Snowflake documentation on "Loading data into Snowflake":
https://docs.snowflake.com/en/user-guide-data-load.html
If we change our current datasbase to PREDICTIVE_MAINTENANCE
the output should look like:
Now we can ask Snowflake to show tables;
and it will output:
We can get more detail on our table RAW_DEVICE_DATA
with the describe table
command:
Put the data in an internal staging table on snowflake Video: Explain how the implied stage table is addressed via syntax (%) View the staged data Copy the staged data into the target table
Each table has a Snowflake stage allocated to it by default for storing files (when using "Table Stage", check out the staging documentation).
We will use the associated staging table (@predictive_maintenance.public.%SUMMARY_SENSOR_DATA
) to load the data into Snowflake. To stage the sensor data in Snowflake for loading use the following command (referencing the sql script stage_sensor_data.sql
included):
snowsql -a nna57244.us-east-1 -u jpatanooga -f ./stage_sensor_data.sql
Depending on where you downloaded the UCI sensor data csv file you have need to update the staging script sql. Once the command is run, you should see output similar to the output below.
To confirm we staged the file on snowflake use the following list
command from SnowSQL:
list @predictive_maintenance.public.%SUMMARY_SENSOR_DATA;
You should see something similar to the output below:
Now let's copy the staged data into the target table. We'll use the included script that has the following contents:
use database PREDICTIVE_MAINTENANCE; COPY INTO summary_sensor_data From @%summary_sensor_data FILE_FORMAT = ( TYPE = CSV, SKIP_HEADER=1 )
Run this script with the command:
snowsql -a nna57244.us-east-1 -u jpatanooga -f ./copy_staged_data_to_table.sql
And the output should look like:
Now if we log into the SnowSQL console and run the query:
select * from SUMMARY_SENSOR_DATA limit 5;
We should see:
Now that we have established our cloud-base sensor data management platform and loaded our data, let's connect some data science tools to snowflake.
Google Colaboratory is a cloud-based Jupyter notebook server that is free to use. Given that we have a small team of data scientists that want to use python in Jupyter notebooks and we need to run notebooks in the cloud, its a great option for the ACME Tool Co. data science team to "move fast".
The ACME Tool Co. data science team needs to explore the sensor data before any modeling occurs and has some existing exploratory data analysis methods they want to use to analyze the data before modeling. Fortunately there is a Snowflake python connector and it works from inside Google Colab, keeping all of our analysis and data management in the cloud.
The nice thing about the Snowflake Connector for Python is that we can just pip
install it in the Colab notebook and keep on moving via:
!pip install snowflake-connector-python
Once we do that, we can quickly test our connection to our Snowflake account with the following code:
import snowflake.connector # Gets the version ctx = snowflake.connector.connect( user='[your_account_here]', password='[your_pw_here], account='[xxxxxxxxxx.us-east-1]' ) cs = ctx.cursor() try: cs.execute("SELECT current_version()") one_row = cs.fetchone() print(one_row[0]) finally: cs.close() ctx.close()
At this point the ACME Tool Co. data science (ATDS) team has a platform they can ingest sensor data into and then pull into a juypter notebook for analytics and machine learning in the cloud.
In this post in our series we learned how Snowflake provides us with a rock solid analytical store in the cloud to collect raw machine data over time. Once the ATDS team needs to rebuild their model, they can query the raw data to build any summary aggregates for analysis and modeling.
Before the ATDS team can start building machine learning models they need to better understand what is in the current dataset. This exploratory process is known as "Exploratory Data Analysis" (EDA) and is key to informing how to best start our machine learning activities. In the next post (part 3) we are going to dive straight into performing an EDA workflow on the sensor data stored in Snowflake.
Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.