Applied Predictive Maintenance

Part 1 of 6: "Making the Business Case for Predictive Maintenance"

Author: Josh Patterson
Date: January 12th, 2022

Other entries in this series:


Applications of Predictive Maintenance

Predictive maintenance allows us to estimate when maintenance should be performed through predicting the condition of operational equipment. It improves uptime in industrial sectors and lowers costs. Relevant industries and applications:

Industry Application Area
Smart CityTraffic management, road safety, security, parking management, automotive telematics, public transport, environment protection
Oil and GasEquipment+infrastructure reliabiltiy and efficiency
LogisticsPredictive Maintenance and equipment optimization
Power and UtilitiesTransmission, Generation, and Distribution infrastructure reliabiltiy and efficiency
ManufacturingPredicting device failure
MiningEquipment and vehicle predictive failure/maintenance
Heavy EquipmentPredictive analytics and failure detection

For more ideas around use cases in predictive maintenance, manufacturing, and other industry verticals, check out our online resource for Machine Learning Use Cases.

In this blog post series on building a real-world predictive maintenance application, we'll start with a business case from a fictional manufacturing company (ACME Tool Co.) and we'll work through how the company makes a business case and then builds a predictive application to meet the business requirements.

Series Key Take-Aways:

  1. Learning how to formulate a business case for a data science project
  2. Operationalizing the live application in cloud-based Jupyter notebooks and on Snowflake
  3. Understanding how to forecast the financial value of a data science project

Far too many data science projects fail due to poorly defined (or no) business goals. Many times this comes down to the fact that the line of business does nto know how to create an operational framework around a data science project that gives the data science team the latitude to operate while keeping them focused on a real financial target.

In this series we'll look at how to define start with our business objectives, financial goals, and cost-benefit information. This information will then be leveraged to inform the data science team where to start and then when they have "enough performance" to hit realistic financial goals with the model.

This framework of clearly defining a start point and delivery criteria for the relationship between the line of business and the data science group is critical to building successful data science products in the enterprise.

In this article we're going to document

  • the business case for predictive maintenance at ACME Tool Co
  • business contraints on the project
  • minimum viable outcomes based on financial targets
Let's now get to know ACME Tool Company.

The Business Case for Predictive Maintenance: ACME Tool Manufacturing Co

In a highly competitive market, manufacturing companies are under pressure to raise output with the same or less resources. This puts pressure on the operations management team of manufacturing plants to find ways to reduce error (see also: anomaly detection article) and raise production uptime.

In this scenario we have ACME Tool Manufacturing Co. (ATMC) that produces a specialized wrench for oil rig maintenance. ACME mantains the following:

  • 5 Manufacturing sites
  • 50 manufacturing lines across the 50 manufacturing sites
  • Each manufacturing line has 10 stations with a key tool with a specific manufacturing step
  • In total, this gives the company 500 station machines to maintain to keep the production lines operational
  • Each tool is worth $20 in sales to the company, and the company can theoretically produce 5783 tools per day
The above metrics give you an idea of what peak operations look like for ACME Tool Manufacturing. We note that in reality AMCE Tool never has "no downtime" and machines go down quite consistently, as we'll see in the operations analysis below. Now, let's dig into how market pressures are forcing ATMC to evolve.

The Red Queen Never Sleeps

A major competitor in the manufacturing space has just announced it plans to not only start producing a similar tool for oil rigs, but they will price the tool at 7.5% below the price point of ATMC to aggressively capture market share. This sends shockwaves through the market.

In a margin-based business, this is a big deal, and nobody at ATMC is having a good week.

The executive team realizes that ACME will have to match this price or lose major market share. This means 7.5% less revenue coming in which would put a non-trivial dent in the companies operating profit margin overnight (from 12% to 4.86%). This operating profit margin drop would adversely affect the value of the business to shareholders and negatively affect ACME's ability to raise new capital to compete in the market or expand. This would create many challenges to continue operations long term in the market for ACME.

ACME aquired this manufacturing unit 3 years prior with the expectation of being able to grow the market, but that was predicated on the ability to maintain or improve operating profit margins. Enduring a sudden reduction in operating profit margin would be disasterous from an investment standpoint.

The management team lays out two paths forward:

  1. Try and use the existing operations to produce more tools at a lower margin to make up some of the missing operating profit
  2. Use modern sensors and predictive maintenance methods to reduce downtime
The team analyzes the first option and decides that putting more stress on their manufacturing operations would be difficult at best, wearing down the equipment faster for less money per tool sold. Their manufacturing plants have largely been operated in a reactive maintenance manner, because "thats how its always been done", and more throughput would create more downtime as well.

The ACME management team is nervous with the second option, but if they could pull it off they could potentially slightly improve their operating profit margin (even with the price drop from $20 to $18.50) by recapturing some of the operational downtime in the manufacturing lines. The worry amongst the team is that traditionally it has been hard to model the value of machine learning models and their associated effectiveness over time. However, their data science team has the support of the Patterson Consulting team, allowing them to explore their options quickly and in a cost-effective manner.

ACME realizes that if they are going to transition into a modern manufacturing operation then the time is now. It's never going to get easier to compete going forward that it is now, so ACME has to adapt or be relegated to continually losing market share. In many ways, the cycle of "software eating the world" is really the old story where the Red Queen tells Alice that we have to "run as fast as we can to stay in the same place". ACME has to run faster than they ever have --- and they have no choice but to adapt.

Analyzing Current Operations and Maintenance Team

Currently the 5 manufacturing sites produce:

Tools produced per hour per line 14.46
Old Value of Tool $20
Total tools production capacity per day (theoretical, if no downtime) 5783
Value of Tools Produced Per day (all lines, no theoretical downtime) $115,662
Actual tools produced per day (with downtime) 3875
Value of Tools Produced Per day (with downtime) $77,493.98

The company has downtime that prevents production during 33% of its operating hours. Unfortunately, the company is still paying the rent, electric bill, executives, and production personnel while the production line is down, so downtime is expensive. The team knows they need to reduce downtime, but first they have to analyze what is specificically causing the dowtime.

Manufacturing puts wear and tear on the machines producing product in a manufacturing line and the machines frequently break down. The manufacturing industry measures and tracks how much life a machine has with a metric called "mean time between failures".

Mean Time Between Failure

Mean time between failures (MTBF) "is the predicted elapsed time between inherent failures of a mechanical or electronic system, during normal system operation". Using the MTBF model we can get a good estimate on how often our machines will fail based off historical data.

Using the reliability open source python module and the ACME Tool historical data the operational team calculates their MTBF to be 131.4 minutes . This calculation was based on 10,000 different tools in the dataset so the team is reasonably confident (95% confidence interval) that the MTBF metric will hold up.

With the MTBF metric in hand, the ACME team analyzed their plant based on the sensor data collected:

Avg Machine Manufacturing Minutes per Tool 0.15 min (9 seconds)
Abg Machine wear per day 17.35 min
Avg Total machine wear across all tools per day 8674.70
Mean Time Between Failure (MTBF) 131.42
Average Machine Failures per Day 66
Lost production hours per machine failure 2
Lost production hours per day due to machine failures 132
Lost Revenue per Day Due to machine failures $38,168 (33% of theoretical revenue max per day)

If any machine in a manufacturing line goes down, the line cannot produce any tools, and the company loses $289.16 every hour the line is down. The ACME team has analyzed just how much revenue is lost each week ($190,843) and year ($9,542,168) due to downtime as part of quarterly operations management reviews.

ACME has grown into this space quickly and only recently started analyzing lost revenue and ways to mitigate the machine downtime (such as proactively maintaining all machines, but due to the large number of machines this would take a lot of labor hours).

Now obviously no plant is going to operate with 100% uptime, but this is still a staggering number to deal with. Obviously there will be "some" downtime in manufacturing plants, but what could we do with statistics and machine learning to reduce at least a portion of this downtime? If we could prevent a few of those machine failures per day, ACME could recoup some of that lost revenue that they are already paying expenses for.

Each machine issue takes around 2 hours to service by a technician (they have to get to the specific location and line, triage, etc) This means a technician can fix around 3-4 issues per day and it takes 16 techs to keep operations running (and they are stretched thin currently as well).

Predictive Maintenance Pilot: Business Goals and Constraints Defined

The ACME Tool Co. operations management team now knows that:

  • The company faces an existential threat from reduction in operating profit margin
  • there is $190,843 of revenue per week that is not realized --- due to unplanned downtime
  • they have a rough idea on how many machines per week will fail (66), causing unplanned downtime
  • they don't have the resources to make a major overhaul of how they are operating (e.g., changing the maintenance patterns completely)
  • predictive models have the potential to build an ordered list of the most likely to fail machines

After internal discussion, the line of business and operations management team conclude that their operations likely will not change over night "magically". They are aware, however, that predictive maintenance has the potential to rank the list of machines by which are most likely to fail based on their historical data. They are also aware that data science can be tricky to "get right" and that there will be some growing pains if they choose the path of predictive maintenance.

ACME Tool decides to investigate the prospect of a predictive maintenance pilot program but under tight yet realistic constraints. They realize that machine learning is not "magic" and it will not just "tell them the machines that will fail with zero error". They do, however, want to understand "how good the predictive model can be" and "what are the tolerance of error" --- because if they invest in this path they are betting the financial future of the company on getting close to the expected results from the pilot program.

The executive team commits to funding a small team of technicians to do overnight maintenance on 18 machines each night on the machines that are likely to fail the next day. These 18 machines will be selected based on the data science team's model's top 18 most likely to fail machines.

Pilot Financial Goals

The finance team has calculated that if they drop the price of the product from $20 to $18.50 (dropping annual revenue to $17,203,662) then this would drop the operating profit margin of the company from 12% (decent) to 4.86% (not good) because (in theory) cost of goods sold and operational expenses would stay the same (if they sold the same number of tools).

Based on financial analysis ACME realizes that an operating profit margin of 4.86% does not bode well for long term operations of the company.

PM Program Cost

The finance and operations team came up with the following metrics and costs that it would take to perform preventative maintenance on 18 machines each night:

  • Hours per Shift: 9 hours
  • Time required per PM Fix: 1 hour
  • Fixes per PM Shift: 9 machines fixed
  • Required PM Technicians: 2
  • Cost per Year per PM Technician: $64,800
  • Total PM Program Cost: $129,600
With this overhead in mind, let's take a look at our goals for the pilot.

Minimum Viable Goal

The finance team has determined that an operations margin of around 9.6% is a realistic minimal viable number to shoot for if the management team wants to remain minimally competitive in the market (e.g., being able to raise capital, etc).

In the best case, the finance team would like to get closer to the original operating profit margin (12%), but they realize this could be a difficult goal under the current circumstances. Regardless, the team decides to set that as the stretch goal for the project.

The team has calculated that if they can detect 11 failures (out of 18 predictions, or 61% accuracy) per day, they could reach a margin of 9.6% with PM Program costs included. This seems like a safe and reasonable minimum viable goal, but given the stakes, the team needs goals they have a high chance of clearing.

Stretch Goal

The stretch goal is to get back to the 12% operating margin, however, and if they can detect 14 out of 18 (78% on the top 18 predictions) then this would get them back to an 10.9% operating profit margin. If they could hit this rate 95% of the time, this would reduce daily failures from 66 to 52 (detecting and preventing 21% of the failures). However, with the lowered price but also using predictive maintenance to reduce downtime, the company summarized that:

  • they could operate the same hours and produce 1,027,084 (+10.45%) tools
  • operational expenses would hold constant
  • cost of goods sold would increase slightly (labor would hold constant, but more products produced costs +10.45% more materials)
The finance team concluded that with the reduced downtime and lower price that ACME would sell $19,001,060 in product as well, with an operating profit margin of 10.9%.

Needless to say, the executive team was excited at the prospect of being able to combat the price drop and potentially salvage their operating profit margin.

With this information in hand, the ACME Tool Co operations team sets up a meeting with the data science team to map out the best path to produce a pilot project.

Summary of Pilot Goals

To implement the new pilot system, ACME Tool Co. does not want to make major changes to their process, so they keep the reactive maintenance process in place during the normal daily operational hours.

Based off past experimentation, both the executive team and the data science team know that predictive models are not perfect and you have to set expectations to have any shot at being successful with a machine learning project. With this in mind, the executive team works with the data science team to set the following pilot parameters:

  • the company is willing to pay for a small team of technicians to do overnight maintenance on 18 machines that are likely to fail the next day
  • minimum viable goal: at least 11 out of 18 correct (61% out of top 18) 95% of the time
  • stretch goal: 14 out of 18 predictions correct (78% out of top 18) 95% of the time

The IT team started collecting machine sensor data about 2 years ago. They currently continue to daily collect the sensor data on how each machine on the line is used (torque, minutes, etc). The manufacturer of the machines also provides data on 10k known devices across multiple companies in a data share agreement to help with maintenance modeling. So ACME Tool Co. has some data to work with, but it remains to be seen how valuable the data really is in the pursuit of predicting machine failure.

Operational Contract Between Line of Business and Data Science Team

The line of business doesn't need to understand everything that goes on in data science land, but there should be some sync points along with an start state and then acceptance criteria for completion.

The data science team is provided the 10k machines worth of historical data to begin their analysis with.

At this point, the data science team has a starting point (data, resources, ROI targets), and they commit to delivering a model and an analysis of the expected performance of the model under the supplied conditions (first deliverable). Together, the data science team and the business team will analyze how the model's performance impacts the business in financial terms.

The (second) deliverable from the data science team to the operations management team is a simple report of the top 18 "most likely to fail machines" every day based on the data up to that day. The data science team doesn't have to build any fancy apps, just a simple text report via email or SQL view.

With the 2 clear established goals, parameters for the pilot, and a collaborative framework, the data science team now sets off to write a plan to achieve the business team's target goals for ROI with a predictive model.

Next Steps: Ingest and Store the Manufacturing Sensor Data

The ACME Tool Co. team has collectively established the goals and business constraints for their predictive maintenance pilot program.

Let's now move on to part 2 of our series where the ACME Tool Co. data science team starts their journey managing and analyzing the machine failure sensor data.