Documentation is a focal point in an organization.  It is sets down the standards and records history; the history of both innovation and failures.  Good documentation helps train new comers and holds all colleagues accountable.  Without documentation, organizations are likely to repeat the same mistakes over, or have no idea how a mistake was made.

Documentation should be factual, contain enough information to be useful, yet remain concise.   There should be documentation standards and a keeper of the records to maintain those standards.  Below is a review some standard types of documentation and key factors to making those documents useful.

Document Master:  There should be a standard format for each type of document.  That format should contain author, approver(s), date and revision.  The storage location should be obvious about which documents are approved for use.  Documents under revision or obsoleted should have access limited.  This can be done through a process, or an electronic document control system (DCS).  If an electronic DCS is not used, naming convention needs to be standardized to speed finding the correct information.  Suggested naming convention  document type + area + specific topic + date in YYMMDD convention.  For example this document would be named BlogEngineerXXDocumentationYYMMDD

Using the date format of YYMMDD will sort documents easily.  The convention of MMDDYY will not sort in chronological order in a file format.

Process Flows:   Process flows are used for everything from mail delivery, to document control, through your most complex deliverable.  Ideally all of these processes would be documented.  However that is not often practical, so a priority needs to be set.

List all the process that need to be documented and order them by priority.  Develop process flows for all the critical process, and set a schedule to work through the rest of the list.  One tip for getting through the whole list is to assign new employees to create process flows as part of their on-boarding.  This helps with their indoctrination, plus gets process flow documented.

Process flows should ideally fit on one page.  Sub processes can be included on subsequent pages as a drill down process.  Process flows should have swim lanes or other information to indicate who is accountable and responsible for the activities on the process flow.

Recommended priority for process flow development

  • High impact processes (either safety or dollars)
    • Process that done correctly can save money or generate revenue
      • Example:  Assembly process of finished product
    • Process that done incorrectly has severe safety consequences
      • Example:  The lockout tagout process
    • Process that done incorrectly can cost significant money
      • Example:  A batch process that finishes at 5 minutes, but must be stopped before 10 minutes.  Removing the batch between 5 and 10 minutes is critical to success.  Early or late finishes are scrap.
  • Process that cross departmental boundaries.
    • When processes fall under different managers, it is important that processes are documented so there clear hand-offs and acceptance points.
  • Process that are performed often and by multiple individuals
    • To ensure consistent output, processes that are carried out by multiple individuals need to be standardized.  These are often then turned into standard operating procedures (SOP)

Standard Operation Procedures (SOP):  Repetitive work should be documented as standard operating procedures.  SOPs should cover management as well as frontline work.  SOPs should contain pictures and instructions on how to perform the work.  These are used to train new individuals and refresh training for all employees.  SOPs are detailed instructions that can be turned into check lists for short reminder versions that are used for daily/weekly interactions.

Checklists: Checklists are short reminders of work to be performed.  They should concentrate on activities that have direct impact on performance.  Checklists should be designed to be completed in one time period.  For instance, a checklist longer than an hour or two should be broken into multiple checklist or routes.  Few people work longer than two hours straight without taking a quick break (coffee, phone call, et’c)

Checklists should not be used as a management tool.  Checklists are to ensure that processes are followed or to check product and equipment condition, not as a substitute for management activities.  If the checklist activity does not have a direct effect on the organization’s output, it is likely not necessary.

Responsibility Accountability Consulted Informed (RACI) :  This is a matrix that assigns accountability and responsibility for activities.  It also helps colleagues understand their role in operational activities.  A RACI can help the consulted and informed understand why they are not able to significantly influence certain areas of the business.

Portion of Example RACI

Continuous Improvement (CI) and Root Cause Analysis (RCA):  CI and RCA activities need to be documented to ensure that mistakes are not repeated.  How often have you been in one of these activities and someone says “hey -didn’t we try that when …”

Properly documented and filed CI and RCA activities should be reviewed at the beginning of every new CI or RCA.  These files can save lots of frustration and wasted effort and money.

Training and Qualification Matrix:  This matrix is useful when changing roles or when help is needed in a pinch.  Understanding the skills required for a job, as well as who in the organization has those skills helps define training needs and budgets.

Lube Training Matrix example

This is just a list of some of the types of documentation required in an organization.  There are lots of forms and other types of documents required to run an organization.  The ones listed here are key to managing the organization, but are often not formally documented.  Large organizations may assume everyone “knows”.  Small organizations  may not realize the importance of this type of documentation as they outgrow the one person in a role structure.

Review and see if there are any gaps in your organization.  It takes time and discipline to create a good document system.  However, the benefits of having one are worth the effort required to build and maintain one.

Latent Root Cause

A good root cause analysis (RCA) should produce more than just the physical root cause.  It is important to discover the physical root cause to fix the immediate problem.  But delving deeper into the systemic, then all the way to the latent cause helps transform your business for the better.

The latent root cause involves practices and cultural norms that allow failures to happen.  I’ll say that again, the latent root cause is the practice that allowed the failure to happen.  Solving the latent root cause, means solving a management problem.

It’s easy to know when you have reached the latent cause, because the terminology has changed from ‘they’ to ‘we’.  It is no longer someone else who has to act, but we, as management, that have to act.

Let me take you through an RCA that I was involved in as a plant manager.  Note: after they are solved, RCA’s can be explained simply.  But the process is grueling, takes many iterations and can be extremely frustrating.  It is easy to get distracted and end an RCA after the physical root cause.  But if the problem was important enough to warrant an RCA (not just troubleshooting), then it is worth it to the organization to finish the RCA and find the systemic and latent causes.

Problem Statement:  Food Product did not meet consistency expectations, but did not present a food safety hazard.

Occurrence:  Product left plant at expectation, and arrived at most sites as expected, however product shipped over the Rocky Mountains lost consistency.

After much research into what made the product the right consistency, and a thorough inspection of the production equipment, it was found that the colloid mill was not shearing the product finely enough to produce the sustained viscosity needed.  When the product was shipped over high altitude it became runny.  Other product became runny before code date, but the high altitude shipment actually helped by alerting us to the problem within days of the production run.

Production Process :  The colloid mill blade spacing was set by adjusting the dial on the outside.  There was an SOP that indicated where to set the dial.

Physical Root Cause:  The mill was dismantled and the thickness of the blade was measured.  It was still a useable blade, but the dial setting should have been adjusted for blade wear.

Systemic Root Cause:  SOPs were created and did not take into account equipment wear.

Latent Root Cause:  No one thought about how equipment wear would effect product quality.  There was no program to adjust SOPs over time to account for blade wear.  But requiring production to follow the SOP, management did not have a provision to adjust requirements as needed over time.  The blade was known to wear, but the spacing adjustment did not account for that wear in the SOP.  Management allowed for using a worn (but still within spec) blade, but did not provide instructions on how to use that thinner blade.

The result of determining the latent root cause was a plant wide review of all equipment that could wear.  There was already a program to periodically measure the mill blades to ensure they maintained safe thickness.  So, a process was added to the inspection to record the blade thickness and adjust SOPs accordingly.

Other equipment was reviewed, to determine if changes needed to be made.  Positive displacement lobe pumps are a common wear item in food plants.  Using pressure and flow settings compensate for lobe wear.  Agitators and mixers were also reviewed to set standards.  A clearance inspection program was set up for them.

By driving to the latent root, we were able to apply the physical root cause (wear) to equipment beyond the colloid mill.  This should prevent future quality issues and lead to better care and understanding of the equipment and its importance in the manufacturing process.

Finding the latent root cause prompts management to act and change the processes.  It can be more expensive and involve more areas of the business than this problem.  It is not uncommon to see capital expenditures or changes in operating philosophy.

The more RCAs that your organization drives to latent root cause, the less RCAs overall will be needed.  This is because solving these management issues, has a broader impact than solving only the physical issues.

You will be operating more proactively, and less reactively.  I encourage you to use an experienced coach to learn the process of driving to the multiple root causes.  But once you understand the process, you and your organization will continue to drive toward solving the latent issues.  You will not be satisfied to stop at the physical root cause.


Qwerty must die

What if I told you that I could take something that you do dozens, maybe hundreds of times a day and make it much more difficult?  There is no benefit in making it more difficult, it is likely to cause you to make more mistakes, possibly contribute to injuries, and increase the time it takes you to do one of your most common tasks.

Some would say ‘no thanks’ to this offer.  Others would question my intelligence, sanity, lineage, and generally say things best said with @#!.   No one would say yes to that proposition.  Yet, the qwerty keyboard persists.  This was a design from 1874, that was designed to optimize typing without jamming the mechanical bars.

Mechanical bars have not been a problem since the electric typewriter was created.  Currently a significant number of keyboards are not even physical, they are touch screen.   Typing often occurs on tiny smart phone screens where ten figure typing is not practical.  Yet, there has been no mainstream advancement in English language keyboard layout for over 140 years.

A search of ergonomic keyboards, yields some split keyboards to improve the ergonomic position for the hands, but no new layout of the actual keys.  A search of keyboard apps yields emoji designs and other visuals, but no improvements on the qwerty layout.

Why not have touchpads and keyboards designed for optimal modern typing?  The most used keys could be positioned better for both thumb and ten finger typing.  The keys could be sized to make it easier to hit more common keys.  Anyone who has watched Wheel of Fortune® or played Scrabble® know that the vowels and a few consonants make up the majority of letters used in English.  It seems like it would be a simple ergonomic engineering exercise to create the best keyboard layout for today’s typing needs.

Even better than one layout, how about a customizable layout.  Letters could be arranged and sized based on the individual user’s needs.  If my best friend’s name is Izzy, I may need to make the z key more prominent than other users.  Or if I decide that I want a keyboard without the letter “C” I could make that my default layout.  Let’s face it, we all know what jerk C can be.  K and S can make up the majority of necessary sounds (H will need a new pal for some chilling words).  But really, who needs C anyway.

What ever an individual’s keyboard needs, it is time to get rid of the qwerty keyboard.   I can not find a good reason to keep it, and have listed several reasons above to get rid of it.  I advocate that #qwertymustdie.   To do this, we need app designers and keyboard makers to create an alternative.  Hopefully one that allows for a customizable solution for each user.  Layout, key size, color, relative proximity – these should be personal choice (tsoise ?) for a modern typing solution.

The use of non OEM parts cannot automatically void a warrantee

Can using non OEM parts void a warrantee?   The answer is maybe, but probably not. The Magnuson-Moss Warranty Act regulates warrantees for both industrial and individual consumers.  The act specifically restricts tie-in requirements.  A manufacturer cannot require specific maintenance or parts usage, unless the company provides those services or materials during the warrantee period.

However, why would you not follow recommended servicing guidelines, or use the OEMs proprietary parts?  The reason that you purchased capital equipment from the OEM is because the product fit your requirements.  Keeping it in top condition should be a high priority.

Use OEM materials that are proprietary to keep your equipment in top shape.  If materials are commercially available materials that the OEM has rebranded, feel free to use the “generic” version of that part.  Some larger companies are requesting that OEMs provide the purchasing information for non-proprietary materials.  Even if you don’t have the buying power of the large companies, it is always a good idea to ask for a complete bill of materials.

So, using non-OEM parts will not automatically void your warrantee.  It is recommended that you have the warrantee period maintenance discussion with your sales rep at the time of purchase.  Understanding your rights and their rights under the Magnuson-Moss act should make the discussion very productive.

It is also recommended to use a warrantee tracking process, to get the most out of your warrantee.  Many CMMS’s allow for tracking warrantees.  If yours doesn’t, set up a spreadsheet or database to track warrantees and dates.   Assign someone to monitor the warrantee periods and ensure that if there are problems with equipment during the warrantee timeframe, that the OEM is notified and allowed to correct defects or provide materials as required.  The money you save by properly administering warrantee claims for equipment should offset the time of the individual monitoring the warrantee periods.

Data – So What

Internet of Things (IoT), Artificial Intelligence (IA), there’s an App for that, … – we now have available an abundance of data, and even some information.  It’s what we do about it that matters.

Although IoT is a buzzword now, it has been in the works for years – decades even.  Dr. Jay Lee was one of the first to introduce me to the concept – if not the term.  The reality is we do have a lot of access to data, and we have machines turning some of that data into information – but – so what?!?

If we as people don’t get involved and make decisions for, about and with that data, then we have succeed at nothing.  I have seen companies working feverishly to capture the latest information on their machinery, only to ignore the actual data and let the machines run to failure.   We need to step back regularly and look at the whole operation to determine what do we really need to know and why.  Also, there needs to be a plan to act on what is learned.  Too often we do not act on what we already know, waiting to see if there is more information around the corner.  IoT will not change behavior.  The process to act on data/information must be in place to utilize IoT successfully.

Automated vehicles are in the news, specifically for the failures of people to act on the data – and even information, they were given.  Factory data rarely has fatal results if ignored, but the failure to act still has significant consequences.   What is the point of knowing your equipment health, if the planning and scheduling system is not allowed to fix equipment before it goes into catastrophic failure?  It is a common theme in after-action reviews to be able to pinpoint warnings, even multiple warnings that were ignored before the failure.  I have seen leadership teams brainstorm how they can get better warning systems, rather than figure out how to act on the warnings they do have.  It is always easier to push the responsibility down the road and wish for perfect information, than to accept the responsibility we have in utilizing the imperfect information already available.

I love data and information.  I am an analyst at heart, but I fear that the growing IoT available will lead us to more catastrophic, and possibly even fatal events.  I worry that folks in charge of making decisions will delay acting on the first sign of potential failure (P on our I-P-F curve), hoping to be ‘heroes’ by maximizing that P-F time and waiting for the really big warning from IoT to tell them time is up.

To avoid this propensity to put off making decisions on known failures, we need to reward managers who maximize the I-P portion of the curve and punish those who do not make decisions as soon as a problem is identified.  That does not mean drop everything and fix problem equipment the second the defect is identified.  But it does mean putting a mitigation strategy in place as soon as the defect is identified.  Don’t wait for further indications of the down hill slope.

How can we make heroes of those that don’t delay?  By measuring what doesn’t happen.  How about measuring days since last production record?  Promoting the equipment health score as metric for everyone to be proud of?  Measuring time from defect identified to fixed?  Bonuses for everyone in an organization that doesn’t have a catastrophic failure?   There are always ways to game any metric, but focusing on positive metrics, rather than negative ones (downtime, production lost, et’c) puts the focus on performance, rather than non-performance.

Does anyone have an organization that rewards on avoidance of failures?

I-P-F curve

Before implementing any data gathering program

  • Determine what you want know (production numbers, equipment health, quality statistics)
  • Determine why you want to know that (product cost information, maintain equipment health, meet quality standards)
  • Determine how to capture the data
  • Determine how to analyze the data
    • Formulas
    • Frequency
    • Who is accountable for the analysis – audit of the analysis
  • Determine accountability for acting on the data
    • Specific title or name – one person needs to be accountable
    • Frequency of checking data and acting
    • Parameters for acting (think of an over-damped system if the reaction is too severe)
  • Determine how metrics will be published and used to drive team members to desired behavior regarding the What/Why you wanted to know

IoT is only as good as the management team that is operating it.


Ode to Machinists Handbook

The Machinery’s Handbook is a wonderful tool.  Although it is often called the machinists handbook, it is a tool that every engineer should also own.  Beyond machinists and engineers, it is a tool that everyone should be aware of its existence.  I came across the handbook early in my career when a colleague pulled out the book and looked up something.   I was hooked at that moment.  I borrowed his book and looked through the myriad of offerings in the book.

I learned what a grade 8 bolt was.  Not only that it was not a grade A bolt – but what it strength was, and when to use them.  I learned how to look at the markings and identify grade 8;  6 dashes with the circle.

Grade 8 Bolt - 5 Radial Lines

I also learned that bolt strength designation is much like women’s clothing sizes: 2, 5, 5.2, 7, 8.   Although these numbers appear random, the handbook walks through the math to explain how these numbers are calculated, and why they are not sequential.  No such standard or logic exists for women’s clothing sizes.

Fastener types and specification, material properties, gear information – all that can be found in the handbook.  Bearing fits and tolerances are critical and specifically spelled out in the handbook.

When I tour a machine or rebuild shop, one of the things I look for is the machinery’s handbook on desks or in toolboxes.  If I don’t see one, I ask about how tolerances are calculated.  Occasionally the shop will reference posters published by the component or OEM manufacturer.  But often, the answer is ‘everyone just knows’.  Even if someone has been rebuilding the same equipment for 20 years, I still want to see the charts that they are referencing.  Even if they remember the tolerance requirements for equipment they see regularly, no one has memorized everything in that handbook.  Machinists, rebuilders, and engineers who do not regularly check to confirm their assumptions and calculations are disrespecting their craft and customers.

There are other tools that provide the same information, but there is nothing as neatly packaged as the handbook.  I urge everyone to have a copy and regularly glance through it, or reference it when needed.  The 30th edition is available and it comes in hardcopy or e-copy.  There are older versions available on line, and there is even an app to help with calculations.   The point is, this is a wonderful reference book, do not go through life “just knowing” confirm what you know, and maybe even learn something new, by using this book.

Does anyone have other must-have reference books that they looking for when auditing shops?



Factors effecting equipment lifecycle cost

Equipment or system life has 4 factors.

Design is the most important, yet often the most compromised aspect of capital equipment life cycle.  The design and purchase of equipment is often a very small portion of the lifecycle cost, but it locks in the rest of the total ownership costs (TOC).

Front-end Engineering Design (FEED), also referred to as Front-End Loading (FEL) and Pre-Project Planning (PPL), is robust planning and analysis during the Design stage,
when the ability to influence changes in design is relatively high and the cost to make
those changes is relatively low. It is important that TCO calculations should be made during this process to ensure the most effective and cost-efficient system is designed.

Though FEED adds cost and time to the design stage, these are minor compared to
making changes at later stages in the project. Identifying and implementing cost saving
modifications in the design stage are the keys to optimizing TCO.
Before actual design work begins, FEED confirms and prioritizes the product/system
requirements — what is critical, what would be ―nice to have, and what should be
categorized as beyond the scope. Refining the requirements before design work
begins is critical, because the requirements drive the design which then determines
the lifecycle. An overly broad scope negatively impacts other stages.
Once design work begins, FEED expands upon traditional engineering analysis,
which focuses primarily on the operational function of a product/system. FEED goes
beyond this by employing engineering best-practices to analyze how design
considerations impact each of the development lifecycle stages. These best-practices
include: mechatronics,  Reliability Centered Maintenance (RCM), Failure Modes and Effect Analysis (FMEA), 3-D modeling, and simulations.


Once design, selection, and purchase has been completed, acceptance testing and installation phase begins.  This phase needs as meticulous planning as the design phase.  Care to how operational materials and personal move and interact with the equipment must be taken.  Often equipment is designed with a “one-size fits all” operator in mind.  This can mean that controls and gauges are not able to be read properly by shorter operators.  Mapping the path of the operators around the equipment will highlight any interference issues.  The triad of material loading, controls, and product unload is as important as the sink-stove-refrigerator triad that home designers obsess about home kitchens.  Installations also need to consider both routine servicing requirements and equipment replacement.


The operational phase requires standard work instructions, trained operators, trained supervisors, and continuous improvement mindset.  This phase consumes most of the total cost of ownership.  These include raw materials, scrap, off-quality product, production labor, indirect labor, utilities, and supplies.  All of these go into the cost of goods sold.  Controlling these costs can either improve profits, or allow a reduction in sale price.  Depreciation is also part of operating cost, but that is completely set at time of purchase.


Maintenance costs are best controlled by being designed into the equipment.  Including condition monitoring into the system controls can ensure that all areas of the equipment are properly monitored.  Operator care, and ensuring that repairs are made promptly and accurately are also key components to optimizing the maintenance portion of total life cycle cost.

The final factor in a cost formula is disposal.  This may include rebuild/overhaul, sale, or decommissioning.

Total cost of ownership consists of the four phases, but design phase is most important in setting that cost.  Often in the rush to become operational, or a short sighted view of costs, the design phase is cut short and compromises are made.  Good front end engineering, project stage gate vetting, and capital budgeting are necessary to any company that values their manufacturing process as a competitive advantage.

Useful Life

Determining the expected life of equipment can be difficult.   Expected life vs actual equipment life is used when determining total cost of ownership.  I have read several root cause analyses that checked “full life wear out” for the physical cause.  But full life and equipment wear out are not necessarily the same thing.

True full life calculations require a lot of data and analysis.  ISO 281 details the calculations for rolling bearings.  This, is just for one component.  Equipment or system life is usually more relevant.  The life of these has 4 factors

  1. Design
  2. Installation
  3. Operation
  4. Maintenance

Let’s examine ways that the average maintenance and reliability group can determine life expectancy of their equipment.

First, let’s go back the point about full life, vs equipment wear out.  Let’s say my brother and I buy identical cars on the same day.  We both have similar driving requirements.  We put about 100 miles per week on the vehicles.  We each fill fuel around the time the refuel light clicks on.  That is where our similarities end.

I ensure the oil and filter are changed every 5,000 miles.  The fluids and air filter are checked at the time of service.  I have the tires rotated at oil changes, and monitor tread wear.  The brakes are checked when the tires are rotated and I have the pads changed before they completely wear out.  I wash the car at least once/month. More often in the winter. I vacuum and wipe down the interior every time I wash the car.

After 10 years, the check engine light on my brother’s vehicle comes on, and he brings it in for service.  The tech says he needs an engine rebuild; which will cost more than the vehicle is worth.   The car is worn out.  My car is fine and over 90% of similar vehicles (10 years / 50,000 miles) are all perfectly road worthy, no major repairs.  In fact, the average vehicle is valued at or above 30% of its original value.  Therefore, even though one vehicle is worn out, it did not live its full life.   This is determined using the definition that full life is when 90% of equipment is still in working condition.

So, in order to determine if equipment is attaining full life, it is necessary to determine when 90% of like equipment is, or reasonable should be still operating.  Determining the equipment life of a large population is much easier than a small population.  So using the tools and information available how to determine life expectancy.

Start with the OEM or design information.  When the equipment was selected, a life expectancy was used in the capital requisition.  That number should be the starting point.

Next, mine your CMMS data.  Do not use built in MTBF calculators as they have trouble calculating from null values.  That means that CMMS built in reports only calculate MTBF for equipment that has the failure code marked against it.  Instead, create your own calculation using population data.  The population consists of all the similar assets.  This is best done by using an asset type characteristic in the CMMS database.   Run the report to determine the asset type population for the site, or organization.  Next, determine what will be considered a failure in CMMS data.  Ideally, the failure code is checked.  However numerous studies and empirical data shows that very few organizations use failure codes, and fewer still use them rigorously.  If you are in one of those less rigorous organization, determine a factor that is used regularly that can be used as a trigger of equipment health.  Consider any work orders that are not generated through the PM system, or work orders over a certain dollar amount.  Determine how many of the assets in the population hit the trigger in a 12 month period.  Divide the number of assets that triggered by the total population.  If that number is close to 10%, then the life expectancy is 12 months.  Change the timeframe to find a calculation that is close to 10%.  As the time frame increases, the same asset may be in the trigger population more than once.  This is acceptable for this calculation.

Compare your calculated equipment life to your projected life at time of capital requisition.  This is how to determine life expectancy in years.

Equipment life can also be determined in usage.  For instance, vehicle life is more commonly thought of in miles, rather than years or time.  This calculation would be more complicated.  It would be easiest to calculate this from production data or the OEE system.  Production numbers or dollar value of product produced over the time frame of the asset before replacement or overhaul.  This would be factored as $ or assets produced, similar to mileage.

Standardized data is available for some equipment, see the list below.

Equipment life cycle charts

Once your actual equipment life is determined, you can monitor it and determine how to improve it.   My next post will go over how the four factors affect equipment life (Design, Installation, Operation, and Maintenance).

Does anyone have other methods for calculating equipment life?


Choosing the right tool for your analysis. RCM or FMEA

Reliability Centered Maintenance (RCM) reviews and Failure Modes and Effects Analyses (FMEA)s have a lot in common, but there are still some key differences.  Rather than go into the mechanics of each, let’s look at the philosophy to help you choose the appropriate tool for your organization.

RCM as events are often overshadowed as folks have started using the term RCM to mean a proactive operating philosophy; what I call manufacturing excellence.  This is not a review of that philosophy.  Here I am talking about the John Moubray pioneered RCM analysis and its legitimate offshoots. (SAE JA1011_199908)

RCM is a member of the zero culture.  No failures are acceptable.  The RCM will identify all the potential failure points and these will be engineered away.  This may take the place of re-engineering equipment or processes, or engineering a proactive inspection to reduce the risk of an unplanned interruption (failure) to zero.  RCM will rank failures in a high-medium-low fashion, but the ultimate goal is to remove all potential failures, no matter the ranking.  RCM is to the maintenance organization what zero defects is to the quality organization.

Just as true believers of the zero quality defects philosophy removed quality inspectors, a true RCM organization would place less emphasis on mechanics rushing to breakdowns.  There would be no unplanned maintenance.  An RCM organization would take the opportunity of a breakdown to review their engineering efforts and determine how to never have this happen again.  Driving unplanned maintenance to zero would be the vision of the whole organization and resources would be applied appropriately.  This requires much upfront engineering and precise execution of planned maintenance.

FMEA culture does accept some failures.  Run to failure is an option in an FMEA philosophy, but that decision is made in advance, and with eyes wide open.  A key feature of the FMEA is the risk priority number (RPN).  The lowest acceptable RPN is determined and this is called the RPN threshold.  The RPN threshold is the point at which the organization has said, the cost of reducing that failure is more than the cost of the failure itself, therefore it will not be engineered out.  Determining the failures and their mitigating activities is similar in both the RCM and the FMEA.  However the FMEA assigns a number to the failure modes’ severity, occurrence, and detection to determine the risk priority number.  The organization then chooses an RPN threshold and only assigns resources to engineering out failures whose RPNs are above that threshold.  Therefore the organization accepts that failures with RPNs below the threshold will still occur.  They have accepted a breakdown culture, to a certain degree.  This organization will rely on a combination of engineering and maintenance to perform.  The FMEA organization will have fewer engineering resources and more maintenance and troubleshooting resources than the RCM organization.

Both the RCM organization with its zero tolerance and the FMEA organization with its limited acceptance of breakdowns are legitimate operating philosophies.  Both have many successful examples.  Airlines and power producers are examples of industries that follow the zero philosophies.  Failures in these industries cost the providers huge economic penalties, so the cost of the RCM implementation is easily saved in cost avoidance.  There is also a risk of loss of life with either of these failures and, actuary tables aside, these cannot be measured in pure economic terms.

Many factories and producers adopt the FMEA philosophy of accepting risk.  However, problems arise when management provides resources to act in an FMEA environment and expects RCM zero results.  Management will keep the responsibility for the budget and approving projects to themselves, but assign the accountability for zero breakdowns to the maintenance or maintenance and engineering departments.  This mis-match in accountability and responsibility is what causes some organizations to spiral out of control and become a reactive culture.   Reactive culture is not a sustainable operating philosophy.  Just to be clear, reactive maintenance culture is not a sustainable operating philosophy.  It is not sustainable to operate your organization with a reactive maintenance philosophy.

So when choosing between FMEA and RCM, understand what the organization’s accountability and responsibility structure are for allocating and implementing engineering and maintenance resources.  It is often advisable to lean toward the RCM zero philosophy.  That way the projects to engineer out the failures are in proposal form, just waiting for the funding to be approved.  Let’s look at how a failure might be handled in each organizational philosophy:

FMEA – a failure occurs with a low RPN.  The organization demands an after action review of the failure.

The maintenance manager reviews the original FMEA, confirms the RPN number is still valid and reports to the rest of the site leadership team that this failure was one that “we” determined the organization could weather.  Added to that report are the cost of the failure, and an estimate of what it might cost to mitigate that failure.  This confirms that run to failure was the most economical plan.

All is good until someone on the leadership team states “they” were not a part of the “we” and will not accept any failure at any time; mis-match in philosophies.  Now the organization has to re-determine which philosophy they hold or should the RPN threshold be lower.  This could trigger a review of all the FMEAs against a lower RPN, or a removal of all RPNs to embrace a zero culture.


The maintenance manager reviews the original FMEA, determines the RPN has changed and it is, in fact, above the threshold now.   This triggers a project for this specific instance.  It also triggers a review of all FMEAs to recalculate the RPN for the current operating conditions.  This also sets up the need to have a trigger to review RPNs as operating conditions change.

RCM – a failure occurs. The organization demands an after action review of the failure.

The maintenance manager reviews the files, finds the failure and the project associated with its mitigation.  The project is presented to the leadership team with an updated ROI given the recent failure.  The leadership team decides project resources and timing.  This may include that the ROI on the project is not still not viable and the project goes back to waiting status.

Both RCM and FMEA philosophy are acceptable ways to run an organization.  However, if the leadership team is constantly changing faces (individuals), or the operating conditions are constantly changing, it can be advantageous to run with the zero failure philosophy of RCM.   Operating under the FMEA philosophy may make more sense in the reality of limited funds, but it takes much more finesse and an understanding of risk analysis to promote and sustain.

Choose your methodology wisely and be able to explain the philosophy to both your peers and your team.  Confidence and support for the methodology is much more important than the specific acronym you apply.  Please do choose a proactive approach, because reactive maintenance is not sustainable.  It costs way too much in lost production, equipment wear, and morale of the humans who have to operate in that environment.

Please share your stories of successful RCM or FMEA implementation.

Industrial Pipe Hacks

• All pipes are made of a long hole, surrounded by metal or plastic.

• All pipes are to be hollow throughout the entire length, do not use holes longer than the length of pipe.

• The inside diameter of the pipe must not exceed the outside diameter of the pipe, otherwise the hole will be on the outside.

• All pipes are to be supplied with nothing in the hole, so that water, steam or any other stuff may be put in at a later date.

• All pipe should be supplied without rust, this can be added later on the job site.  Some vendors are now able to supply pre-rusted pipe, if this is available in your area it may save some time on the job site.

• All pipe over 150 meters in length should have the words “long pipe”, clearly painted on each end, so the contractor will know that it is long pipe.

• All pipes over 1 kilometer long must have the words “really long pipe”, painted in the middle, so the contractor will not have to walk the entire length to determine whether it is long or short pipe.

• All pipe over 150mm inside diameter must have the words “big pipe”, painted on it, so the contractor will not mistake it for a small pipe.

• Flanges must be used on all pipe, the holes in the flange must be separated from the big hole in the middle.

• When ordering 90, 45, or 30 degree elbows make sure you specify right or left turn; otherwise you will end up having the pipe going the wrong way.

• Be sure to specify when you order the pipe, whether you want level, uphill or downhill pipe, otherwise if you use uphill pipe for going downhill, the stuff inside will flow the wrong way.

• All coupling should have either right hand or left hand threads, but do not mix the threads, otherwise the coupling being screwed on one pipe is being un-screwed from the other.

• Acme thread is only sold by the W.E. Coyote company, so be sure to order it well in advance, as they often have work stoppages for mishaps and accidents.

I hope these ‘hacks’ give you as big a chuckle as they give me.