Why we need guiderails in data analytics?
In recent years, data analytics has become essential for many markets, including industrial manufacturing. When allied with domain knowledge, analytics can be key in finding the sources of uptime losses and margin leakage. However, results can prove sensitive to the context of the data, and sometimes data analysis can produce faulty outcomes.
I would like to be able to tell you, (as a CTO of a start-up machine learning company once told me), “just give me the data and I will sort out the problems.” Unfortunately, it does not work like that. Data analysis techniques, including machine learning, are portable across industries; domain knowledge is not – and you need both to succeed.
An analytical solution must correctly separate causation from simple correlation and alert only on true impending issues. But all data analysis, including machine learning, is not a silver bullet. Only with ‘guiderails’ can analysis techniques find correct answers.
Otherwise, silly correlations emerge such as a famous one proclaiming that increased consumption of margarine leads to divorce in the state of Maine. The guiderails come from domain knowledge, translated into contextual data limits that establish reasonable expectations of behaviour and exclude the meaningless correlations that machine learning can find when working in isolation.
Machine learning will find all manner of data correlations where some are often meaningless. Understanding causation requires knowledge and experience. What time, skills and experience will you need to attempt a solution, how long will it take, and will it scale? In a sense, machine learning can only go so far.
Using ‘clustering’ techniques in unsupervised learning algorithms, machine learning can detect and learn similar patterns. Indeed, in the oil and gas sector, clustering can learn to distinguish normal operational behaviour from the signals coming from sensors on and around machines. Then, any deviations from normal, called anomalies, are useful to highlight operational issues with a piece of equipment.
What is supervised learning?
Another machine learning technique called supervised learning requires a human to declare an event as a time and date when something happened. Machine learning has no understanding of what happened on the date and time. It requires domain knowledge and understanding of data context to attach meaning to the event. But once an event is declared, the machine learning learns the signature of the precise patterns that led to the event.
For example, in heavy industry, an event could be a machine failure due to an exact cause such as a bearing failure. With its learned knowledge of the precise degradation and failure pattern, the artificial intelligence then tests new incoming patterns to discover recurrences well before the failure occurs. Such early notification allows action to avoid the degradation completely, or provides time to arrange a repair before major damage occurs. The results are much lower maintenance costs and more uptime producing valuable products.
At a plant site, expert staff understand the relationships between machine behaviour and subsequent degradation mechanisms. The staff provide such insight to direct machine learning to find the proper causation patterns. In addition, we are discovering that our complex first principle and empirical models can forecast the likely ‘neighbourhood’ of specific results and consequently can also provide guiderails for machine learning to discover exact patterns of degradation.
All in all, that data context is crucial to correctly label events, select variables and direct the data clean-up. Effective solutions always require the marriage of what you know about processes emitting the data combined with expertise in the analytical techniques. Thus, the guiderails must be tough and robust.
How does all this work in practice?
Take a two-phased approach. First, do the engineering. Learn about the process producing the data, correctly label the important events and perhaps calculate some imperative events such as known physical limitations. Use this information as guiderails to cleanse data and subsequent event patterns with an understanding about operating modes. Then, when the engineering effort is done, switch into data scientist mode.
Once there, you have supplied the data context: now the algorithms do not care about your particular problem domain. In the analytical depths, the data, algorithms and patterns do not know from whence they came: it is just data. Scales, engineering units and data sources can be diverse and do not matter. In this context, we do not strictly need the rigour of engineering models and the implied complex differential equations.
In summary, the data input guiderails do matter. It always takes carefully ‘framed’ data sets to secure precise outcomes. Understanding frames data with context. So, learn the pertinent process details for each solution and then transition from engineering to data science using the guiderails.
For the latest refining and petrochemical industry related videos, subscribe to our YouTube page.