Machine learning and Artificial Intelligence technologies are excellent methods for extracting insight from vast amount of data. But how do you generate insight when there is limited or inaccurate data?
Consider a scenario where a variety of outcomes exists with different degrees of certainty and data accuracy. For example, measuring risks of international oil and gas projects. These projects feature high capital-intensity, high risks, and contract diversity. Therefore, in order to help decision makers make more reasonable decisions under uncertainty, it is necessary to measure the risks.
Handling scenarios like this using machine learning techniques is very complex and requires a lot of accurate data. How, then, can you generate insight?
The simple answer is the Monte Carlo Simulation method.
The method could be applied to simulate the stochastic distribution of risk factors in a probabilistic model. The potential is incredible, and the opportunities are never-ending.
To demonstrate the usage of this technique, we are going to apply this rightly popular method to an incredibly complex pipeline integrity case and discuss the way infrastructure and compute challenges were handled.
First, however, let’s break down exactly what the Monte Carlo Simulation (MCS) is and how it functions as such an important tool.
A Powerful Data Science Multitool
The Monte Carlo Simulation method is a broad class of mathematical algorithms that uses repeated random sampling to gain probabilistic insights into problems. MCS is used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. It is a technique used to understand the impact of risk and uncertainty in prediction and forecasting models. A Monte Carlo Simulation can be used to tackle a range of problems in finance, oil and gas, engineering, project management, supply chain, risk analysis, and science.
The simulation is used when:
A process cannot be easily predicted due to involved risks and uncertainties
The relationship between influential factors is too complex for traditional statistical models
Data is sufficient but not abundant
Data on outcomes/results are impossible or costly to collect
When faced with significant uncertainty in the process of making a forecast or estimation, rather than just replacing the uncertain variable with a single average number, the Monte Carlo Simulation does better, by representing uncertainty with distributions. Through the MCS, uncertainties in the input distributions is propagated to the output. The output is then a distribution of the millions of probable scenarios that are likely to happen given the input parameters. This makes MCS a very popular approach for risk analysis in business, engineering and operational problems, when there is always a degree of uncertainty in input parameters and output.
Below is a simple diagram demonstrating the MCS method:
For an equation: x + y = z, where x and y are single values, z is simply a sum of x and y.
Now, imagine instead of x and y being a single number, they are probabilistic distributions. In order to solve for z, a random number is sampled from each of the distributions, and subsequently plugged into the equation. This process of “sample -> substitute -> solve” is repeated many times, and the end values are collected to generate a distribution of z.
The Benefits of Monte Carlo Simulation
MCS offers a number of advantages over deterministic statistical methods when approaching complex problems:
What-if scenario analysis can be conducted, and the impact of changes in input variables can be tested with no real consequences.
Simulation model can be calibrated by adjusting the distribution models for input variables.
Better decision-making is made possible, as the distribution of estimated variable of interest provides more insights. Decisions can be made based on probable scenarios and a range of values instead of a single value result.
Accounting for uncertainties by using input distribution mitigates the risk of making assumptions. Input distributions take into account errors, variations and uncertainties innately.
Extreme events and rare instances can be simulated when performing risk analysis.
Challenges Utilizing MCS Across Industries
With all the benefits that MCS provides, this method is still not broadly utilized across industries. There are number of roadblocks that organizations face when trying to run simulation at scale. Here are three of the most common challenges:
Failure to define the right distributions for input parameters: This requires deep knowledge of the behaviour of parameters individually and in relationship with other parameters, as well as the ability to translate that knowledge into available and existing data and define the right distribution for each input parameter.
Inefficient sample size in designing the simulation: MCS is a computationally intensive solution and it can easily get very costly. With the right sample size, companies can reduce the compute, time, and cost, and still achieve the desired accuracy. Optimizing sample size for different scenarios and thresholds will allow organizations to run the simulation at scale and cost/time effectively.
Lack of right infrastructure to run the simulation in a cost and time effective manner. To achieve the required accuracy for effective decision making, simulations are usually run for millions of iterations. This takes a large amount of compute and requires advanced infrastructure and integration. Not every company has access to these technologies. If they do, running the simulation will be a very costly task.
Utilizing MCS in Pipeline Failure Simulation
Integra has developed an MCS solution for a Pipeline Integrity use case, and addressed the above challenges through advanced analytics and data science, combined with subject matter expertise and its innovative platform Digital Hub™.
In this use case, MCS was used to estimate the severity of corrosion defects and the probability of causing pipelines to fail safety criteria. The result of this estimation is used by integrity and risk specialists to make repair and re-inspection decisions to ensure pipelines are running safely and reliably. To be able to make those decisions, pipeline integrity specialists require a minimum precision of 10^ -6.
In other words, each corrosion defect needs to be simulated at least 1 million times.
The number of corrosion defects per KM of pipeline varies from a few hundred to a few thousand, based on the pipeline length, age and condition. In Integra’s Pipeline Failure Simulation, MCS algorithm was developed to simulate the probability of failure of 350,000 corrosion defects across the pipeline and required accuracy of 10^ -7.
We will discuss how Integra addressed the infrastructure challenge of running MCS in its Pipeline Integrity use case. If you’re interested in reading more about how the first two challenges to utilize the Monte Carlo Simulation around input distributions and sample size optimization was handled, keep your eyes on these blog pages for our upcoming detailed logistical rundown.
Handling Infrastructure and Compute Challenges
To be able to run MCS at scale, infrastructures and technologies need to exist to optimize compute, cost, and time. Infrastructure and computation challenges associated with using Monte Carlo Simulation at scale fall into three categories:
Lack of proper infrastructure and integration of technologies
Maintenance, troubleshooting, and compatibility difficulties
Maintaining a reasonable cost without compromising quality and potential
Let’s look at how these challenges impact the usability of MCS and how Integra solved them using its platform, Digital Hub™.
MCS processes can be computationally expensive. A limitation to the insights that can be generated from data is how much of it can easily be analyzed. This is difficult to do with traditional, monolithic architecture-based platforms and applications.
Innovative and advanced parallel and cloud computing processors have solved this problem. However, documentation relevant to setting up the required infrastructure is vastly dispersed, and it is difficult to troubleshoot and problem-solve these cutting-edge approaches.
Technologies for parallel processing vary. Some technologies may be a better fit for certain applications than others. Selecting the best fit for a project is a task that requires considerable research, thought, and consideration.
Setting up and maintaining big data processing for projects is difficult and requires a significant learning curve. Maintenance of a multi-server cluster to deploy applications onto is a significant task that not all users may be comfortable engaging in. This, however, is necessary for MCS as a computationally intensive process.
How Integra Tackled the Infrastructure Challenge
Integra leveraged its innovative platform Digital Hub™ to remove these roadblocks and enable organizations to solve business problems using Monte Carlo Simulation without worrying about setting up and maintaining the infrastructure.
Digital Hub™ is a modular and integrated system of mainly open source tools and technologies. The platform enables the development and deployment of end-to-end enterprise-grade AI and Machine Learning solutions including MCS in a cost-effective fashion. Applications required to develop a MC simulation and handle the data and compute are already integrated and actively maintained and updated. These software applications are organized into containers and deployed onto a scalable cloud infrastructure.
The system is built to handle big data, which is the common feature across MCS applications.
Data is streamlined across the applications and can be accessed and analyzed at any stage of the solution development. The compute challenge is handled through innovative cloud infrastructures, that scale automatically to the point that is required to process the data and run simulation algorithms.
Scalable cloud-based architecture allows for the concurrent usage of multiple servers and parallel processing. This accelerates processing large amount of data and reduces the direct compute cost.
Utilizing open source applications reduces the cost of infrastructure, and with auto scaling capabilities, organizations only pay for the used compute and storage.
This means we can utilize MCS in a way that fully takes advantage of the advanced and open source cloud technologies out there.
The appropriate use of data to generate input distributions to accurately reflect uncertainties and deviations, well designed algorithms that backs the MCS, and the incredible power from cloud computing technologies are the ingredients to make a data-driven decision possible.