We needed better and it didn’t exist, So we built it: An Open Source Data Science Ecosystem - Part 1
As a member of a data science team working in Oil and Gas, I worked on many analytics projects solving real world, every day operational problems in Drilling, SAGD operations, Gas Production, Pipeline Integrity, etc. No matter what the nature of the problem and its complexity is, there are common roadblocks I have consistently come across. I spent a good amount of time throughout my career breaking through these roadblocks with each project, just to get to a point where I can begin understanding the data, solutioning the problem, and eventually implementing the solution. Not to mention that each and every data scientist on the team had to go through the same pain.
We decided to stop wasting time. Instead, my Integra Data & Analytic Solutions team actually built a tool to tackle these roadblocks.
We know how powerful and transformative data can be, and we wanted to remove any and all barriers to entry and data science project success. Digital Hub™ is the platform we created for ourselves, using open source tools that are neatly packaged and ready to use, so we can focus on solving problems and doing what we do best: Data Science!
The Roadblocks That Kill Projects
Lack of a centralized data storage: Everyone in the field knows the mantra: Data is everything! It is important to be able to access the data at any stage of the project and from the various applications and tools you employ to deliver your solution. There should be only one source of truth for the data - not many versions in different places. This is critical, especially when working in collaboration with others.
Absence of an integrated marketplace with all the required libraries and technologies: Most of my time at the beginning of each project should be spent planning and outlining a project’s roadmap to success. Instead, I spent a lot of it setting up the environment, installing the packages, and finding the most up to date open source tools to perform the exploratory data analysis. There are certain tools that are required for every data science project, such as a notebook to develop your code with all the packages and libraries, mostly open source to perform the initial and advanced analysis on the data, a visualization tool for reporting purposes, etc. Integrating these tools and packages, keeping them up to date, and making them all work together has always been a challenge for data scientists. These steps are repetitive and more or less the same for all the projects.
Limited compute and processing power to handle big data: As a data scientist working on a desktop or personal laptop, you are always limited to the processing power of your system. This becomes a major issue when you need to process large amount of data (high frequency data) or perform high compute tasks (running millions of simulations).
Lack of a collaboration environment for a team of data scientists: Most of the data science projects are done by a team of data scientists with different specialties and responsibilities. While you are working on running different algorithms for your machine learning problem, your teammate might be working on the EDA or visualizations. Everyone on the team should be able to share their work and collaborate while having access to the same source of data and technologies.
How Digital Hub™ Made Every Project Better
Enter our platform, Digital Hub™. In the simplest terms, it helps me accelerate my data science work. In the most significant way, it transforms the way my team works.
Leveraging Digital Hub™ capabilities, we are able to deliver our data science projects in very short timelines with incredible success.
1. It gave us a centralized data storage to streamline data flow through all the applications.
Digital Hub™ creates a cloud storage for each user that can also be shared within the teams. With just a few lines of code, I can bring data from a client’s database - in this case Azure blob storage - to Digital Hub™. From there, I can access it at any stage of the project and in any associated technology.
Data flow from Database to Digital Hub™
2. It is an integrated platform with all the required tools and technologies we could need.
Digital Hub™ has integrated a number of technologies that are required in any data science project. Its Jupyter notebook comes with all the required packages and libraries already installed, and complex technologies, such as Apache Spark, already integrated. There are also pre-built notebooks for some common problems in Oil and Gas, meaning if you are not a coder, you can get started with your analysis. (Take a look at this notebook on Rock Type analysis using well log data.)
There’s a lot to this second point that needs to be unpacked. It means:
No more Exploratory Data Analysis (EDA) redundancies: The data science process starts with data exploration. In the traditional approach, I had to get deep into coding to understand the data. There are redundant EDA steps that had to be done for each new dataset. Not exactly the most exciting part of being a data scientist. Digital Hub™, on the other hand, comes with the option of open source EDA libraries, which creates a comprehensive EDA report without any coding. I can get all the EDA insights without a single line of code. With this capability, I now get to directly focus on drawing insights from the data.
Machine Learning (ML) implementation is made easy: After the initial data analysis, it’s time to implement a machine learning model. Instead of going to the scikit learn documentation to see how to implement ML models, the Auto ML through H2O.ai is already available to do that for me. It will not only implement different models, but also provide insights on which model seems to be working the best based on the evaluation metrics. With all my projects, I invested far more time that I would’ve liked to go through the scikit-learn documentation to understand the implementation and evaluation of the ML models. This was unnecessary avoidable overhead, as there are already many open source libraries available which generate the model results and evaluation metrics. Digital Hub™ brought together the Auto ML capabilities in the platform itself. There is no need to type each line of code for different ML algorithms. The H2O.ai’s Auto ML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Below is an example of Machine Learning model, trained and evaluated using Auto ML with H2O.ai for a rock type classification (3 classes) problem.
20 models implemented on the classification problem with the model errors
It isn’t just about the model training and evaluation, as many other important characteristics about the model can be obtained, one of them being the variable importance.
Variable importance for the classification problem
No more struggling with Visualization and Dashboarding: For reporting and insight sharing purposes, I would use Matplotlib, Seaborn, Plotly and many more libraries for visualizations. The dashboarding process was quite complex. It was not easy to bring the data to and from the BI solutions. Lots of isolated pieces complicate the analysis pipeline, and chances of error increase manifold. Digital Hub™ integrates technologies like Superset and Grafana, which can be very efficient for data visualization, whether it’s Time Series data or any other data type. I can build a Dashboard using Superset and share it with the stakeholders responsible for taking critical drilling decisions.
No more upgrading package versions: We all know how cumbersome and frustrating it is to upgrade packages. Making all the technologies work with the new version is a pain we all go through in our data science efforts. Digital Hub™ has been designed in a way to consistently keep the package versions up to date and the CI/CD process keeps a check on the integration. It is re-assuring that I just need to execute and focus on my project, while the compatibility issues in the backend are being taken care of.
I only covered two of the listed roadblocks that Digital Hub™ helped me tackle, but will walk you through the compute, processing power, and collaboration challenges it took care of in upcoming blogs.
Stay tuned for Part 2!