top of page

We needed better and it didn’t exist, So we built it: An Open Source Data Science Ecosystem - Part 1

As a member of a data science team working in Oil and Gas, I worked on many analytics projects solving real world, every day operational problems in Drilling, SAGD operations, Gas Production, Pipeline Integrity, etc. No matter what the nature of the problem and its complexity is, there are common roadblocks I have consistently come across. I spent a good amount of time throughout my career breaking through these roadblocks with each project, just to get to a point where I can begin understanding the data, solutioning the problem, and eventually implementing the solution. Not to mention that each and every data scientist on the team had to go through the same pain.

We decided to stop wasting time. Instead, my Integra Data & Analytic Solutions team actually built a tool to tackle these roadblocks.

In fact, we ended up with our dream data science platform, Digital Hub™.

We know how powerful and transformative data can be, and we wanted to remove any and all barriers to entry and data science project success. Digital Hub™ is the platform we created for ourselves, using open source tools that are neatly packaged and ready to use, so we can focus on solving problems and doing what we do best: Data Science!

The Roadblocks That Kill Projects

  1. Lack of a centralized data storage: Everyone in the field knows the mantra: Data is everything! It is important to be able to access the data at any stage of the project and from the various applications and tools you employ to deliver your solution. There should be only one source of truth for the data - not many versions in different places. This is critical, especially when working in collaboration with others.

  2. Absence of an integrated marketplace with all the required libraries and technologies: Most of my time at the beginning of each project should be spent planning and outlining a project’s roadmap to success. Instead, I spent a lot of it setting up the environment, installing the packages, and finding the most up to date open source tools to perform the exploratory data analysis. There are certain tools that are required for every data science project, such as a notebook to develop your code with all the packages and libraries, mostly open source to perform the initial and advanced analysis on the data, a visualization tool for reporting purposes, etc. Integrating these tools and packages, keeping them up to date, and making them all work together has always been a challenge for data scientists. These steps are repetitive and more or less the same for all the projects.

  3. Limited compute and processing power to handle big data: As a data scientist working on a desktop or personal laptop, you are always limited to the processing power of your system. This becomes a major issue when you need to process large amount of data (high frequency data) or perform high compute tasks (running millions of simulations). 

  4. Lack of a collaboration environment for a team of data scientists: Most of the data science projects are done by a team of data scientists with different specialties and responsibilities. While you are working on running different algorithms for your machine learning problem, your teammate might be working on the EDA or visualizations. Everyone on the team should be able to share their work and collaborate while having access to the same source of data and technologies.

How Digital Hub™ Made Every Project Better

Enter our platform, Digital Hub™. In the simplest terms, it helps me accelerate my data science work. In the most significant way, it transforms the way my team works.

Leveraging Digital Hub™ capabilities, we are able to deliver our data science projects in very short timelines with incredible success.

1. It gave us a centralized data storage to streamline data flow through all the applications.

Digital Hub™ creates a cloud storage for each user that can also be shared within the teams. With just a few lines of code, I can bring data from a client’s database - in this case Azure blob storage - to Digital Hub™. From there, I can access it at any stage of the project and in any associated technology.

Data flow from Database to Digital Hub™

2. It is an integrated platform with all the required tools and technologies we could need.

Digital Hub™ has integrated a number of technologies that are required in any data science project. Its Jupyter notebook comes with all the required packages and libraries already installed, and complex technologies, such as Apache Spark, already integrated. There are also pre-built notebooks for some common problems in Oil and Gas, meaning if you are not a coder, you can get started with your analysis. (Take a look at this notebook on Rock Type analysis using well log data.)

There’s a lot to this second point that needs to be unpacked. It means:

  • No more Exploratory Data Analysis (EDA) redundancies: The data science process starts with data exploration. In the traditional approach, I had to get deep into coding to understand the data. There are redundant EDA steps that had to be done for each new dataset. Not exactly the most exciting part of being a data scientist. Digital Hub™, on the other hand, comes with the option of open source EDA libraries, which creates a comprehensive EDA report without any coding. I can get all the EDA insights without a single line of code. With this capability, I now get to directly focus on drawing insights from the data.