Abhay Paroha, an engineering leader with more than 15 years’ experience in leading product dev teams, joins SE Radio’s Kanchan Shringi to talk about cloud migration for oil and gas production operations. They discuss Abhay’s experiences in building a cloud foundation layer that includes a canonical data model for storing bi-temporal data. They further delve into his teams’ learnings from using Kubernetes for microservices, the transition from Java to Scala, and use of Akka streaming, along with tips for ensuring reliable operations. Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
Related Episodes
Patents
- Converting uni-temporal data to cloud based multi-temporal data
- Customized canonical data standardization, ingestion, and storage
Contact Info
- LinkedIn: @abhay-dutt-paroha-79766316/
Transcript
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Kanchan Shringi 00:00:18 Hello everyone. Welcome to this episode of Software Engineering Radio. Our guest today is Abhay Paroha. Abhay is an engineering leader with over 15 years of experience leading product dev teams on building scalable systems. Abhay has built up his expertise in cloud architecture and data integration with the focus on reliability and DevOps. In this episode, we are talking with Abhay on his experience building a cloud foundation for oil and gas production operations. It’s so good to have you on today, Anhay. Welcome to the show. Is there anything you’d like to add to your bio before we get started?
Abhay Paroha 00:00:57 Thank you Kanchan for the nice introduction. I guess you covered everything, but I would like to start this talk with a brief introduction about oil and gas industry. So oil and gas industry covers basically two kind of companies. One is the oil and gas company and second one is the companies which are providing services to our gas company. So, I’m from the domain where we provide services, and I am from the upstream service company where we provide all kind of services including hardware and drilling the oil wells and software, whatever needed to explore and extract the hydrocarbons from beneath the earth.
Kanchan Shringi 00:01:38 Can we start with a business goal on the move to the cloud? What outcome was the business looking for?
Abhay Paroha 00:01:46 Sure. So I am from upstream production domain. So in any upstream company there could be different segment, it could be drilling or wireline or production assurance. I am from production assurance and our goal was to use the existing infrastructure, whatever available and expand over that infrastructure to build and deliver the innovative solution to the client. So what that means, existing infrastructure means before moving to the Cloud, we were using some production data management software. Those are mostly on-prem based desktop application and we were storing the data in some SQL server or Oracle and basically developing some reports on those desktop app. But these apps were limited in terms of scalability because we were not able to run workflows like where we need a lot of historical data involved to run some machine learning workflows. So we decided to move this data to the Cloud so that we can run advanced workflow where we can see the historical records maybe more than 25 years of data of any well and we can develop new apps like give the recommendation about any well in real time or do the well surveillance or do the forecasting and there are some other things like doing the digital twins operation for any hardware or equipment installed in the field.
Abhay Paroha 00:03:25 So to achieve these kind of advanced workflow, the first need was to have this data ready and collect it on any Cloud platform so that we can expand it using machine learning workflows or some recommendation agent. What was our main business goal?
Kanchan Shringi 00:03:42 Abhay Iím sorry, what is digital twins operation?
Abhay Paroha 00:03:45 So digital twins is an important concept for any upstream or midstream operation where we basically model any physical process or physical equipment. So whenever there is some equipment installed in the field where oil is producing or there is any production facility, so we can create a model of that physical equipment in Cloud. So whenever we are ingesting data in the Cloud and we want to represent the physical equipment as a digital entity so that we can understand what is the equipment behavior when some incident happen. So digital twin basically help us to model these kind of equipment behavior in distal world.
Kanchan Shringi 00:04:34 Can you give us an idea of the size of the data that we’re talking about here?
Abhay Paroha 00:04:39 In a typical small oil and gas company, maybe that companies owning let’s say 500 wells and any oil well can have the oil flowing maybe last 20 years or maybe there are some wells who are flowing or producing the oil maybe last a hundred years as well. So the data volume depends on the life of the oil well and the number of oil wells owned by any oil and gas company. So for example, let’s say if there is major national oil company who has more than 10,000 wells and because national oil companies they are producing oil since long so they may have data for more than 50 years. And second aspect of this volume is related to the frequency. So in any oil field the frequency of data depends on the kind of operation we are performing. It could be a second base frequency data as well and it could be weekly, daily, monthly, half yearly or yearly. These different aspects defines the volume of the data, but the bare minimum could be a second base frequency data coming for more than a hundred properties from any oil well. And a simple multiplication will give you a million or billion of data points streaming from any field
Kanchan Shringi 00:06:01 For second base frequency data with the million or billion of data points, I do expect that it’s really critical to have your Cloud deployment close to where the sensors are. What was your strategy in picking Cloud regions that you’re deploying to?
Abhay Paroha 00:06:19 Yeah, whenever we are doing any commercial agreement with any client, these clients are mostly oil and gas. The companies, whenever they discuss the requirement, they always come up with the data residency aspect. So for example, let’s say some client in Middle East and Asia, they are not very much comfortable in sending their data outside Middle East, but there are some clients like in South America they are still okay to have their data in USA. Based on the client requirement, we just set a clear expectation. For example, let’s say we always want to have the Cloud project or the Cloud cluster closer to the actual data source because it helps us to ingest data as fast as we can. But if there is some limitation from client side and if they are okay to with the ingest latency then we basically we still try like which reason is close to that particular next client.
Abhay Paroha 00:07:22 But the first requirement is to understand the data residency aspect because that is more important. But for one of the client like in South America and the client was okay to have their Cloud project in USA. For that we are able to ingest the data with some latency gap. I mean it was minimal, it was not very much huge consideration for the ingestion aspect. But yeah, mostly for the second-best data or even for the higher frequency data, we always set a clear expectation to client and based on that we decide over ingestion time latency and consumption time latency.
Kanchan Shringi 00:08:01 What were the metrics you were using to track success as the project kicked off?
Abhay Paroha 00:08:07 As I said, our old application was mostly on-prem and our main challenge was we were only delivering a feature maybe once in a year and we were spending a lot of money on the hardware costs where it is mostly around deploying these desktop-based application to the client field. Major success criteria was if we want to spend money on this Cloud infrastructure, so the first thing in our mind was how can we deliver features incrementally and as so as possible as we built it instead of relying on a slow software delivery cycle. And second thing was how can we customize this application easily per client? The old application was mostly we were giving a fixed kind of workflow to all clients, but here what we want to do, I mean we should be able to customize the workflow as per the client need. Let’s say if some client is interested in seeing certain dashboard in certain way. So we have those flexibility in place. So the major success criteria was to deliver feature with agility and be able to work as what the client need and customize these workflows.
Kanchan Shringi 00:09:29 Put yourself back, you know, whenever you got started with this project, how was the approach to getting the project kicked off? Were you evaluating Cloud providers? Were you forming a new team? Were you worrying about how you will train them? Was there a plan to create an awareness and a mind shift about DevOps and other operational criteria to the team? Just give us a overview of what were the discussions.
Abhay Paroha 00:10:00 We started our Cloud journey back in the year 2016, it’s almost eight years ago now. That time our main goal was to develop something quick and showcase some demo to our client. Our main goal was we want to showcase some advanced workflow. So by that time there was no application where we were doing some performance advisory or giving the recommendation to client in real time. Our existing application was mostly giving or generating the report maybe once in a week or once in a month. So there was a delay of at least one week for producing this report. We were talking about the digital transformation journey across the companies in our production assurance segment as well. We decided to build something quick. So the basic wasn’t what we wanted to demonstrate to any client. Let’s say there is one asset field where let’s say 80 or a hundred or less number of oil wells or available, how can we stream the data, the historical data and the streaming operational data to the Cloud and do some real time surveillance.
Abhay Paroha 00:11:15 For example, let’s say if there is some issue in some equipment health, some user can be able to see that equipment health from any location instead of directly going to the oil field. To demonstrate this basic capability and feature of the Cloud, we decided to start a journey with one of the major Cloud provider that was Google Cloud platform. And why? Because Google Cloud platform was a mandate across all companies. So we didn’t have much voices that time quickly used the managed services offered by Google Cloud for example, that time Google was providing the app engine for creating a managed services. Basically it really developed the two pieces of software. One is running on premise to ingest the data on the Cloud and second piece of software is basically storing the data on the Google Cloud hosted Bigtable instance. By just creating these two pieces, we were able to build a simple dashboard to show the equipment health.
Kanchan Shringi 00:12:20 Do you believe you got it right the first time or was there any learnings midstream that led you to reevaluate the approach and refine it?
Abhay Paroha 00:12:28 First time we didn’t, right? Because I still remember the year 2017 when we installed the first version of the application for the same client in South Asia where after installing our on-prem software that is used to ingest the data streaming from any oil field to the Cloud, that piece of software stopped working after some days. Later client was contacting us, okay, we are not able to see any data on your surveillance app. What could be wrong? Okay, so they contacted us but that time we were new to the Cloud world. We are not aware about the different aspect of Cloud. For example, there was no remote logging facility, there was no monitoring available so we were clueless and the only option was to ask the client to share the logs which are available on the on-prem machine. But this learning help us to understand, okay, this is not the way to work because if there is certain downtime we need to make some commitment to the client instead of client is asking us, okay, hey your product is not streaming data.
Abhay Paroha 00:13:38 We need to be more proactive in the approach and we need to change our strategy because we are mostly coming from the desktop world. So we are mostly relying on the big log files where we can debug later. But no, this is not going to work in the Cloud. So as a team, we all involved in some training that time we offered a training from Google Cloud platform only. We attended different trainings that was mostly on job training and this training basically helped us to learn what are the basic steps involved in building any Cloud ready software there. And later we applied these learning. For example, the first version was only storing this log in the on-prem machine, but later we build this capability in the software itself along with sending the actual data. Now this piece of software can send the log messages as well. So it basically helped us to remotely monitor the on-prem machine and whenever it is down we can send some message from the Cloud and wake up that machine. This basically help us to learn, okay, delivering feature is not only the important part, the other part is operating as well. So later we expanded and put a lot of metrics in place and the remote capability of monitoring all this.
Kanchan Shringi 00:14:57 So in the next section, maybe let’s drill down into the aspect of which Cloud provider did you end up with and what were your learnings in that area? Let’s talk first about the microservices. You started with Google Cloud app, App server, is that what you mentioned?
Abhay Paroha 00:15:16 Yes. So we started with Google App Engine that is a managed platform provided by Google Cloud platform.
Kanchan Shringi 00:15:22 And did you stay there or was there more evolution to the architecture?
Abhay Paroha 00:15:28 We delivered our first microservice. At that time we were only having one microservice, which is mostly responsible for ingesting the data received from this on-premise agent. The first version because we wanted to deliver something real quick. So we built this service using the managed offering from Google Cloud that was app engine. But when we started expanding more and more features to our software platform that time, we want to deploy this application to as many clients as we can and at the same same time we were receiving the feedback from different oil and gas companies who were interested in our project. Not all companies were ready to store their data to Google Cloud because some companies, they have their own private Cloud, some were using Azure, other companies were using AWS. The discussion was started, okay, is there any way to make it Cloud agnostic because we are still in a very initial phase, so can we make our design in that way so whenever we want to evolve in the future, and we should be able to support a different Cloud platform.
Abhay Paroha 00:16:47 That time we started evaluating Kubernetes because Kubernetes that time it was very initial phase, but that time it was only the, and still I think it is mostly common use platform to build a Cloud diagnostic microservice. So to do a POC again, we migrated this app engine based service to the Kubernetes hosted microservice and we also learned a lot of things just by migrating this services from managed platform to the Kubernetes space and there are a lot of learning involved like how to run the Kubernetes cluster and how to scale the basic units of Kubernetes like POS and deployments. We just migrated our first services and the piece of software and build and release pipeline and whatever scripts we developed in this migration journey that formed a base of creating the new and more, more microservice. And after that for this data foundation platform over account, we delivered around 18 to 20 microservice using the same pattern what we followed for our first microservice.
Kanchan Shringi 00:17:58 Okay, got it. So you quickly pivoted from just app engine on Google Cloud to Kubernetes and Cloud agnostic. So hold a thought will come back to the microservice design classifications and design, but in the meantime can we get into the data layer? Can you start with talking about the types of data and the challenges
Abhay Paroha 00:18:25 In any upstream production operation there are different kind of data sets involved. For example, what kind of rock properties are available in particular geographical location that is called a geological data. Other data could be the reservoir properties or what kind of simulation model we are running to get the production rate. There are other things involve what we call production data, which is mostly, flow rates or pressure temperature, fluid composition and the remaining data type are basically, the streaming data coming in real time from this facility location. Overall, we categorize these all different kind of data into two types. The first one is called the structural data, which is basically static kind of data and second data is time series data because most of the data points coming from any sensors, those are time series in nature and now we want to store this data on Cloud.
Abhay Paroha 00:19:32 So for time series data we evaluated the Bigtable. So Bigtable is a scalable cloud storage offering from Google Cloud and for structured data we used another commercialized database that is built by a company called Cognitect, that database name is Datomic. The most interesting part in the storage. We wanted to have our workflow all kind of historical aspect so that we can run kind of as of date workflow for example, how certain well was looking at certain date to run these kind of workflows, we need to maintain the history of all data. So we never deleted any data point or updated anything. We are always creating immutable records of all data. It helped us to track the history and maintain the auditing of all these data points. To achieve that we use the bi-temporality aspect for atomic and Bigtable.
Kanchan Shringi 00:20:39 Based on your explanation, I get that Bitemporal data has two-time components associated with it. Can you contrast that with time series data?
Abhay Paroha 00:20:48 When I’m saying, time series data whenever any sensor is sending us data value and the timestamp value. So we are mostly interested in different characteristics of any oil well. For example, we are interested in knowing the pressure or temperature. We are also interested in knowing the oil volume or gas volume or water volume. So when this data is streamed through in the form of several data points, later we characterize this time series further to the sporadic and periodic time series. For example, let’s say some user is running some workflow which is kind of well test kind of workflow that is sporadic in nature. Sporadic means if there is some event start right now then second event, let’s say start maybe after 10 days and third event could be after 20 days. So that is sporadic in nature. There is no fixed time interval between these events.
Abhay Paroha 00:21:49 And periodic is like if there is a daily reading of any particular value. For example, there is a daily reading of a pressure or daily reading of a temperature. So this is like a fixed time interval once we categorize this to sporadic and time series. So what is important for us whenever we are running any calculation in the form of recommendation engine, so those calculations should understand the nature of the time series. I mean if it is sporadic or time series because when we perform the recommendation we also need to understand if there is a data gap or something. So we need to fill those data points with some value so that user understand is it like missing data or bad data. This characterization of this sporadic and time series basically help us to give the correct recommended
Kanchan Shringi 00:22:39 For structured data. Why did you specifically use Atomic as a database?
Abhay Paroha 00:22:44 Yes, because our static data was not time series in nature so Bigtable was not an obvious choice to store it. So we were looking for some database which has by default support of bi-temporality. So the Atomic is one of the database which by default its storage engine supports the temporal feature. So whenever you are storing anything in Atomic, it stores as a value attribute couple format and it always capture the two aspects of by temporality. The first one is the valid time. Valid time means when certain event was valid in real world. For example, let’s say I’m doing some drilling operation so that drilling operation happened at certain valid time. The second aspect of bi-temporal is transaction time. Transaction time means when data is actually stored in that particular database. So the Atomic is one of the database which has these characteristic in built. So it provides you the support of running temporal query.
Abhay Paroha 00:23:56 What we did, we basically to store the static data, let me give you one example. Let’s say some asset manager who is looking for the total rollups of all oil production on certain field. So that asset manager is basically looking for more than 500 oil wells and checking the daily production deport as of today. In the atomic world, what we are doing, we are storing the hierarchy of this asset. For example, if asset manager is in North America and he’s looking into Texas, he’s looking at certain city. So all these hierarchies basically stored as versioned format in Datomic. So whenever someone is using the API is built over atomic, any user can say okay, give me the report as of today or give me the report as of last year. So it’ll give you the hierarchy back as of that particular date. That’s why the atomic was chosen due to the temporal aspect.
Kanchan Shringi 00:24:59 About Bigtable. Can we take an example and explain the data structure and how that is different from a regular structured data database?
Abhay Paroha 00:25:08 Yeah, sure. Bigtable is basically a database from the column family databases. So column family means internally it has a structure of multi-level map. So you can think of it like a three-dimensional map. Map is like I’m talking about any hash map, what we use any programming language. And second important feature is it is a column family database. So for example, when I’m saying column family, it has a concept of set of columns and when I’m saying a set of column then that calls column family.
Kanchan Shringi 00:25:45 Is the column family aspect relevant to the time series data or?
Abhay Paroha 00:25:49 Yes, it is relevant for time series data. We use this concept to maintain a bi- temporarily. So for example, let’s take an example of pressure data. Let’s say there is one well, that well is giving one data point every hour and we are ingesting that data point from on-prem to Cloud. What will happen whenever, let’s say I started my ingestion nine o’clock in morning, so there is a one-timestamp nine o’clock, then another timestamp 10 o’clock and after that for whole day I have at least 24 data points. So this is like a physical timestamp. Then second timestamp, what I’m assigning. When this data point leads to the Cloud ingestion service, I’m assigning another timestamp that what we call a version-time timestamp. So now we have these two timestamps. So when I’m storing this data points to Bigtable, then I’m using a column family.
Abhay Paroha 00:26:55 And in the column family the column name is a version timestamp, and rows are basically a physical timestamp. So whenever some production engineer is interested in seeing the pressure value for as of date workflow as of date means clients are always interested in knowing, okay, I want to generate report as of first January or I want to generate report as of first February. So this version timestamp aspect, what I’m storing in the column family, that is basically helping us to maintain the as of date workflow aspect and the timestamp, the physical time is stamp what I’m storing as a row key in Bigtable, that is basically helping us to index it. And whenever I’m running any query, then I’m considering these two aspects, the column family and the row key or both. So it’ll help us to give us the output for any give me a pressure data point for T1 to T2 start date and an end date for as of date. So yeah, by combining these two aspects it helps us to return the bi-temporal.
Kanchan Shringi 00:28:06 I’d like to refer listeners to Episode 623, Michael J. Freedman on TimescaleDB. This episode does deal with time series data and the optimizations for storing in columnar formats. So back to Bigtable. Bigtable, I understand a part of Google Cloud, but what happened when you decided to look at other Cloud vendors also?
Abhay Paroha 00:28:37 Our first version we built using Bigtable on Google Cloud and when we migrated for some client to Azure, then in Azure, the Cosmos DB was an obvious choice to store this time series data and we also helped other client to store and this client, they had their own private Cloud or data center. And to help these clients, what we did, we basically use Cassandra. So Cassandra has a similar nature and characteristic for storing any time series data and to achieve the compatibility for the code. When we are moving to one Cloud provider to other, we basically use some facade APIs so that we can abstract the nature of querying the database so that with minimal changes and changing the facade layer only we can easily deploy the same version of the code to different databases.
Kanchan Shringi 00:29:37 Okay, let’s get back to the mid tier now. We’ve established that you are using Kubernetes to be Cloud agnostic and we’ve talked about the databases used specifically in the area of actually designing the microservice. What was your approach, how did you decide the scope of a microservice and the order in which you would move that to Kubernetes and to the Cloud?
Abhay Paroha 00:30:03 Yeah, our first piece of software and delivered microservice was mostly on the data ingestion part, which was mostly covering if data is ingested correctly or not, but that was only a POC and we evolved that to first microservice delivery. But later there are other domain of this problem, for example, how can any user ask for particular, okay, I’m only interested in that particular well, you ingest data for this particular well and these many properties of, well for let’s say last 50 years. So we need some mapping kind of facility available so that user can map this property using some Cloud hosted app. The second aspect was, okay, now we have one ingestion domain available and mapping domain available. The third piece was, okay, how to effectively create APIs so that our client can use these APIs and can build their own dashboards.
Abhay Paroha 00:31:05 So the third aspect was third scoping was how to create a tier one services over this time series data. After that we want to create some workflow as well ourselves. So first part was delivering the APIs only so that client can use it in their dashboard. Second was, okay, what we can create workflow as our own. So the fourth scope was, okay, what kind of services we can build so that we, for example, we build some microservice to do the forecasting of oil well, other microservices for doing the advisory or recommendation for oil well. So these were mostly the scoping and other criteria of delivering the microservice was how effectively we can deliver end-to-end feature. Those are mostly scoped in a smaller microservice, and we also had a stateful kind of small monolithic application and those applications were dependent on different smaller microservices. So the monolithic application delivery cycle was different than the stateless microservice. These are basic criteria was if we can split it by the domain or split it by the delivery frequency.
Kanchan Shringi 00:32:22 Did you have to build any infrastructure components for the pieces that you deployed to Kubernetes?
Abhay Paroha 00:32:27 For the Kubernetes service, we always relied on the Cloud offered Kubernetes platform. We never tried to create our own VM and create our own Kubernetes cluster because lot of complexity involved in it. Kubernetes has its own networking mechanism and there are lots of complicated pieces involved. So we always work with the Cloud provider hosted Kubernetes platform. For example, in Google Cloud it is GKE. In Azure communities it’s a K8s. So we always relied on this managed Kubernetes offering. And for the infrastructure part, we only deploy a few things like if we need our distributed cache or if we want to run some database in the Kubernetes cluster that time we only use our own expertise and we deploy these databases using the straightforward side in Kubernetes. Otherwise, we were mostly relying on the infrastructure coming from the default Cloud provider only.
Kanchan Shringi 00:33:33 Okay. So you mentioned a distributed cache. Was that used in all your, in deploying to all core Cloud providers? Was there a specific cache that you used there?
Abhay Paroha 00:33:43 When we started our journey with Google Cloud, so Google Cloud had by default cache providing for Memcached and Redis. We developed our first version with Redis, but after operating that Redis for a few months, what we found that total cost of operating is little high than expected and we don’t want to invest a lot of money in the caching storage because our mostly storage cost was going on Bigtable. Because that was our main storage, we decided to run our own Redis cluster over Kubernetes and to achieve that in the Kubernetes, there is one option available to run the stateful set and by the default Redis provides to run any Redis stance as a high availability mode as well where we can run a three node Redis cluster to achieve the high availability. So it’s a matter of using the stateful set from the Kubernetes that will help us to provide some persistent storage and running a high availability cluster of Redis on Kubernetes port. So main gain was okay, now the cost is reduced, but the challenge was we need to learn about all these complexities running the straightforward set using persistent storage because it’s not a straightforward thing.
Kanchan Shringi 00:35:07 Can you explain stateful set?
Abhay Paroha 00:35:09 So Kubernetes has this concept which is called a stateful set. So stateful set is a feature of Kubernetes whenever we want to deploy any stateful application, stateful application, like if you want to run any message queue in Kubernetes the cluster or you want to run any cache or you want to run some database. Kubernetes provide a stateful set resource or we can say artifact where you can define your stateful process. When I’m saying stateful, then there is some persistent storage is also involved. So whenever I want to run some database in the Kubernetes cluster, so by default Kubernetes deployment or ports, those are ephemeral in nature. I mean whenever it restarts you may lose your data. So that’s why those things are not suitable for your database. And with a stateful set, what you can do, you can attach persistent SSDs your hard disk.
Abhay Paroha 00:36:14 So whenever I am installing, let’s say I want to run my atomic database in Kubernetes. What I will do, I will apply the atomic installation YAML file in the form of a state for set and also, I will assign some SSD hard disk. So even if my Kubernetes port or deployment is restarting, I will not lose the state. And another important aspect of it, it’ll assign a unique identification number to each spot. So whenever it restarts due to some network issue or some other issue, or we are applying some patch or upgrade it automatically assigned the same number of the restart as well. So these two concept like persistent storage in the form of persistent volume and persistent volume claim and this unique assignment number help it to maintain the database state.
Kanchan Shringi 00:37:12 What about the language that these microservices were written in? Can you cover that?
Abhay Paroha 00:37:18 When we started writing our services, so we were evaluating different technology choices that time we evaluated so in our team we have Java expertise available, so we had so many software engineers who have expertise in Java, but with Java framework we evaluated Spring Boot. That time it was available, but it was not meeting our scalability need. For example, we wrote the first version of injection service using spring. It was not meant for running any streaming kind of injection. So we started looking for other choices. So we evaluated Scala. Scala is a functional programming language per se. So the reason to chose Scala was not a functional programming language per se we were mostly interested in in Akka. So Akka is basically an actor-based framework, which very well-known framework for building a scalable and fault tolerant applications. So we started evaluating Akka as a framework.
Kanchan Shringi 00:38:22 Could you define actor-based framework?
Abhay Paroha 00:38:25 Actor-based framework is not a new concept. It is there in the industry since 1970s or ë74, I think. Actor is basically a concept which is using actor as a logical or minimal feature whenever you are writing any code. For example in object oriented language, we use object or create any class. In actor framework, we are basically creating actors whenever we are writing some code. For example, let’s say I’m writing a simple service to consume data from some database. So what I will do, I will create controller layer for the APIs and after that controller layer I will create another business layer. So this business layer will be created using AKKA characters. So provide a notion or class or interface, but you can, whenever you’re writing the code, you can extend your classes by using that actor. So internally it’ll, whenever your code is executing, it’ll create a small thread and it’ll assign a mailbox to that thread. So whenever you are interacting with any character running instance, you will always interact using message passing. So one actor has no tight coupling with any other actor so they can scale independently by just passing the message from one actor to other actor.
Kanchan Shringi 00:39:55 How does this actually coexist with Kubernetes?
Abhay Paroha 00:39:58 Yeah, so Akka has very good integration with Kubernetes. It has its own clustering mechanism available that called AKKA cluster. But because we were only using the Kubernetes platform, so we didn’t go into that route. That route is specifically for the people who want to run their own on-prem data center. Because we were using the Kubernetes as plat platform. So we started looking the other features of Akka. So our main application was the ingestion part. Okay, so ingestion part and the other one is obviously consumption part. So for ingestion our main goal is to have the good freshness of data point. For example, if let’s say some data point came at today at 12 O’clock, so within the 10 seconds I want to store that data point into our Cloud storage. We are looking for some library or feature where we can achieve the streaming ingestion and we can scale it.
Abhay Paroha 00:41:01 So Akka provides AKKA streaming and one important point in AKKA streaming, it basically handle the asynchronous data ingestion. And other important point is it handle back pressure also. So back pressure is means when we are running any ingestion pipeline. So there is a one producer which is producing data point and another consumer which is basically consuming this data point and storing it. Back pressure means let’s say if consumer is still busy and over then still processing, so it automatically sends some signal to producer to stop. When we build our ingestion services, so our producer was any messaging maybe Cloud pops up. So it is producing so AKKA streaming is basically taking the data points in batches and whenever our Bigtable consumer part is still busy in writing, there is no need to do anything. It automatically sends signal to producer, okay, you hold on. Iím busy right now. So it basically helped us to maintain the scalability within a service also. So it minimizes the threat consumption and memory footprint. So that was the main reason to use AKKA streaming.
Kanchan Shringi 00:42:19 You came up with mechanical data model, you talked about the challenges in ingesting data and the methodologies. Now the data is in, how did you use it?
Abhay Paroha 00:42:29 So now we have all of the data stored and everything is stored in atomic and Bigtable, now the second aspect of it, using the APIs to consume this data and run a specific production optimization workflow. So when I’m saying production optimization workflow, what our end goal was to give some recommendation and action to our end user. Our end users are mostly asset managers or production engineers, so that they can make the corrective action on any incidents happening on oil field. So for example, let’s say our data is stored and now I want to give just some recommendation on, if some well is producing more water than oil. So what I will do in my workflow, I’m running some calculation engine. This calculation engine is basically listening to the events whenever we are storing data into a Bigtable and after that it is running some calculation over that data and these calculations are mostly time series in nature and automatically it is generating some recommendation, okay? Because your water flow is increased due to these many conditions. So you can give this preventive action. So in the, whenever we are displaying the results of this calculation engine to sub-UI, so production engineer looking on the screen, they can see okay, the engine is recommending me this kind of action. So it helped us to make the decision correctly. Okay, now I can inform someone on the field. Okay, you can try this so that it’ll minimize this kind of incident.
Kanchan Shringi 00:44:16 Did you meet your goal of scaling? I think you plan to go up to 20,000 oil well
Abhay Paroha 00:44:22 Yeah, so we tried different thing. What covered for the injection side, we tried to achieve the auto scalability. First, we tried the manual scalability, then auto scalability. So with the auto scalability using KEDA and custom matrix, we are able to achieve the ingestion scalability aspect and we were able to handle one client with 8,000 wells and more than 50 properties per well streaming 1.2 billion data points for around 25 years. So this is for the ingestion part. For consumption part we wanted to deliver our API results with minimum latency. So there also we use the Kubernetes before scalability. So it has scalability based on CPU percentage or memory usage. So we are able to achieve the consumption scalability by using or monitoring this metric and within microservice to achieve the scalability, I mean to minimize the thread and to reduce the memory footprint of services within AKKA it also provide the monitoring mechanism.
Abhay Paroha 00:45:36 For example, let’s say I have one microservice and which is basically reading the data from Bigtable. So when I’m writing the service using AKKA actors, I can enable the matrix on that AKKA actors. These metrics are basically where my actor is involved. For example, if some message is waiting on the mailbox of AKKA character(?), I can create a matrix on that or some actors consuming more CPU than expected. So I can create a matrix on that. So by creating or capturing all this metering and monitoring, I was able to tune the thread count and different threshold regarding main and max size of this messages queues within a character. So it helped us to achieve the scalability within service. So different aspect we cover different scalability either within the service or integration point or the storage.
Kanchan Shringi 00:46:35 You mentioned KEDA for auto scaling. Can you talk about what is KEDA? What is the full form?
Abhay Paroha 00:46:41 So KEDA stands for Kubernetes Event Driven Autoscaler. So we started using KEDA after trying few other solutions for autoscaling. I mean we started with manual scaling where we manually scaled a number of Kubernetes port. For example, let’s say we deploy our application to some client and some sensor is streaming high volume of data someday, but other days it is not streaming high volume of data. So far what we did for we fixed the number of ingestion pod in Kubenetes. For example, I want to run my ingestion pipeline with five instances. Okay. But the problem was with that approach, like maybe sometimes all ports are occupied, but other times I’m just paying the money but it’s not streaming anything. So second aspect of it, we tried a custom matrix auto scaler, which is available with Kubernetes. So custom metric base means, by default Kubernetes provide a CPU based on memory based autoscaling.
Abhay Paroha 00:47:48 But custom metrics means you can write any custom metric so that based on the state of that metric, Kubernetes will auto scale. So for our case, the custom metric was how number of data points are residing in the queue and waiting for return back to the time series. So we use this a custom matrix autoscaler, but still there was one problem with this approach. It solved a manual auto-scaling problem. Now there is no need to always keep five instance of service, but it’s still for Kubernetes autoscaler to work, it always need at least, one instance of service running even if that service is not doing anything. But still I need to run at least one instance of it. So we went one step further. We started using a Kubernetes event driven autoscaler, which is a Cloud foundation Cloud native open-source project. And the beautiful feature of KEDA is it also provide a custom metric based autoscaler, but there is, no need of running even one instance of your service. So whenever it’s monitoring, it can scale from the zero state to the desired number of a state.
Kanchan Shringi 00:49:04 Let’s switch tracks now and start talking about the migration of customers from your on-premise solution to the Cloud. How did that go?
Abhay Paroha 00:49:14 We had two kind of customer, one existing customer who are already using our propriety software and the new customer we want, who wants to onboard with our production operation platform. So one customer who already using our on-prem software, they were mostly using some test desktop version of application to migrate these clients to Cloud work workflow. The main challenge was ingesting all this data. So for that we use what I covered already. We use the autonomous agent to ingest this data to the Cloud and after that they started using the Cloud-based workflow. The second client who we are new and they had their data into their, data center or their data provider. For that, we built a Cloud specific autonomous agent, and we directly ingested their data from their data center or their data provider to our Cloud and third kind of customer who were not interested in moving their data outside their Cloud. So we deployed our services in their data center only. So it helped us to cover different kind of client like on-prem to the Cloud or who are interested in using our Cloud provider or who are not interested in using. So we covered all of them.
Kanchan Shringi 00:50:40 Is there something you would do differently if you had to do it all over again?
Abhay Paroha 00:50:44 So I think, when we started and over the years in last five, six years, we learned a lot. So initially when we started our journey back in 2016, we were new to the Cloud world. I mean how to operate. So we learned a lot. I mean earlier when we started operating a work client, we didn’t use anything for monitoring or good monitoring, we were not using the SRE principles Service Reliability Principle. But later, over the years we improved a lot. So one good learning what we have right now and what I particularly use with my team. So whenever we are planning some features or new product for any new client or maybe we are writing some new feature to the existing client also. So in the planning phase itself, we are considering the efforts of operational aspect also. I mean we are not only thinking about okay, I need to write the code and I’m done.
Abhay Paroha 00:51:45 Now we consider okay, how to automate it, how to integrate it properly with CI and CD and what you need to basically monitor it and how to push your code without breaking anything. So we consider all these aspects now what we were not considering earlier. So it basically helped us to change our mindset on how to push code more frequently and without breaking anything. So once we have all these captured in our, even in our spend planning meeting or entity meeting, people start thinking about different things. Okay. They’re always, more cautious about even I am pushing one line of code, but I need to make sure I’m not breaking anything.
Kanchan Shringi 00:52:32 Would you have any other advice to other folks that are starting on their Cloud journey?
Abhay Paroha 00:52:37 Yeah, I have a few advice based on my learnings over the years. So what I understand, I mean Cloud is not like a silver bullet to solve everything. Most of the people say okay, you will get infinite scalability and everything, but based on my experience, I mean scalability is there, but you need to maintain the cost also because cost is the aspect which basically helped any business to earn revenue to achieve the final scalability. If I keep spending or writing my services or code like that, which is always dependent on Cloud and I’m not focusing on a good quality software or writing code in good way, then we’ll end up spending more and more money and we tried it and we failed. So that’s why we learned whenever I’m writing anything I need to think about the cost also because if I am spending all money in operating the Cloud or operating our services, I’m not generating any the revenue.
Kanchan Shringi 00:53:38 When you onboard a new customer, how much of the existing infrastructure do you leverage versus provisioning additional components?
Abhay Paroha 00:53:46 We are mostly spending our time if we are writing some new service for specific client need. For example, let’s say one client who is using our software since last three years and for the functionality we deliver to that client, they’re mostly using the non-machine learning workflow. But the same, same client started asking for some other workflow like okay, I want to screen the well intervention. Well intervention is a different concept. We didn’t use it in past, so we wanted to write some machine learning workflow. So we wrote a different microservice and we provisioned that and for running the machine learning workflow, we use another enterprise software to achieve that functionality for that specific client. We only provisioned the machine learning or ML ops thing is specific to the client and other microservice what we built. So it depends on the client need. Otherwise most of the thing we are able to achieve with the existing services only.
Kanchan Shringi 00:54:52 I see. So it sounds like you are still adding features to round out your solution and my question was more I think geared towards potentially when you actually become SaaS, if that’s your goal. In terms of being when new customers on board, do you actually go ahead and provision additional components for them traditional infrastructure or are you more multi-tenant in leveraging things that are already provisioned and sharing infrastructure? Is there some vision in that area?
Abhay Paroha 00:55:21 Yeah, I think it’s a good point. So for the existing application, what we built for this data injection and consumption part, it is mostly single tenant only. Single tenant means for one client the tenant is defined at Cloud project level. But for the new workflow, what we are building right now, I just gave you the example of that machine learning workflow. These workflows are basically multi-tenant in nature. Now we are, since last one or two years, we started building multi-tenant application over this data foundation layer we built over the year. So this data foundation layer is still single-tenant, but the workflow we are running over this data foundation, those are multi-tenant. So whenever we are onboarding a new customer for this multi-tenant app, so the provision is mostly around provisioning a tenant and infrastructure associated with it. So yeah, new workflows are multi-tenant. But yeah, this old data foundation that is still single tenant only.
Kanchan Shringi 00:56:27 As we start to wrap up here, AYA, can you talk about how the product definition process has changed as part of your move to the Cloud?
Abhay Paroha 00:56:36 In oil industry particularly? I mean the way we build product is different than other companies. I mean, we are not directly talking to external clients. We have domain expertise in our own team. For example, I have production engineers or petroleum engineers, so people in the team. So whenever we are writing a new software, the requirements coming from these domain people only, these domain people we call product champions and product analyst. So these people, they basically have the understanding what is going on in the oil and gas market. So since last two, three years, the market is more towards building SaaS and everything, which was not the case until 2020 or so. But yeah, now I think they are changing the focus. They want to efficiently utilize whatever we built, and SaaS and multi-tenancy is one way to achieve it. Earlier they were not focusing on cost of optimization or other things, but now they understand, and they learned over the years. So now they’re interested in offering these technical sales also as a SaaS offering.
Kanchan Shringi 00:57:46 Got it. Sounds like besides all the technical learnings you also had business on from this whole experience, it was great to have you here away Abhay. Certainly good luck on your journey going forward, but this was very extremely useful and helpful insight. I hope our listeners enjoyed it as well.
Abhay Paroha 00:58:03 Thanks for having me. Thank you.
[End of Audio]