Why is there a focus on becoming Cloud Agnostic?
People often talk about being cloud native these days. There are several reasons why this is important. Some of the benefits include:
- Better scalability and reliability
- Potentially less costly than maintaining a multi-AZ and multi-regional data center.
NextBillion.ai is cloud native from Day 0 of its journey by not having any physical data centers, depending only on cloud infra, using Kubernetes as an orchestration layer and using CICD to develop and deploy.
However, we also have unique challenges for which cloud native itself isn’t enough. We are a B2B business with many enterprise customers from different regions of the world. These customers have varying needs and demands:
- They may already have a preferred cloud service provider and want our solution to be within the same provider or even deployed to their current cloud infrastructure. (Same VPC to minimize response delay and ensure privacy)
- They might be calling our APIs from different geographical regions.
- They might not be on the cloud yet and want our solution in their physical data centers.
For example, we had a customer who required a 5ms p95 response latency for simple routing requests. The only way to meet this requirement was to perform a custom deployment in the AWS US-West region (where the customer was making API calls) to minimize propagation delay. Of course, we also had to code a robust and blazing-fast API in the first place.
Therefore, we cannot afford to be selective about the cloud. We need to be present where our customers want us and in the manner they prefer. To achieve this, we have to be CLOUD AGNOSTIC, on top of cloud-native.
And how do we do that?
Problems definitions first:
- We need to be able to deploy our single-tenant/customized solution on major cloud service providers on demand
- We need to deploy our multi-tenant/standard solution globally for customers who just want a working API as is
- We need to support on-premise deployments
These add up as a HARD problem, and no one else in the industry we know has done this before.
Also, don’t forget that we are a small startup with big goals but limited resources at our disposal. We neither have a billion dollars to spend on infrastructure nor a whole team of DevOps engineers to manage concurrent deployment projects. In fact, for an extended period of time, we have only 2 people that manage all kinds of deployments for all customers in the whole world.
Hence, we need to have A SINGLE PIPELINE that works for all scenarios.
What does it mean for us?
- We cannot depend on things native to any single cloud service provider
- We cannot fight with all the different software and hardware that the world throws at our faces
Luckily, we happily married Kubernetes!
Center around Kubernetes, we have crafted a deployment pipeline that supports all different kinds of customer scenarios.
We declare our alliance with Kubernetes (K8s) in the following ways:
- We support cloud service providers as long as they have managed Kubernetes Services.
Luckily this means we support major cloud providers, including GCP, AWS, Azure, Alicloud and others
- We support on-premise deployments only through Kubernetes.
You might ask: what if I have naked Linux boxes but no Kubernetes? The answer is simple: we will install Kubernetes Cluster(s) on your machines first, and then deploy our solution.
There are, of course, more considerations:
For routing/navigation purposes, most of the data is stored in the memory and disk of the routing engine service. We occasionally need relational and KV databases but most of them are supposed to be temporary and can be backed up and restored anytime. So the design is to build/host/backup/restore them in K8s itself.
Both ephemeral and persistent storages are extensively used in our solution. Luckily managed Kubernetes services are usually configured to be able to provide these on demand. For on-premise deployment, we use https://min.io/ to achieve the same.
Logs and metrics collection
Because our solutions are decentralized and distributed around multiple continents, we work with 3rd party providers to stream logs and metrics to a centralized location for monitoring and alerting, as well as problem tracing.
Global presence for multi-tenant cloud-hosted solution
ANYCAST with location-based smart routing and outage handling. Basically, customers of multi-tenant solutions will be routed to and served by the nearest clusters to achieve low latency and throughput. This is fairly complex and worth another blog post itself.
Infrastructure as code
All infrastructures are expressed using K8s manifests, configurations and blueprints (a format processed by in-house construction tools). We utilize GitOps and CICD to develop and roll out our solutions.
Today we are capable of meeting our customers’ needs, be where our customers want us to be (most CSPs, almost anywhere in the world or even on-premise) and how they want us to be (global multi-tenant standard API or customized solution), all thanks to our CLOUD AGNOSTIC mindset and execution since Day 0.
There was one major unfortunate incident with one of our CSPs where everything deployed there suddenly stopped working. Given our uncertainty in determining the underlying issue and swiftly restoring service functionality with the current CSP, we deemed it necessary to carry out an immediate migration to another CSP.
The initial migration was completed within 10+ minutes and most of the production services were restored. The remaining big-data replication took several hours but was ultimately successful. There are things to improve for sure, such as having a fully functional replica in a different CSP and the ability to switch over during similar incidents.
Budget constraints, especially in the early days, pose a challenge too. Despite the downtime, we feel lucky that this battle-tested our cloud architecture. It is a testimony to being truly CLOUD AGNOSTIC!