Evolution of Infrastructure

The current ops model has been shaped by 2 very important concepts: virtualization and containerization

Application Deployment Evolution

Bare Metal   

On the left column, we have the “old” way of deploying applications. We simply have some hardware resources ( CPU, RAM, Hard Drive, etc. ) which can be accessed using an Operating System. On top of that Operating System, we would run our application which is nothing more than a process. This model is not really fancy or complex but it is extremely time-consuming and expensive. Why? Well, in case we need to deploy a new application, first, we would have to build a brand new machine. Also, choosing the right CPU, RAM, etc. is always painful. How can we properly dimension a machine when we don’t even know how much resources our application would require? 

Virtual Machines

Having dedicated hardware for each application is extremely expensive because 99% of the time they were over-dimensioned.  Many companies have figured out that it must be a way to run applications in a more effective way, so starting with 2006, Virtual Machines skyrocket. Using a software called Hypervisor we are now able to effectively share the same resources but in an isolated manner. Applications do no longer run under the same operating system, but they can be nicely organized based on their needs of the operating system and / or resources. 

Containers

Virtual Machines are a great improvement compared to the “Bare Metal” style of deployments but they still have a downside. Each Virtual Machine requires to run a separate Operating System. That’s a lot of overhead, considering the fact that you would want a virtual machine just to properly run your application. Containers to the rescue! Containers came as a solution to the overhead provided by the virtual machines, by allowing almost the same level of isolation without the overhead of a virtual machine.

Static vs. Dynamic Infrastructure

There are 3 areas that I would like to compare these 2 terms:

  1. Provisioning
    • Static Infrastructure: do we have a new application? then we need a new machine; in order to properly isolate the applications we will create dedicated environments for them;
    • Dynamic Infrastructure: we reuse as much as possible the existing infrastructure and use other means for proper isolation; also, we can scale up the existing fleet based on current demand;
  2. Accessibility
    • Static Infrastructure: how can we access an application which lives on another machine? Using the host IP of the machine, of course; since we know that the application will not be deployed to another machine we can safely assume that it will “always” be accessible using that IP;
    • Dynamic Infrastructure: the same application may be deployed to different machines at different points in time, so we need a more intelligent way to access that application; to do that we can use services that will dynamically route our requests to the desired service;
  3. Security
    • Static Infrastructure: we know for sure which applications need to communicate so we can easily deduce which machines need to talk to each other; knowing the machines, means that we know the IP’s involved so we can set up firewall rules that will allow those 2 machines to communicate;
    • Dynamic Infrastructure: when a request is reaching an application, we can’t really tell that it came from a trusted source so at first, we need to determine the identity  ( e.g: access tokens ) of the application / user that has done the request; based on that we can allow or discard the access to the requested resources;

Static vs Dynamic Infrastructure

How to achieve Dynamic Infrastructure?

Dynamic Infrastructure is actually represented by a set of concepts and target areas so there isn’t a “one size fits all” solution. Achieving a completely dynamic infrastructure can be done incrementally by targeting individual areas from the following list:

Provisioning

Our apps need to run somewhere. In most of the cases, a virtual machine ( or more ) is required to provide our much-needed resources. In a static approach, we would have a limited set of VM’s dimensioned based on the needs of our apps while peeking traffic. This works, but it’s also very inefficient. A better approach would be to have the number of VM’s set based on needs. For example: if we experience low traffic during the night we should scale down to maybe one machine, whereas during the day we should be able to scale up to 3 machines. Most of the cloud providers offer such solutions which enable auto-scaling based on some predefined configuration. Let’s say if all the available machines have a CPU utilization of over 80%, we can automatically request new machines, making them available for usage. 

Service Registry and Discovery

A Service Registry is nothing more than a central location where all the information regarding all the current services that are running. It may contain details like the type of service, the location of it ( e.g: which VM? ), how to access it ( e.g: which IP? which port? ), and even the number of running instances.

A Service Registry has 2 main functionalities:

  1. Registration – a service should be able to auto-register at startup and de-register when it is shutdown. The process can be done automatically by the service itself ( e.g: Eureka Service Registry  ) or through a third-party software ( e.g: Istio Service Mesh
  2. Discovery – a client application should not be aware of the location of the service that it should connect to! Instead, it should only know the location of the service registry. The client application will ask the Service Registry for the current address of the service and then it will connect to the desired service.

Service Configuration

Service Configuration is the equivalent of Service Registry but for configuration. Managing application configuration for one or two applications is quite easy but when that number jumps high enough, managing configuration will become quite difficult. A simple solution would be to use a Configuration Server on which we will store all the application configurations ( based on the type of the application, environment, etc. ). Some more intelligent solutions also enable auto-reloading fo the app in the case of a configuration change. Using this feature, it simply notifies the app that the configuration has changed and then the app will restart with the new configuration. An extremely promising solution is Hashicorp Consul which among Service Configuration offers other features like Service Discovery and Service Mesh capabilities.

Network Segmentation

Network Segmentation is all about access. Considering 3 apps ( A, B, and C ) deployed in a dynamic environment. That means they can all be deployed on a single machine or even scattered across multiple machines.  Now, the question is “how to make sure that Service A can communicate with Service B but NOT with Service C?”. You’ve probably heard of this concept as “the principle of the least privilege access”. In a Dynamic structure, we can do this through virtual defined networks ( e.g: docker networks ) or Service Meshes.

Orchestration

I’m pretty sure you’ve heard this word before! Every application that we build is unique, and that means we need a very specific way to configure, deploy, and manage it. We can build our own tools that can tackle all these concerns ( aka. bash your way through it ), or we can use an orchestration tool. Using such a tool, it allows us to have a standardized way to automate all the processes mentioned above. There are many options that we can choose from: Ansible, Chef, Puppet, etc., or if we talk about container orchestration we have Kubernetes, Docker Swarm, Marathon ( Apache Mesos ), etc.

Scheduling

If we take another look at the Application Deployment Evolution diagram we can immediately notice that multiple applications ( or instances of the same application ) can end up on the same machine or on different machines. If these instances scale up and down dynamically, how can we decide which application ends upon which machine? The answer is we don’t! We let someone or something else to do it. That something else is called a scheduler. The scheduler has underlying information about all the machines that it manages and it knows which one is the most underutilized! So, when a new application instance needs to be deployed, it will pick the one with the most resources available. 

Network Automation

Virtual Machines come and go, applications are not always deployed on the same spot, this can mean only one thing! Networking hell! Most of the network problems result from configuration error, and most of these changes are manually implemented. Imagine building subnets, creating firewall openings, and all the other operations required to ensure connectivity between services, at a very dynamic pace.  So no wonder that Network Automation is the way to go in a dynamic infrastructure!

Observability

Observability, compared to monitoring, is more of a state rather than an action. It allows us to deduce the state of the application based on the output that it gives, logs, metrics, events, and tracing data. With the increased complexity in our software architecture, monitoring alone isn’t enough to help us solve potential problems. Monitoring can only indicate to us that something is going wrong, whereas observability helps us even more, by telling us why it is going wrong!

Traffic Shaping

On software we have 2 types of traffic:

  1. EAST-WEST Traffic – is represented by all the communication done within a data center ( or cloud ) 
  2. NORTH-SOUTH Traffic – is represented by all the communication done between an external host and a server within our data center ( or cloud )

Traffic Shaping is all about bandwidth management, with the main goal of ensuring that critical applications are not experiencing network degradation on high load periods. Basically, not all the applications will be treated the same. Some will receive more bandwidth in order to support all the incoming requests, whereas the others, which contain less critical functionality, will receive less. 

Case Study: Kubernetes

If you are here, you already know this word! But … have you ever wondered why is it so popular? I know you may have thousands of reasons for which you like working with Kubernetes but have you ever analyzed Kubernetes based on all its capabilities? 

Target Areas Solved by Kubernetes

Just by using a vanilla Kubernetes cluster, we can immediately tackle a lot of target areas. However, all the remaining problems can be solved with some third-party software. Bellow, you can find an example of such setup:

Example Solutions Targeting Remaining Areas

Traefik – Traffic Shaping

ELK Stack – Observability

Terraform – Provisioning

Istio Service Mesh – Network Segmentation

Credits

This blog post was inspired by a private presentation done by Abhinav Sonkar ( Twitter, LinkedIn ) who managed to break down such a complex topic into easily understandable pieces! Many thanks!