Introduction: key concepts
Introduction: key concepts
1.- Virtualization & Cloud computing
Regarding Cloud Computing, we also take a look at the NIST definition for this paradigm (full document available here), defined as "a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. "
- Cloud Computing’s five essential characteristics:
On-demand self-service, Broad network access, Resource pooling, Rapid
Elasticity, and Measured service.
- The
three service models for Cloud Computing are: Software as a Service
(SaaS), Platform as a Service (PaaS), and Infrastructure as a Service
(IaaS).
- Cloud Computing deployment models: private cloud, community cloud (also federated cloud), public cloud, and hybrid cloud.
2.- Clusters and Local Resource Management Systems
According to Wikipedia "a computer cluster is a set of computers that work together so that they can be viewed as a single system". All these computers are inter-connected to each other through fast local area networks allowing them to work together with the ability to perform computationally intensive tasks. In a cluster, each computer is referred to as a "node". If all the nodes have the same physical characteristics (i.e., same number of CPUs or GPUs, RAM memory, disk,...) and the same OS, we have an homogeneous cluster. However, diversity is also allowed and we will have then an heterogeneous cluster. Notice that a cluster does not have to be composed by physical machines, it can be also deployed on a Cloud Computing platform, conforming a virtual cluster composed by virtual machines. This is what the EC3 tool provides to its users.
Typically, a cluster has a small number of front-end nodes, usually one or two (for fault tolerance purposes), and a large number of compute nodes or working nodes. The front-end node is the computer to which the user logs in, and where he/she edits scripts, compiles code, and submits jobs.
The jobs are automatically run on the compute nodes by the Local Resource Management System (LRMS) that is the software able to schedule tasks and manage the nodes that compose the cluster. There are several LRMS that are used both in science and business environments. The most used and well-known ones are:
- SLURM is a workload manager software designed specifically to satisfy the demanding needs of high performance computing (HPC). It is free and open-source, what facilitates its usage at government laboratories, universities and companies world wide. Slurm is highly configurable: it comes with a set of optional plugins that provide the functionality needed to satisfy the needs of demanding HPC centers.
- Torque is a resource manager that provides control over batch jobs and distributed computing resources. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a cluster.
- Kubernetes (also known as K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.
- Apache Mesos is a distributed systems kernel that abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
- HTCondor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, HTCondor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to HTCondor, HTCondor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.
Image from Jhon Voo Flickr account.
3.- Infrastructure-as-code tools
"A long time ago, in a data center far, far away, an ancient group of powerful beings known as sysadmins used to deploy infrastructure manually. Every server, every route table entry, every database configuration, and every load balancer was created and managed by hand. It was a dark and fearful age: fear of downtime, fear of accidental misconfiguration, fear of slow and fragile deployments, and fear of what would happen if the sysadmins fell to the dark side (i.e. took a vacation). The good news is that thanks to the DevOps Rebel Alliance, we now have a better way to do things:Infrastructure-as-Code(IAC)." Source https://blog.gruntwork.io/.
Infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.
This has a number of benefits:
- You can automate your entire provisioning and deployment process, which makes it much faster and more reliable than any manual process.
- You can store those source files in version control, which means the entire history of your infrastructure is now captured in the commit log, which you can use to debug problems, and if necessary, roll back to older versions.
- You can validate each infrastructure change through code reviews and automated tests.
- You can create a library of reusable, documented, battle-tested infrastructure code that makes it easier to scale and evolve your infrastructure.
There are several tools to manage infrastructure-as-code, but the most well-known ones are Ansible (this is the one used by EC3), Puppet, Chef, Saltstack, Terraform and CloudFormation. You can follow the links to know more details about these tools. Also we recommend you to see the following video as a summary of some of these tools:
- View