Container Orchestration Guidelines

This section describes a set of standards, conventions and guidelines for deploying application suites on Container Orchestration technologies.

Overview of Standards

These standards, best practices and guidelines are based on existing industry standards and tooling. The main references are:

The standards are broken down into the following areas:

  • Structuring application suites for orchestration - general guidelines for breaking up application suites for running in a container orchestration
  • Defining and building cloud native application suites - resource definitions, configuration, platform resource integration
  • Kubernetes primitives - a more detailed look at key components: Pods, Services, Ingress
  • Scheduling and running cloud native application suites - scheduling, execution, monitoring, logging, diagnostics, security considerations

Throughout this documentation, Kubernetes in conjunction with Helm is used as the reference implementation with the canonical versions being Kubernetes v1.16.2 and Helm v3.1.2, however the aim is to target compliance with the OCI specifications and CNF guidelines so it is possible to substitute in alternative Container Orchestration solutions, and tooling.

A set of example Helm Charts are provided in the repository container-orchestration-chart-examples. These can be used to get an overall idea of how the components of a chart function together, and how the life cycle and management of a chart can be managed with make.

Structuring application suites for Orchestration

In order to understand how to structure applications suites for orchestration, we first need to understand what the goals of Cloud Native software engineering are.

what is Cloud Native

It is the embodiment of modern software delivery practices supported by tools, frameworks, processes and platform interfaces.

These capabilities are the next evolution of Cloud Computing, raising the level of abstraction for all actors against the architecture from the hardware unit to the application component.

What does this mean? Developers and system operators (DevOps) interface with the platform architecture using abstract resource concepts, and should have next to no concern regarding the plumbing or wiring of the platform, while still being able to deploy and scale applications according to cost and usage.

Cloud Native exploits the advantages of the Cloud Computing delivery model:

  • PaaS (Platform as a Service) layered on top of IaaS (Infrastructure as a Service)
  • CI/CD (Continuous Integration/Delivery) – fully automated build, test, deploy
  • Modern DevOps – auto-scaling, monitoring feedback loop to tune resource requirements
  • Software abstraction from platform compute, network, storage
  • Portability across Cloud Services providers

Why Cloud Native SDLC (Software Development Life Cycle)?

Cloud Native SDLC

How Kubernetes fits into the Cloud Native SDLC

Kubernetes provides cohesion for distributed projects:

  • Codify standards through implementing testing gates
  • Ensures code quality, consistency and predictability of deployment success – CI/CD
  • Automation – build AND rebuild for zero day exploits at little cost
  • Portability of SDI (Software Defined Infrastructure) as well as code
  • Provides a codified reference implementation of best practices, and exemplars
  • Enables broad engagement – an open and collaborate system - a “Social Coding Platform”
  • Consistent set of standards for integration with SRC (SKA Regional Centres), and other projects – the future platform of integrated science projects through shared resources enabled by common standards

How does orchestration work

At the core of Cloud Native is the container orchestration platform. For the purposes of these guidelines, this consists of Kubernetes as the orchestration layer, over Docker as the container engine.

Kubernetes Architecture

The architecture of Kubernetes at the centre of the Cloud Native platform

Kubernetes provides an abstraction layer from hardware infrastructure resources enabling compute, network, storage, and other dependent services (other applications) to be treated as abstract concepts. A computing cluster is not a collection of machines but instead is an opaque pool of resources, that are advertised for availability through a consistent REST based API. These resources can be customised to provide access to and accounting of specialised devices such as GPUs.

Through the Kubernetes API, the necessary resources that make up an application suite (compute, network, storage) are addressed as objects in an idempotent way that declares the desired state eg: this number of Pods running these containers, backed by this storage, on that network. The scheduler will constantly move the cluster towards this desired state including in the event of application or node/hardware failure. This builds in robustness and auto-healing.

Both platform and service resources can be classified by performance characteristics and reservation criteria using labelling, which in turn are used by scheduling algorithms to determine optimum placement of workloads across the cluster. All applications are deployed as sets of one or more containers in a minimum configuration called a Pod. Pods are the minimum scalable unit that are distributed and replicated across the cluster according to the scheduling algorithm. A Pod is essentially a single Kernel namespace holding one or more containers. It only makes sense to put together containers that are essentially tightly coupled and logically indivisible by design. These Pods can be scheduled in a number of patterns using Controllers (full list) including bare Pod (a single Pod instance), Deployment (a replicated Pod set), StatefulSet (a Deployment with certain guarantees about naming and ordering of replicated units), DaemonSets (one Pod per scheduled compute node), and Job/CronJob (run to completion applications).

A detailed discussion of these features can be found in the main Kubernetes documentation under Concepts.

Structuring Application Suites

Architecting software to run in an orchestration environment builds on the guidelines given in the Container Standards ‘Structuring Containerised Applications’ section. The key concepts of treating run time containers as immutable and atomic applications where any application state is explicitly dealt with through connections to storage mechanisms, is key.

The application should be broken into components that represent:

  • an application component has an independent development lifecycle
  • individual process that performs a discrete task such as a micro service, specific database/web service, device, computational task etc.
  • component that exposes a specific service to another application eg. a micro service or database
  • a reusable component that is applicable to multiple application deployments eg. a co-routine or proximity depdendent service (logger, metrics collector, network helper, private database etc)
  • an independently scalable unit that can be replicated to match demand
  • the minimum unit required to match a resource profile at scheduling time such as storage, memory, cpu, specialised device

Above all, design software to scale horizontally through a UNIX process model so that individual components that have independent scaling characteristics can be replicated independently.

The application interface should be through the standard container run time interface contract:

  • inputs come via a configurable Port
  • outputs go to a configurable network service
  • logging goes to stdout/stderr and syslog and uses JSON to enrich metadata (see Container Standards ‘Logging’)
  • metrics are advertised via a standard such as Prometheus Exporters, or emit metrics in a JSON format over TCP consumable by ETL services such as LogStash
  • configuration is passed in using environment variables, and simple configuration files (eg: ini, or key/value pairs).
  • POSIX compliant storage IO is facilitated by bind mounted volumes.
  • connections to DBMS, queuing technologies and object storage are managed through configuration.
  • applications should have builtin recoverability so that prior state and context is automatically discovered on restart. This enables the cluster to auto-heal by re-launching workloads on other resources when nodes fail (critical aspect of a micro-services architecture).

By structuring an application in this fashion, it can scale from the single instance desktop development environment up to a large parallel deployment in production without needing to have explicit understanding builtin for the plumbing and wiring of each specific environment because this is handled through external configuration at the Infrastructure management layer.

Example: Tango Controls

To help illustrate the Cloud Native application architecture concepts, a walk through of a Tango application suite is used.

A Tango Controller System environment is typically made up of the following:

  • Database containing the system state eg: MySQL.
  • DatabaseDS Tango device server.
  • One or more Tango devices.
  • Optional components - Tango REST interface, Tango logviewer, SysAdmin and debugging tools such as Astor and Jive.

These components map to the following Kubernetes resources:

  • MySQL Database == StatefulSet.
  • DatabaseDS == Deployment or StatefulSet.
  • Tango REST interface == Deployment.
  • Tango Device == bare Pod, or single replica Deployment.

This example does not take into consideration an HA deployment of MySQL, treating MySQL as a single instance StatefulSet. Using a StatefulSet in this case gives the following guarantees above a Deployment:

  • Stable unique network identifiers.
  • Stable persistent storage.
  • Ordered graceful deployment and scaling.
  • Ordered automated rolling updates.

These characteristics are useful for stable service types such as databases and message queues.

DatabaseDS is a stateless and horizontally scalable service in it’s own right (state comes from MySQL). This makes it a fit for the Deployment (which in turn uses a ReplicaSet) or the StatefulSet. Deployments are a good fit for stateless components that require high availability through mechanisms such as rolling upgrades.

The Tango Devices are single instance applications that act as a proxy between the ‘real’ hardware being controlled and the DatabaseDS service that provides each Tango Device with a gateway to the Tango cluster state database (MySQL). Considering that in most cases, an upgrade to a Device Pod is likely to be a delete and replace, we can use the simplest case of a bare Pod which will enable us to name each Pod after it’s intended device without the random suffix generated for Deployments.

Example: MPI jobs

A typical MPI application consists of a head node, and worker nodes with the (run to completion) job being launched from the head node, which in turn controls the work distribution over the workers.

This can be broken in to:

  • a generic component type that covers head node and worker nodes.
  • a launcher that triggers the application on the designated head node.

These components map to the following Kubernetes resources:

  • Worker node == DaemonSet or StatefulSet.
  • Launcher and Head node == Job.

MPI jobs typically only require a single instance per physical compute node, and this is exactly the use case of DaemonSets where Kubernetes ensures exactly one instance of a Pod is running on each designated node. Using Jobs enables the launcher and the head node to be combined. Both Job and DaemonSet Pods will most likely need the same library and tools from MPI, so can be combined into a single container image.

Linking Components Together

Components of an application suite or even between suites should use DNS for service discovery. This is achieved by using the Service resource. Services should always be declared before Pods so that the automatic generation of associated Environment Variables happens in time for the subsequent Pods to discover them. Service names are permanent and predictable, and are tied to the Namespace that a application suite is deployed in, for example in the namespace test, the DatabaseDS Tango component can find the MySQL database tangodb using the name tangodb or tangodb.test which is distinctly different to the instance running in the qa namespace also named tangodb but addressable by tangodb.qa. This greatly simplifies configuration management for software deployment.

Defining and building cloud native application suites

All Kubernetes resource objects are described through the REST based API. The representations of the API documents are in either JSON or YAML, however the preference is for YAML as the description language as this tends to be more human readable. The API representations are declarative, specifying the end desired state. It is up to the Kubernetes scheduler to make this a reality.

It is important to use generic syntax and Kubernetes resource types. Specialised resource types reduce portability of resource descriptors and templates, and increase dependency on 3rd party integrations. This could lead to upgrade paralysis because the SDLC is out of our control. An example of this might be using a non-standard 3rd party Database Operator for MySQL instead of the official Oracle one.

Metadata

Each resource is described with:

  • apiVersion - API version that this document should invoke
  • kind - resource type (object) that is to be handled
  • metadata - descriptive information including name, labels, annotations, namespace, ownership, references
  • spec(ification) - the body of the specification for this resource type denoted by kind

The following is an example of the start of a StatefulSet for the Tango DatabaseDS:

Resource description
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: databaseds-integration-tmc-webui-test
  labels:
    app.kubernetes.io/name: databaseds-integration-tmc-webui-test
    helm.sh/chart: integration-tmc-webui-0.1.0
    app.kubernetes.io/instance: test
    app.kubernetes.io/managed-by: helm
spec:
  ...

Namespaces

Even though it is possible to specify the namespace directly in the Metadata, it SHOULD NOT be, as this reduces the flexibility of any resource definition and templating solution employed such as Helm. The namespace can be specified at run time eg: kubectl --namespace test apply -f resource-file.yaml.

Name and Labels

Naming and labelling of all resources associated with a deployment should be consistent. This ensures that deployments that land in the same namespace can be identified along with all inter-dependencies. This is particulaly useful when using the kubectl command line tool as label based filtering can be employed to sieve out all related objects.

Labels are entirely flexible and free form, but as a minimum specify:

  • the name and app.kubernetes.io/name with the same identifier with sufficient precision that the same application component deplyed in the same namespace can be distinguished eg: a concatenation of <application>-<suite>-<release>. name and app.kubernetes.io/name are duplicated because label filter interaction between resources relies on labels eg: Service exposing Pods of a Deployment.
  • the labels of the deployment suite such as the helm.sh/chart for Helm, including the version.
  • the app.kubernetes.io/instance (which is release) of the deployment suite.
  • app.kubernetes.io/managed-by what tooling is used to manage this deployment - most likely helm.

Optional extras which are also useful for filtering are:

  • app.kubernetes.io/version the component version.
  • app.kubernetes.io/component the component type (most likely related to the primary container).
  • app.kubernetes.io/part-of what kind of application suite this component belongs to.

The recommended core label set are described under Kubernetes common labels.

metadata:
  name: databaseds-integration-tmc-webui-test
  labels:
    app.kubernetes.io/name: databaseds-integration-tmc-webui-test
    helm.sh/chart: integration-tmc-webui-0.1.0
    app.kubernetes.io/instance: test
    app.kubernetes.io/version: "1.0.3"
    app.kubernetes.io/component: databaseds
    app.kubernetes.io/part-of: tango
    app.kubernetes.io/managed-by: helm

Using this labelling scheme enables filtering for all deployment related objects eg: kubectl get all -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test.

kubectl label filtering
$ kubectl get all,configmaps,secrets,pv,pvc -l helm.sh/chart=integration-tmc-webui-0.1.0,app.kubernetes.io/instance=test
NAME                                          READY   STATUS     RESTARTS   AGE
pod/databaseds-integration-tmc-webui-test-0   1/1     Running    0          55s
pod/rsyslog-integration-tmc-webui-test-0      1/1     Running    0          55s
pod/tangodb-integration-tmc-webui-test-0      1/1     Running    0          55s
pod/tangotest-integration-tmc-webui-test      1/1     Running    0          55s
pod/webjive-integration-tmc-webui-test-0      0/6     Init:0/1   0          55s

NAME                                            TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                       AGE
service/databaseds-integration-tmc-webui-test   ClusterIP   None          <none>        10000/TCP                                     55s
service/rsyslog-integration-tmc-webui-test      ClusterIP   None          <none>        514/TCP,514/UDP                               55s
service/tangodb-integration-tmc-webui-test      ClusterIP   None          <none>        3306/TCP                                      55s
service/webjive-integration-tmc-webui-test      ClusterIP   10.97.135.8   <none>        80/TCP,5004/TCP,3012/TCP,8080/TCP,27017/TCP   55s

NAME                                                     READY   AGE
statefulset.apps/databaseds-integration-tmc-webui-test   1/1     55s
statefulset.apps/rsyslog-integration-tmc-webui-test      1/1     55s
statefulset.apps/tangodb-integration-tmc-webui-test      1/1     55s
statefulset.apps/webjive-integration-tmc-webui-test      0/1     55s

NAME                                                  CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                        STORAGECLASS   REASON   AGE
persistentvolume/rsyslog-integration-tmc-webui-test   10Gi       RWO            Retain           Bound    default/rsyslog-integration-tmc-webui-test   standard                56s
persistentvolume/tangodb-integration-tmc-webui-test   1Gi        RWO            Retain           Bound    default/tangodb-integration-tmc-webui-test   standard                55s
persistentvolume/webjive-integration-tmc-webui-test   1Gi        RWO            Retain           Bound    default/webjive-integration-tmc-webui-test   standard                55s

NAME                                                       STATUS   VOLUME                               CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/rsyslog-integration-tmc-webui-test   Bound    rsyslog-integration-tmc-webui-test   10Gi       RWO            standard       56s
persistentvolumeclaim/tangodb-integration-tmc-webui-test   Bound    tangodb-integration-tmc-webui-test   1Gi        RWO            standard       55s
persistentvolumeclaim/webjive-integration-tmc-webui-test   Bound    webjive-integration-tmc-webui-test   1Gi        RWO            standard       55s

Templating the Application

While it is entirely possible to define all the necessary resources for an application suite to be deployed on Kubernetes in individual or a single YAML file, this approach is static and quickly reveals it’s limitations in terms of creating reusable and composable application suites. This is where Helm Charts have been adopted by the Kubernetes community as the leading templating solution for deployment. Helm provides a mechanism for generically describing an application suite, separating out configuration, and rolling out deployment releases all done in a declarative ‘configuration as code’ style. All Helm Charts should target a minimum of three environments:

  • Minikube - the standalone developer environment.
  • CI/CD - the Continuous Integration testing environment which is typically the same benchmark as Minikube.
  • Production Cluster - the target production Kubernetes environment.

Minikube should be the default target environment for a Chart, as this will have the largest audience and should be optimised to work without modification of any configuration if possible.

When designing a Chart it is important to have clear separation of concerns:

  • the application - essentially the containers to run.
  • configuration - any variables that influence the application run time.
  • resources - any storage, networking, configuration files, secrets, ACLs.

The general structure of a Chart should follow:

charts/myapp/
        Chart.yaml          # A YAML file containing information about the chart and listing
                            # dependencies for the chart (refer to Helm 2 vs Helm 3 differences).
        LICENSE             # OPTIONAL: A plain text file containing the license for the chart
        README.md           # OPTIONAL: A human-readable README file
        values.yaml         # The default configuration values for this chart
        charts/             # A directory containing any charts upon which this chart depends.
        templates/          # A directory of templates that, when combined with values,
                            # will generate valid Kubernetes manifest files.
        templates/NOTES.txt # OPTIONAL: A plain text file containing short usage notes
        templates/tests     # A directory of test templates for running with 'helm test'

All template files in the templates/ directory should be named in a readily identifiable way after the component that it contains, and if further clarification is required then it should be suffixed with the Kind of resource eg: tangodb.yaml contains the StatefulSet for the Tango database, and tangodb-pv.yaml contains the PersistentVolume declaration for the Tango database. ConfigMaps should be clustered in configmaps.yaml and Secrets in secrets.yaml. The aim is to make it easy for others to understand the layout of application suite being deployed.

Helm Best Practices

The Helm community have a well defined set of best practices. The following highlights key aspects of these practices that will help with achieving consistency and reliability.

  • charts should be placed in a charts/ directory within the parent project.
  • chart names should be lowercase and hyphenated and must match the directory name eg. charts/my-app.
  • name, version, description, home, maintainers and sources must be included.
  • version must follow the Semantic Versioning standards.
  • the chart must pass the helm lint charts/<chart-name> test.

Warning

Helm 2 vs Helm 3

It should be noted that we have now migrated to using Helm 3. Feel free to upgrade Helm in your development environments using our Ansible Playbook upgrade_helm.yml found in the SKA Ansible Playbooks repository.

There are a few changes that may impact specific cases, to read up on them please read up at This blog post, as well as on Helm’s own FAQ page.

Example Chart.yaml file:

name: my-app
version: 1.0.0
description: Very important app
keywords:
- magic
- mpi
home: https://www.skatelescope.org/
icon: http://www.skatelescope.org/wp-content/uploads/2016/07/09545_NEW_LOGO_2014.png
sources:
- https://gitlab.com/ska-telescope/my-app
maintainers:
- name: myaccount
  email: myacount@skatelescope.org

Metadata with Helm

All resources should have the following boilerplate metadata to ensure that all resources can be uniquely identified to the chart, application and release:

...
metadata:
name: <component>-{{ template "my-app.name" . }}-{{ .Release.Name }}
labels:
    app.kubernetes.io/name: <component>-{{ template "my-app.name" . }}-{{ .Release.Name }}
    helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
...

Defining resources

The Helm templating language is based on Go template.

All resources go in the templates/ directory with the general rule is one Kubernetes resource per template file. Files that render resources are suffixed .yaml whilst files that contain expressions and macros only go in files suffixed .tpl.

Sample resource template for a Service generated by ‘helm create mychart’
apiVersion: v1
kind: Service
metadata:
name: {{ include "mychart.fullname" . }}
labels:
  app.kubernetes.io/name: {{ include "mychart.name" . }}
  helm.sh/chart: {{ include "mychart.chart" . }}
  app.kubernetes.io/instance: {{ .Release.Name }}
  app.kubernetes.io/managed-by: {{ .Release.Service }}
spec:
  type: {{ .Values.service.type }}
  ports:
  - port: {{ .Values.service.port }}
    targetPort: http
    protocol: TCP
    name: http
  selector:
    app.kubernetes.io/name: {{ include "mychart.name" . }}
    app.kubernetes.io/instance: {{ .Release.Name }}
Expression or macro template generated by ‘helm create mychart’
{{/* vim: set filetype=mustache: */}}
{{/*
Expand the name of the chart.
*/}}
{{- define "mychart.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
{{- end -}}

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "mychart.fullname" -}}
{{- if .Values.fullnameOverride -}}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- $name := default .Chart.Name .Values.nameOverride -}}
{{- if contains $name .Release.Name -}}
{{- .Release.Name | trunc 63 | trimSuffix "-" -}}
{{- else -}}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "mychart.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" -}}
{{- end -}}

Tightly coupled resources may go in the same template file where they are logically linked or there is a form of dependency.

An example of logically linked resources are PersistentVolume and PersistentVolumeClaim definitions. Keeping these together makes debugging and maintenance easier.

PersistentVolume and PersistentVolumeClaim definitions
---
kind: PersistentVolume
apiVersion: v1
metadata:
    name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
    namespace: {{ .Release.Namespace }}
labels:
    app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
    helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
    storageClassName: standard
    capacity:
        storage: 1Gi
    accessModes:
        - ReadWriteOnce
    hostPath:
        path: /data/tangodb-{{ template "tango-chart-example.name" . }}/

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
    namespace: {{ .Release.Namespace }}
labels:
    app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
    helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
    storageClassName: standard
    accessModes:
        - ReadWriteOnce
    resources:
        requests:
            storage: 1Gi
    volumeName: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}

An example of dependency is the declaration of a Service before the associated Pod/Deployment/StatefulSet/DaemonSet. The Pod will get the environment variables set from the Service as this will be evaluated by the Kubernetes API first as guaranteed by being in the same template file.

Service before the associated Pod/Deployment
---
apiVersion: v1
kind: Service
metadata:
name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
  app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
  app.kubernetes.io/instance: "{{ .Release.Name }}"
  app.kubernetes.io/managed-by: "{{ .Release.Service }}"
  helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
  type: ClusterIP
  ports:
  - name: rest
    port: 80
    targetPort: rest
    protocol: TCP
  selector:
    app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
  namespace: {{ .Release.Namespace }}
labels:
  app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
  app.kubernetes.io/instance: "{{ .Release.Name }}"
  app.kubernetes.io/managed-by: "{{ .Release.Service }}"
  helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
  replicas: {{ .Values.tangorest.replicas }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
        app.kubernetes.io/instance: "{{ .Release.Name }}"
        app.kubernetes.io/managed-by: "{{ .Release.Service }}"
        helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
    spec:
      containers:
      - name: tango-rest
        image: "{{ .Values.tangorest.image.registry }}/{{ .Values.tangorest.image.image }}:{{ .Values.tangorest.image.tag }}"
        imagePullPolicy: {{ .Values.tangorest.image.pullPolicy }}
        command:
        - /usr/local/bin/wait-for-it.sh
        - databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
        - --timeout=30
        - --strict
        - --
        - /usr/bin/supervisord
        - --configuration
        - /etc/supervisor/supervisord.conf
        env:
          - name: TANGO_HOST
            value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
        ports:
          - name: rest
            containerPort: 8080
            protocol: TCP
    restartPolicy: Always
{{- with .Values.nodeSelector }}
    nodeSelector:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.affinity }}
    affinity:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.tolerations }}
    tolerations:
{{ toYaml . | indent 8 }}
{{- end }}

Note

It may also be necessary to consider the alphabetic ordering of template files, if there is a declaration dependency wider than the immediate file, for instance when s Service definition and it’s environment variables are necessary for multiple Deployment/StatefulSet/DaemonSet definitions. In this case, it maybe necessary to use a numerical file prefix such as 00-service-and-pod.yaml, 01-db-statefulset.yaml …

Use comments liberally in the template files to describe the intended purpose of the resource declarations and any other features of the template markup. # YAML comments get copied through to the rendered template output and are a valuable help when debugging template issues with helm template charts/chart-name/ ... .

Managing configuration

Helm charts and the Go templating engine enable separation of application management concerns along multiple lines:

  • resources are broken out into related and named templates.
  • Application specific configuration values are placed in ConfigMaps.
  • volatile run time configuration values are placed in the values.yaml file, and then templated into ConfigMaps, container commandline parameters or environment variables as required.
  • sensitive configuration is placed in Secrets.
  • template content is programable (iterators and operators) and this can be parameterised at template rendering time.

Variable names for template substitution should observe the following rules:

  • Use camel-case or lowercase variable names - never hyphenated.
  • Structure parameter values in shallow nested structures to make it easier to pass on the Helm command line eg: --set tangodb.db.connection.host=localhost is convoluted compared to --set tangodb.host=localhost.
  • Use explicitly typed values eg: enabled: false is not enabled: "false".
  • Be careful of how YAML parsers coerce value types - long integers get coerced into scientific notation so if in doubt use strings and type casting eg: foo: "12345678" and {{ .Values.foo | int }}.
  • use comments in the values.yaml liberally to describe the intended purpose of variables.

Config in ConfigMaps

ConfigMaps can be used to populate Pod configuration files, environment variables and command line parameters where the values are largely stable, and should not be bundled with the container itself. This should include any (small) data artefacts that could be different (hence configured) between different instances of the running containers. Even files that already exist inside a given container image can be overwritten by using the volumeMounts example below.

ConfigMap values in Pods
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  SPECIAL_LEVEL: very
  SPECIAL_TYPE: charming
  example.ini: |-
    property.1=value-1
    property.2=value-2
    property.3=value-3
---
apiVersion: v1
kind: Pod
metadata:
 name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: k8s.gcr.io/busybox
      # accessing ConfigMap values in the commandline fron env vars
      command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY); cat /etc/config/example.ini" ]
      env:
        # reference the map and key to assign to env var
        - name: SPECIAL_LEVEL_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: SPECIAL_LEVEL
        - name: SPECIAL_TYPE_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: SPECIAL_TYPE
      volumeMounts:
      # mount a ConfigMap file blob as a configuration file
      - name: config-volume
        mountPath: /etc/config/example.ini
        subPath: example.ini
        readOnly: true
  volumes:
    - name: config-volume
      configMap:
        # Provide the name of the ConfigMap containing the files you want
        # to add to the container
        name: special-config
  restartPolicy: Never
# check the logs with kubectl logs dapi-test-pod
# clean up with kubectl delete pod/dapi-test-pod configmap/special-config

Where configuration objects are large or have a sensitive format, then separate these out from the configmaps.yaml file, and then include them using the template directive: tpl (.Files.Glob "configs/*").AsConfig . ) where the configs/ directory is relative to the charts/my-chart directory.

ConfigMap file blobs separated
---
apiVersion: v1
kind: ConfigMap
metadata:
name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
    app.kubernetes.io/name: config-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
    helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
data:
{{ (tpl (.Files.Glob "configs/*").AsConfig . ) | indent 2  }}

Secrets

Secrets information is treated in almost exactly the same way as ConfigMaps. While the default configuration (as at v1.14.x) is for Secrets to be stored as Base64 encoded in the etcd database, it is possible and expected that the Kubernetes cluster will be configured with encryption at rest (available from v1.13). All account details, passwords, tokens, keys and certificates should be extracted and managed using Secrets.

As was for ConfigMaps, separate Secrets out into the secrets.yaml template.

Secret values in Pods
---
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque
stringData:
  username: myuser
  password: mypassword
  config.yaml: |-
    apiUrl: "https://my.api.com/api/v1"
    username: myuser
    password: mypassword

---
apiVersion: v1
kind: Pod
metadata:
  name: secret-env-pod
spec:
  containers:
  - name: mycontainer
    image: k8s.gcr.io/busybox
    # accessing Secret values in the commandline fron env vars
    command: [ "/bin/sh", "-c", "echo $(SECRET_USERNAME) $(SECRET_PASSWORD); cat /etc/config/example.yaml" ]
    env:
    - name: SECRET_USERNAME
      valueFrom:
        secretKeyRef:
          name: mysecret
          key: username
    - name: SECRET_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mysecret
          key: password
    volumeMounts:
    - name: foo
      mountPath: "/etc/config"
  volumes:
  - name: foo
    secret:
      secretName: mysecret
      items:
      - key: config.yaml
        path: example.yaml
        mode: 511
  restartPolicy: Never
# check the logs with kubectl logs secret-env-pod
# clean up with kubectl delete pod/secret-env-pod secret/mysecret

Where sensitive data objects are large or have a sensitive format, then separate these out from the secrets.yaml file, and then include them using the template directive: tpl (.Files.Glob "secrets/*").AsSecrets . ) where the secrets/ directory is relative to the charts/my-chart directory.

Secret file blobs separated
---
apiVersion: v1
kind: Secret
metadata:
name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
    app.kubernetes.io/name: secret-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
    helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
type: Opaque
data:
{{ (tpl (.Files.Glob "secrets/*").AsSecrets . ) | indent 2  }}

Storage

PersistentVolumes and partner PersistentVolumeClaims should be defined by default in a separate template. This template should be bracketed with a switch to enable the storage declaration to be turned off (eg: {{ if .Values.tangodb.createpv }}), which will most likely be dependent on, and optimised for each environment.

On the PersistentVolume:

  • All storage should be treated as ephemeral by setting persistentVolumeReclaimPolicy: Delete.
  • Explicitly set volume mode eg: volumeMode: Filesystem so that it is clear whether Filesystem or Block is being requested.
  • Explicitly set the access mode eg: ReadWriteOnce, ReadOnlyMany, or ReadWriteMany so that it is clear what access rights containers are expected to have.
  • always specify the storage class - this should always default to standard eg: storageClassName: standard given that the default target environment is Minikube.

On the PersistentVolumeClaim:

  • Always specify the matching storage class eg: storageClassName: standard, so that it will bind to the intended PersistentVolume storage class.
  • Where possible, always specify an explicit PersistentVolume with volumeName eg: volumeName: tangodb-tango-chart-example-test. This will force the PersistentVolumeClaim to bind to a specific PersistentVolume and storage class, avoiding the loosely binding issues that volumes can have.

Storage In Kubernetes Clusters Managed by the Systems Team

In any of the existing deployed Kubernetes clusters there are a number of default StorageClasses available, that are backed by Ceph, and integrated using Rook. The StorageClass es expose RDB block devices and CephFS Network File System based storage to Kubernetes.

The StorageClasses are as follows:

Classname Maps to Usage
nfss1 CephFS Shared Network Filesystem - ReadWriteMany
nfs alias to nfss1 Shared Network Filesystem - ReadWriteMany
bds1 RBD Single concurrent use ext4 - ReadWriteOnce
block alias to bds1 Single concurrent use ext4 - ReadWriteOnce

StorageClass naming convention follows the following pattern:

<xxx type><x class><n version>[-<location>]

  • xxx type - bd=block device, nfs=network filesystem
  • x class - s=standard,i=iops optimised (could be ssd/nvme), t=throughput optimised (could be hdd, or cheaper ssd)
  • n version - 1=first version,…
  • location - future tag for denoting location context, rack, dc, etc

Current classes:

  • bds1 - block device - single mount (ReadWriteOnce) - standard - version 1
  • nfss1 - network filesystem enabled storage (ReadWriteMany) - standard - version 1
  • block = shortcut for bds1
  • nfs = shortcut for nfss1

Tests

Helm Chart tests live in the templates/tests directory, and are essentially one Pod per file that must be run-to-completion (i.e. restartPolicy: Never). These Pods are annotated in one of two ways:

  • "helm.sh/hook": test-success - Pod is expected to exit with return code 0
  • "helm.sh/hook": test-failure - Pod is expected to exit with return code not equal 0

This is a simple solution for test assertions at the Pod scale.

As with any other resource definition, tests should have name and metadata correctly scoping them. End the Pod name with a string that indicates what the test is suffixed with -test.

Helm tests, must be self contained are should be atomic and non-destructive as the intention is that a chart user can use the tests to determine that the chart installed correctly. As with the following example, the test is for checking that Pods can reach the DatabaseDS service. Other tests might be checking services are correctly exposed via Ingress.

Helm Chart test Pod - metadata and annotations on a simple connection test
---
apiVersion: v1
kind: Pod
metadata:
  name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test
  namespace: {{ .Release.Namespace }}
  labels:
    app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"
    app.kubernetes.io/managed-by: "{{ .Release.Service }}"
    helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
  annotations:
    "helm.sh/hook": test-success
spec:
  {{- if .Values.pullSecrets }}
  imagePullSecrets:
  {{- range .Values.pullSecrets }}
    - name: {{ . }}
  {{- end}}
  {{- end }}
  containers:
  - name: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-connection-test
    image: "{{ .Values.powersupply.image.registry }}/{{ .Values.powersupply.image.image }}:{{ .Values.powersupply.image.tag }}"
    imagePullPolicy: {{ .Values.powersupply.image.pullPolicy }}
    command:
      - sh
    args:
      - -c
      - "( retry --max=10 -- tango_admin --ping-device test/power_supply/1 ) && echo 'test OK'"
    env:
    - name: TANGO_HOST
      value: databaseds-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}:10000
  restartPolicy: Never

Integrating a chart into the SKAMPI repo

Prerequisites

To integrate a helm chart into the SKAMPI repo, follow these steps:

Local steps

  • Clone the SKAMPI repo, available here.

  • Add a directory in charts with a descriptive name

  • Add your helm chart and associated files within that directory

  • Check the validity of the chart

    • Verify that the chart is formatted correctly

      helm lint ./charts/<your_chart_directory>/
      
    • Verify that the templates are rendered correctly and the output is as expected

      helm install --dry-run --debug ./charts/<your_chart_directory>/
      
    • Check that your chart deploys locally (utilising minikube as per our standards) and behaves as expected

      make deploy KUBE_NAMESPACE=integration
      make deploy KUBE_NAMESPACE=integration HELM_CHART=<your_chart_directory>
      
  • Once functionality has been confirmed, go ahead and commit and push the changes

Gitlab

Once the changes had been pushed it will be built in Gitlab. Find the pipeline builds at https://gitlab.com/ska-telescope/skampi/pipelines.

If the pipeline completes successfully, the full integration environment will be available at https://integration.engageska-portugal.pt.

Kubernetes primitives

The following focuses on the core Kubernetes primitives - Pod, Service, and Ingress. These provide the core delivery chain of a networked application to the end consumer.

The Pod

The Pod is the basic deployable application unit in Kubernetes, and provides the primary configurable context of an application component. Within this construct, all configuration and resources are plugged in to the application.

This is a complete example that demonstrates container patterns, initContainers and life-cycle hooks discussed in the following sections.

Container patterns and life-cycle hooks
---
kind: Service
apiVersion: v1
metadata:
  name: pod-examples
spec:
  type: ClusterIP
  selector:
    app: pod-examples
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: http

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pod-examples
  labels:
    app: pod-examples
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-examples
    spec:
      volumes:
      # lifecyle containers as hooks share state using volumes
      - name: shared-data
        emptyDir: {}
      - name: the-end
        hostPath:
          path: /tmp
          type: Directory

      initContainers:
      # initContainers can initialise data, and do pre-flight checks
      - name: init-container
        image: alpine
        command: ['sh', '-c', "echo 'initContainer says: hello!' > /pod-data/status.txt"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

      containers:
      # primary data generator container
      - name: main-app-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'Main app says: ' `date` >> /pod-data/status.txt; sleep 5;done"]
        lifecycle:
          # postStart hook is async task called on Pod boot
          # useful for async container warmup tasks that are not hard dependencies
          # definitely not guaranteed to run before main container command
          postStart:
            exec:
              command: ["/bin/sh", "-c", "echo 'Hello from the postStart handler' >> /pod-data/status.txt"]
          # preStop hook is async task called on Pod termination
          # useful for initiating termination cleanup tasks
          # definitely not guaranteed to complete before container termination (sig KILL)
          preStop:
            exec:
              command: ["/bin/sh", "-c", "echo 'Hello from the preStop handler' >> /the-end/last.txt"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data
        - name: the-end
          mountPath: /the-end

      # Sidecar helper that exposes data over http
      - name: sidecar-nginx-container
        image: nginx
        ports:
          - name: http
            containerPort: 80
            protocol: TCP
        volumeMounts:
        - name: shared-data
          mountPath: /usr/share/nginx/html
        livenessProbe:
          httpGet:
            path: /index.html
            port: http
        readinessProbe:
          httpGet:
            path: /index.html
            port: http

      # Ambassador pattern used as a proxy or shim to access external inputs
      # gets date from Google and adds it to input
      - name: ambassador-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'Ambassador says: '`wget -S -q 'https://google.com/' 2>&1 | grep -i '^  Date:' | head -1 | sed 's/^  [Dd]ate: //g'` > /pod-data/input.txt; sleep 60; done"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

      # Adapter pattern used as a proxy or shim to generate/render outputs
      # fit for external consumption (similar to Sidecar)
      # reformats input data from sidecar and ambassador ready for output
      - name: adapter-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "while true; do cat /pod-data/status.txt | head -3 > /pod-data/index.html; cat /pod-data/input.txt | head -1 >> /pod-data/index.html; cat /pod-data/status.txt | tail -1 >> /pod-data/index.html;  echo 'All from your friendly Adapter' >> /pod-data/index.html; sleep 5; done"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

This will produce output that demonstrates each of the containers fulfilling their role:

$ curl http://`kubectl get service/pod-examples -o jsonpath="{.spec.clusterIP}"`
initContainer says: hello!
Main app says:  Thu May 2 03:45:42 UTC 2019
Hello from the postStart handler
Ambassador says: Thu, 02 May 2019 03:45:55 GMT
Main app says:  Thu May 2 03:46:12 UTC 2019
All from your friendly Adapter

$ kubectl delete deployment/pod-examples service/pod-examples
deployment.extensions "pod-examples" deleted
service "pod-examples" deleted
piers@wattle:~$ cat /tmp/last.txt
Hello from the preStop handler

Container patterns

The Pod is a cluster of one or more containers that share the same resource namespaces. This enables the Pod cluster to communicate as though they are on the same host which is ideal for preserving the one-process-per-container ideal, but be able to deliver orchestrated processes as a single application that can be separately maintained.

All Pod deployments should be designed around having a core or leading container. All other containers in the Pod provide auxillary or secondary services. There are three main patterns for multi-container Pods:

  • Sidecar - extend the primary container functionality eg: adds logging, metrics, health checks (as input to livenessProbe/readinessProbe).
  • Ambasador - container that acts as an out-bound proxy for the primary container by handling translations to external services.
  • Adapter - container that acts as an in-bound proxy for the primary container aligning interfaces with alternative standards.

initContainers

Any serial container action that does not neatly fit into the one-process-per-container pattern, should be placed in an initContainer. These are typically actions like initialising databases, checking for upgrade processes, executing migrations. initContainer are executed in order, and if any one of them fails, the Pod will be restarted inline with the restartPolicy. With this behaviour, it is important to ensure that the initContainer actions are idempotent, or there will be harmful side effects on restarts.

postStart/preStop

Life-cycle hooks have very few effective usecases as there is no guarantee that a postStart task will run before the main container command does (this is demonstrated above), and there is no guarantee that a preStop task (which is only issued when a Pod is terminated - not completed) will complete before the KILL signal is issued to the parent container after the cluster wide configured grace period (30s).

The value of the lifecycle hooks are generally reserved for:

  • postStart - running an asynchronous non-critical task in the parent container that would otherwise slow down the boot time for the Pod and impact service availability.
  • preStop - initiating asynchronous clean up tasks via an external service - essentially an opportunity to send a quick message out before the Pod is fully terminated.

readinessProbe/livenessProbe

Readiness probes are used by the scheduler to determine whether the container is in a state ready to serve requests. Liveness probes are used by the scheduler to determine whether the container continues to be in a healthy state for serving requests. Where possible, livenessProbe and readinessProbe should be specified. This is automatically used to calculate whether a Pod is available and healthy and whether it should be added and load balanced in a Service. These features can play an important role in the continuity of service when clusters are auto-healed, workloads are shifted from node to node, or during rolling updates to deployments.

The following shows the registered probes and their status for the sidecar container in the examples above:

$ kubectl describe deployment.apps/pod-examples
...
sidecar-nginx-container:
    Image:        nginx
    Port:         80/TCP
    Host Port:    0/TCP
    Liveness:     http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:http/index.html delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
    /usr/share/nginx/html from shared-data (rw)
...

While probes can be a command, it is better to make health checks an http service that is combined with an application metrics handler so that external applications can use the same feature to do health checking (eg: Prometheus, or Icinga).

Sharing, Networking, Devices, Host Resource Access

Sharing resources is often the bottle neck in High Performance Computing, and where the greatest attention to detail is required with containerised applications in order to gain acceptable performance and efficency.

Containers within a Pod can share resources with each other directly using shared volumes, network, and memory. These are the preferred methods because they are cross-platform portable for containers in general, Kubernetes and OS/hardware.

The following example demonstrates how to share memory as a volume between containers:

Pod containers sharing memory
---
kind: Service
apiVersion: v1
metadata:
  name: pod-sharing-memory-examples
  labels:
    app: pod-sharing-memory-examples
spec:
  type: ClusterIP
  selector:
    app: pod-sharing-memory-examples
  ports:
  - name: ncat
    protocol: TCP
    port: 5678
    targetPort: ncat

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pod-sharing-memory-examples
  labels:
    app: pod-sharing-memory-examples
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-sharing-memory-examples
    spec:
      containers:
      # Producer - write to shared memory
      - name: producer-container
        image: python:3.7
        command: ["/bin/sh"]
        args: ["-c", "python3 /src/mmapexample.py -p; sleep infinity"]
        volumeMounts:
        - name: src
          mountPath: /src/mmapexample.py
          subPath: mmapexample.py
          readOnly: true
        - mountPath: /dev/shm
          name: dshm

      # Consumer - read from shared memory and publish on 5678
      - name: consumer-container
        image: python:3.7
        command: ["/bin/sh"]
        # mutating container - this is bad practice but we need netcat for this example
        args: ["-c", "apt-get update; apt-get -y install netcat-openbsd; python3 -u /src/mmapexample.py | nc -l -k -p 5678; sleep infinity"]
        ports:
        - name: ncat
          containerPort: 5678
          protocol: TCP
        volumeMounts:
        - name: src
          mountPath: /src/mmapexample.py
          subPath: mmapexample.py
          readOnly: true
        - mountPath: /dev/shm
          name: dshm

      volumes:
        - name: src
          configMap:
            name: pod-sharing-memory-examples
        - name: dshm
          emptyDir:
            medium: Memory

    # test with:
    # $ nc `kubectl get service/pod-sharing-memory-examples -o jsonpath="{.spec.clusterIP}"` 5678
    # Producers says: 2019-05-05 19:21:10
    # Producers says: 2019-05-05 19:21:11
    # Producers says: 2019-05-05 19:21:12
    # $ kubectl delete deployment,svc,configmap -l app=pod-sharing-memory-examples
    # deployment.extensions "pod-sharing-memory-examples" deleted
    # service "pod-sharing-memory-examples" deleted
    # configmap "pod-sharing-memory-examples" deleted
    # debug with: kubectl logs -l app=pod-sharing-memory-examples -c producer-container

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-sharing-memory-examples
  labels:
    app: pod-sharing-memory-examples
data:
  mmapexample.py: |-
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    """ example mmap python client
    """

    import datetime
    import time
    import getopt
    import os
    import os.path
    import sys
    import logging
    from collections import namedtuple
    import mmap
    import signal


    def parse_opts():
        """ Parse out the command line options
        """
        options = {
            'mqueue': "/example_shared_memory_queue",
            'debug': False,
            'producer': False
        }

        try:
            (opts, _) = getopt.getopt(sys.argv[1:],
                                    'dpm:',
                                    ["debug",
                                    "producer"
                                    "mqueue="])
        except getopt.GetoptError:
            print('mmapexample.py [-d -p -m <message_queue_name>]')
            sys.exit(2)

        dopts = {}
        for (key, value) in opts:
            dopts[key] = value
        if '-p' in dopts:
            options['producer'] = True
        if '-m' in dopts:
            options['mqueue'] = dopts['-m']
        if '-d' in dopts:
            options['debug'] = True

        # container class for options parameters
        option = namedtuple('option', options.keys())
        return option(**options)


    # main
    def main():
        """ Main
        """
        options = parse_opts()

        # setup logging
        logging.basicConfig(level=(logging.DEBUG if options.debug
                                else logging.INFO),
                            format=('%(asctime)s [%(name)s] ' +
                                    '%(levelname)s: %(message)s'))
        logging.info('mqueue: %s mode: %s', options.mqueue,
                    ('Producer' if options.producer else 'Consumer'))

        # trap the keyboard interrupt
        def signal_handler(signal_caught, frame):
            """ Catch the keyboard interrupt and gracefully exit
            """
            logging.info('You pressed Ctrl+C!: %s/%s', signal_caught, frame)
            sys.exit(0)

        signal.signal(signal.SIGINT, signal_handler)

        mqueue_fd = os.open("/dev/shm/" + options.mqueue,
                            os.O_RDWR | os.O_SYNC | os.O_CREAT)

        last = ""
        while True:
            try:
                if options.producer:
                    now = datetime.datetime.now()
                    data = "Producers says: %s\n" % \
                        (now.strftime("%Y-%m-%d %H:%M:%S"))
                    logging.debug('sending out to mqueue: %s', data)
                    os.ftruncate(mqueue_fd, 512)
                    with mmap.mmap(mqueue_fd, 0) as mqueue:
                        mqueue.seek(0)
                        mqueue[0:len(data)] = data.encode('utf-8')
                        mqueue.flush()
                else:
                    with mmap.mmap(mqueue_fd, 0,
                                access=mmap.ACCESS_READ) as mqueue:
                        mqueue.seek(0)
                        data = mqueue.readline().rstrip().decode('utf-8')
                        logging.debug('from mqueue: %s', data)
                        if data == last:
                            logging.debug('same as last time - skipping')
                        else:
                            last = data
                            sys.stdout.write(data+"\n")
                            sys.stdout.flush()
            except Exception as ex:                 # pylint: disable=broad-except
                logging.debug('error: %s', repr(ex))

            time.sleep(1)

        logging.info('Finished')
        sys.exit(0)


    # main
    if __name__ == "__main__":

        main()

The following example demonstrates how to share memory over POSIX IPC between containers:

Pod containers sharing memory over POSIX IPC
---
kind: Service
apiVersion: v1
metadata:
  name: pod-ipc-sharing-examples
  labels:
    app: pod-ipc-sharing-examples
spec:
  type: ClusterIP
  selector:
    app: pod-ipc-sharing-examples
  ports:
  - name: ncat
    protocol: TCP
    port: 1234
    targetPort: ncat

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pod-ipc-sharing-examples
  labels:
    app: pod-ipc-sharing-examples
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-ipc-sharing-examples
    spec:
      volumes:
      - name: shared-data
        emptyDir: {}

      initContainers:
      # get and build the ipc shmem tool
      - name: builder-container
        image: golang:1.11
        command: ['sh', '-c', "export GOPATH=/src; go get gitlab.com/ghetzel/shmtool"]
        volumeMounts:
        - name: shared-data
          mountPath: /src

      containers:
      # Producer
      - name: producer-container
        image: alpine
        command: ["/bin/sh"]

        args:
        - "-c"
        - >
          apk add -U util-linux;
          mkdir /lib64 && ln -s /lib/libc.musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2;
          ipcmk --shmem 1KiB;
          echo "ipcmk again as chmtool cant handle 0 SHMID";
          ipcmk --shmem 1KiB; > /pod-data/memaddr.txt;
          while true;
           do echo 'Main app (pod-ipc-sharing-examples) says: ' `date` | /pod-data/bin/shmtool open -s 1024 `ipcs -m | cut -d' ' -f 2 | sed  '/^$/d' | tail -1`;
              sleep 1;
           done
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

      # Consumer - read from the pipe and publish on 1234
      - name: consumer-container
        image: alpine
        command: ["/bin/sh"]
        args:
        - "-c"
        - >
          apk add --update coreutils util-linux;
          mkdir /lib64 && ln -s /lib/libc.musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2;
          sleep 3;
          (while true;
             do /pod-data/bin/shmtool read `ipcs -m | cut -d' ' -f 2 | sed  '/^$/d' | tail -1`;
                sleep 1;
             done) | stdbuf -i0 nc -l -k -p 1234
        ports:
        - name: ncat
          containerPort: 1234
          protocol: TCP
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

# test with:
#  $ nc `kubectl get service/pod-ipc-sharing-examples -o jsonpath="{.spec.clusterIP}"` 1234
#  Main app (pod-ipc-sharing-examples) says:  Tue May 7 20:46:03 UTC 2019
#  Main app (pod-ipc-sharing-examples) says:  Tue May 7 20:46:04 UTC 2019
#  Main app (pod-ipc-sharing-examples) says:  Tue May 7 20:46:05 UTC 2019
# $ kubectl delete deployment,svc -l app=pod-ipc-sharing-examples
# deployment.extensions "pod-ipc-sharing-examples" deleted
# service "pod-ipc-sharing-examples" deleted

The following example demonstrates how to share over a named pipe between containers:

Pod containers sharing over named pipe
---
kind: Service
apiVersion: v1
metadata:
  name: pod-sharing-examples
  labels:
    app: pod-sharing-examples
spec:
  type: ClusterIP
  selector:
    app: pod-sharing-examples
  ports:
  - name: ncat
    protocol: TCP
    port: 1234
    targetPort: ncat

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pod-sharing-examples
  labels:
    app: pod-sharing-examples
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-sharing-examples
    spec:
      volumes:
      # lifecyle containers as hooks share state using volumes
      - name: shared-data
        emptyDir: {}

      initContainers:
      # Setup the named pipe for inter-container communication
      - name: init-container
        image: alpine
        command: ['sh', '-c', "mkfifo /pod-data/piper"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

      containers:
      # Producer
      - name: producer-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'Main app (pod-sharing-examples) says: ' `date` >> /pod-data/piper; sleep 1;done"]
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

      # Consumer - read from the pipe and publish on 1234
      - name: consumer-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "apk add --update coreutils; tail -f /pod-data/piper | stdbuf -i0 nc -l -k -p 1234"]
        ports:
        - name: ncat
          containerPort: 1234
          protocol: TCP
        volumeMounts:
        - name: shared-data
          mountPath: /pod-data

# test with:
#  $ nc `kubectl get service/pod-sharing-examples -o jsonpath="{.spec.clusterIP}"` 1234
#  Main app says:  Thu May 2 20:48:56 UTC 2019
#  Main app says:  Thu May 2 20:49:53 UTC 2019
#  Main app says:  Thu May 2 20:49:56 UTC 2019
# $ kubectl delete deployment,svc -l app=pod-sharing-examples
# deployment.extensions "pod-sharing-examples" deleted
# service "pod-sharing-examples" deleted

The following example demonstrates how to share over the localhost network between containers:

Pod containers sharing over localhost network
---
kind: Service
apiVersion: v1
metadata:
  name: pod-sharing-network-examples
  labels:
    app: pod-sharing-network-examples
spec:
  type: ClusterIP
  selector:
    app: pod-sharing-network-examples
  ports:
  - name: ncat
    protocol: TCP
    port: 5678
    targetPort: ncat

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pod-sharing-network-examples
  labels:
    app: pod-sharing-network-examples
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-sharing-network-examples
    spec:
      containers:
      # Producer
      - name: producer-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "apk add --update coreutils; (while true; do echo 'Main app (pod-sharing-network-examples) says: ' `date`; sleep 1; done) | stdbuf -i0 nc -lk -p 1234"]

      # Consumer - read from the local port and publish on 5678
      - name: consumer-container
        image: alpine
        command: ["/bin/sh"]
        args: ["-c", "apk add --update coreutils; nc localhost 1234 | stdbuf -i0 nc -l -k -p 5678"]
        ports:
        - name: ncat
          containerPort: 5678
          protocol: TCP

    # test with:
    #  $ nc `kubectl get service/pod-sharing-network-examples -o jsonpath="{.spec.clusterIP}"` 5678
    #  Main app says:  Thu May 2 20:48:56 UTC 2019
    #  Main app says:  Thu May 2 20:49:53 UTC 2019
    #  Main app says:  Thu May 2 20:49:56 UTC 2019
    # $ kubectl delete deployment,svc -l app=pod-sharing-network-examples
    # deployment.extensions "pod-sharing-network-examples" deleted
    # service "pod-sharing-network-examples" deleted

Performance driven networking requirements are a concern with HPC. Often the solution is to bind an application directly to a specific host network adapter. Historically, the solution for this in containers has been to escalate the privileges of the container so that it is running in the host namespace, and this is achieved in in Kubernetes using the following approach:

...
spec:
  containers:
    - name: my-privileged-container
      securityContext:
        privileged: true
...

This SHOULD be avoided at all costs. This pushes the container into the host namespace for processes, network and storage. A critical side effect of this is that any port that the container consumes can conflict with host services, and will mean that ONLY a single instance of this container can run on any given host. Outside of these functional concerns, it is a serious source of security breach as the privileged container has full (root) access to the node including any applications (and containers) running there.

To date, the only valid exceptions discovered have been:

  • Core daemon services running for the Kubernetes and OpenStack control plane that are deployed as containers but are node level services.
  • Storage, Network, or Device Kubernetes plugins that need to deploy OS kernel drivers.

As a first step to resolving a networking issue, the Kubernetes and Platform management team should always be approached to help resolve architectural issues to avoid this approach. In the event of not being able to reconcile the requirement, then the following hostNetwork solution should be attempted first:

...
spec:
  containers:
    - name: my-hostnetwork-container
      securityContext:
        hostNetwork: true

Use of Services

Service resources should be defined in the same template file as the associated application deployment and ordered at the top. This will ensure that service related environment variables will be passed into the deployment at scheduling time. It is good practice to only have a single Service resource per deployment that covers the port mapping/exposure for each application port. It is also important to only have one deployment per Service as it will make debugging considerably harder mapping a Service to more than one application. As part of this, ensure that the selector definition is specific to the fully qualified deployment including release and version to prevent leakage across multiple deployment versions. Fully qualify port definitions with name, port, protocol and targetPort so that the interface is self documenting. Using names for targetPort the same as name is encouraged as this can give useful hints as to the function of the container interface.

Service resource with fully qualified port description and specific selector
---
apiVersion: v1
kind: Service
metadata:
name: tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
  app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
  app.kubernetes.io/instance: "{{ .Release.Name }}"
  app.kubernetes.io/managed-by: "{{ .Release.Service }}"
  helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
spec:
  type: ClusterIP
  ports:
  - name: rest
    protocol: TCP
    port: 80
    targetPort: rest
  selector:
    app.kubernetes.io/name: tango-rest-{{ template "tango-chart-example.name" . }}
    app.kubernetes.io/instance: "{{ .Release.Name }}"

type: ClusterIP is the default and should almost always be used and declared. NodePort should only be used under exceptional circumstances as it will reserve a fixed port on the underlying node using up the limited node port address range resource.

Only expose ports that are actually needed external to the deployment. This will help reduce clutter and reduce the surface area for attack on an application.

Use of Ingress

A Helm chart represents an application to be deployed, so it follows that it is best practice to have a single Ingress resource per chart. This represents the single frontend for an application that exposes it to the outside world (relative to the Kubernetes cluster). If a chart seemingly requires multiple hostnames and/or has services that want to inhabit the same port or URI space, then consider splitting this into multiple charts so that the component application can be published independently.

It is useful to parameterise the control of SSL/TLS configuration so that this can be opted in to in various deployment strategies (as below).

One Ingress per chart with TLS parameterised
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: rest-api-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
labels:
  app.kubernetes.io/name: rest-{{ template "tango-chart-example.name" . }}
  app.kubernetes.io/instance: "{{ .Release.Name }}"
  app.kubernetes.io/managed-by: "{{ .Release.Service }}"
  helm.sh/chart: "{{ template "tango-chart-example.chart" . }}"
annotations:
  {{- range $key, $value := .Values.ingress.annotations }}
  {{ $key }}: {{ $value | quote }}
  {{- end }}
spec:
  rules:
    - host: {{ .Values.ingress.hostname }}
      http:
        paths:
          - path: /
            backend:
              serviceName:  tango-rest-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}
              servicePort: 80
{{- if .Values.ingress.tls.enabled }}
  tls:
    - secretName: {{ tpl .Values.ingress.tls.secretname . }}
      hosts:
        - {{ tpl .Values.ingress.hostname . }}
{{- end -}}

Scheduling and running cloud native application suites

Security

Security covers many things, but this section will focus on RBAC and network Policies.

Roles

Kubernetes will implement role based access control which will be used to control external and internal user access to scheduling and consuming resources.

While it is possible to create serviceAccounts to modify the privileges for a deployment, this should generally be avoided so that the access control profile of the deploying user can be inherited at launch time.

Do not create ClusterRole and ClusterRoleBinding resources and/or allocate these to ServiceAccounts used in a deployment as these have extended system wide access rights. Role and RoleBinding are scoped to the deployment Namespace so limit the scope for damage.

Pod Security Policies

Pod Security Policies will affect what can be requested in the securityContext section.

It should be assumed that Kubernetes clusters will run restrictive Pod security policies, so it should be expected that:

  • Pods do not need to access resources outside the current Namespace.
  • Pods do not run as privileged: true and will not have privilege escalation.
  • hostNetwork activation will require discussion with operations.
  • hostIPC will be unavailable.
  • hostPID will be unavailable.
  • Containers should run as a non-root user.
  • host ports will be restricted.
  • host paths will be restricted (hostPath mounts).
  • it maybe required to have read only root filesystem (layer in container).
  • Capabilities maybe dropped and a restricted list put in place to determine what can be added.
  • it should be expected that the default service account credentials will NOT be mounted into the running containers by default - applications should rarely need to query the Kubernetes API, so access will be removed by default.

In general, only system level deployments such as Kubernetes control plane components (eg: adminsion controllers, device drivers, Operators, etc.) are the only deployments that should have cluster level rights.

Network Policies

Explicit Network Policies are encouraged to restrict unintended access across deployments, and to secure applications from some forms of intrusion.

The following restricts access to the deployed TangoDB to only the DatabaseDS application.

One Ingress per chart with TLS parameterised
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tangodb-{{ template "tango-chart-example.name" . }}-{{ .Release.Name }}-network-policy
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: tangodb-{{ template "tango-chart-example.name" . }}
      app.kubernetes.io/instance: "{{ .Release.Name }}"
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
    # enable the DatabaseDS interface
        matchLabels:
          app.kubernetes.io/name: databaseds-{{ template "tango-chart-example.name" . }}
          app.kubernetes.io/instance: "{{ .Release.Name }}"
    ports:
    - name: ds
      protocol: TCP
      port: 10000
  egress:
  - to:
    # anywhere in the standard Pod Network address range to all ports
    - ipBlock:
        cidr: 10.0.0.0/16

Images, Tags, and pullPolicy

Only use images from trusted sources. In most cases this should be only from the official SKA repository, with a few exceptions such as the core vender supported images for key services such as MySQL. It is anticipated that in the future the SKA will host mirrors and/or pull-through caches for key external software components, and will then firewall off access to external repositories that are not explicitly trusted.

As a general rule, stable image tags should be used for images that at least include the Major and Minor version number of Semantic Versioning eg: mysql:5.27. As curated images come from trusted sources, this ensures that the deployment process gets a functionally stable starting point that will still accrue bug fixing and security patching over time. Do NOT use the latest tag as it is likely that this will break your application in future as it gives no way of guaranteeing feature parity and stability.

In Helm Charts, it is good practice to parameterise the registry, image and tag of each container so that these can be varied in different environment deployments by changing values. Also parameterise the pullPolicy so that communication with the registry at container boot time can be easily turned on and off.

...
containers:
- name: tangodb
  image: "{{ .Values.tangodb.image.registry }}/{{ .Values.tangodb.image.image }}:{{ .Values.tangodb.image.tag }}"
  imagePullPolicy: {{ .Values.tangodb.image.pullPolicy }}

Resource reservations and constraints

Compute platform level resources encompass:

  • Memory.
  • CPU.
  • Plugin based devices.
  • Extended resources - configured node level logical resources.

Resources can be either specified in terms of:

  • Limits - the maximum amount of resource a container is allowed to consume before it maybe restarted or evicted.
  • Requests - the amount of resource a container requires to be available before it will be scheduled.

Limits and requests are specified at the individual container level:

...
containers:
- name: tango-device-thing
  resources:
    requests:
      cpu: 4000m    # 4 cores
      memory: 512M  # 0.5GB
      skatelescope.org/widget: 3
    limits:
      cpu: 8000m    # 8 cores
      memory: 1024M  # 1GB

Resource requirements should be explicitly set both in terms of requests and limits (not normally applicable to extended resources) as this can be used by the scheduler to determine load balancing policy, and to determine when an application is misbehaving. These parameters should be set as configured values.yaml parameters.

Restarts

Containers should be designed to cleanly crash - the main process should exit on a fatal error (no internal restart). This then will ensure that the configured livenessProbe and readinessProbe function correctly and where necessary, remove the affected Pod from Services ensuring that there are no dead service connections.

Logging

The SKA has adopted SKA Log Message Format as the logging standard to be used by all SKA software. This should be considered a base line standard and will be decorated with additional data by an infrastructure wide integrated logging solution (eg: ElasticStack). To ensure compliance with this, all containers must log to stdout/stderr and/or be configured to log to syslog. Connection to syslog should be configurable using standard container mechanisms such as mounted files (handled by ConfigMaps) or environment variables. This will ensure that any deployed application can be automatically plugged into the infrastructure wide logging and monitoring solution. A simple way to achieve this is to use a logging client library that is dynamically configurable for output destination such as import logging for Python.

Metrics

Each Pod should have an application metrics handler that emits the adopted container standard format. For efficency purposes this should be amalgamated with the livenessProbe and readinessProbe.

Scheduling

Scheduling in Kubernetes enables the resources of the entire cluster to be allocated using a fine grained model. These resources can be partitioned according to user policies, namespaces, and quotas. The default scheduler is a comprehensive rules processing engine that should be able to satisfy most needs.

The primary mechanism for routing incoming tasks to execution is by having a labelling system throughout the cluster that reflects the distribution profile of workloads and types of resources required, coupled with Node and Pod affinity/anti-affinity rules. These are applied like a sieve to the available resources that the Scheduler keeps track of to determine if resources are available and where the next Pod can be placed.

Scheduling on Kubernetes behaves similarly to a force directed graph, in that the tensions between the interdependent rules form the pressures of the spring bars that influence relative placement across the cluster.

When creating scheduling constraints, attempt to keep them as generic as possible. Concentrate on declaring rules related to the individual Helm chart and the current chart in relation to any dependent charts (subcharts). Avoid coding in node specific requirements. Often it is more efficient to outsource the rules to the values.yaml file as they are almost guaranteed to change between environments.

---
...
{{- with .Values.nodeSelector }}
      nodeSelector:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.affinity }}
      affinity:
{{ toYaml . | indent 8 }}
{{- end }}
{{- with .Values.tolerations }}
      tolerations:
{{ toYaml . | indent 8 }}
{{- end }}
...

Always remember that the Kubernetes API is declarative and expect that deployments will use the apply semantics of kubectl, with the scheduler constantly trying to move the system towards the desired state as and when resources become available as well as in response to failures. This means that scheduling is not guaranteed, so any downstream depedencies must be able to cope with that (also a tenent of micro-services architecture).

Examples of scheduling control patterns

The below scheduling scenarios are run using the following conditions:

  • container replicas launched using a sleep command in busybox, defined in a StatefulSet.
  • Specific node.
  • Type of node.
  • Density - 1 per node, n per node.
  • Position next another Pod - specific Pod, or Pod type.
  • Soft and hard rules.
  • A four node cluster - master and three minions.
  • The nodes have been split into two groups: rack01 - k8s-master-0 and k8s-minion-0, and rack02 - k8s-minion-1, and k8s-minion-2.
  • The master node has the labels: node-role.kubernetes.io/headnode, and node-role.kubernetes.io/master.

The aim is to demonstrate how the scheduler works, and how to configure for the common use cases.

obs1 and obs2 - nodeAffinity

Use nodeSelector to force all 3 replicas onto rack: rack01 for obs1-rack01 and rack02 for obs2-rack02:

node select rack01 for obs1-rack01 and rack02 for obs2-rack02
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs1-rack01
  labels:
    group: scheduling-examples
    app: obs1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: obs1
  serviceName: obs1
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs1
      annotations:
        description: node select rack01
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs1-rack01
        command: ["sleep", "365d"]
      nodeSelector:
        rack: rack01

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs2-rack02
  labels:
    group: scheduling-examples
    app: obs2
spec:
  replicas: 3
  selector:
    matchLabels:
      app: obs2
  serviceName: obs2
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs2
      annotations:
        description: node select rack02
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs2-rack02
        command: ["sleep", "365d"]
      nodeSelector:
        rack: rack02

Scenario obs1 - run 3 Pods on hosts allocated to rack01. Only nodes master-0, and minion-0 are used reflecting rack01.

NAME          DESC               STATUS  NODE
obs1-rack01-0 node select rack01 Running k8s-master-0
obs1-rack01-1 node select rack01 Running k8s-minion-0
obs1-rack01-2 node select rack01 Running k8s-master-0

and for Scenario obs2 - run 3 Pods on hosts allocated to rack02. Only minion-1 and minion-2 are used reflecting rack02.

NAME          DESC               STATUS  NODE
obs2-rack02-0 node select rack02 Running k8s-minion-2
obs2-rack02-1 node select rack02 Running k8s-minion-1
obs2-rack02-2 node select rack02 Running k8s-minion-2

obs3 - nodeAffinity exclussion

Use nodeAffinity operator: NotIn rules to exclude the master node from scheduling:

nodeAffinity NotIn master
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs3-node-affinity-not-master
  labels:
    group: scheduling-examples
    app: obs3
spec:
  replicas: 4
  selector:
    matchLabels:
      app: obs3
  serviceName: obs3
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs3
      annotations:
        description: nodeAffinity NotIn master
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs3-node-affinity-not-master
        command: ["sleep", "365d"]
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: NotIn
                values:
                - ""

Scenario obs3 - run 4 Pods on any host so long as they are not labelled node-role.kubernetes.io/master. In this case minion-0 and minion-1 have been selected minion-2 could also have been used.

NAME                            DESC                      STATUS  NODE
obs3-node-affinity-not-master-0 nodeAffinity NotIn master Running k8s-minion-1
obs3-node-affinity-not-master-1 nodeAffinity NotIn master Running k8s-minion-0
obs3-node-affinity-not-master-2 nodeAffinity NotIn master Running k8s-minion-1
obs3-node-affinity-not-master-3 nodeAffinity NotIn master Running k8s-minion-0

obs4 - nodeAntiAffinity

Use podAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to position on the same node as obs1-rack01, and nodeAntiAffinity to (soft preferredDuringSchedulingIgnoredDuringExecution) exclude the node labelled ‘node-role.kubernetes.io/headnode’ from scheduling:

podAffinity require obs1-rack01, nodeAntiAffinity prefer headnode
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs4-pod-affinity-obs1-pref-not-headnode
  labels:
    group: scheduling-examples
    app: obs4
spec:
  replicas: 5
  selector:
    matchLabels:
      app: obs4
  serviceName: obs4
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs4
      annotations:
        description: podAffinity req obs1, nodeAntiAffinity pref headnode
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs4-pod-affinity-obs1-pref-not-headnode
        command: ["sleep", "365d"]
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - obs1
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/headnode
                operator: NotIn
                values:
                - ""

Scenario obs4 - run 5 Pods using required Pod Affinity with obs1 and preferred Node Anti Affinity with headnode (master label). Pods have been scheduled on minion-0 and master-0 as this is where obs1 is. This is further compounded by the anti affinity rule with headnode where only one replica is on master-0.

NAME                                   DESC                                            STATUS  NODE
obs4-pod-affinity-obs1-pref-not-headnode-0 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-1 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-2 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0
obs4-pod-affinity-obs1-pref-not-headnode-3 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-master-0
obs4-pod-affinity-obs1-pref-not-headnode-4 podAffinity req obs1, nodeAntiAffinity pref headnode Running k8s-minion-0

obs5 - podAntiAffinity

Use podAntiAffinity (hard requiredDuringSchedulingIgnoredDuringExecution) to ensure only one instance of self per node (topologyKey: “kubernetes.io/hostname”), and podAffinity to require a position on the same node as obs3:

podAntiAffinity require self and podAffinity require obs3
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs5-pod-one-per-node-and-obs3
  labels:
    group: scheduling-examples
    app: obs5
spec:
  replicas: 5
  selector:
    matchLabels:
      app: obs5
  serviceName: obs5
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs5
      annotations:
        description: podAntiAffinity req self, podAffinity req obs3
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs5-pod-one-per-node-and-obs3
        command: ["sleep", "365d"]
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - obs5
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - obs3
            topologyKey: "kubernetes.io/hostname"

Scenario obs5 - run 3 Pods using required Pod Anti Affinity with self (force schedule one per node) and require Pod Affinity with obs3. This has forced scheduling of one per node, and because obs3 is only running on two different nodes the 3rd replica is in a constant state of Pending. Pod Affinity is described with a topology key that is

obs6 - Taint NoSchedule

kubernetes.io/hostname ie. the node identifier. The topology key sets the scope for implementing the rule, so could be a node, a group of nodes, an OS or device classificaton etc.

NAME                             DESC                                           STATUS  NODE
obs5-pod-one-per-node-and-obs3-0 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-0
obs5-pod-one-per-node-and-obs3-1 podAntiAffinity req self, podAffinity req obs3 Running k8s-minion-1
obs5-pod-one-per-node-and-obs3-2 podAntiAffinity req self, podAffinity req obs3 Pending <none>

First, the master node is tainted to disallow scheduling with kubectl cordon <master node>.

Use nodeSelector to force all 3 replicas onto rack: rack01, but this will fail to schedule as the taint will not allow it so subsequently forced onto minion-0:

node select rack01, but trapped by Taint NoSchedule
---
# kubectl taint nodes k8s-master-0 key1=value1:NoSchedule, or kubectl cordon k8s-master-0
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs6-rack01-taint
  labels:
    group: scheduling-examples
    app: obs6
spec:
  replicas: 3
  selector:
    matchLabels:
      app: obs6
  serviceName: obs6
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs6
      annotations:
        description: node select rack01, but trapped by Taint NoSchedule
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs6-rack01-taint
        command: ["sleep", "365d"]
      nodeSelector:
        rack: rack01

The resulting schedule is:

NAME                READY STATUS  RESTARTS AGE IP              NODE NOMINATED NODE
obs6-rack01-taint-0 1/1   Running 0        32s 192.168.105.180 k8s-minion-0 <none>
obs6-rack01-taint-1 1/1   Running 0        31s 192.168.105.177 k8s-minion-0 <none>
obs6-rack01-taint-2 1/1   Running 0        29s 192.168.105.181 k8s-minion-0 <none>

For obs6, a StatefulSet that has nodeSelector:

nodeSelector:
rack: rack01

The result shows that of the two nodes (ks-master-0, and k8s-minion-0) in rack01, only k8s-minion-0 is available for these Pods.

obs7 - add tolleration

Repeat obs6 as obs7 but add a tolleration to the NoSchedule taint:

node select rack01, with Tolleration to Taint NoSchedule
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: obs7-rack01-taint-and-tolleration
  labels:
    group: scheduling-examples
    app: obs7
spec:
  replicas: 3
  selector:
    matchLabels:
      app: obs7
  serviceName: obs7
  template:
    metadata:
      labels:
        group: scheduling-examples
        app: obs7
      annotations:
        description: node select rack01, with Tolleration to Taint NoSchedule
    spec:
      containers:
      - image: busybox:1.28.3
        name: obs7-rack01-taint-and-tolleration
        command: ["sleep", "365d"]
      nodeSelector:
        rack: rack01
      tolerations:
      - key: "key1"
        operator: "Equal"
        value: "value1"
        effect: "NoSchedule"

Now with the added a Tolleration to the Taint, we have the following:

NAME                                READY STATUS RESTARTS AGE IP              NODE NOMINATED NODE
obs7-rack01-taint-and-tolleration-0 1/1   Running 0       33s 192.168.105.184 k8s-minion-0 <none>
obs7-rack01-taint-and-tolleration-1 1/1   Running 0       32s 192.168.72.27   k8s-master-0 <none>
obs7-rack01-taint-and-tolleration-2 1/1   Running 0       31s 192.168.105.182 k8s-minion-0 <none>

For a StatefulSet that has nodeSelector and Tollerations:

nodeSelector:
  rack: rack01
tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

The result shows that the two nodes k8s-master-0, and k8s-minion-0 in rack01, are available for these Pods.