Feature Flags

This section explains the fundamental concepts behind feature flags and their specific application within the SKAO context.

What are Feature Flags?

Feature flags (also known as feature toggles, flippers, or conditional features) are a software development technique that allows you to turn certain functionality on and off during runtime, without deploying new code.

At its core, a feature flag is a decision point in your code that can change the behaviour of the software based on the flag’s state (e.g., on or off).

Example

if feature_flags.is_enabled("new-shiny-feature"):
    # Execute the new code path
    show_new_shiny_feature()
else:
    # Execute the old code path
    show_old_feature()

Note

Feature flags are used as a term to describe a variety of different ways to achieve the same goal in this document. The premise is the same as described above, but the implementation may differ. Feature flags can be implemented as runtime configurable toggles via API calls, or via Tango attributes, or static configuration options as deployment configuration or program parameters for the applications, or compile-time options as build flags, or any other custom implementations. The goal is to provide a consistent way to control the behaviour of the software across multiple components.

How to choose which option to use is explained with the flowchart below in more detail.

Generally, many of the SKA components will only require static configuration options.

Where They Can Be Useful

Feature flags offer several advantages:

Decouple Deployment from Release: Deploy code to production frequently, but only release features to users when ready or only configure the system fully when it’s ready.
Canary Releases & Gradual Rollouts: Release features to a small subset of users (e.g., PSIs, AIV, software only cloud, etc.) before a full rollout, reducing risk
A/B Testing/Compatibility: Expose different versions of a feature or dependencies to test new functionality or compatibility with other components and gather feedback. e.g., different algorithms, different data formats, different UIs
Kill Switches: Quickly disable problematic features in production without needing a rollback or hotfix deployment.
Development: Allow developers to merge incomplete features (especially features which require updates to many components) to the main branch, hidden behind a flag, reducing merge conflicts and integration pain.
Operational Control: Enable/disable features for specific operational needs (e.g., disabling a resource-intensive feature during peak load).

You can read more about different types of feature flags here.

See Project examples for examples of how to use feature flags in SKA software.

Where They Shouldn’t Be Used (Anti-patterns)

While powerful, feature flags should be used judiciously:

Long-Term Configuration: Flags should be temporary. Avoid using them as a permanent configuration system; use proper configuration management for that. Plan for flag removal.
Excessive Complexity: Too many flags, especially nested ones, can make code hard to reason about, test, and maintain.
Replacing Proper Design: Avoid using flags to implement architectural decisions or refactoring code or reducing technical debt.
Core Architectural Changes: Flags are generally unsuitable for toggling fundamental architectural differences since these changes are often introduce more code and test complexity and migration/rollback issues. Moving from No-SQL to SQL shouldn’t be controlled via feature flags.

Naming Conventions

The following naming conventions are advised:

Prefix the flag name with the component and/or subsystem name. This helps to identify the feature flag and its purpose.
Use the same flag name across different repositories if a flag introduced in Component A needs to be controlled during the integration testing of Subsystem X or System 1. The definition and control plane shifts, but the flag name checked in the code remains the same.
Use the same flag name across different environments.
Use the same flag name across different components of the same system.

Example: Let’s say component-x-enable-new-function is a feature flag for a new feature in Component X. Since it’s not wanted to change the name of the feature flag in the code, the flag name should be the same across all deployments and environments. The flag name should be the same for Component X, Subsystem A, and System 1.

This can then be used in multiple projects and environments like below via configuration across different datacentres (e.g. STFC, ITF, PSI, AA etc.) and if the deployment is from the same project, an environment option can be used to control the flag.

Datacentre	Environment	Component	Flag Status
STFC	CI/Test	X	enabled
STFC	Integration/Staging	Subsystem A	enabled
ITF	Integration	SKA MID ITF Integration	disabled
AA	Production	SKA MID AA	enabled

Best Practices

Define flags at the highest level where they need to be controlled. If a flag in Component A only affects A’s internal behaviour and isn’t relevant to Subsystem X or Telescope, it could potentially be managed within its own project. However, if the feature controlled by the flag needs coordinated rollout across the integrated system, it needs to be clearly documented on its usage and default behaviour. Clear communication and naming conventions are crucial.
Use the same flag name across different environments if a flag introduced in Component A needs to be controlled during the integration testing of Subsystem X or Telescope. The definition and control plane shifts, but the flag name checked in the code remains the same.
Always use configurable options for initialisation of the Unleash client so that different datacentres can be used for different environments.
Ensure that feature flags are removed after their purpose has been fulfilled to maintain code clarity and reduce complexity.

Feature Flag Lifecycle

The following diagram illustrates the typical lifecycle of a feature controlled by a runtime feature flag backed by a client/server model (UnleashClient in this example and Gitlab backend as the feature flag store), specifically focusing on a hypothetical Component X using the flag new-x-feature. It shows the journey from initial development through various environments to production rollout and eventual cleanup, keeping system dependencies in mind. Please omit the Unleash components for other kids on feature flags.

Other kinds of implementations (e.g. Tango attributes, Static configuration, compile-time options etc.) are not documented here as they are more straightforward and do not require a client/server model, please see the Project examples for implementation examples.

This diagram illustrates the typical lifecycle of a feature controlled by a feature flag, specifically focusing on a hypothetical Component X using the flag new-x-feature. It shows the journey from initial development through various environments to production rollout and eventual cleanup, keeping system dependencies in mind.

Note: As indicated at the bottom of the diagram, ``Component X`` runs within ``Subsystem A``, which in turn runs within ``Telescope``.

1. Local Development (Repo X)

Developer Action: A developer working within the Component X repository (Repo X) introduces new functionality. They wrap this new code path and the original (“Old logic”) code path within a conditional statement controlled by the new-x-feature flag using an Unleash client library (e.g., ff.is_enabled('new-x-feature', fallback = True)). Using a fallback is crucial to handle potential initialisation or network issues gracefully.
Testing: During local development, the developer uses: * A mock client, cached values, or a local Unleash instance for testing flag behaviour without connecting to a central server.
Flag Definition: The diagram notes that the new-x-feature flag might initially be defined in the Repo X GitLab project settings. This then can be used by the team to control it for development and testing. (See Integration Testing for where control often resides).
Outcome: The code containing the flag logic is committed and pushed via Git.

CI/CD Pipeline

Trigger: The Git Push triggers the CI/CD Pipeline.
Action: This pipeline builds, tests, and orchestrates the deployment of the application (e.g., Telescope / System A, which includes Component X) to subsequent environments.

2. Integration Testing (Repo X / Subsystem A / CI)

Environment: This phase often occurs within the CI/CD pipeline itself or a dedicated, short-lived test environment (represented within Kubernetes).
Testing: Various automated tests run against the integrated code: Unit tests, Component tests, and Integration tests etc.
Flag Configuration Source: Tests in this environment fetch flag configurations from the GitLab feature flags defined within the Repo X project.
GitLab State: For this CI/test environment, the new-x-feature flag is configured to be ON in the Repo X GitLab project to allow testing of the new code path during the CI phase.
Client Interaction: The client (potentially via a proxy, as indicated by “Client (Proxy) -> GitLab”) checks the flag state against the configuration fetched from the Repo X GitLab project.
Outcome: Successful tests allow the pipeline to proceed to deploy to Staging.

3. Staging Environment

Deployment: The CI/CD pipeline deploys the integrated application to a persistent staging environment.
Unleash Proxy: An Unleash Proxy service runs within the Staging environment. It periodically fetches the flag configurations from the central GitLab instance and caches them.
Application Behaviour: The running application checks the flag status (Check Flag toggle) by querying the local Unleash Proxy within the staging environment.
GitLab State: The flag is configured in GitLab to be ON for the staging environment. Strategies might involve enabling it for specific subsystems (select subsystems for compatibility) or for specific users/groups.
Verification: AIV (Assembly, Integration and Verification), Cloud, or other designated Testers may interact with the staging system to manually verify the new feature.

4. Production Environment

Deployment: After successful staging validation, the application is deployed to the Production.
Unleash Proxy: A dedicated Unleash Proxy may run in Production, fetching flag configurations from GitLab for the production environment scope.
Application Behaviour: Production instances check the flag status (Check Flag toggle) via the Production Unleash Proxy or directly from GitLab.
GitLab State (Rollout): The flag’s strategy for the production environment is managed in GitLab for a controlled rollout: * The flag can be toggled OFF/ON. * A Gradual Rollout strategy may be used (e.g., enabling for specific subsystems, user percentages, user IDs). * Eventually, the strategy is updated to enable the flag for 100% of users/subsystems.

5. Cleanup (Post-Rollout)

Trigger: Once the feature is stable and fully rolled out in Production.
Actions:
1. Remove flag logic from Component X code (leaving only the new path).
2. Git Push & CI/CD Deploy: Push the cleaned code; the pipeline deploys the updated application without the flag logic.
3. Delete Flag definition from GitLab.
4. Remove related tests for the old code path that is no longer reachable.

Read more on feature flags on developer portal: Feature Flags