1
0
mirror of https://github.com/Unleash/unleash.git synced 2025-03-27 00:19:39 +01:00

Update 11 principles docs (#7907)

This commit is contained in:
Melinda Fekete 2024-08-20 17:19:12 +02:00 committed by GitHub
parent 51f61d5103
commit 500a9c7df4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -1,339 +1,197 @@
---
title: "Feature flags: Best practices for building and scaling"
description: "Discover 11 essential principles for building robust, large-scale feature flag systems."
toc_max_heading_level: 2
---
# Feature flags: Best practices for building and scaling feature flag systems
# Best practices for building and scaling feature flags
import Figure from '@site/src/components/Figure/Figure.tsx'
Feature flags, sometimes called feature toggles or feature switches, are a software development technique that allows engineering teams to decouple the release of new functionality from software deployments. With feature flags, developers can turn specific features or code segments on or off at runtime, without the need for a code deployment or rollback. Organizations who adopt feature flags see improvements in all key operational metrics for DevOps: Lead time to changes, mean-time-to-recovery, deployment frequency, and change failure rate.
Feature flags, sometimes called feature toggles or feature switches, are a powerful software development technique that allows engineering teams to decouple the release of new functionality from software deployments.
There are 11 principles for building a large-scale feature flag system. These principles have their roots in distributed systems architecture and pay particular attention to security, privacy, and scale that is required by most enterprise systems. If you follow these principles, your feature flag system is less likely to break under load and will be easier to evolve and maintain.
With feature flags, developers can turn specific features or code segments on or off at runtime without needing a code deployment or rollback. Organizations that adopt feature flags see improvements in key DevOps metrics like lead time to changes, mean time to recovery, deployment frequency, and change failure rate.
At Unleash, we've defined 11 principles for building a large-scale feature flag system. These principles have their roots in distributed systems design and focus on security, privacy, and scalability—critical needs for enterprise systems. By following these principles, you can create a feature flag system that's reliable, easy to maintain, and capable of handling heavy loads.
These principles are:
- [Background](#background)
- [1. Enable run-time control. Control flags dynamically, not using config files.](#1-enable-run-time-control-control-flags-dynamically-not-using-config-files)
- [2. Never expose PII. Follow the principle of least privilege.](#2-never-expose-pii-follow-the-principle-of-least-privilege)
- [3. Evaluate flags as close to the user as possible. Reduce latency.](#3-evaluate-flags-as-close-to-the-user-as-possible-reduce-latency)
- [4. Scale Horizontally. Decouple reading and writing flags.](#4-scale-horizontally-decouple-reading-and-writing-flags)
- [5. Limit payloads. Feature flag payload should be as small as possible.](#5-limit-payloads-feature-flag-payload-should-be-as-small-as-possible)
- [6. Design for failure. Favor availability over consistency.](#6-design-for-failure-favor-availability-over-consistency)
- [7. Make feature flags short-lived. Do not confuse flags with application configuration.](#7-make-feature-flags-short-lived-do-not-confuse-flags-with-application-configuration)
- [8. Use unique names across all applications. Enforce naming conventions.](#8-use-unique-names-across-all-applications-enforce-naming-conventions)
- [9. Choose open by default. Democratize feature flag access.](#9-choose-open-by-default-democratize-feature-flag-access)
- [10. Do no harm. Prioritize consistent user experience.](#10-do-no-harm-prioritize-consistent-user-experience)
- [11. Enable traceability. Make it easy to understand flag evaluation.](#11-enable-traceability-make-it-easy-to-understand-flag-evaluation)
1. [Enable runtime control](#1-enable-runtime-control)
2. [Protect PII by evaluating flags server-side](#2-protect-pii-by-evaluating-flags-server-side)
3. [Evaluate flags as close to the user as possible](#3-evaluate-flags-as-close-to-the-user-as-possible)
4. [Scale horizontally by decoupling reads and writes](#4-scale-horizontally-by-decoupling-reads-and-writes)
5. [Limit feature flag payload](#5-limit-feature-flag-payload)
6. [Prioritize availability over consistency](#6-prioritize-availability-over-consistency)
7. [Make flags short-lived](#7-make-flags-short-lived)
8. [Ensure unique flag names](#8-ensure-unique-flag-names)
9. [Choose open by default](#9-choose-open-by-default)
10. [Prioritize consistent user experience](#10-prioritize-consistent-user-experience)
11. [Optimize for developer experience](#11-optimize-for-developer-experience)
## Background
Let's dive deeper into each principle.
Feature flags have become a central part of the DevOps toolbox along with Git, CI/CD and microservices. You can write modern software without all of these things, but it sure is a lot harder, and a lot less fun.
## 1. Enable runtime control
And just like the wrong Git repo design can cause interminable headaches, getting the details wrong when first building a feature flag system can be very costly.
A scalable feature management system evaluates flags at runtime. Flags are dynamic, not static. If you need to restart your application to turn on a flag, that's configuration, not a feature flag.
This set of principles for building a large-scale feature management platform is the result of thousands of hours of work building and scaling Unleash, an open-source feature management solution used by thousands of organizations.
A large-scale feature flag system that enables runtime control should have, at minimum, the following components: a service to manage feature flags, a database or data store, an API layer, a feature flag SDK, and a continuous update mechanism.
Before Unleash was a community and a company, it was an internal project, started by [one dev](https://github.com/ivarconr), for one company. As the community behind Unleash grew, patterns and anti-patterns of large-scale feature flag systems emerged. Our community quickly discovered that these are important principles for anyone who wanted to avoid spending weekends debugging the production system that is supposed to make debugging in production easier.
Let's break down these components.
“Large scale” means the ability to support millions of flags served to end-users with minimal latency or impact on application uptime or performance. That is the type of system most large enterprises are building today and the type of feature flag system that this guide focuses on.
- **Feature Flag Control Service**: A service that acts as the control plane for your feature flags, managing all flag configurations. The scope of this service should reflect the boundaries of your organization.
- **Database or data store**: A robust, scalable, and highly available database or data store that stores feature flag configurations reliably. Common options include SQL databases, NoSQL databases, or key-value stores.
- **API layer**: An API layer that exposes endpoints for your application to interact with the _Feature Flag Control Service_. This API should allow your application to request feature flag configurations.
- **Feature flag SDK**: An easy-to-use interface for fetching flag configurations and evaluating feature flags at runtime. When considering feature flags in your application, the call to the SDK should query the local cache, and the SDK should ask the central service for updates in the background.
- **Continuous update mechanism**: An update mechanism that enables dynamic updates to feature flag configurations without requiring application restarts or redeployments. The SDK should handle subscriptions or polling to the _Feature Flag Control Service_ for updates.
Our motivation for writing these principles is to share what weve learned building a large-scale feature flag solution with other architects and engineers solving similar challenges. Unleash is open-source, and so are these principles. Have something to contribute? [Open a PR](https://github.com/Unleash/unleash/pulls) or [discussion](https://github.com/orgs/Unleash/discussions) on our Github.
<Figure caption="The SDK holds an in-memory feature flag configuration cache which is continuously synced with the Feature Flag Control Service. You can then use the SDK to check the state of feature flags in your application." img="/img/feature-flag-scalable-architecture.png"/>
## 1. Enable run-time control. Control flags dynamically, not using config files.
## 2. Protect PII by evaluating flags server-side
A scalable feature management system evaluates flags at runtime. Flags are dynamic, not static. If you need to restart your application to turn on a flag, you are using configuration, not feature flags.
Feature flags often require contextual data for accurate evaluation, which could include sensitive information such as user IDs, email addresses, or geographical locations. To safeguard this data, follow the data security [principle of least privilege (PoLP)](https://www.cyberark.com/what-is/least-privilege), ensuring that all [Personally Identifiable Information (PII)](https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp) remains confined to your application.
A large-scale feature flag system that enables runtime control should have at minimum the following components:
To implement the principle of least privilege, ensure that your _Feature Flag Control Service_ only handles the configuration for your feature flags and passes this configuration down to the SDKs connecting from your applications.
**1. Feature Flag Control Service**: Use a centralized feature flag service that acts as the control plane for your feature flags. This service will handle flag configuration. The scope of this service should reflect the boundaries of your organization.
Let's look at an example where feature flag evaluation happens inside the server-side application. This is where all the contextual application data lives. The flag configuration—all the information needed to evaluate the flags—is fetched from the _Feature Flag Control Service_.
Independent business units or product lines should potentially have their own instances, while business units or product lines that work closely together should most likely use the same instance in order to facilitate collaboration. This will always be a contextual decision based on your organization and how you organize the work, but keep in mind that youd like to keep the management of the flags as simple as possible to avoid the complexity of cross-instance synchronization of feature flag configuration.
<Figure caption="Evaluating flags on the server side without exposing sensitive information." img="/img/feature-flag-server-side-evaluation.png"/>
**2. Database or Data Store**: Use a robust and scalable database or data store to store feature flag configurations. Popular choices include SQL or NoSQL databases or key-value stores. Ensure that this store is highly available and reliable.
Client-side applications where the code resides on the user's machine in browsers or mobile devices, require a different approach. You can't evaluate flags on the client side because it raises significant security concerns by exposing potentially sensitive information such as API keys, flag data, and flag configurations. Placing these critical elements on the client side increases the risk of unauthorized access, tampering, or data breaches.
**3. API Layer**: Develop an API layer that exposes endpoints for your application to interact with the Feature Flag Control Service. This API should allow your application to request feature flag configurations.
**4. Feature Flag SDK**: Build or integrate a feature flag SDK into your application. This SDK should provide an easy-to-use interface for fetching flag configurations and evaluating feature flags at runtime. When evaluating feature flags in your application, the call to the SDK should query the local cache, and the SDK should ask the central service for updates in the background.
Build SDK bindings for each relevant language in your organization. Make sure that the SDKs uphold a standard contract governed by a set of feature flag client specifications that documents what functionality each SDK should support.
<Figure caption="The SDK holds an in-memory feature flag configuration cache which is continuously synced with the feature flag control service. The SDK can then be used to check the state of feature flags in your application." img="/img/feature-flag-scalable-architecture.png"/>
**5. Continuously Updated**: Implement update mechanisms in your application so that changes to feature flag configurations are reflected without requiring application restarts or redeployments. The SDK should handle subscriptions or polling to the feature flag service for updates.
## 2. Never expose PII. Follow the principle of least privilege.
To keep things simple, you may be tempted to evaluate the feature flags in your Feature Flag Control Service. Dont. Your Feature Flag Control Service should only handle the configuration for your feature flags and pass this configuration down to SDKs connecting from your applications.
The primary rationale behind this practice is that feature flags often require contextual data for accurate evaluation. This may include user IDs, email addresses, or geographical locations that influence whether a flag should be toggled on or off. Safeguarding this sensitive information from external exposure is paramount. This information may include Personally Identifiable Information (PII), which must remain confined within the boundaries of your application, following the data security principle of least privilege (PoLP).
<Figure caption="Evaluation happens inside the server-side application, where you have all the contextual application data. The flag configuration (how to evaluate the flags) is fetched from the feature flagging control service." img="/img/feature-flag-server-side-evaluation.png"/>
For client-side applications where the code resides on the user's machine, such as in the browser or on mobile devices, youll want to take a different approach. You cant evaluate on the client side because it raises significant security concerns by exposing potentially sensitive information such as API keys, flag data, and flag configurations. Placing these critical elements on the client side increases the risk of unauthorized access, tampering, or data breaches.
Instead of performing client-side evaluation, a more secure and maintainable approach is to conduct feature flag evaluation within a self-hosted environment. Doing so can safeguard sensitive elements like API keys and flag configurations from potential client-side exposure. This strategy involves a server-side evaluation of feature flags, where the server makes decisions based on user and application parameters and then securely passes down the evaluated results to the frontend without any configuration leaking.
Instead of performing client-side evaluation, a more secure and maintainable approach is to evaluate feature flags within a self-hosted environment. Doing so can safeguard sensitive elements like API keys and flag configurations from potential client-side exposure. This strategy involves a server-side evaluation of feature flags, where the server makes decisions based on user and application parameters and then securely passes down the evaluated results to the frontend without any configuration leaking.
<Figure caption="In client-side setups, perform the feature flag evaluation on the server side. Connected client-side applications receive only evaluated feature flags to avoid leaking configuration." img="/img/feature-flag-architecture-client-side.png"/>
Heres how you can architect your solution to minimize PII or configuration leakage:
Here's how you can architect your solution to protect PII and flag configuration data:
1. **Server-Side Components**:
### Server-side components
In Principle 1, we proposed a set of architectural principles and components to set up a Feature Flag Control Service. The same architecture patterns apply here, with additional suggestions for achieving local evaluation. Refer to Principle 1 for patterns to set up a feature flagging service.
In [Principle 1](#1-enable-runtime-control), we proposed a set of architectural components for building a feature flag system. The same principles apply here, with additional suggestions for achieving local evaluation. For client-side setups, use a dedicated evaluation server that can evaluate feature flags and pass evaluated results to the client SDK.
**Feature Flag Evaluation Service**: If you need to use feature flags on the client side, where code is delivered to users' devices, youll need an evaluation server that can evaluate feature flags and pass evaluated results down to the SDK in the client application.
### SDKs
2. **SDKs**:
[SDKs](/reference/sdks) make it more convenient to work with feature flags. Depending on the context of your infrastructure, you need different types of SDKs to talk to your feature flagging service. Server-side SDKs should fetch configurations from the _Feature Flag Control Service_ and evaluate flags locally using the application's context, reducing the need for frequent network calls.
SDKs will make it more comfortable to work with feature flags. Depending on the context of your infrastructure, you need different types of SDKs to talk to your feature flagging service. For the server side, youll need SDKs that can talk directly to the feature flagging service and fetch the configuration.
For client-side feature flags, SDKs should send the context to an evaluation server and receive the evaluated results. The evaluated results are then cached in memory in the client-side application, allowing quick lookups without additional network overhead. This provides the performance benefits of local evaluation while minimizing the exposure of sensitive data.
The server-side SDKs should implement logic to evaluate feature flags based on the configuration received from the Feature Flag Control Service and the application-specific context. Local evaluation ensures that decisions are made quickly without relying on network roundtrips.
This approach enhances privacy by minimizing the amount of sensitive data sent to the _Feature Flag Control Service_, reducing the risk of data exposure and potential security threats.
For client-side feature flags, youll need a different type of SDK. These SDKs will send the context to the Feature Flag Evaluation Service and receive the evaluated results. These results should be stored in memory and used when doing a feature flag lookup in the client-side application. By keeping the evaluated results for a specific context in memory in the client-side application, you avoid network roundtrips every time your application needs to check the status of a feature flag. It achieves the same level of performance as a server-side SDK, but the content stored in memory is different and limited to evaluated results on the client.
## 3. Evaluate flags as close to the user as possible
The benefits of this approach include:
For optimal performance, you should evaluate feature flags as close to your users as possible. Building on the server-side evaluation approach from [Principle 2](#2-protect-pii-by-evaluating-flags-server-side), let's look at how evaluating flags locally can bring key benefits in terms of performance, cost, and reliability:
**Privacy Protection**:
- **Reduced latency**: Network roundtrips introduce latency, slowing your application's response time. Local evaluation eliminates the need for these roundtrips, resulting in faster feature flag decisions. This makes your application more responsive thereby improving the user experience.
- **Offline functionality**: Many applications need to function offline or in low-connectivity environments. Local evaluation ensures feature flags are still functional, even without an active network connection. This is especially important for mobile apps or services in remote locations.
- **Reduced bandwidth costs**: Local evaluation reduces the amount of data transferred between your application and the feature flag service. This can lead to significant cost savings, particularly if you have a large user base or high traffic volume.
- **Ease of development and testing**: Developers can continue their work in environments where a network connection to the feature flag service might be unstable or unavailable. Local evaluation allows teams to work on feature flag-related code without needing constant access to the service, streamlining the development process.
- **Resilience during service downtime**: If the feature flag service experiences downtime or outages, local evaluation allows your application to continue functioning without interruptions. This is important for maintaining service reliability and ensuring your application remains available even when the service is down.
a. **Data Minimization**: By evaluating feature flags in this way, you minimize the amount of data that needs to be sent to the Feature Flag Control Service. This can be crucial for protecting user privacy, as less user-specific data is transmitted over the network.
In summary, this principle emphasizes optimizing performance while protecting end-user privacy by evaluating feature flags as close to the end user as possible. This also leads to a highly available feature flag system that scales with your applications.
b. **Reduced Data Exposure**: Sensitive information about your users or application's behavior is less likely to be exposed to potential security threats. Data breaches or leaks can be mitigated by limiting the transmission of sensitive data.
## 4. Scale horizontally by decoupling reads and writes
## 3. Evaluate flags as close to the user as possible. Reduce latency.
When designing a scalable feature flag system, one of the most effective strategies is to separate read and write operations into distinct APIs. This architectural decision not only allows you to scale each component independently but also provides better performance, reliability, and control.
Feature flags should be evaluated as close to your users as possible, and the evaluation should always happen server side as discussed in Principle 2. In addition to security and privacy benefits, performing evaluation as close as possible to your users has multiple benefits:
<Figure caption="Separating the reading and writing of the database allows you to horizontally scale out the read APIs without scaling the write APIs." img="/img/feature-flag-horizontal-scaling.png"/>
1. **Performance Efficiency**:
By decoupling read and write operations, you gain the flexibility to scale horizontally based on the unique demands of your application. For example, if read traffic increases, you can add more servers or containers to handle the load without needing to scale the write operations.
a. **Reduced Latency**: Network roundtrips introduce latency, which can slow down your application's response time. Local evaluation eliminates the need for these roundtrips, resulting in faster feature flag decisions. Users will experience a more responsive application, which can be critical for maintaining a positive user experience.
The benefits of decoupling read and write operations extend beyond just scalability; let's look at a few others:
- **More efficient caching**: You can optimize your flag caching for read operations to reduce latency while keeping write operations consistent.
- **Granular access control**: You can apply different security measures and access controls to the two APIs, reducing the risk of accidental or unauthorized changes.
- **Improved monitoring and troubleshooting**: Monitoring and troubleshooting become more straightforward. It's easier to track and analyze the performance of each API independently. When issues arise, you can isolate the source of the problem more quickly and apply targeted fixes or optimizations.
- **Flexibility and maintenance**: Updates to one API won't directly impact the other, reducing the risk of unintended consequences. This separation of concerns allows development teams to work on each API separately, facilitating parallel development and deployment cycles.
- **Distributed traffic**: You can tailor load-balancing strategies to the specific needs of the read and write APIs. You can distribute traffic and resources accordingly to optimize performance and ensure that neither API becomes a bottleneck under heavy load.
b. **Offline Functionality**: Applications often need to function offline or in low-connectivity environments. Local evaluation ensures that feature flags can still be used and decisions can be made without relying on a network connection. This is especially important for mobile apps or services in remote locations.
## 5. Limit feature flag payload
2. **Cost Savings**:
Minimizing the size of feature flag payloads is a critical aspect of maintaining the efficiency and performance of a feature flag system. Payload size can vary based on targeting rule complexity. For example, targeting based on individual user IDs may work with small user bases but becomes inefficient as the user base grows.
a. **Reduced Bandwidth Costs**: Local evaluation reduces the amount of data transferred between your application and the feature flag service. This can lead to significant cost savings, particularly if you have a large user base or high traffic volume.
Avoid storing large user lists directly in the feature flag configuration, which can lead to scaling issues. Instead, categorize users into logical groups at a higher layer (for example, by subscription plan or location) and use group identifiers for targeting within the feature flag system.
3. **Offline Development and Testing**:
Keeping the feature flag payload small results in:
a. **Development and Testing**: Local evaluation is crucial for local development and testing environments where a network connection to the feature flag service might not be readily available. Developers can work on feature flag-related code without needing constant access to the service, streamlining the development process.
- **Reduced network load**: Large payloads can lead to increased network traffic between the application and the feature flagging service. This can overwhelm the network and cause bottlenecks, leading to slow response times and degraded system performance. Even small optimizations can make a big difference at scale.
- **Faster flag evaluation**: Smaller payloads mean faster data transmission and flag evaluation, crucial for real-time decisions that affect user experience.
- **Improved memory efficiency**: Feature flagging systems often store flag configurations in memory for quick access during runtime. Larger payloads consume more memory, potentially causing memory exhaustion and system crashes. Limiting payloads ensures that the system remains memory-efficient, reducing the risk of resource-related issues.
- **Better scalability**: Smaller payloads require fewer resources, making it easier to scale your system as your application grows.
- **Lower infrastructure costs**: Optimized payloads reduce infrastructure needs and costs while simplifying system management.
- **Improved system reliability**: Delivering smaller, more manageable payloads minimizes the risk of network timeouts and failures.
- **Ease of monitoring and debugging**: Smaller payloads are easier to monitor and debug, making issue resolution faster.
4. **Resilience**:
For more insights into reducing payload size, visit our [Best practices for using feature flags at scale](/topics/feature-flags/best-practices-using-feature-flags-at-scale#14-avoid-giant-feature-flag-targeting-lists) guide.
a. **Service Outages**: If the feature flag service experiences downtime or outages, local evaluation allows your application to continue functioning without interruptions. This is important for maintaining service reliability and ensuring your application remains available even when the service is down.
## 6. Prioritize availability over consistency
In summary, this principle emphasizes the importance of optimizing performance while protecting end-user privacy by evaluating feature flags as close to the end user as possible. Done right, this also leads to a highly available feature flag system that scales with your applications.
Your application shouldn't have any dependency on the availability of your feature flag system. Robust feature flag systems avoid relying on real-time flag evaluations because the unavailability of the feature flag system will cause application downtime, outages, degraded performance, or even a complete failure of your application.
## 4. Scale Horizontally. Decouple reading and writing flags.
If the feature flag system fails, your application should continue running smoothly. Feature flagging should degrade gracefully, preventing any unexpected behavior or disruptions for users.
Separating the reading and writing of feature flags into distinct APIs is a critical architectural decision for building a scalable and efficient feature flag system, particularly when considering horizontal scaling. This separation provides several benefits:
You can implement the following strategies to achieve a resilient architecture:
<Figure caption="Separating reading and writing of the database allows you to horizontally scale out the read APIs (for instance by placing them behind a load balancer) without scaling the write APIs." img="/img/feature-flag-horizontal-scaling.png"/>
- **Bootstrap SDKs with data**: Feature flagging SDKs should work with locally cached data, even when the network connection to the _Feature Flag Control Service_ is unavailable, using the last known configuration or defaults to ensure uninterrupted functionality.
- **Use local cache**: Maintaining a local cache of feature flag configurations helps reduce network round trips and dependency on external services. The local cache can periodically synchronize with the central _Feature Flag Control Service_ when it's available. This approach minimizes the impact of network failures or service downtime on your application.
- **Evaluate feature flags locally**: Whenever possible, the SDKs or application components should evaluate feature flags locally without relying on external services, ensuring uninterrupted feature flag evaluations even if the feature flagging service is down.
- **Prioritize availability over consistency**: In line with the [CAP theorem](https://www.ibm.com/topics/cap-theorem), design for availability over strict consistency. In the face of network partitions or downtime of external services, your application should favor maintaining its availability rather than enforcing perfectly consistent feature flag configuration caches. Eventually consistent systems can tolerate temporary inconsistencies in flag evaluations without compromising availability.
1. **Horizontal Scaling**:
## 7. Make flags short-lived
- By separating read and write APIs, you can horizontally scale each component independently. This enables you to add more servers or containers to handle increased traffic for reading feature flags, writing updates, or both, depending on the demand.
The most common use case for feature flags is to manage the rollout of new functionality. Once a rollout is complete, you should remove the feature flag from your code and archive it. Remove any old code paths that the new functionality replaces.
2. **Caching Efficiency**:
Avoid using feature flags for static application configuration. Application configuration should be consistent, long-lived, and loaded during application startup. In contrast, feature flags are intended to be short-lived, dynamic, and updated at runtime. They prioritize availability over consistency and are designed to be modified frequently.
- Feature flag systems often rely on caching to improve response times for flag evaluations. Separating read and write APIs allows you to optimize caching strategies independently. For example, you can cache read operations more aggressively to minimize latency during flag evaluations while still ensuring that write operations maintain consistency across the system.
To succeed with feature flags in a large organization, follow these strategies:
3. **Granular Access Control**:
- **Set flag expiration dates**: Assign expiration dates to feature flags to track which flags are no longer needed. A good feature flag management tool will alert you to expired flags, making it easier to maintain your codebase.
- **Treat feature flags like technical debt**: Incorporate tasks to remove outdated feature flags into your sprint or project planning, just as you would with technical debt. Feature flags add complexity to your code by introducing multiple code paths that need context and maintenance. If you don't clean up feature flags in a timely manner, you risk losing the context as time passes or personnel changes, making them harder to manage or remove.
- **Archive old flags**: When feature flags are no longer in use, archive them after removing them from the codebase. This archive serves as an important audit log of feature flags and allows you to revive flags if you need to restore an older version of your application.
- Separation of read and write APIs simplifies access control and permissions management. You can apply different security measures and access controls to the two APIs. This helps ensure that only authorized users or systems can modify feature flags, reducing the risk of accidental or unauthorized changes.
While most feature flags should be short-lived, there are valid exceptions for long-lived flags, including:
- **Kill switches**: These act as inverted feature flags, allowing you to gracefully disable parts of a system with known weak spots.
- **Internal flags**: Used to enable additional debugging, tracing, and metrics at runtime, which are too costly to run continuously. Engineers can enable these flags while debugging issues.
4. **Better Monitoring and Troubleshooting**:
## 8. Ensure unique flag names
- Monitoring and troubleshooting become more straightforward when read and write operations are separated. It's easier to track and analyze the performance of each API independently. When issues arise, you can isolate the source of the problem more quickly and apply targeted fixes or optimizations.
5. **Flexibility and Maintenance**:
- Separation of concerns makes your system more flexible and maintainable. Changes or updates to one API won't directly impact the other, reducing the risk of unintended consequences. This separation allows development teams to work on each API separately, facilitating parallel development and deployment cycles.
6. **Load Balancing**:
- Load balancing strategies can be tailored to the specific needs of the read and write APIs. You can distribute traffic and resources accordingly to optimize performance and ensure that neither API becomes a bottleneck under heavy loads.
## 5. Limit payloads. Feature flag payload should be as small as possible.
Minimizing the size of feature flag payloads is a critical aspect of maintaining the efficiency and performance of a feature flag system. The configuration of your feature flags can vary in size depending on the complexity of your targeting rules. For instance, if you have a targeting engine that determines whether a feature flag should be active or inactive based on individual user IDs, you might be tempted to include all these user IDs within the configuration payload. While this approach may work fine for a small user base, it can become unwieldy when dealing with a large number of users.
If you find yourself facing this challenge, your instinct might be to store this extensive user information directly in the feature flagging system. However, this can also run into scaling problems. A more efficient approach is to categorize these users into logical groupings at a different layer and then use these group identifiers when you evaluate flags within your feature flagging system. For example, you can group users based on their subscription plan or geographical location. Find a suitable parameter for grouping users, and employ those group parameters as targeting rules in your feature flagging solution.
Imposing limitations on payloads is crucial for scaling a feature flag system:
1. **Reduced Network Load**:
- Large payloads, especially for feature flag evaluations, can lead to increased network traffic between the application and the feature flagging service. This can overwhelm the network and cause bottlenecks, leading to slow response times and degraded system performance. Limiting payloads helps reduce the amount of data transferred over the network, alleviating this burden. Even small numbers become large when multiplied by millions.
2. **Faster Evaluation**:
- Smaller payloads reduce latency which means quicker transmission and evaluation. Speed is essential when evaluating feature flags, especially for real-time decisions that impact user experiences. Limiting payloads ensures evaluations occur faster, allowing your application to respond promptly to feature flag changes.
3. **Improved Memory Efficiency**:
- Feature flagging systems often store flag configurations in memory for quick access during runtime. Larger payloads consume more memory, potentially causing memory exhaustion and system crashes. By limiting payloads, you ensure that the system remains memory-efficient, reducing the risk of resource-related issues.
4. **Scalability**:
- Scalability is a critical concern for modern applications, especially those experiencing rapid growth. Feature flagging solutions need to scale horizontally to accommodate increased workloads. Smaller payloads require fewer resources for processing, making it easier to scale your system horizontally.
5. **Lower Infrastructure Costs**:
- When payloads are limited, the infrastructure required to support the feature flagging system can be smaller and less costly. This saves on infrastructure expenses and simplifies the management and maintenance of the system.
6. **Reliability**:
- A feature flagging system that consistently delivers small, manageable payloads is more likely to be reliable. It reduces the risk of network failures, timeouts, and other issues when handling large data transfers. Reliability is paramount for mission-critical applications.
7. **Ease of Monitoring and Debugging**:
- Smaller payloads are easier to monitor and debug. When issues arise, it's simpler to trace problems and identify their root causes when dealing with smaller, more manageable data sets.
## 6. Design for failure. Favor availability over consistency.
Your feature flag system should not be able to take down your main application under any circumstance, including network disruptions. Follow these patterns to achieve fault tolerance for your feature flag system.
**Zero dependencies**: Your application's availability should have zero dependencies on the availability of your feature flag system. Robust feature flag systems avoid relying on real-time flag evaluations because the unavailability of the feature flag system will cause application downtime, outages, degraded performance, or even a complete failure of your application.
**Graceful degradation**: If the system goes down, it should not disrupt the user experience or cause unexpected behavior. Feature flagging should gracefully degrade in the absence of the Feature Flag Control service, ensuring that users can continue to use the application without disruption.
**Resilient Architecture Patterns**:
- **Bootstrapping SDKs with Data**: Feature flagging SDKs used within your application should be designed to work with locally cached data, even when the network connection to the Feature Flag Control service is unavailable. The SDKs can bootstrap with the last known feature flag configuration or default values to ensure uninterrupted functionality.
- **Local Cache**: Maintaining a local cache of feature flag configuration helps reduce network round trips and dependency on external services. The local cache can be periodically synchronized with the central Feature Flag Control service when it's available. This approach minimizes the impact of network failures or service downtime on your application.
- **Evaluate Locally**: Whenever possible, the SDKs or application components should be able to evaluate feature flags locally without relying on external services. This ensures that feature flag evaluations continue even when the feature flagging service is temporarily unavailable.
- **Availability Over Consistency**: As the CAP theorem teaches us, in distributed systems, prioritizing availability over strict consistency can be a crucial design choice. This means that, in the face of network partitions or downtime of external services, your application should favor maintaining its availability rather than enforcing perfectly consistent feature flag configuration caches. Eventually consistent systems can tolerate temporary inconsistencies in flag evaluations without compromising availability. In CAP theorem parlance, a feature flagging system should aim for AP over CP.
By implementing these resilient architecture patterns, your feature flagging system can continue to function effectively even in the presence of downtime or network disruptions in the feature flagging service. This ensures that your main application remains stable, available, and resilient to potential issues in the feature flagging infrastructure, ultimately leading to a better user experience and improved reliability.
## 7. Make feature flags short-lived. Do not confuse flags with application configuration.
Feature flags have a lifecycle shorter than an application lifecycle. The most common use case for feature flags is to protect new functionality. That means that when the roll-out of new functionality is complete, the feature flag should be removed from the code and archived. If there were old code paths that the new functionality replaces, those should also be cleaned up and removed.
Feature flags should not be used for static application configuration. Application configuration is expected to be consistent, long-lived, and read when launching an application. Using feature flags to configure an application can lead to inconsistencies between different instances of the same application. Feature flags, on the other hand, are designed to be short-lived, dynamic, and changed at runtime. They are expected to be read and updated at runtime and favor availability over consistency.
To succeed with feature flags in a large organization, you should:
- **Use flag expiration dates**: By setting expiration dates for your feature flags, you make it easier to keep track of old feature flags that are no longer needed. A proper feature flag solution will inform you about potentially expired flags.
- **Treat feature flags like technical debt.**: You must plan to clean up old feature branches in sprint or project planning, the same way you plan to clean up technical debt in your code. Feature flags add complexity to your code. Youll need to know what code paths the feature flag enables, and while the feature flag lives, the context of it needs to be maintained and known within the organization. If you dont clean up feature flags, eventually, you may lose the context surrounding it if enough time passes and/or personnel changes happen. As time passes, you will find it hard to remove flags, or to operate them effectively.
- **Archive old flags**: Feature flags that are no longer in use should be archived after their usage has been removed from the codebase. The archive serves as an important audit log of feature flags that are no longer in use, and allows you to revive them if you need to install an older version of your application.
There are valid exceptions to short-lived feature flags. In general, you should try to limit the amount of long-lived feature flags. Some examples include:
- Kill-switches - these work like an inverted feature flag and are used to gracefully disable part of a system with known weak spots.
- Internal flags used to enable additional debugging, tracing, and metrics at runtime, which are too costly to run all the time. These can be enabled by software engineers while debugging issues.
## 8. Use unique names across all applications. Enforce naming conventions.
All flags served by the same Feature Flag Control service should have unique names across the entire cluster to avoid inconsistencies and errors.
- **Avoid zombies:** Uniqueness should be controlled using a global list of feature flag names. This prevents the reuse of old flag names to protect new features. Using old names can lead to accidental exposure of old features, still protected with the same feature flag name.
- **Naming convention enforcement: **Ideally, unique names are enforced at creation time. In a large organization, it is impossible for all developers to know all flags used. Enforcing a naming convention makes naming easier, ensures consistency, and provides an easy way to check for uniqueness.
Ensure that all flags within the same _Feature Flag Control Service_ have unique names across your entire system. Unique naming prevents the reuse of old flag names, reducing the risk of accidentally re-enabling outdated features with the same name.
Unique naming has the following advantages:
- **Flexibility over time**: Large enterprise systems are not static. Monoliths may split into microservices, microservices may merge, and applications change responsibility. Unique flag naming across your organization means that you can reorganize your flags to match the changing needs of your organization.
- **Fewer conflicts**: If two applications use the same feature flag name, it can become difficult to identify which flag controls which application. Even with separate namespaces, you risk toggling the wrong flag, leading to unexpected consequences.
- **Easier flag management**: Unique names make it simpler to track and identify feature flags. Searching across codebases becomes more straightforward, and it's easier to understand a flag's purpose and where it's used.
- **Improved collaboration**: A feature flag with a unique name in the organization simplifies collaboration across teams, products, and applications, ensuring that everyone refers to the same feature.
- **Flexibility over time: **Large enterprise systems are not static. Over time, monoliths are split into microservices, microservices are merged into larger microservices, and applications change responsibility. This means that the way flags are grouped will change over time, and a unique name for the entire organization ensures that you keep the option to reorganize your flags to match the changing needs of your organization.
- **Prevent conflicts**: If two applications use the same Feature Flag name it can be impossible to know which flag is controlling which applications. This can lead to accidentally flipping the wrong flag, even if they are separated into different namespaces (projects, workspaces etc).
- **Easier to manage: **It's easier to know what a flag is used for and where it is being used when it has a unique name. E.g. It will be easier to search across multiple code bases to find references for a feature flag when it has a unique identifier across the entire organization.
- **Enables collaboration:** When a feature flag has a unique name in the organization, it simplifies collaboration across teams, products and applications. It ensures that we all talk about the same feature.
## 9. Choose open by default
## 9. Choose open by default. Democratize feature flag access.
At Unleash, we believe in democratizing feature flag access. Making feature flag systems open by default enables engineers, product owners, and support teams to collaborate effectively and make informed decisions. Open access encourages productive discussions about feature releases, experiments, and their impact on the user experience.
Allowing engineers, product owners, and even technical support to have open access to a feature flagging system is essential for effective development, debugging, and decision-making. These groups should have access to the system, along with access to the codebase and visibility into configuration changes:
Access control and visibility are also key considerations for security and compliance. Tracking and auditing feature flag changes help maintain data integrity and meet regulatory requirements. While open access is key, it's equally important to integrate with corporate access controls, such as SSO, to ensure security. In some cases, additional controls like feature flag approvals using the [four-eyes principle](/reference/change-requests) are necessary for critical changes.
1. **Debugging and Issue Resolution**:
For open collaboration, consider providing the following:
- **Access to the codebase**: Engineers need direct access to the codebase where feature flags are implemented. This allows them to quickly diagnose and fix issues, minimizing downtime and performance problems.
- **Access to configuration**: Engineers, product owners, and even technical support should be able to view feature flag configuration. This transparency provides insights into which features are currently active, what conditions trigger them, and how they impact the application's behavior. Product owners can also make real-time decisions on feature rollouts or adjustments without relying solely on engineering resources.
- **Access to analytics**: Both engineers and product owners should be able to correlate feature flag changes with production metrics. This helps assess how flags impact user behavior, performance, and system health, enabling data-driven decisions for feature rollouts, optimizations, or rollbacks.
- **Code Access**: Engineers should have access to the codebase where feature flags are implemented. This access enables them to quickly diagnose and fix issues related to feature flags when they arise. Without code access, debugging becomes cumbersome, and troubleshooting becomes slower, potentially leading to extended downtimes or performance problems.
## 10. Prioritize consistent user experience
2. **Visibility into Configuration**:
Feature flagging solutions are indispensable tools in modern software development, enabling teams to manage feature releases and experiment with new functionality. However, one aspect that is absolutely non-negotiable in any feature flag solution is the need to ensure a consistent user experience. Feature flagging solutions must prioritize consistency and guarantee the same user experience every time, especially with percentage-based gradual rollouts.
- **Configuration Transparency**: Engineers, product owners, and even technical support should be able to view the feature flag configuration. This transparency provides insights into which features are currently active, what conditions trigger them, and how they impact the application's behavior. It helps understand the system's state and behavior, which is crucial for making informed decisions.
Strategies for consistency in percentage-based gradual rollouts:
- **Change History**: Access to a history of changes made to feature flags, including who made the changes and when, is invaluable. This audit trail allows teams to track changes to the system's behavior over time. It aids in accountability and can be instrumental in troubleshooting when unexpected behavior arises after a change.
- **User hashing**: Assign users to consistent groups using a secure hashing algorithm based on unique identifiers like user IDs or emails. This ensures that the same user consistently falls into the same group.
- **Segmentation control**: Provide controls within the feature flagging tool to allow developers to [segment](/reference/segments) users logically by criteria like location, subscription type, or other relevant factors, ensuring similar experiences for users within the same segment.
- **Fallback mechanisms**: Include fallback mechanisms in your architecture. If a user encounters issues or inconsistencies, the system should automatically switch them to a stable version or feature state.
- **Logging and monitoring**: Implement robust logging and monitoring. Continuously track which users are in which groups and what version of the feature they are experiencing. Monitor for anomalies or deviations and consider building automated processes to disable features that may be misbehaving.
- **Transparent communication**: Clearly communicate gradual rollouts through in-app notifications, tooltips, or changelogs, so users are informed about changes and know what to expect.
- **Correlating Changes with Metrics**: Engineers and product owners often need to correlate feature flag changes with production application metrics. This correlation helps them understand how feature flags affect user behavior, performance, and system health. It's essential for making data-driven decisions about feature rollouts, optimizations, or rollbacks.
## 11. Optimize for developer experience
3. **Collaboration**:
[Developer experience](https://www.opslevel.com/resources/devex-series-part-1-what-is-devex) is a critical factor to consider when implementing a feature flag solution. A positive developer experience enhances the efficiency of the development process and contributes to the overall success and effectiveness of feature flagging. One key aspect of developer experience is ensuring the testability of the SDK and providing tools for developers to understand how and why feature flags are evaluated.
- **Efficient Communication**: Open access fosters efficient communication between engineers and the rest of the organization. When it's open by default, everyone can see the feature flagging system and its changes, and have more productive discussions about feature releases, experiments, and their impact on the user experience.
To ensure a good developer experience, you should provide the following:
- **Simplified testing and debugging**: A testable SDK enables developers to quickly test and iterate on features, speeding up development cycles. Developers can toggle flags, simulate conditions, and observe results without significant code changes or redeployments. This makes it easier to identify and fix issues by examining flag configurations and logging decisions.
- **Visibility into flag behavior**: Developers need tools to understand how feature flags impact the user experience. Visibility into flag behavior helps them assess changes, debug effectively with multiple inputs, and collaborate more easily within cross-functional teams.
- **Effective monitoring**: A testable SDK should support real-time monitoring of flag performance, tracking metrics like evaluations, user engagement, and feature impact. Developers can use this data to evaluate the success of new features, conduct A/B tests, and make informed decisions about optimizations.
- **Usage metrics**: Provide aggregated insights into feature flag usage, helping developers confirm that everything is working as expected.
- **Documentation and training**: Offer clear, comprehensive documentation for the API, UI, and SDKs, with easy-to-follow examples. This simplifies onboarding for new developers and supports continuous training, ensuring the effective use of the feature flagging system.
4. **Empowering Product Decisions**:
- **Product Owner Involvement**: Product owners play a critical role in defining feature flags' behavior and rollout strategies based on user needs and business goals. Allowing them to access the feature flagging system empowers them to make real-time decisions about feature releases, rollbacks, or adjustments without depending solely on engineering resources.
5. **Security and Compliance**:
- **Security Audits**: Users of a feature flag system should be part of corporate access control groups such as SSO. Sometimes, additional controls are necessary, such as feature flag approvals using the four-eyes principle.
Access control and visibility into feature flag changes are essential for security and compliance purposes. It helps track and audit who has made changes to the system, which can be crucial in maintaining data integrity and adhering to regulatory requirements.
## 10. Do no harm. Prioritize consistent user experience.
Feature flagging solutions are indispensable tools in modern software development, enabling teams to manage feature releases and experiment with new functionality. However, one aspect that is absolutely non-negotiable in any feature flag solution is the need to ensure a consistent user experience. This isn't a luxury; it's a fundamental requirement. Feature flagging solutions must prioritize consistency and guarantee the same user experience every time, especially with percentage-based gradual rollouts.
**Why Consistency is Paramount:**
1. **User Trust**: Consistency breeds trust. When users interact with an application, they form expectations about how it behaves. Any sudden deviations can erode trust and lead to a sense of unreliability.
2. **Reduced Friction**: Consistency reduces friction. Users shouldn't have to relearn how to use an app every time they open it. A consistent experience reduces the cognitive load on users, enabling them to engage effortlessly.
3. **Quality Assurance**: Maintaining a consistent experience makes quality assurance more manageable. It's easier to test and monitor when you have a reliable benchmark for the user experience.
4. **Support and Feedback**: Inconsistent experiences lead to confused users, increased support requests, and muddied user feedback. Consistency ensures that user issues are easier to identify and address.
5. **Brand Integrity**: A consistent experience reflects positively on your brand. It demonstrates professionalism and commitment to user satisfaction, enhancing your brand's reputation.
**Strategies for Consistency in Percentage-Based Gradual Rollouts:**
1. **User Hashing**: Assign users to consistent groups using a secure hashing algorithm based on unique identifiers like user IDs or emails. This ensures that the same user consistently falls into the same group.
2. **Segmentation Control**: Provide controls within the feature flagging tool to allow developers to segment users logically. For instance, segment by location, subscription type, or any relevant criteria to ensure similar user experiences.
3. **Fallback Mechanisms**: Include fallback mechanisms in your architecture. If a user encounters issues or inconsistencies, the system can automatically switch them to a stable version or feature state.
4. **Logging and Monitoring**: Implement robust logging and monitoring. Continuously track which users are in which groups and what version of the feature they are experiencing. Monitor for anomalies or deviations and consider building automated processes to disable features that may be misbehaving.
5. **Transparent Communication**: Clearly communicate the gradual rollout to users. Use in-app notifications, tooltips, or changelogs to inform users about changes, ensuring they know what to expect.
In summary, consistency is a cornerstone of effective feature flagging solutions. When designing an architecture for percentage-based gradual rollouts, prioritize mechanisms that guarantee the same user gets the same experience every time. This isn't just about good software practice; it's about respecting your users and upholding their trust in your application. By implementing these strategies, you can create a feature flagging solution that empowers your development process and delights your users with a dependable and consistent experience.
## 11. Enable traceability. Make it easy to understand flag evaluation.
Developer experience is a critical factor to consider when implementing a feature flag solution. A positive developer experience enhances the efficiency of the development process and contributes to the overall success and effectiveness of feature flagging. One crucial aspect of developer experience is ensuring the testability of the SDK and providing tools for developers to understand how and why feature flags are evaluated. This is important because:
1. **Ease of Testing and Debugging:**
- **Faster Development Cycles:** A feature flagging solution with a testable SDK allows developers to quickly test and iterate on new features. They can easily turn flags on or off, simulate different conditions, and observe the results without needing extensive code changes or redeployments.
- **Rapid Issue Resolution:** When issues or unexpected behavior arise, a testable SDK enables developers to pinpoint the problem more efficiently. They can examine the flag configurations, log feature flag decisions, and troubleshoot issues more precisely.
2. **Visibility into Flag Behaviour:**
- **Understanding User Experience:** Developers need tools to see and understand how feature flags affect the user experience. This visibility helps them gauge the impact of flag changes and make informed decisions about when to roll out features to different user segments. Debugging a feature flag with multiple inputs simultaneously makes it easy for developers to compare the results and quickly figure out how a feature flag evaluates in different scenarios with multiple input values.
- **Enhanced Collaboration:** Feature flagging often involves cross-functional teams, including developers, product managers, and QA testers. Providing tools with a clear view of flag behavior fosters effective collaboration and communication among team members.
3. **Transparency and Confidence:**
- **Confidence in Flag Decisions:** A transparent feature flagging solution empowers developers to make data-driven decisions. They can see why a particular flag evaluates to a certain value, which is crucial for making informed choices about feature rollouts and experimentation.
- **Reduced Risk:** When developers clearly understand of why flags evaluate the way they do, they are less likely to make unintentional mistakes that could lead to unexpected issues in production.
4. **Effective Monitoring and Metrics:**
- **Tracking Performance:** A testable SDK should provide developers with the ability to monitor the performance of feature flags in real time. This includes tracking metrics related to flag evaluations, user engagement, and the impact of flag changes.
- **Data-Driven Decisions:** Developers can use this data to evaluate the success of new features, conduct A/B tests, and make informed decisions about optimizations.
- **Usage metrics:** A feature flag system should provide insight on an aggregated level about the usage of feature flags. This is helpful for developers so that they can easily assess that everything works as expected.
5. **Documentation and Training:**
- **Onboarding and Training:** The entire feature flag solution, including API, UI, and the SDKs, requires clear and comprehensive documentation, along with easy-to-understand examples, in order to simplify the onboarding process for new developers. It also supports the ongoing training of new team members, ensuring that everyone can effectively use the feature flagging solution.
Thank you for reading
Thank you for reading. Our motivation for writing these principles is to share what we've learned building a large-scale feature flag solution with other architects and engineers solving similar challenges. Unleash is open-source, and so are these principles. Have something to contribute? [Open a PR](https://github.com/Unleash/unleash/pulls) or [discussion](https://github.com/orgs/Unleash/discussions) on our GitHub.