Best practice with feature flags is mostly prevention. The rules that matter aren’t ceremonies — they’re the ones that stop a flag from becoming the suspect in next quarter’s incident review. 73% of feature flags are never removed; the average enterprise application carries more than 200 active flags, and 60% of them are stale beyond 90 days (FlagShark, 2025). The same lack of discipline that produced those numbers is what reused a deprecated flag bit at Knight Capital in 2012 and cost the firm $460 million in 45 minutes (Henrico Dolfing, 2019).
Most “feature flag best practices” articles list rules without saying which failure mode each one prevents. This one does. Eight rules, each anchored to a specific way flags hurt teams that skip it, each with a concrete “what good looks like” check.
Key Takeaways
- Tag every flag with one of four lifecycle types at creation (release, experiment, kill switch, entitlement). Lifespan, ownership, and removal procedure all follow from the type.
- One toggle point per flag. The decision lives in a single named helper, called from everywhere; otherwise removing the flag becomes a graph-traversal problem.
- Test both states and the transition in CI. Slack’s May 2020 cascade started on a flag-on code path that had never run under production load (Slack Engineering).
- Roll out progressively (internal → 1–5% canary → 10–20% beta → 100%), then stabilise at 100% for one to two weeks before removing the flag.
- 73% of flags are never removed (FlagShark, 2025). The fix is creation-time discipline, not a quarterly purge.
How to use this guide. This post is the pillar — short on each rule, deep on the inventory. Where a rule has its own depth post, the link is there. For the operational cleanup playbook, jump to our feature flag cleanup post. For the inverse list, see feature flag anti-patterns. For the boundary with configuration, see feature flags vs environment variables. For structured composition between flags, see prerequisite flags.
1. Tag every flag with a lifecycle type at creation
Every flag is one of four types, and the type drives everything that follows. Release flags are temporary — they gate a new feature during rollout and should be removed within weeks of hitting 100%. Experiment flags live for the experiment window plus a stabilisation period and then come out. Kill switches are permanent, assigned an owner, and exercised on a schedule. Entitlements are permanent and gate behaviour by plan tier or tenant. A flag that isn’t tagged is, by default, a zombie waiting to happen.
LaunchDarkly’s own documentation is explicit: “Release flags are temporary. After you verify the new code is stable and roll out the feature to 100% of contexts, you should archive the flag” (LaunchDarkly). The teams that internalise that don’t enforce it with willpower — they enforce it at creation, when the flag is cheapest to govern.
What “good” looks like is concrete. Every flag-creation form requires the type, an owner, and — for non-permanent types — an expected removal date. The dashboard surfaces expired release flags for review the day they expire, not the next time someone runs a cleanup sprint. Kill switches carry a documented test cadence; if they can’t be flipped in a non-production environment without breaking something, they don’t actually work. The depth on the operational loop sits in our feature flag cleanup playbook. Tagging the type at creation is the upstream change that makes the loop cheap to run.
2. Adopt a naming convention and enforce it in CI
A flag name should encode purpose, scope, and lifecycle at a glance. release_checkout_v3_2026q3 reads correctly six months later when nobody on the original PR is still on the team. newCheckout doesn’t — there’s no type, no scope, no quarter to anchor it. The cost of bad names is most visible at deprecation time, when somebody reuses a flag bit because the original purpose was unreadable from the name. That’s the failure mode that produced the Knight Capital incident.
The template most teams converge on is <type>_<area>_<feature>_<owner-or-quarter>. One good name, one bad:
release_checkout_v3_2026q3 ← type + area + feature + quarternewCheckout ← no type, no scope, no lifecycle hintA wiki page documenting the convention is worth almost nothing because nobody reads it at creation time. A regex check in CI is worth a lot, because it fails the PR that introduces a non-conforming name:
// .github/workflows/flag-name-check.js — fail CI on non-conforming flag keys.const FLAG_KEY_PATTERN = /^(release|experiment|killswitch|entitlement)_[a-z0-9]+_[a-z0-9-]+_(20\d{2}q[1-4]|[a-z]+)$/;
const newFlagKeys = collectFlagKeysFromDiff(); // your platform's API or grepconst offenders = newFlagKeys.filter(k => !FLAG_KEY_PATTERN.test(k));
if (offenders.length) { console.error('Non-conforming flag keys:', offenders); process.exit(1);}Naming is the cheapest governance you can ship. The full depth — patterns by team size, migration paths off bad names, what to do when the convention itself evolves — belongs in its own post; until then, the creating flags guide covers the Featureflip-specific knobs.
3. Keep flag decisions flat — one toggle point per flag
When a flag’s decision logic is scattered across the codebase, removing the flag becomes a graph-traversal problem. Martin Fowler calls out the distinction: the toggle point is where the decision is read; the toggle router is the logic that decides (martinfowler.com). Tangling them is what makes flag cleanup terrifying years later, because nobody knows what they’re deleting.
The anti-pattern looks like this — the SDK call repeated across five files, the evaluation context subtly different in each one:
// In billing.js, checkout.js, signup.js, dashboard.js, mobile.js:if (client.evaluate('billing-v2', { tenantId, region, plan, betaCohort })) { // ...}Wrap the decision once, behind a name, and import it everywhere:
// One file. One named decision.export function isBillingV2Enabled(user) { return client.evaluate('billing-v2', { tenantId: user.tenantId, region: user.region, plan: user.plan, betaCohort: user.betaCohort, });}Every other file calls isBillingV2Enabled(user). The decision lives once. Removal is one delete and a search-and-replace, not an archaeological dig through five evaluation contexts that drifted apart.
The rule extends to flag composition. If a section of behaviour depends on more than one flag, don’t nest the conditionals — fold them into a single named decision (getCheckoutVariant(user) → 'legacy' | 'beta' | 'rollout') and switch on the result. Where the composition is structural (one flag’s behaviour genuinely depends on another being on), use prerequisite flags — they’re the supported pattern. Where it’s accidental, flatten it. The inverse failure mode, including the combinatorics math, lives in the anti-patterns post.
4. Test both states — and the transition — in CI
A flag with only one tested state is a deploy waiting to surprise you. CI exercises the new code path because that’s what the PR added. The old path stops getting touched. When the flag flips back during an incident — or rolls forward to a cohort that exposes a latent bug — the un-exercised path executes for the first time in months.
Slack’s May 12, 2020 incident started exactly this way. Around 8:30am Pacific, a percentage-based flag rollout exposed a longstanding performance bug. Slack rolled the flag back within minutes, but the morning load spike had already pushed the backend into a cascade through stale HAProxy state and webapp autoscaling — by 4:45pm Pacific, the user-visible outage began (Slack Engineering). The flag itself wasn’t broken. The code under the flag had never run under production load.
Two rules cover the gap. Every PR that adds a flag check needs a CI test for both branches:
// vitest / jest pattern: parameterise the suite over flag state.describe.each([ ['flag on', true], ['flag off', false],])('checkout (%s)', (_, billingV2Enabled) => { beforeEach(() => mockFlag('billing-v2', billingV2Enabled));
it('charges the right amount', async () => { const result = await checkout(cart); expect(result.total).toBe(expectedTotal(billingV2Enabled)); });});And pre-rollout, the on path needs a load test or a canary that actually pushes traffic through it. Per-environment overrides on your flag platform exist for exactly this; pin the value in staging so the assertion is deterministic. If you’re only going to do one of the two, do the both-branch CI rule first — it costs nothing and catches the common case.
5. Roll out progressively — internal, canary, beta, GA
Progressive delivery rolls a change to expanding rings of users — internal team first, then a low-percentage canary, then beta, then general availability. Each ring is a checkpoint with a gating metric: error rate, p95 latency, the business metric the feature is supposed to move. If a ring fails its check, halt and rollback. The flag is the mechanism; the rings are the discipline. DORA’s continuous delivery capability calls this out directly: flags let teams separate deploy from release, which is the precondition for keeping change failure rate down at high deployment frequency (DORA).
The 7–14 day stabilisation at 100% matters more than it sounds — it’s the window that catches latent bugs only visible at full traffic, before the flag (and the rollback path) is gone. LaunchDarkly’s lifecycle documentation recommends the same band (LaunchDarkly). The Featureflip-specific knobs — percentage rollout per environment, cohort targeting, the rollback control — live in rollout strategies.
6. Treat flag changes as production changes
A flag flip in production is a production change. Audit who flipped what, gate prod-flag-edit permissions with role-based access, and require a second pair of eyes on high-blast-radius flags. The audit log isn’t compliance paperwork — it’s the first artifact incident responders reach for when a flag is the suspect.
What “good” looks like, concretely:
- Production flag edits gated by an SSO-backed role separate from dev and staging.
- Every flag change emits an event with actor, timestamp, before/after values, and an optional reason.
- High-risk flags (auth, billing, anything touching more than half of users) require a second approver.
- The audit log is retained for at least 90 days and queryable by both flag key and actor.
A flag-platform admin role that lets anyone on the engineering team flip any flag in production is the default that bites teams quickest. Split the permission early — read-everything is a different role to write-anything, and write-staging is a different role to write-production. The cost of getting this wrong shows up at incident-time, when “who turned that on?” returns a shrug.
Featureflip ships the audit log, RBAC, and per-evaluation events that the rules above lean on; the team management guide covers how to wire roles to your SSO. A built-in second-approver workflow on high-risk flags isn’t in the product today — that’s on us. Until it ships, the lightweight pattern is to gate prod edits behind a peer-pair convention enforced by humans, not the platform.
7. Keep evaluation local and observable
Flag evaluation should be sub-millisecond and observable. Sub-millisecond means in-process — the SDK bootstraps with a cached config and streams updates, so per-request evaluation pays no network cost. Observable means every evaluation emits an event with the flag key, the chosen variant, the reason (rule-match, fallthrough, prerequisite-failed), and a user identifier you can join with your error logs. The combination is what makes flag-suspected incidents debuggable in minutes instead of hours.
If the flag service is unreachable, a local-evaluation SDK should serve the last-known-good config — the app keeps working. If it’s only reachable, the app’s startup latency now includes the flag service’s tail latency, and the flag service is a hard dependency you didn’t ask for. OpenFeature codifies the local-evaluation pattern as the cross-vendor standard for exactly this reason (openfeature.dev).
When a flag is the suspect in an incident, there are four questions to answer in the first ten minutes:
- Which flags changed in the last hour?
- For the affected user, which variant fired?
- What’s the current variant rollout percentage?
- Has the variant cohort’s error rate diverged from baseline?
Without per-flag evaluation events and an audit log, every one of those is a guess. With them, all four are dashboard queries. The minimum bar is an audit log of config changes; the good bar is per-evaluation events you can correlate with request-level error spikes. None of that is exotic — it’s the same telemetry you’d want for any production system. The difference is that flag changes happen during incidents, which is exactly when the data has to already exist. Local-evaluation support across all 11 of our SDKs is built around this.
8. Schedule cleanup as recurring discipline, not a sprint
Flag debt accumulates in days but pays down in quarters. Without a recurring cleanup checkpoint, the share of stale flags climbs to roughly 60% of the total within 24 months (FlagShark, 2025). The fix isn’t a heroic quarterly sprint — it’s a small, scheduled checkpoint (monthly for high-velocity teams, quarterly for everyone else) that runs an automated staleness scan and produces a removal queue.
The actual removal procedure is mechanical when the upstream practices are in place: delete the flag check first (the platform now serves the new default), verify in production for a release cycle, then delete the dead branch. Two PRs, in that order. The four-step operational loop — detect, triage, remove, prevent — and the grep / AST scripts that drive it sit in the feature flag cleanup playbook. That’s the depth post. The pillar’s job is to put cleanup at the end on purpose: rules 1 through 7 are the practices that minimise the amount of cleanup you’ll ever have to do. Without them, cleanup is heroic. With them, it’s a 30-minute task on a recurring calendar invite.
The shorter version
Most flag-related problems come from a small set of things people skip at creation time. Tag every flag with one of four lifecycle types — release, experiment, kill switch, or entitlement — and let ownership and removal follow from the type. Name flags so the type, area, and quarter are readable six months later; enforce the convention in CI, not on a wiki. Wrap each flag decision in one named helper so the toggle point and the toggle router stay separable. Test both states and the transition before the flag ships. Roll out progressively through internal, canary, beta, and GA, with a gating metric on each ring. Treat flag flips in production as production changes — audit log, RBAC, second approver for the risky ones. Keep evaluation local and emit a per-evaluation event so incident triage isn’t guesswork. And schedule a small recurring cleanup checkpoint instead of letting debt accumulate to the point of needing a sprint to pay it down.
Frequently asked questions
What are feature flag best practices?
The eight rules production teams converge on: tag every flag with a lifecycle type at creation, name flags with an enforced convention, keep each flag’s decision in one named helper, test both states in CI, roll out progressively through internal-canary-beta-GA, treat flag changes as production changes, keep evaluation local and observable, and schedule cleanup as a recurring checkpoint instead of a quarterly sprint.
How long should a feature flag live in production?
It depends on the type. Release flags should be removed within roughly 30 days of hitting 100%. Experiment flags should live for the experiment window plus a two-week stabilisation. Kill switches and entitlements are permanent if they’re documented as such and assigned an owner (LaunchDarkly). Tag the type at creation; the lifespan follows.
How do you manage feature flag debt?
By preventing it at creation, not cleaning it up at quarter-end. 73% of feature flags are never removed (FlagShark, 2025). Tag every flag with a lifecycle type, set an expiration date for non-permanent ones, and run a small recurring cleanup checkpoint to clear the queue. The full four-step loop — detect, triage, remove, prevent — lives in our cleanup playbook.
What’s the difference between a feature flag and a kill switch?
A feature flag is the mechanism — a runtime branch the platform can flip without redeploying. A kill switch is one type of feature flag: permanent, owned, exercised periodically, used to disable a misbehaving feature in production. All kill switches are flags; not all flags are kill switches. Tagging the type at creation makes the distinction enforceable instead of cultural.
How do you name feature flags?
The template most teams converge on is <type>_<area>_<feature>_<owner-or-quarter> — for example, release_checkout_v3_2026q3. Type tells you the lifecycle, area tells you which subsystem, feature is the specific change, and the quarter or owner anchors it in time. Enforce the convention with a regex check in CI; a wiki page never gets read at creation time.
How do you roll out a feature flag safely?
Through four rings: internal team and dogfood first (24–48 hours), canary at 1–5% of production (30–60 minutes per increment, gated on error rate and p95 latency), beta at 10–20% (24–72 hours, gated on the business metric), then general availability at 100%. Stabilise at 100% for 7–14 days before removing the flag (LaunchDarkly). Each promotion should be an automated rule, not a Slack message.
Featureflip is built around the practices in this post: sub-millisecond local evaluation, per-evaluation events through the SDK, an audit log on every flag, RBAC for production edits, and flag-type tagging at creation so cleanup follows the lifecycle. If you’d like to see how that plays out in practice, start with the Solo plan. It’s free forever for one project.