Feature Flag Best Practices: 8 Rules Production Teams Live By

Q: What's the difference between a feature flag and a kill switch?

A feature flag is the mechanism — a runtime branch the platform can flip without redeploying. A kill switch is one type of feature flag: permanent, owned, exercised periodically, used to disable a misbehaving feature in production. All kill switches are flags; not all flags are kill switches. Tagging the type at creation makes the distinction enforceable instead of cultural.

Q: How do you name feature flags?

The template most teams converge on is — for example, releasecheckoutv32026q3. Type tells you the lifecycle, area tells you which subsystem, feature is the specific change, and the quarter or owner anchors it in time. Enforce the convention with a regex check in CI; a wiki page never gets read at creation time.

Best practice with feature flags is mostly prevention. The rules that matter aren’t ceremonies — they’re the ones that stop a flag from becoming the suspect in next quarter’s incident review. Most feature flags are never removed; Pete Hodgson’s canonical treatment frames toggles as “inventory which comes with a carrying cost” that teams have to actively keep low (martinfowler.com). The same lack of discipline is what reused a deprecated flag bit at Knight Capital in 2012 and cost the firm $460 million in 45 minutes (Henrico Dolfing, 2019).

Most “feature flag best practices” articles list rules without saying which failure mode each one prevents. This one does. Eight rules, each anchored to a specific way flags hurt teams that skip it, each with a concrete “what good looks like” check.

Key Takeaways

Tag every flag with one of four lifecycle types at creation (release, experiment, kill switch, entitlement). Lifespan, ownership, and removal procedure all follow from the type.

One toggle point per flag. The decision lives in a single named helper, called from everywhere; otherwise removing the flag becomes a graph-traversal problem.

Test both states and the transition in CI. Slack’s May 2020 cascade started on a flag-on code path that had never run under production load (Slack Engineering).

Roll out progressively (internal → 1–5% canary → 10–20% beta → 100%), then stabilise at 100% for one to two weeks before removing the flag.

Most flags are never removed. The fix is creation-time discipline, not a quarterly purge.

How to use this guide. This post is the pillar — short on each rule, deep on the inventory. Where a rule has its own depth post, the link is there. For the operational cleanup playbook, jump to our feature flag cleanup post. For the inverse list, see feature flag anti-patterns. For the boundary with configuration, see feature flags vs environment variables. For structured composition between flags, see prerequisite flags. For the upstream decision of whether to adopt a flag platform at all, see build vs buy feature flags.

1. Tag every flag with a lifecycle type at creation

Every flag is one of four types, and the type drives everything that follows. Release flags are temporary — they gate a new feature during rollout and should be removed within weeks of hitting 100%. Experiment flags live for the experiment window plus a stabilisation period and then come out. Kill switches are permanent, assigned an owner, and exercised on a schedule. Entitlements are permanent and gate behaviour by plan tier or tenant. A flag that isn’t tagged is, by default, a zombie waiting to happen.

LaunchDarkly’s own documentation is explicit: “Release flags are temporary. After you verify the new code is stable and roll out the feature to 100% of contexts, you should archive the flag” (LaunchDarkly). The teams that internalise that don’t enforce it with willpower — they enforce it at creation, when the flag is cheapest to govern.

The type is the contract. A release flag that lives past day 30 is overdue; a kill switch that's never been flipped isn't a kill switch.

What “good” looks like is concrete. Every flag-creation form requires the type, an owner, and — for non-permanent types — an expected removal date. The dashboard surfaces expired release flags for review the day they expire, not the next time someone runs a cleanup sprint. Kill switches carry a documented test cadence; if they can’t be flipped in a non-production environment without breaking something, they don’t actually work. The depth on the operational loop sits in our feature flag cleanup playbook. Tagging the type at creation is the upstream change that makes the loop cheap to run.

2. Adopt a naming convention and enforce it in CI

A flag name should encode purpose, scope, and lifecycle at a glance. release_checkout_v3_2026q3 reads correctly six months later when nobody on the original PR is still on the team. newCheckout doesn’t — there’s no type, no scope, no quarter to anchor it. The cost of bad names is most visible at deprecation time, when somebody reuses a flag bit because the original purpose was unreadable from the name. That’s the failure mode that produced the Knight Capital incident.

The template most teams converge on is <type>_<area>_<feature>_<owner-or-quarter>. One good name, one bad:

release_checkout_v3_2026q3   ← type + area + feature + quarter
newCheckout                  ← no type, no scope, no lifecycle hint

A wiki page documenting the convention is worth almost nothing because nobody reads it at creation time. A regex check in CI is worth a lot, because it fails the PR that introduces a non-conforming name:

// .github/workflows/flag-name-check.js — fail CI on non-conforming flag keys.
const FLAG_KEY_PATTERN = /^(release|experiment|killswitch|entitlement)_[a-z0-9]+_[a-z0-9-]+_(20\d{2}q[1-4]|[a-z]+)$/;

const newFlagKeys = collectFlagKeysFromDiff();      // your platform's API or grep
const offenders   = newFlagKeys.filter(k => !FLAG_KEY_PATTERN.test(k));

if (offenders.length) {
  console.error('Non-conforming flag keys:', offenders);
  process.exit(1);
}

Naming is the cheapest governance you can ship. The full depth (patterns by team size, migration paths off bad names, what to do when the convention itself evolves) lives in the dedicated feature flag naming conventions guide; the creating flags guide covers the Featureflip-specific knobs.

3. Keep flag decisions flat — one toggle point per flag

When a flag’s decision logic is scattered across the codebase, removing the flag becomes a graph-traversal problem. Martin Fowler calls out the distinction: the toggle point is where the decision is read; the toggle router is the logic that decides (martinfowler.com). Tangling them is what makes flag cleanup terrifying years later, because nobody knows what they’re deleting.

The anti-pattern looks like this — the SDK call repeated across five files, the evaluation context subtly different in each one:

// In billing.js, checkout.js, signup.js, dashboard.js, mobile.js:
if (client.evaluate('billing-v2', { tenantId, region, plan, betaCohort })) {
  // ...
}

Wrap the decision once, behind a name, and import it everywhere:

// One file. One named decision.
export function isBillingV2Enabled(user) {
  return client.evaluate('billing-v2', {
    tenantId: user.tenantId,
    region: user.region,
    plan: user.plan,
    betaCohort: user.betaCohort,
  });
}

Every other file calls isBillingV2Enabled(user). The decision lives once. Removal is one delete and a search-and-replace, not an archaeological dig through five evaluation contexts that drifted apart.

The rule extends to flag composition. If a section of behaviour depends on more than one flag, don’t nest the conditionals — fold them into a single named decision (getCheckoutVariant(user) → 'legacy' | 'beta' | 'rollout') and switch on the result. Where the composition is structural (one flag’s behaviour genuinely depends on another being on), use prerequisite flags — they’re the supported pattern. Where it’s accidental, flatten it. The inverse failure mode, including the combinatorics math, lives in the anti-patterns post.

4. Test both states — and the transition — in CI

A flag with only one tested state is a deploy waiting to surprise you. CI exercises the new code path because that’s what the PR added. The old path stops getting touched. When the flag flips back during an incident — or rolls forward to a cohort that exposes a latent bug — the un-exercised path executes for the first time in months.

Slack’s May 12, 2020 incident started exactly this way. Around 8:30am Pacific, a percentage-based flag rollout exposed a longstanding performance bug. Slack rolled the flag back within minutes, but the morning load spike had already pushed the backend into a cascade through stale HAProxy state and webapp autoscaling — by 4:45pm Pacific, the user-visible outage began (Slack Engineering). The flag itself wasn’t broken. The code under the flag had never run under production load.

Two rules cover the gap. Every PR that adds a flag check needs a CI test for both branches:

// vitest / jest pattern: parameterise the suite over flag state.
describe.each([
  ['flag on',  true],
  ['flag off', false],
])('checkout (%s)', (_, billingV2Enabled) => {
  beforeEach(() => mockFlag('billing-v2', billingV2Enabled));

  it('charges the right amount', async () => {
    const result = await checkout(cart);
    expect(result.total).toBe(expectedTotal(billingV2Enabled));
  });
});

And pre-rollout, the on path needs a load test or a canary that actually pushes traffic through it. Per-environment overrides on your flag platform exist for exactly this; pin the value in staging so the assertion is deterministic. If you’re only going to do one of the two, do the both-branch CI rule first — it costs nothing and catches the common case.

5. Roll out progressively — internal, canary, beta, GA

Progressive delivery rolls a change to expanding rings of users — internal team first, then a low-percentage canary, then beta, then general availability. Each ring is a checkpoint with a gating metric: error rate, p95 latency, the business metric the feature is supposed to move. If a ring fails its check, halt and rollback. The flag is the mechanism; the rings are the discipline. DORA’s continuous delivery capability calls this out directly: flags let teams separate deploy from release, which is the precondition for keeping change failure rate down at high deployment frequency (DORA).

Four rings, each with a gating metric. Promotion is mechanical; rollback is one flag flip.

The 7–14 day stabilisation at 100% matters more than it sounds — it’s the window that catches latent bugs only visible at full traffic, before the flag (and the rollback path) is gone. LaunchDarkly’s lifecycle documentation recommends the same band (LaunchDarkly). The Featureflip-specific knobs — percentage rollout per environment, cohort targeting, the rollback control — live in rollout strategies. This rule is one slice of a wider discipline; for the full strategy, including the delegation pillar most teams skip, see progressive delivery, explained.

6. Treat flag changes as production changes

A flag flip in production is a production change. Audit who flipped what, gate prod-flag-edit permissions with role-based access, and require a second pair of eyes on high-blast-radius flags. The audit log isn’t compliance paperwork — it’s the first artifact incident responders reach for when a flag is the suspect.

What “good” looks like, concretely:

Production flag edits gated by an SSO-backed role separate from dev and staging.
Every flag change emits an event with actor, timestamp, before/after values, and an optional reason.
High-risk flags (auth, billing, anything touching more than half of users) require a second approver.
The audit log is retained for at least 90 days and queryable by both flag key and actor.

A flag-platform admin role that lets anyone on the engineering team flip any flag in production is the default that bites teams quickest. Split the permission early — read-everything is a different role to write-anything, and write-staging is a different role to write-production. The cost of getting this wrong shows up at incident-time, when “who turned that on?” returns a shrug.

Featureflip ships the audit log, RBAC, and per-evaluation events that the rules above lean on; the team management guide covers how to wire roles to your SSO. A built-in second-approver workflow on high-risk flags isn’t in the product today — that’s on us. Until it ships, the lightweight pattern is to gate prod edits behind a peer-pair convention enforced by humans, not the platform.

7. Keep evaluation local and observable

Flag evaluation should be sub-millisecond and observable. Sub-millisecond means in-process — the SDK bootstraps with a cached config and streams updates, so per-request evaluation pays no network cost. Observable means every evaluation emits an event with the flag key, the chosen variant, the reason (rule-match, fallthrough, prerequisite-failed), and a user identifier you can join with your error logs. The combination is what makes flag-suspected incidents debuggable in minutes instead of hours.

If the flag service is unreachable, a local-evaluation SDK should serve the last-known-good config — the app keeps working. If it’s only reachable, the app’s startup latency now includes the flag service’s tail latency, and the flag service is a hard dependency you didn’t ask for. OpenFeature codifies the local-evaluation pattern as the cross-vendor standard for exactly this reason (openfeature.dev).

When a flag is the suspect in an incident, there are four questions to answer in the first ten minutes:

Which flags changed in the last hour?
For the affected user, which variant fired?
What’s the current variant rollout percentage?
Has the variant cohort’s error rate diverged from baseline?

Without per-flag evaluation events and an audit log, every one of those is a guess. With them, all four are dashboard queries. The minimum bar is an audit log of config changes; the good bar is per-evaluation events you can correlate with request-level error spikes. None of that is exotic — it’s the same telemetry you’d want for any production system. The difference is that flag changes happen during incidents, which is exactly when the data has to already exist. Local-evaluation support across all 11 of our SDKs is built around this.

8. Schedule cleanup as recurring discipline, not a sprint

Flag debt accumulates in days but pays down in quarters. Without a recurring cleanup checkpoint, the stale share of your flag inventory climbs steadily, because nothing in the toolchain fails when a flag goes stale. The fix isn’t a heroic quarterly sprint — it’s a small, scheduled checkpoint (monthly for high-velocity teams, quarterly for everyone else) that runs an automated staleness scan and produces a removal queue.

The actual removal procedure is mechanical when the upstream practices are in place: delete the flag check first (the platform now serves the new default), verify in production for a release cycle, then delete the dead branch. Two PRs, in that order. The four-step operational loop — detect, triage, remove, prevent — and the grep / AST scripts that drive it sit in the feature flag cleanup playbook. That’s the depth post. The pillar’s job is to put cleanup at the end on purpose: rules 1 through 7 are the practices that minimise the amount of cleanup you’ll ever have to do. Without them, cleanup is heroic. With them, it’s a 30-minute task on a recurring calendar invite.

The shorter version

Most flag-related problems come from a small set of things people skip at creation time. Tag every flag with one of four lifecycle types — release, experiment, kill switch, or entitlement — and let ownership and removal follow from the type. Name flags so the type, area, and quarter are readable six months later; enforce the convention in CI, not on a wiki. Wrap each flag decision in one named helper so the toggle point and the toggle router stay separable. Test both states and the transition before the flag ships. Roll out progressively through internal, canary, beta, and GA, with a gating metric on each ring. Treat flag flips in production as production changes — audit log, RBAC, second approver for the risky ones. Keep evaluation local and emit a per-evaluation event so incident triage isn’t guesswork. And schedule a small recurring cleanup checkpoint instead of letting debt accumulate to the point of needing a sprint to pay it down.

Frequently asked questions

What are feature flag best practices?

The eight rules production teams converge on: tag every flag with a lifecycle type at creation, name flags with an enforced convention, keep each flag’s decision in one named helper, test both states in CI, roll out progressively through internal-canary-beta-GA, treat flag changes as production changes, keep evaluation local and observable, and schedule cleanup as a recurring checkpoint instead of a quarterly sprint.

How long should a feature flag live in production?

It depends on the type. Release flags should be removed within roughly 30 days of hitting 100%. Experiment flags should live for the experiment window plus a two-week stabilisation. Kill switches and entitlements are permanent if they’re documented as such and assigned an owner (LaunchDarkly). Tag the type at creation; the lifespan follows.

How do you manage feature flag debt?

By preventing it at creation, not cleaning it up at quarter-end. Most feature flags are never removed; toggles are “inventory which comes with a carrying cost” (martinfowler.com). Tag every flag with a lifecycle type, set an expiration date for non-permanent ones, and run a small recurring cleanup checkpoint to clear the queue. The full four-step loop — detect, triage, remove, prevent — lives in our cleanup playbook.

What’s the difference between a feature flag and a kill switch?

A feature flag is the mechanism — a runtime branch the platform can flip without redeploying. A kill switch is one type of feature flag: permanent, owned, exercised periodically, used to disable a misbehaving feature in production. All kill switches are flags; not all flags are kill switches. Tagging the type at creation makes the distinction enforceable instead of cultural.

How do you name feature flags?

The template most teams converge on is <type>_<area>_<feature>_<owner-or-quarter> — for example, release_checkout_v3_2026q3. Type tells you the lifecycle, area tells you which subsystem, feature is the specific change, and the quarter or owner anchors it in time. Enforce the convention with a regex check in CI; a wiki page never gets read at creation time.

How do you roll out a feature flag safely?

Through four rings: internal team and dogfood first (24–48 hours), canary at 1–5% of production (30–60 minutes per increment, gated on error rate and p95 latency), beta at 10–20% (24–72 hours, gated on the business metric), then general availability at 100%. Stabilise at 100% for 7–14 days before removing the flag (LaunchDarkly). Each promotion should be an automated rule, not a Slack message.

Featureflip is built around the practices in this post: sub-millisecond local evaluation, per-evaluation events through the SDK, an audit log on every flag, RBAC for production edits, and flag-type tagging at creation so cleanup follows the lifecycle. If you’d like to see how that plays out in practice, start with the Solo plan. It’s free forever for one project.