Feature Flag Cleanup: A Playbook for Paying Down Flag Debt

Q: How do you find unused feature flags in your code?

Three tools, in increasing power. grep for a quick reference catalogue. A small AST-style script (about 30 lines of Python) that takes your platform's flag-key list and counts references per flag. And git log --diff-filter=A -S to find when each flag was first introduced, so you can spot keys that have been stale for years. Cross-reference with your platform's "last evaluated" timestamp. On Featureflip that cross-reference is built in: flags are classified Stale or Dead from evaluation traffic, so the platform side of the audit starts as a filtered list rather than a spreadsheet.

The most expensive feature flag in history was a single bit on one server, repurposed in 2012 because Knight Capital had run out of bits to use. The code path under the deprecated flag had been dead for eight years — until that one server executed it again, and the company lost $460 million in 45 minutes.

Most teams aren’t a brokerage running an automated order-routing system. But the underlying problem is universal: feature flags are easy to create, awkward to delete, and almost never cleaned up on schedule. Industry data puts the share of flags that never get removed at 73%, with the average enterprise application carrying more than 200 active flags and 60% of them stale beyond 90 days (FlagShark, 2025).

Key Takeaways

73% of feature flags are never removed; engineers in flag-heavy codebases lose 3–5 hours per week navigating them (FlagShark, 2025).

The fix is a four-step loop (detect, triage, remove, prevent), not a one-off purge.

The two-PR removal pattern (delete the flag check first, dead branch second) is the specific procedure that prevents Knight Capital-class incidents.

Governance at flag creation time (naming conventions, owners, expiration dates) does more than any cleanup sprint.

Why feature flag debt accumulates

Creating a flag takes seconds. Removing one takes a code review, a deploy, and someone willing to verify nothing downstream depended on the dead branch. There’s no automated reminder, no compiler error, and no test that fails when a flag goes stale. So flags accumulate the way unused CSS classes accumulate: silently, and faster than anyone expects.

Modeled from FlagShark 2025 industry data — 60% of an average enterprise's 200+ flags are stale beyond 90 days.

The cost compounds in three places. Engineers spend 3–5 hours per week navigating flag-related conditional branches in code review and debugging. Pull request reviews take roughly 60% longer when reviewers have to mentally trace flag interactions. And incident resolution slows by about 40% in flag-heavy codebases, because the runbook now has to consider which paths are gated by which toggles (FlagShark, 2025). For a 50-engineer team, the lost productivity adds up to roughly $520,000 a year. That’s enough to hire three or four senior engineers.

There’s a quieter cost too: every deprecated flag is a future Knight Capital, sitting under code that still runs but is no longer expected to.

Step 1 — Detect: find every flag and how stale it is

You can’t clean up what you can’t see. Detection has two halves: inventory (which flag keys exist?) and staleness (when was each one last meaningful?). Combine your platform’s flag list with a code-reference scan and a git log query, and you have a complete picture in under an hour.

If you’re on Featureflip, the staleness half is automatic. Every flag in every environment is classified as Active, Stale, or Dead from real evaluation traffic, with the reason attached: fully rolled out, rolled back, no traffic, or never used. Filter the flag list by status and you have your removal queue. The inventory half still needs a code search, which the tools below cover.

The grep one-liner

For a small codebase, grep gets you most of the way. Adjust the pattern to whatever evaluation API your SDK uses:

grep -rn 'boolVariation\|isFeatureEnabled\|client\.evaluate' src/ \
  | awk '{print $2}' | sort -u

This produces a rough catalogue of every flag key referenced in code. Crude, but useful as a sanity check against the platform’s flag list. Anything in code but not in the platform is probably dead. Anything in the platform but not in code is a removal candidate.

A small AST-style scanner

Once you outgrow grep, a 30-line Python script gives you reference counts per flag key. Save the platform’s flag keys to a text file, one per line, and run:

#!/usr/bin/env python3
"""Count feature flag references across a codebase.
Usage: python find-flags.py flags.txt src/"""
import re, sys
from pathlib import Path

flag_keys = Path(sys.argv[1]).read_text().splitlines()
root = Path(sys.argv[2])
extensions = {".ts", ".tsx", ".js", ".jsx", ".py", ".go", ".cs", ".java", ".rb"}

counts = {k: 0 for k in flag_keys if k.strip()}
for path in root.rglob("*"):
    if not path.is_file() or path.suffix not in extensions:
        continue
    try:
        text = path.read_text(errors="ignore")
    except Exception:
        continue
    for key in counts:
        counts[key] += len(re.findall(rf'["\']{re.escape(key)}["\']', text))

for key, n in sorted(counts.items(), key=lambda kv: kv[1]):
    print(f"{n:6d}  {key}")

Flags with zero hits are obviously dead. Flags with a single hit usually indicate a stale check that was meant to be removed when rollout completed. The interesting case is one or two hits clustered in a single file: a leftover branch that nobody noticed.

Git archaeology for creation dates

Reference counts tell you what’s stale. git log tells you how long it’s been stale:

while read -r flag; do
  date=$(git log --diff-filter=A -S"$flag" --format=%cs --reverse \
         | head -1)
  printf "%s\t%s\n" "${date:-unknown}" "$flag"
done < flags.txt | sort

-S is git’s “pickaxe”: it finds the commit that introduced (or removed) a string. Combined with --diff-filter=A and --reverse, you get the date the flag key first appeared. Cross-reference that with your platform’s “last evaluated” timestamp, and any flag with creation date older than 90 days and zero recent evaluations is a high-confidence cleanup candidate.

Code-reference scanners also exist as open-source tools published by some flag platforms, if you’d rather adopt one off the shelf than maintain the script above.

Platform-side staleness tooling varies more than most teams expect. ConfigCat ships a zombie flags report keyed off when a flag was last changed or toggled, while Featureflip classifies from live evaluation traffic. If staleness detection is a deciding factor in your platform choice, the ConfigCat alternative guide compares the approaches side by side.

Step 2 — Triage: not every old flag is removable

The easy mistake is treating “old” as a synonym for “removable.” Some flags belong forever. Triage means classifying every flag from Step 1 into one of four buckets before you touch the code.

1. Release flag, fully rolled out

The cleanest cleanup target. The percentage rollout finished weeks or months ago, the new code path has been the only one serving traffic since, and the old branch is dead weight. Remove the flag check, then remove the dead branch, then delete the flag from the platform.

2. Experiment flag, experiment ended

The experiment has a winner. The losing arm is dead code, and the flag is no longer evaluated meaningfully. Remove both the flag check and the losing arm — leaving the loser in place is how an A/B test silently turns into permanent dead code.

3. Permanent flag

Kill switches, plan-tier entitlements, regional gates, and per-tenant overrides legitimately stay forever. They’re not flags in the rollout-toggle sense; they’re runtime targeting rules. Don’t delete them. Do verify each one has an explicit owner, a perm- prefix in the key (so triage doesn’t have to re-evaluate it next quarter), and a sentence of documentation about what it gates.

4. Zombie flag

Evaluated nowhere — the platform shows zero traffic for 90+ days — but code references still exist. This is the dangerous bucket. The code path under the flag is probably dead, but it’s been waiting around long enough that nobody currently on the team remembers what it does. Treat zombies like any other production change: confirm with the team that owned the original feature, check error logs for any evidence the path still runs, and only then proceed to removal.

The Knight Capital incident is what happens when zombies are left in place. The flag bit they reused had been a zombie for eight years.

Step 3 — Remove: how to delete a flag without breaking production

Removing a flag is a production change. Treat it like one. The dangerous removal isn’t the obvious dead branch. It’s the branch that quietly turned out to still be reachable by some code path nobody traced.

The procedure that handles this safely has five steps:

1. Confirm zero unexpected traffic

The flag has been evaluated only at the rolled-out variant for at least 30 days. No edge cases, no weird per-user overrides still firing. If your platform shows surprise evaluations, find out why before continuing.

2. One flag per pull request

Never bundle flag removals. If something goes wrong, you want a clean revert that affects exactly one feature.

3. Delete the flag check first, the dead branch second — in two PRs

This is the single most important rule in flag cleanup, and the one most teams skip. Don’t do this:

// PR that does too much
- if (await client.boolVariation('checkout-v2', ctx, false)) {
-   return renderCheckoutV2(user);
- } else {
-   return renderCheckoutV1(user);
- }
- function renderCheckoutV1(user) { /* 200 lines of legacy code */ }
+ return renderCheckoutV2(user);

Do this instead. PR 1: make the code path unconditional, but leave the dead branch in place.

- if (await client.boolVariation('checkout-v2', ctx, false)) {
-   return renderCheckoutV2(user);
- } else {
-   return renderCheckoutV1(user);
- }
+ return renderCheckoutV2(user);
  function renderCheckoutV1(user) { /* still here, unused */ }

Deploy. Watch error rates and traffic for 24 hours. If anything was wrong, the revert is a one-line change and the dead-but-present branch is still there to catch any callers you missed.

PR 2: delete the dead branch.

  return renderCheckoutV2(user);
- function renderCheckoutV1(user) { /* 200 lines of legacy code */ }

If you’re confident after PR 1 that nothing else calls renderCheckoutV1, this is safe. If anything does call it, your editor’s “find references” already told you so before you opened the PR.

This two-PR pattern is exactly what would have prevented Knight Capital. Their problem wasn’t the flag itself — it was that the dead branch was still in the deployed binary on one of eight servers when an unrelated change re-activated it.

4. Watch error rates and traffic for 24 hours after PR 1

Treat flag removal like any deploy. Most cleanup-related incidents surface within hours, not days.

5. Then delete the flag from the platform

This is the irreversible step. Do it after both PRs have shipped and the system has been stable for a week. Once the flag is gone from the platform, any rollback path that depended on flipping it is closed.

Step 4 — Prevent: governance that stops the cycle

If creation policy doesn’t change, cleanup is a treadmill: you’ll be back here in six months with another 200 flags. Three controls do almost all the work, and none of them require buying anything.

Naming conventions that encode lifecycle

A flag named new_feature loses its meaning the day after it ships. A flag named release-billing-annual-plans-2026q2 tells you the type, the owning team, the feature, and the rough timeline — even if every original author has left the company. A reasonable naming convention:

release-billing-annual-plans-2026q2     # rollout, removable
exp-onboarding-progress-bar-v2          # experiment, removable
perm-killswitch-fraud-detection         # permanent, never to be removed
perm-entitlement-pro-plan-features      # permanent, plan tier

The release-, exp-, and perm- prefixes do most of the work. They let triage immediately separate “should be removed” from “intentionally permanent” without context.

Expiration dates set at creation

Every flag platform should have a “scheduled review” or expiration field. If yours doesn’t, a calendar reminder works. The bar is low: any defaulted-to-30-days reminder beats no reminder at all. Google’s internal practice is to expire experiment flags after 30 days unless an engineer explicitly renews them with a justification (Statsig glossary). It’s a mechanism worth borrowing whether you’re at Google scale or not.

A recurring cleanup cadence

A monthly or quarterly “flag debt day” — one engineering hour per person — is more sustainable than a yearly purge. Treat it like dependency upgrades: ignored for a quarter, manageable; ignored for two years, terrifying.

Annual cost of sustained flag debt scales linearly with engineering headcount. Source: FlagShark productivity calculator, 2025.

The investment that prevents most of this is genuinely small: an hour at flag creation to set a name, owner, and expiration; an hour a month to triage. Skipping both, year after year, is what produces the curve in the chart above.

The Knight Capital lesson

In July 2012, Knight Capital deployed new code for its automated equity order-routing system, called SMARS. The deploy missed one of eight production servers. The new code reused a feature flag bit from a feature called Power Peg, which had been deprecated in 2003 and never removed from the binary (Henrico Dolfing case study, 2019).

When the flag was set in production, seven of eight servers ran the new logic. The eighth server, still carrying the old binary, ran Power Peg. Power Peg’s order-fulfillment reporting had been altered after deprecation, so completed orders were never marked as completed — and the system kept sending more. In 45 minutes, Knight bought $7 billion of stock it couldn’t pay for, lost roughly $460 million in the unwind, and was acquired by Getco three months later (Knight Capital Group, Wikipedia).

The framing most retellings reach for is “feature flags are dangerous.” That’s the wrong takeaway. The flag was perfectly safe in 2003, when the Power Peg feature was live. It became dangerous only because the flag was retired without removing the code path it gated, and stayed dangerous for nine years until something else turned the bit back on.

The lesson is narrower and more useful: deferred cleanup compounds. Every deprecated flag still in your binary is a future incident waiting for an unrelated change to step on it. The two-PR removal procedure in Step 3 isn’t bureaucracy. It’s the specific protocol that prevents the dead branch from outliving the flag.

For a deeper view of the lifecycle this fits into, see the broader question of when feature flags belong in your stack. Flag reuse is also one of the nine anti-patterns behind real production incidents.

Frequently asked questions

What percentage of feature flags are never removed?

Industry data puts the share at 73%, with the average enterprise application carrying more than 200 active flags and roughly 60% of them stale beyond 90 days (FlagShark, 2025). The accumulation creates measurable productivity costs: engineers spend 3–5 hours a week navigating flag-heavy code in review and debugging.

How do you find unused feature flags in your code?

Three tools, in increasing power. grep for a quick reference catalogue. A small AST-style script (about 30 lines of Python) that takes your platform’s flag-key list and counts references per flag. And git log --diff-filter=A -S<flag> to find when each flag was first introduced, so you can spot keys that have been stale for years. Cross-reference with your platform’s “last evaluated” timestamp. On Featureflip that cross-reference is built in: flags are classified Stale or Dead from evaluation traffic, so the platform side of the audit starts as a filtered list rather than a spreadsheet.

How long should a feature flag live?

Most rollout flags should be retired within weeks of hitting 100%. Experiment flags should live for the duration of the experiment plus a two-week stabilization window. Permanent flags (kill switches, plan-tier entitlements, regional gates) live indefinitely if they’re documented as permanent and assigned an owner. Set an expiration or review date when the flag is created, not after.

Is it safe to just delete an old flag from the platform?

Not on its own. Deleting a flag from the platform without first removing the flag check from code is how you produce silent behavior changes. The SDK will start serving the default value, which may or may not match the rolled-out behavior. Delete the flag check from code first (in two PRs, as in Step 3), then delete the flag from the platform once you’ve verified the system is stable.

The shorter version

Cleanup is a four-step loop, not a one-off effort. Detect every flag and how stale it is. Triage into rollout, experiment, permanent, or zombie. Remove rollout and experiment flags in two PRs — flag check first, dead branch second. And prevent the next round by setting a name, owner, and expiration when the flag is created.

The Knight Capital story is the worst-case demonstration of why this matters, but it isn’t the typical case. The typical case is quieter: 200 flags accumulated over two years, three to five hours a week of mental tax per engineer, slower reviews, slower incident response, and a codebase that’s slightly harder to reason about than it should be. Every flag you create is a flag someone has to remember to delete.

Cleanup sits inside the broader practice. For the eight rules production teams converge on — type tagging at creation, naming conventions, flat decisions, both-state testing, progressive rollouts, governance, observability, and recurring cleanup — see the canonical best-practices reference.

Featureflip is built around this lifecycle — flag keys, owners, and last-evaluated timestamps are first-class in the dashboard, so naming conventions and stale-flag triage become observable rather than tribal knowledge. If you’d like to see how that plays out in practice, start with the Solo plan. It’s free forever for one project.