Featureflip Blog

Feature Flag Cleanup: A Playbook for Paying Down Flag Debt

Mon, 27 Apr 2026 00:00:00 GMT

The most expensive feature flag in history was a single bit on one server, repurposed in 2012 because Knight Capital had run out of bits to use. The code path under the deprecated flag had been dead for eight years — until that one server executed it again, and the company lost $460 million in 45 minutes.

Most teams aren't a brokerage running an automated order-routing system. But the underlying problem is universal: feature flags are easy to create, awkward to delete, and almost never cleaned up on schedule. Industry data puts the share of flags that never get removed at 73%, with the average enterprise application carrying more than 200 active flags and 60% of them stale beyond 90 days (FlagShark, 2025).

Key Takeaways

73% of feature flags are never removed; engineers in flag-heavy codebases lose 3–5 hours per week navigating them (FlagShark, 2025).

The fix is a four-step loop (detect, triage, remove, prevent), not a one-off purge.

The two-PR removal pattern (delete the flag check first, dead branch second) is the specific procedure that prevents Knight Capital-class incidents.

Governance at flag creation time (naming conventions, owners, expiration dates) does more than any cleanup sprint.

Why feature flag debt accumulates

Creating a flag takes seconds. Removing one takes a code review, a deploy, and someone willing to verify nothing downstream depended on the dead branch. There's no automated reminder, no compiler error, and no test that fails when a flag goes stale. So flags accumulate the way unused CSS classes accumulate: silently, and faster than anyone expects.

Active feature flag count over 24 months for a typical enterprise application A line chart showing total active flags growing from 0 to 210 over 24 months, with the stale-beyond-90-days portion growing from 0 to 126 (about 60% of the total) by month 24. 0 6 12 18 24 Months since flag system adopted 0 50 100 150 200 Active flags 210 total 126 stale Total active flags Stale > 90 days

Modeled from FlagShark 2025 industry data — 60% of an average enterprise's 200+ flags are stale beyond 90 days.

The cost compounds in three places. Engineers spend 3–5 hours per week navigating flag-related conditional branches in code review and debugging. Pull request reviews take roughly 60% longer when reviewers have to mentally trace flag interactions. And incident resolution slows by about 40% in flag-heavy codebases, because the runbook now has to consider which paths are gated by which toggles (FlagShark, 2025). For a 50-engineer team, the lost productivity adds up to roughly $520,000 a year. That's enough to hire three or four senior engineers.

There's a quieter cost too: every deprecated flag is a future Knight Capital, sitting under code that still runs but is no longer expected to.

Step 1 — Detect: find every flag and how stale it is

You can't clean up what you can't see. Detection has two halves: inventory (which flag keys exist?) and staleness (when was each one last meaningful?). Combine your platform's flag list with a code-reference scan and a git log query, and you have a complete picture in under an hour.

The grep one-liner

For a small codebase, grep gets you most of the way. Adjust the pattern to whatever evaluation API your SDK uses:

grep -rn 'boolVariation\|isFeatureEnabled\|client\.evaluate' src/ \
  | awk '{print $2}' | sort -u

This produces a rough catalogue of every flag key referenced in code. Crude, but useful as a sanity check against the platform's flag list. Anything in code but not in the platform is probably dead. Anything in the platform but not in code is a removal candidate.

A small AST-style scanner

Once you outgrow grep, a 30-line Python script gives you reference counts per flag key. Save the platform's flag keys to a text file, one per line, and run:

#!/usr/bin/env python3
"""Count feature flag references across a codebase.
Usage: python find-flags.py flags.txt src/"""
import re, sys
from pathlib import Path

flag_keys = Path(sys.argv[1]).read_text().splitlines()
root = Path(sys.argv[2])
extensions = {".ts", ".tsx", ".js", ".jsx", ".py", ".go", ".cs", ".java", ".rb"}

counts = {k: 0 for k in flag_keys if k.strip()}
for path in root.rglob("*"):
    if not path.is_file() or path.suffix not in extensions:
        continue
    try:
        text = path.read_text(errors="ignore")
    except Exception:
        continue
    for key in counts:
        counts[key] += len(re.findall(rf'["\']{re.escape(key)}["\']', text))

for key, n in sorted(counts.items(), key=lambda kv: kv[1]):
    print(f"{n:6d}  {key}")

Flags with zero hits are obviously dead. Flags with a single hit usually indicate a stale check that was meant to be removed when rollout completed. The interesting case is one or two hits clustered in a single file: a leftover branch that nobody noticed.

Git archaeology for creation dates

Reference counts tell you what's stale. git log tells you how long it's been stale:

while read -r flag; do
  date=$(git log --diff-filter=A -S"$flag" --format=%cs --reverse \
         | head -1)
  printf "%s\t%s\n" "${date:-unknown}" "$flag"
done < flags.txt | sort

-S is git's "pickaxe": it finds the commit that introduced (or removed) a string. Combined with --diff-filter=A and --reverse, you get the date the flag key first appeared. Cross-reference that with your platform's "last evaluated" timestamp, and any flag with creation date older than 90 days and zero recent evaluations is a high-confidence cleanup candidate.

Code-reference scanners also exist as open-source tools published by some flag platforms, if you'd rather adopt one off the shelf than maintain the script above.

Step 2 — Triage: not every old flag is removable

The easy mistake is treating "old" as a synonym for "removable." Some flags belong forever. Triage means classifying every flag from Step 1 into one of four buckets before you touch the code.

1. Release flag, fully rolled out

The cleanest cleanup target. The percentage rollout finished weeks or months ago, the new code path has been the only one serving traffic since, and the old branch is dead weight. Remove the flag check, then remove the dead branch, then delete the flag from the platform.

2. Experiment flag, experiment ended

The experiment has a winner. The losing arm is dead code, and the flag is no longer evaluated meaningfully. Remove both the flag check and the losing arm — leaving the loser in place is how an A/B test silently turns into permanent dead code.

3. Permanent flag

Kill switches, plan-tier entitlements, regional gates, and per-tenant overrides legitimately stay forever. They're not flags in the rollout-toggle sense; they're runtime targeting rules. Don't delete them. Do verify each one has an explicit owner, a perm- prefix in the key (so triage doesn't have to re-evaluate it next quarter), and a sentence of documentation about what it gates.

4. Zombie flag

Evaluated nowhere — the platform shows zero traffic for 90+ days — but code references still exist. This is the dangerous bucket. The code path under the flag is probably dead, but it's been waiting around long enough that nobody currently on the team remembers what it does. Treat zombies like any other production change: confirm with the team that owned the original feature, check error logs for any evidence the path still runs, and only then proceed to removal.

The Knight Capital incident is what happens when zombies are left in place. The flag bit they reused had been a zombie for eight years.

Step 3 — Remove: how to delete a flag without breaking production

Removing a flag is a production change. Treat it like one. The dangerous removal isn't the obvious dead branch. It's the branch that quietly turned out to still be reachable by some code path nobody traced.

The procedure that handles this safely has five steps:

1. Confirm zero unexpected traffic

The flag has been evaluated only at the rolled-out variant for at least 30 days. No edge cases, no weird per-user overrides still firing. If your platform shows surprise evaluations, find out why before continuing.

2. One flag per pull request

Never bundle flag removals. If something goes wrong, you want a clean revert that affects exactly one feature.

3. Delete the flag check first, the dead branch second — in two PRs

This is the single most important rule in flag cleanup, and the one most teams skip. Don't do this:

// PR that does too much
- if (await client.boolVariation('checkout-v2', ctx, false)) {
-   return renderCheckoutV2(user);
- } else {
-   return renderCheckoutV1(user);
- }
- function renderCheckoutV1(user) { /* 200 lines of legacy code */ }
+ return renderCheckoutV2(user);

Do this instead. PR 1: make the code path unconditional, but leave the dead branch in place.

- if (await client.boolVariation('checkout-v2', ctx, false)) {
-   return renderCheckoutV2(user);
- } else {
-   return renderCheckoutV1(user);
- }
+ return renderCheckoutV2(user);
  function renderCheckoutV1(user) { /* still here, unused */ }

Deploy. Watch error rates and traffic for 24 hours. If anything was wrong, the revert is a one-line change and the dead-but-present branch is still there to catch any callers you missed.

PR 2: delete the dead branch.

  return renderCheckoutV2(user);
- function renderCheckoutV1(user) { /* 200 lines of legacy code */ }

If you're confident after PR 1 that nothing else calls renderCheckoutV1, this is safe. If anything does call it, your editor's "find references" already told you so before you opened the PR.

This two-PR pattern is exactly what would have prevented Knight Capital. Their problem wasn't the flag itself — it was that the dead branch was still in the deployed binary on one of eight servers when an unrelated change re-activated it.

4. Watch error rates and traffic for 24 hours after PR 1

Treat flag removal like any deploy. Most cleanup-related incidents surface within hours, not days.

5. Then delete the flag from the platform

This is the irreversible step. Do it after both PRs have shipped and the system has been stable for a week. Once the flag is gone from the platform, any rollback path that depended on flipping it is closed.

Step 4 — Prevent: governance that stops the cycle

If creation policy doesn't change, cleanup is a treadmill: you'll be back here in six months with another 200 flags. Three controls do almost all the work, and none of them require buying anything.

Naming conventions that encode lifecycle

A flag named new_feature loses its meaning the day after it ships. A flag named release-billing-annual-plans-2026q2 tells you the type, the owning team, the feature, and the rough timeline — even if every original author has left the company. A reasonable naming convention:

release-billing-annual-plans-2026q2     # rollout, removable
exp-onboarding-progress-bar-v2          # experiment, removable
perm-killswitch-fraud-detection         # permanent, never to be removed
perm-entitlement-pro-plan-features      # permanent, plan tier

The release-, exp-, and perm- prefixes do most of the work. They let triage immediately separate "should be removed" from "intentionally permanent" without context.

Expiration dates set at creation

Every flag platform should have a "scheduled review" or expiration field. If yours doesn't, a calendar reminder works. The bar is low: any defaulted-to-30-days reminder beats no reminder at all. Google's internal practice is to expire experiment flags after 30 days unless an engineer explicitly renews them with a justification (Statsig glossary). It's a mechanism worth borrowing whether you're at Google scale or not.

A recurring cleanup cadence

A monthly or quarterly "flag debt day" — one engineering hour per person — is more sustainable than a yearly purge. Treat it like dependency upgrades: ignored for a quarter, manageable; ignored for two years, terrifying.

Annual flag-debt cost by engineering team size A bar chart showing annual lost productivity from feature flag debt scaling linearly with team size: $104,000 at 10 engineers, $260,000 at 25, $520,000 at 50, and $1,040,000 at 100. Annual lost productivity ($) $0 $250k $500k $750k $1M 10 eng $104k 25 eng $260k 50 eng $520k 100 eng $1.04M Engineering team size

Annual cost of sustained flag debt scales linearly with engineering headcount. Source: FlagShark productivity calculator, 2025.

The investment that prevents most of this is genuinely small: an hour at flag creation to set a name, owner, and expiration; an hour a month to triage. Skipping both, year after year, is what produces the curve in the chart above.

The Knight Capital lesson

In July 2012, Knight Capital deployed new code for its automated equity order-routing system, called SMARS. The deploy missed one of eight production servers. The new code reused a feature flag bit from a feature called Power Peg, which had been deprecated in 2003 and never removed from the binary (Henrico Dolfing case study, 2019).

When the flag was set in production, seven of eight servers ran the new logic. The eighth server, still carrying the old binary, ran Power Peg. Power Peg's order-fulfillment reporting had been altered after deprecation, so completed orders were never marked as completed — and the system kept sending more. In 45 minutes, Knight bought $7 billion of stock it couldn't pay for, lost roughly $460 million in the unwind, and was acquired by Getco three months later (Knight Capital Group, Wikipedia).

The framing most retellings reach for is "feature flags are dangerous." That's the wrong takeaway. The flag was perfectly safe in 2003, when the Power Peg feature was live. It became dangerous only because the flag was retired without removing the code path it gated, and stayed dangerous for nine years until something else turned the bit back on.

The lesson is narrower and more useful: deferred cleanup compounds. Every deprecated flag still in your binary is a future incident waiting for an unrelated change to step on it. The two-PR removal procedure in Step 3 isn't bureaucracy. It's the specific protocol that prevents the dead branch from outliving the flag.

For a deeper view of the lifecycle this fits into, see the broader question of when feature flags belong in your stack.

Frequently asked questions

What percentage of feature flags are never removed?

Industry data puts the share at 73%, with the average enterprise application carrying more than 200 active flags and roughly 60% of them stale beyond 90 days (FlagShark, 2025). The accumulation creates measurable productivity costs: engineers spend 3–5 hours a week navigating flag-heavy code in review and debugging.

How do you find unused feature flags in your code?

Three tools, in increasing power. grep for a quick reference catalogue. A small AST-style script (about 30 lines of Python) that takes your platform's flag-key list and counts references per flag. And git log --diff-filter=A -S<flag> to find when each flag was first introduced, so you can spot keys that have been stale for years. Cross-reference with your platform's "last evaluated" timestamp.

How long should a feature flag live?

Most rollout flags should be retired within weeks of hitting 100%. Experiment flags should live for the duration of the experiment plus a two-week stabilization window. Permanent flags (kill switches, plan-tier entitlements, regional gates) live indefinitely if they're documented as permanent and assigned an owner. Set an expiration or review date when the flag is created, not after.

Is it safe to just delete an old flag from the platform?

Not on its own. Deleting a flag from the platform without first removing the flag check from code is how you produce silent behavior changes. The SDK will start serving the default value, which may or may not match the rolled-out behavior. Delete the flag check from code first (in two PRs, as in Step 3), then delete the flag from the platform once you've verified the system is stable.

The shorter version

Cleanup is a four-step loop, not a one-off effort. Detect every flag and how stale it is. Triage into rollout, experiment, permanent, or zombie. Remove rollout and experiment flags in two PRs — flag check first, dead branch second. And prevent the next round by setting a name, owner, and expiration when the flag is created.

The Knight Capital story is the worst-case demonstration of why this matters, but it isn't the typical case. The typical case is quieter: 200 flags accumulated over two years, three to five hours a week of mental tax per engineer, slower reviews, slower incident response, and a codebase that's slightly harder to reason about than it should be. Every flag you create is a flag someone has to remember to delete.

Featureflip is built around this lifecycle — flag keys, owners, and last-evaluated timestamps are first-class in the dashboard, so naming conventions and stale-flag triage become observable rather than tribal knowledge. If you'd like to see how that plays out in practice, start with the Solo plan. It's free forever for one project.

Feature Flags vs Environment Variables: A Practical Guide

Sat, 25 Apr 2026 00:00:00 GMT

Environment variables and feature flags both control application behavior, but they solve different problems. Conflating them leads to fragile systems, awkward deployment workflows, and flags that outlive their usefulness by years. The doctrine of putting config in the environment goes back to The Twelve-Factor App; the concept of a feature toggle was formalised by Martin Fowler and Pete Hodgson, who split toggles into release, experiment, ops, and permission categories. This post lays out when each tool belongs, and where teams confuse them.

Key Takeaways

Env vars configure where and how a process runs (database URLs, API keys, log levels). They are static for the life of the process and require a redeploy to change.

Feature flags configure what a specific request or user sees. They are evaluated per-request, support targeting, and change without a deploy.

The one-line rule: if two users hitting the same running instance could ever need different values, it's a flag. If not, it's an env var.

The most common mistake is ENABLE_X=true env vars repurposed as poor-man's flags. Fine until the day you need a 10% rollout or a 2 AM kill switch (feature flag cleanup playbook).

What each one is

An environment variable is a named string value set in the process environment at startup and treated as static for the lifetime of that process. It configures where the application connects, what mode it runs in, and what secrets it uses: things that differ between deployment targets, not between users.

A feature flag is a named boolean or multi-variant value evaluated at runtime, typically over a remote source of truth, that controls which code path executes for a given request or user. It changes behavior without a redeploy, and can be scoped to a specific segment of traffic.

When should you use environment variables?

Environment variables excel at configuration that is static within an environment, secret, or determined before the process starts. The Twelve-Factor App's third factor codifies this: anything that varies between deploys (credentials, hosts, per-environment toggles) should live in the process environment, not in source code (12factor.net). Env vars are read once at startup and treated as immutable for the process's lifetime.

1. Database connection strings

DATABASE_URL is the canonical example. It points to your Postgres instance, includes credentials, and is different in dev, staging, and production. It never changes while the app is running. Storing it as an env var means it stays out of source code, can be rotated by updating the deployment secret, and doesn't require any runtime evaluation logic.

2. API keys and secrets

Third-party service keys (payment processor secrets, object storage credentials, outbound email API keys) are secrets first and configuration second. Env vars compose naturally with secret management systems (Kubernetes secrets, Vault, Doppler) and satisfy security policies that require secrets to be kept out of application code. Evaluating a feature flag to find a secret is the wrong abstraction entirely.

3. Third-party service endpoints

STRIPE_API_BASE, OPENAI_API_HOST, S3_ENDPOINT. These are environment-level choices. Staging points at sandbox endpoints, production points at live ones. These don't change per-user and don't need runtime toggle semantics.

4. Build-time and runtime mode flags

NODE_ENV=production, RAILS_ENV=production, ASPNETCORE_ENVIRONMENT=Production. These inform the framework itself, not just your code. They affect which config files load, whether debug middleware is enabled, and how assets are bundled. They must be set before the process starts and cannot meaningfully be changed mid-flight.

5. Log levels and observability config

LOG_LEVEL=warn, OTEL_EXPORTER_OTLP_ENDPOINT, SENTRY_DSN. These affect how the process reports on itself. They are environment-wide, they configure external systems, and they often come from your platform's secrets store. Routing them through a feature flag system adds a circular dependency (what if the flag system itself fails before logging is configured?).

The common thread: env vars are for where and how the process runs, not what it does for a specific user.

When should you use feature flags?

Feature flags shine when the question is: "should this code path execute, for whom, and starting when?" Martin Fowler's canonical taxonomy splits flags into four categories (release, experiment, ops, and permission toggles), each with different lifetimes and ownership (martinfowler.com). Three of those four are impossible to express cleanly as env vars, because they need per-request, per-user, or runtime-mutable evaluation.

1. Gradual rollouts

You've merged a rework of your checkout flow. You want to expose it to 5% of users, monitor error rates and conversion, then ramp to 100% if the metrics look healthy, all without a second deployment. A feature flag does this; an env var does not.

2. Kill switches

Certain features carry operational risk: a new third-party integration, a resource-intensive background job, a new payment provider. A kill switch flag lets an on-call engineer disable it in seconds without touching infrastructure. An env var change requires a process restart (typically a rolling deploy that takes minutes, not seconds) and introduces its own risk.

3. A/B tests and experiments

Testing two button labels, two pricing page layouts, or two recommendation algorithms requires serving different variants to different users within the same deployed build. That's a flag, specifically a multivariate flag, not a config value.

4. Per-user and per-segment targeting

Beta programs, internal dogfooding, enterprise tenant overrides, and geographic feature launches all require the same question: "should user X get behavior Y?" Env vars have no concept of a user context. Flags evaluate against user attributes, segment membership, or a deterministic hash of the user ID.

5. Paywall and plan-based variations

Showing premium features to paid users, gating beta features behind an opt-in, or launching a new UI only for enterprise accounts: these are targeting decisions made at evaluation time, per-request. Flags model this directly. A build-time config cannot.

6. Dark launches

You want to call the new code path in production, observe its behavior, and collect metrics, but not yet show its output to users. Wrap it in a flag that's off for everyone, deploy, then turn it on for internal users only. This is impossible to express cleanly as an env var.

The common thread: flags are for what a specific request or user experiences, and for control you need to exert without touching deployment.

Side-by-side comparison

	Environment variable	Feature flag
Changes at runtime	No, requires process restart or redeploy	Yes, evaluated on every request
Per-user targeting	No, process-wide value	Yes, evaluates against user context
Audit log	No built-in history	Yes, changes tracked with timestamp and actor
Restart required	Yes, set at process startup	No, changes propagate to running instances
Granularity	Environment (dev / staging / prod)	Per-user, per-segment, percentage-based
Typical lifetime	Indefinite (rotated when credentials change)	Weeks to months (should be cleaned up after rollout)

The table understates one dimension: change-propagation latency. The order-of-magnitude gap between deploying a new env var and flipping a flag is what makes flags suitable for kill switches and incident response.

Change propagation latency: code change vs env var vs feature flag A horizontal bar chart on a log scale comparing how long a configuration change takes to take effect. A code change with a full deploy lands between 10 and 60 minutes. An env var change with a rolling restart lands between 1 and 10 minutes. A feature flag flip with an SDK streaming connection lands between 1 and 5 seconds. Change propagation: how fast does a switch take effect? Code change + deploy 10–60 min Env var (rolling deploy) 1–10 min Feature flag flip 1–5 sec 1s 10s 1 min 10 min 1 hour Time for change to take effect (log scale)

Log scale. Flag SDKs propagate via streaming or short polling intervals; env var changes need a process restart (typically a rolling deploy); pure code changes also pay a CI build. That 100×–1000× gap is why kill switches and gradual rollouts belong in a flag system.

Decision flowchart: which one do I reach for?

Env var vs feature flag decision flowchart A three-question decision tree. If the value is a secret or static per deployment target, use an environment variable. If it requires runtime change, per-user targeting, or temporary rollout control, use a feature flag. Otherwise ask whether two users in the same instance could ever need different values: yes is a flag, no is an env var. What are you configuring? Is it a secret, credential, or static per deployment target? Yes No ENV VAR Per-user targeting, runtime change, or temporary rollout toggle? Yes No FLAG Could two users in the same instance ever need different values? No Yes ENV VAR FLAG

If you'd ever want two users in the same running instance to see different values, it's a flag. Otherwise it's an env var.

If you prefer it as text:

Secret, credential, or static per deployment (API keys, DATABASE_URL, LOG_LEVEL) → environment variable.
Runtime change, per-user targeting, or temporary rollout (kill switches, A/B tests, beta gates) → feature flag.
Still ambiguous? Ask whether two users hitting the same running instance could ever need different values. If yes → flag. If no → env var.

Common pitfalls

Using env vars as poor-man's feature flags

The most frequent mistake: an engineer creates ENABLE_NEW_CHECKOUT=true in the environment to toggle a feature. This works until you need to roll it out to 10% of users, or turn it off at 2 AM without waking up the DevOps rotation. At that point the team discovers they've built a deployment-gated toggle instead of a runtime one. Changing it requires a redeployment and process restart, not a flag flip, and migrating the semantics while the feature is already in production is uncomfortable. The kill-switch case has real-world stakes: the Knight Capital incident — $460M lost in 45 minutes when a deprecated flag's bit was reused — is the textbook example of a toggle that should have been a runtime kill switch backed by a cleanup workflow.

Using flags for things that never change

The inverse mistake is routing static infrastructure config through a flag system. POSTGRES_MAX_CONNECTIONS or REDIS_CLUSTER_HOST do not need audit logs or gradual rollout semantics. Adding them to a flag system increases the surface area for misconfiguration and creates a dependency on the flag service during startup, which is exactly when you want the fewest external dependencies.

Stale flags rotting in the codebase

Feature flags are temporary by design, but in practice they rot. Industry data puts the share of flags that never get removed at 73%, with the average enterprise application carrying 200+ active flags and 60% stale beyond 90 days (FlagShark, 2025). A flag for a rollout that completed eight months ago is dead code wrapped in an if statement, and nobody is sure whether it's safe to remove. Build retirement into your workflow: when a flag hits 100% rollout and metrics are stable, schedule the cleanup. The four-step cleanup playbook (detect, triage, remove, prevent) covers the specific procedure.

Leaking configuration concerns across layers

Checking process.env.ENABLE_NEW_CHECKOUT deep in a service module creates an implicit coupling between deployment config and business logic. Flag evaluation, by contrast, passes a user context explicitly. This makes the behavior testable (inject a flag client that returns the value you want), auditable (the flag system records who changed it), and decoupled from deployment.

What flag evaluation looks like in production code

Here's a concrete example using Featureflip's Node.js SDK:

import { FeatureflipClient } from "@featureflip/node-sdk";

const client = await FeatureflipClient.create({
  sdkKey: process.env.FEATUREFLIP_SDK_KEY!,
});

const currentUser = { id: "user-123" }; // from your auth layer

const checkoutV2Enabled = client.boolVariation(
  "checkout-v2",
  { user_id: currentUser.id },
  false, // default if evaluation fails or flag is missing
);

if (checkoutV2Enabled) {
  // new flow
} else {
  // old flow
}

The SDK key itself is an env var: it's a credential scoped to an environment. The flag evaluation is runtime, per-user, and falls back gracefully to false if the flag service is unreachable. The SDK key never changes between requests; the flag result does. For more on how Featureflip models targeting, segments, and rollout percentages, see the rollout strategies and environments docs.

Frequently asked questions

Can I use environment variables for A/B testing?

Not effectively. A/B testing requires serving different variants to different users within the same running process, based on a stable hash of the user ID or a targeting rule. Environment variables are process-wide (every request sees the same value), so you can't split traffic without running multiple deployments. Use a multivariate feature flag instead.

How long should a feature flag live in the codebase?

Most rollout flags should be retired within weeks of hitting 100%. Permanent kill switches and entitlement flags (paywall, plan tier) live indefinitely by design. The trap is rollout flags that quietly become permanent: 73% of flags are never removed in surveys of mature flag installations (FlagShark, 2025). Treat retirement as part of the rollout, not an afterthought.

Are feature flags a security risk?

They can be, if misused. Putting secrets in a flag system is the obvious mistake: flag values are typically cached on the client and visible to anyone who inspects the SDK payload. Use environment variables and a secrets manager for credentials. Flags are safe for behavior toggles and targeting rules, which don't expose sensitive data even if the flag config leaks.

What happens if the feature flag service is down?

Every reputable SDK is built to degrade safely: evaluations fall back to the default value passed in code, and the SDK retries the connection in the background. That's why the third argument to boolVariation(...) is false in the example above. It's the value served if the flag service can't be reached, so your app stays up and the new code path simply doesn't activate.

Should I version-control my feature flag definitions?

Most teams keep flag definitions (key, variations, default) out of source control and manage them through the flag service's UI or API, because the whole point is changing them without a deploy. What does belong in version control: the flag key used in code, the default value, and a comment explaining what the flag gates and when it's expected to be retired.

The one-line rule

Configuration that belongs to the process goes in env vars. Configuration that belongs to the request or user goes in flags. Env vars answer "where does this service connect?"; flags answer "what does this user see?"

If you remember one heuristic from this post: ask whether you'd ever want this value to differ between two users hitting the same running instance. If yes, it's a flag. If no, it's an env var.

For a deeper reference on how Featureflip models flags, targeting rules, and evaluation context, see Feature flags overview.