A/B testing with feature flags
Serve two variations, keep every user in a sticky bucket, and log who saw what. The flag delivers the experiment; your metrics decide the winner.
Last updated:
An A/B test serves two versions of a feature to randomly assigned users and measures which one performs better on a metric you care about. A feature flag is the delivery mechanism: a 50/50 percentage split sends each user to variation A or B by a deterministic hash of their ID, and the same user stays in the same group for the life of the test. That stickiness is what keeps the comparison clean. The flag also records an impression for every evaluation, so you have a log of who saw which variation. In short, the flag owns assignment and exposure; you bring the metric and the analysis that calls the winner. For the bare definition, see what A/B testing is.
Why run an A/B test behind a flag?
Four reasons teams reach for a flag when they want to compare two versions in production.
- 1
Start and stop without a deploy
A test is a configuration, not a release. Set the split to 50/50 in the dashboard to begin, set it back to the control to end. The instrumented code ships once and sits dormant until you start the test on top of it, so the experiment never waits on a deploy window or a rollback.
- 2
Sticky assignment keeps groups clean
A user who flips between variation A and B on every page load pollutes both groups and ruins the result. The deterministic hash pins each user to one variation for the whole test, so the control and the treatment stay distinct populations you can actually compare.
- 3
The test becomes the rollout
Pick a winner and you rebuild nothing. Move the split to 100% on the winning variation and the A/B test turns into a progressive rollout of the version that won. One flag covers the test, the ramp, and the kill switch if that winner misbehaves later.
- 4
Decouple the experiment from the pipeline
Product and data folks can start, widen, or end a test from the dashboard with no code change and no deploy. Engineering ships the instrumented code once; the experiment's lifecycle lives in configuration from then on, where the people running it can reach it.
What an A/B test needs, and which half a flag covers
An A/B test has a delivery side and a measurement side. Feature flags own the delivery side completely. The measurement side stays with the metrics and analysis you already run.
| Ingredient | Where it comes from | What it covers |
|---|---|---|
| Random assignment | The feature flag | A percentage split sends each user to variation A or B by a deterministic hash of their ID, so the groups form at random rather than by who happened to log in first. |
| Sticky bucketing | The feature flag | The same user keeps the same variation for the life of the test. Without that stickiness a user who flips between A and B pollutes both groups and the result means nothing. |
| Exposure tracking | The feature flag | An impression event records which user saw which variation and when, giving you the log you join your outcomes against. |
| A hypothesis and a metric | You | The outcome you expect to move (signups, checkout rate, retained sessions) and the direction that counts as a win, both decided before the first user is bucketed. |
| The analysis | Your analytics stack | Joining exposure to outcomes and deciding whether the gap between the groups is a real effect or noise. This runs where your metrics already live. |
That division is the whole mental model. The flag guarantees a fair, sticky split and a record of who saw what. Whether variation B actually beat variation A is a question your analytics answer, against the outcome you decided mattered. A/B testing is the simplest form of experimentation, and the same split machinery powers richer designs as you grow into them.
How to run an A/B test with a flag
Six steps. The first four are flag configuration; the last two are yours.
| Step | What you do | Detail |
|---|---|---|
| 1 | Define the variations | Create a flag with the versions you want to compare: a control that serves today's behaviour and one or more treatments. Two variations is the simplest test; the same split handles more. |
| 2 | Split the traffic | Put a percentage rollout on the fallthrough. A 50/50 split for a head-to-head test, or an even share across however many variations you are comparing. |
| 3 | Let bucketing hold the line | Each user is hashed to a stable bucket from their ID and the flag key, so they see the same variation on every visit for the whole test. Pass a consistent user identifier or the split cannot stay sticky. |
| 4 | Record exposure | Every evaluation emits an impression event: which user saw which variation, and when. The SDK buffers these and flushes them to the events API for you. |
| 5 | Measure against your hypothesis | Join the impression log to the outcome you picked up front and compare the groups in your analytics stack. The flag gave you a clean split; the metric tells you which variation won. |
| 6 | Ship the winner | Move the split to 100% on the variation that won and the test becomes a rollout. Archive the losing variation so the flag does not linger as flag debt once the decision is made. |
Steps two and three are the same percentage rollout mechanics a gradual release uses, held at a fixed split for measurement rather than ramped. You can also stack a targeting rule above the split to force internal staff onto one variation, or to hold a sensitive segment on the control while the rest of your users take part.
How Featureflip handles the delivery side
The assignment and exposure mechanics that make a test fair, sticky, and measurable, so the only open question is which variation won.
- Deterministic sticky bucketing. Featureflip hashes the user ID with the flag key into a stable bucket, so the same user always sees the same variation for that flag. Growing or shrinking a variation re-buckets nobody who is already assigned. See the rollout strategies docs for the algorithm.
- Two or more variations. A percentage split serves a control against one treatment for a classic A/B test, or against several treatments at once when you want to compare more than two versions in a single flag.
- Forced variations and holdbacks. Stack targeting rules above the split to pin internal staff to the treatment for dogfooding, or to keep a segment on the control while everyone else is in the test.
- Impression events out of the box. Each evaluation can emit an impression that records which user saw which variation. The SDKs buffer and flush these to the events API, giving you the exposure log your analysis joins against.
- Real-time start and stop. Change a split in the dashboard and every connected SDK picks it up over a Server-Sent Events stream within a second or two, fleet-wide, with no redeploy. Ending a test is just as immediate.
- One flag, test to rollout. When the result is in, move the split to 100% on the winner and the test becomes a progressive rollout. No new flag, no rebuild, no second wiring job.
What it looks like in your app
The application asks the SDK which variation this user is in and renders accordingly. The third argument is the fallback, returned if the SDK cannot reach Featureflip, so the control doubles as the safe default:
// Returns 'control' or 'treatment'. 'control' is the fallback,
// served if the SDK can't reach Featureflip.
const variation = client.evaluate('checkout-headline-test', user, 'control');
if (variation === 'treatment') {
return renderNewHeadline(user); // variation B
}
return renderCurrentHeadline(user); // variation A, the control
The split lives entirely in the dashboard. At a 50/50 setting, a stable half of identified users get treatment and the other half get control, and they hold those assignments for the run. The SDK records an impression each time, tying the user to the variation they saw. You join that exposure log to the conversion or retention metric you already track to decide the winner. The same surface works in every language the platform supports, from Python and Go to C#, Java, and Node. Pick a quickstart from the SDK overview.
A/B test vs progressive rollout
They run on the same bucketing machinery but answer different questions. Knowing which one you are doing keeps you from reading a ramp like an experiment.
| Dimension | A/B test | Progressive rollout |
|---|---|---|
| Question it answers | Which variation is better? | Is this change safe to expand? |
| What you change | Hold a fixed split, often 50/50 | Raise the percentage in steps |
| How long you hold it | Until you have enough data to call it | Until it is at 100% and stable |
| What "done" looks like | A meaningful winner on your metric | The new path is the only path |
| Bucketing | Deterministic and sticky | Deterministic and sticky |
They compose well. Run an A/B test to find the variation that wins, then turn that winner into a progressive rollout to expand it safely. If the rolled-out winner ever turns sour, the same flag is its kill switch. One flag, three jobs across its life.
Common mistakes to avoid
The patterns that turn a test into a result you cannot trust. Most of them are about the analysis, not the flag.
Peeking and stopping early
Checking the numbers every hour and stopping the moment they look significant inflates your false-positive rate. A split that reads like a 10% win on day one is often just noise that has not averaged out yet. Decide the sample size or the run length before you start, and hold the test until you reach it.
Changing the split mid-test
Moving from 50/50 to 70/30 partway through re-weights the groups and muddies every metric measured across the change. The users you already bucketed stay put, but the populations you are comparing no longer match. If you must adjust, treat it as a new test with a fresh start time, not a continuation of the old one.
Evaluating without a stable identifier
Percentage splits bucket on a hash of the user ID. Evaluate without a consistent identifier and the user drops to the fallback every time, so anonymous traffic and server jobs all land on the control and never really enter the test. Pass a stable ID, or scope the experiment to identified users only.
Running without a hypothesis
A test with no metric chosen up front and no expected direction turns into a fishing trip. You will find some difference somewhere, and it usually will not hold up when you look again. State the metric and what counts as a win before the first user is bucketed, so you are confirming a prediction rather than hunting for one.
A sample too small to conclude
Low-traffic features can run for weeks without reaching the volume needed to separate a real effect from noise. If you cannot gather enough exposures in a sensible window, a head-to-head test is the wrong instrument. Ship the more promising version behind a rollout and watch the broader release metrics instead.
When a flag-driven A/B test is the wrong tool
A flag makes the split easy, but not every decision is an A/B test. A few cases call for something else:
- There is not enough traffic to conclude. If a feature cannot gather enough exposures in a reasonable window, a head-to-head test will never separate signal from noise. Ship the stronger candidate behind a rollout and watch the broader metrics instead.
- Nothing measurable moves. An internal refactor or a copy fix that no metric responds to has nothing to compare. Ship it and move on; there is no experiment to run.
- The change cannot split per user. Infrastructure swaps, schema migrations, and atomic cutovers do not divide cleanly across users. Reach for a canary release or a staged migration, not an A/B test.
- The decision needs deeper experiment design. Sequential testing, multi-armed bandits, and variance reduction are the job of a dedicated experimentation platform. Featureflip still supplies the assignment and exposure underneath that analysis; the statistics run in your experimentation or analytics layer.
Frequently asked questions
- Can I A/B test with feature flags?
- Yes. A flag with two variations and a 50/50 percentage split is a working A/B test: it randomly assigns each user to a variation and holds them there with deterministic bucketing. Featureflip records an impression for every evaluation, so you get a log of who saw which variation. You find the winner by joining that exposure log to the conversion or retention metric you already track.
- What is the difference between an A/B test and a progressive rollout?
- Both split traffic with the same deterministic bucketing, but they answer different questions. An A/B test holds a fixed split, often 50/50, to learn which variation performs better. A progressive rollout raises the percentage in steps to expand one change safely toward 100%. Teams often test first to pick a winner, then roll that winner out.
- Does Featureflip calculate statistical significance?
- Featureflip handles the delivery side of a test: random assignment, sticky bucketing, and an impression log of who saw which variation through the events API. The significance calculation runs in your analytics or experimentation stack, where you join exposure to your outcome metric. That keeps the verdict tied to the metrics you already trust rather than a separate number to reconcile.
- How do users stay in the same group during a test?
- Featureflip hashes the user ID together with the flag key to assign a bucket. Because the hash is deterministic, the same user always lands in the same variation for that flag, on every visit and across devices. That stickiness keeps the control and treatment groups distinct, which is what makes the comparison valid. Pass a stable user identifier on every evaluation for it to hold.
- Can I test more than two variations at once?
- Yes. A percentage rollout can split traffic across more than two variations, so you can compare a control against several treatments in a single flag. Each user is still bucketed deterministically to one variation. More variations need more total traffic to reach a conclusion on any one of them, so multi-variation tests suit higher-volume features.
- Do I need to redeploy to start or stop a test?
- No. Deploy the instrumented code once with the flag serving the control, then start the test by setting the split in the dashboard. The change reaches every connected SDK over a Server-Sent Events stream within a second or two. Stopping is the same operation in reverse: set the split back to the control and the test ends on the next evaluation, with no rebuild.
Put your next decision behind a flag
Free Solo plan covers 10 flags and 2 environments. No credit card, no demo call: create a variation and start the split.
Related
A/B testing (definition)
The glossary entry: the short definition and how a test differs from a percentage rollout.
Progressive rollouts
The same bucketing, ramped instead of held: expand the winning variation safely toward 100%.
Rollout strategies (docs)
Deterministic hashing, fixed serving, and how percentage splits are configured.
Experimentation
Where A/B testing fits in the wider practice of deciding product changes from data.