Wednesday, November 19, 2025

Feature Flag orchestration guide

If I am maker of a website or app and want certain features killed in a specific build that is going to be deployed to apps in several million devices.. can you explain at the code level how specific features can be enabled or disabled on a per device basis and how all this can be orchestrated?

Nice — this is exactly the sort of thing you want nailed before a multi-million device rollout. Below I’ll give a practical, code-level guide you can implement: concepts, architecture, safety knobs, and concrete code snippets (server and client) that show how to enable/disable features on a per-device basis and how to orchestrate rollouts.

1) Core idea — feature flags / remote config

  • Keep feature control data (flags, rules, percentages, targeting) separate from app binary.

  • Apps ask a centralized service (or a CDN cache) for the flag state for a given principal (device id / user id / session).

  • Evaluate flags server-side when possible (most secure) or client-side for responsiveness (with signed config + local cache).

  • Provide kill switches (global off), percentage rollouts (gradual), targeted rollouts (device model, country), and experiment buckets (A/B).

2) Basic components

  1. Flag store / control plane — where product/ops define flags and targeting rules (DB + UI).

  2. Evaluation service / SDK — evaluates which variant a given device should get. Can be:

    • Server-side evaluation API (recommended for sensitive control).

    • Lightweight client SDK with local cache + occasional sync.

  3. Delivery — CDN + TLS + signing for cached configs; use shorter TTLs for rapid kill.

  4. Telemetry / metrics — track flag exposures and success/failure metrics.

  5. Safety — global kill switch, audit logs, incremental rollout, circuit-breakers.

3) Flag model (simple)

Each flag entry:

{
  "key": "new_payment_flow",
  "default": false,
  "rules": [
    { "type": "user_id_allow", "ids": ["123","456"] },          // explicit allow
    { "type": "device_model", "values": ["Pixel5","iPhone12"] },
    { "type": "country", "values": ["IN","PK"] },
    { "type": "percentage", "percent": 20, "salt": "newpay-v1" } // 20% rollout
  ],
  "created_by": "pm@company.com",
  "created_at": "2025-11-01T10:00:00Z",
  "kill_switch": false
}

4) Deterministic bucketing (important)

To do percentage rollouts that are sticky per device, compute a deterministic hash of (salt + device_id) and map to 0–99. Devices with value < percent are in the cohort. Example function (JS):

// simple stable bucket: returns 0..99
function stableBucket(deviceId, salt = "") {
  // djb2-like hash computed digit-by-digit (deterministic)
  let h = 5381;
  const s = salt + "|" + deviceId;
  for (let i = 0; i < s.length; i++) {
    h = ((h << 5) + h) + s.charCodeAt(i); // h * 33 + c
    h = h & 0xffffffff; // keep 32-bit
  }
  // unsigned
  return Math.abs(h) % 100;
}

// usage:
if (stableBucket("device-abcdef", "newpay-v1") < 20) {
  // in 20% rollout
}

5) Example: server-side evaluation service (Node + Redis)

A small HTTP endpoint that evaluates flags for a device; caches rules in Redis for speed.

// server/evaluator.js (Node/Express)
const express = require('express');
const redis = require('redis');
const bodyParser = require('body-parser');
const app = express();
app.use(bodyParser.json());
const client = redis.createClient();

function stableBucket(deviceId, salt="") {
  let h = 5381;
  const s = salt + "|" + deviceId;
  for (let i = 0; i < s.length; i++) {
    h = ((h << 5) + h) + s.charCodeAt(i);
    h &= 0xffffffff;
  }
  return Math.abs(h) % 100;
}

// load flags from Redis (or DB) -- assume cached JSON at "flags:all"
async function getFlags() {
  return new Promise((res, rej) => {
    client.get('flags:all', (err, data) => {
      if (err) return rej(err);
      res(data ? JSON.parse(data) : {});
    });
  });
}

app.post('/evaluate', async (req, res) => {
  const { device_id, user_id, country, device_model } = req.body;
  const flags = await getFlags();
  const result = {};
  for (const f of Object.values(flags)) {
    if (f.kill_switch) { result[f.key] = false; continue; }
    let on = f.default;
    // rules short-circuit: if any allow rule hit -> set true; explicit deny rules could be supported too
    for (const rule of f.rules || []) {
      if (rule.type === 'user_id_allow' && rule.ids?.includes(user_id)) { on = true; break; }
      if (rule.type === 'device_model' && rule.values?.includes(device_model)) { on = true; break; }
      if (rule.type === 'country' && rule.values?.includes(country)) { on = true; break; }
      if (rule.type === 'percentage') {
        if (stableBucket(device_id, rule.salt || f.key) < (rule.percent || 0)) { on = true; break; }
      }
    }
    result[f.key] = on;
  }
  res.json({ flags: result, server_time: new Date().toISOString() });
});

app.listen(3000);

6) Example: client SDK (JS) with local cache & fallback

Client requests /evaluate on startup and caches result. If offline, use last cached flags + safe defaults. Use short TTL for critical flags.

// client/flagClient.js
async function fetchFlags(deviceInfo) {
  try {
    const r = await fetch('https://flags.example.com/evaluate', {
      method: 'POST',
      body: JSON.stringify(deviceInfo),
      headers: {'content-type':'application/json'}
    });
    const json = await r.json();
    localStorage.setItem('flags:cached', JSON.stringify({ts:Date.now(), payload: json}));
    return json.flags;
  } catch (e) {
    // offline or network failure: use cache
    const cached = JSON.parse(localStorage.getItem('flags:cached') || 'null');
    if (cached) return cached.payload.flags;
    // final fallback: empty
    return {};
  }
}

// usage in app
(async () => {
  const deviceInfo = { device_id: DEVICE_ID, user_id: USER_ID, country: 'IN', device_model: 'Pixel5' };
  const flags = await fetchFlags(deviceInfo);
  if (flags['new_payment_flow']) {
    startNewPaymentFlow();
  } else {
    startLegacyPayment();
  }
})();

Security note: if a flag unlocks a sensitive server path, the server must authorize (server-side) — don't rely purely on client flags.

7) Orchestration & rollout strategies

  • Gradual % rollout: start at 0 → 1% → 5% → 25% → 100%. Use deterministic bucketing so devices stay in same bucket across updates.

  • Canary by cohort: route a percentage of traffic or specific devices (internal QA devices) to the new feature.

  • Geo / device targeting: limit to certain countries or device models.

  • User segment: power users, paid users, etc.

  • Time-based rules: enable on a date/time window.

  • Kill switch: global boolean that can be toggled to instantly disable feature everywhere. Put kill_switch evaluation before rules.

8) Telemetry and safety

  • Emit an exposure event whenever the client or server evaluates a flag: {timestamp, flag_key, device_id_hash, variant, context}. Use hashed device id to preserve privacy.

  • Track errors and KPIs (error rate, latency, crash rate) by flag exposure. Ramp back if errors rise.

  • Automated alerting based on metric thresholds.

  • Keep audit trail of who changed flags + when.

9) Performance & caching

  • Keep flag config small and cache on CDN/edge. TTL tradeoffs:

    • Short TTL (e.g., 30s–1min): quick kill but more load.

    • Long TTL (10m–1h): less load but slower response to kills.

  • Use push (WebSocket/FCM/APNs) to notify clients of critical flag changes (e.g., kill switch) so they fetch immediately.

10) Data schema changes & migrations

  • Use flags to gate new database schema changes (e.g., feature off uses old schema). Migrate progressively:

    1. Add optional columns.

    2. Run backend to write both old+new data under a gate.

    3. Flip flag for backends and clients to read new schema.

    4. Remove old code after full rollout.

11) Rollback & incident playbook

  • Have a single global kill switch that bypasses rules (immediately off).

  • Maintain a team runbook: how to flip kill switch (UI + API + direct DB update), who authorizes, communication plan.

12) Example: signed client config (prevent tampering)

If you do client evaluation, sign the flag bundle with server private key and verify in the client:

// server: produce signed config
// pseudo:
const payload = JSON.stringify({flags: {...}, exp: Date.now()+60000});
const signature = HMAC_SHA256(serverSecret, payload);
return { payload, signature };

// client:
if (verifyHMAC(serverSecretPublic, resp.payload, resp.signature)) {
  useFlags(JSON.parse(resp.payload).flags);
} else {
  // ignore, use fallback
}

(Use proper crypto libs; never embed private keys in the client.)

13) Multi-platform notes

  • Mobile (iOS/Android): use SDK that caches flags, receives push notifications for immediate refresh, uses deterministic device id (or hashed advertising id) for bucketing. Keep TTL small for kill switches.

  • Web: fetch flags on boot, store in cookie/localStorage; consider server-side rendering to ensure search engines and first render obey server evaluation.

  • Backend: do sensitive checks server-side (e.g., enable payment route only if server says so).

14) Danger / anti-patterns to avoid

  • Shipping logic for enabling sensitive features purely to the client (easy to tamper). Always enforce critical checks on server.

  • Using non-deterministic random for rollouts (users get different experiences every request). Use deterministic hashing.

  • Long TTLs for critical flags (makes kill slow).

  • No telemetry or metrics — you must measure.

15) Example: quick Python evaluator (for integrations)

# evaluator.py
import hashlib, json, time
FLAGS = {} # load from DB/redis

def stable_bucket(device_id: str, salt: str="") -> int:
    s = (salt + "|" + device_id).encode('utf-8')
    h = int(hashlib.sha1(s).hexdigest()[:8], 16)
    return h % 100

def eval_flag(flag, ctx):
    if flag.get('kill_switch'): return False
    on = flag.get('default', False)
    for rule in flag.get('rules', []):
        t = rule['type']
        if t == 'user_id_allow' and ctx.get('user_id') in rule.get('ids', []):
            return True
        if t == 'device_model' and ctx.get('device_model') in rule.get('values', []):
            return True
        if t == 'country' and ctx.get('country') in rule.get('values', []):
            return True
        if t == 'percentage':
            if stable_bucket(ctx.get('device_id',''), rule.get('salt','')) < rule.get('percent',0):
                return True
    return on

16) Orchestration & workflow (who does what)

  • Product configures flag intent and rollout plan.

  • SRE/Release sets kill switch and rollout gates in the control plane.

  • Engineers instrument exposure events and gate code paths.

  • Data/Analytics monitor KPI changes and recommend ramp up/down.

  • Legal/Privacy ensure device ids are hashed and PII handled correctly.

17) Checklist before rollout to millions

  • Flag control UI + API (audit logs + RBAC).

  • Deterministic bucketing implemented.

  • Server enforcement for sensitive paths.

  • Client caching + fallback behavior defined.

  • Telemetry for exposures + metrics instrumented.

  • Kill switch tested (flip and observe).

  • Test groups (QA, internal) pinned.

  • Rollout plan (percent steps + criteria to proceed).

  • Alerts for spike in errors/crashes.


If you want, I can:

  • Produce a small, ready-to-deploy reference repo layout (server + client) in your preferred language/framework.

  • Draft the rollout playbook and monitoring queries you should add to Prometheus/Grafana (or Datadog).

  • Give a short, copyable policy text for RACI/approval for flips and kills.

Which of those would be most useful next?

No comments:

Post a Comment