If I am maker of a website or app and want certain features killed in a specific build that is going to be deployed to apps in several million devices.. can you explain at the code level how specific features can be enabled or disabled on a per device basis and how all this can be orchestrated?
Nice — this is exactly the sort of thing you want nailed before a multi-million device rollout. Below I’ll give a practical, code-level guide you can implement: concepts, architecture, safety knobs, and concrete code snippets (server and client) that show how to enable/disable features on a per-device basis and how to orchestrate rollouts.
1) Core idea — feature flags / remote config
-
Keep feature control data (flags, rules, percentages, targeting) separate from app binary.
-
Apps ask a centralized service (or a CDN cache) for the flag state for a given principal (device id / user id / session).
-
Evaluate flags server-side when possible (most secure) or client-side for responsiveness (with signed config + local cache).
-
Provide kill switches (global off), percentage rollouts (gradual), targeted rollouts (device model, country), and experiment buckets (A/B).
2) Basic components
-
Flag store / control plane — where product/ops define flags and targeting rules (DB + UI).
-
Evaluation service / SDK — evaluates which variant a given device should get. Can be:
-
Server-side evaluation API (recommended for sensitive control).
-
Lightweight client SDK with local cache + occasional sync.
-
-
Delivery — CDN + TLS + signing for cached configs; use shorter TTLs for rapid kill.
-
Telemetry / metrics — track flag exposures and success/failure metrics.
-
Safety — global kill switch, audit logs, incremental rollout, circuit-breakers.
3) Flag model (simple)
Each flag entry:
{
"key": "new_payment_flow",
"default": false,
"rules": [
{ "type": "user_id_allow", "ids": ["123","456"] }, // explicit allow
{ "type": "device_model", "values": ["Pixel5","iPhone12"] },
{ "type": "country", "values": ["IN","PK"] },
{ "type": "percentage", "percent": 20, "salt": "newpay-v1" } // 20% rollout
],
"created_by": "pm@company.com",
"created_at": "2025-11-01T10:00:00Z",
"kill_switch": false
}
4) Deterministic bucketing (important)
To do percentage rollouts that are sticky per device, compute a deterministic hash of (salt + device_id) and map to 0–99. Devices with value < percent are in the cohort. Example function (JS):
// simple stable bucket: returns 0..99
function stableBucket(deviceId, salt = "") {
// djb2-like hash computed digit-by-digit (deterministic)
let h = 5381;
const s = salt + "|" + deviceId;
for (let i = 0; i < s.length; i++) {
h = ((h << 5) + h) + s.charCodeAt(i); // h * 33 + c
h = h & 0xffffffff; // keep 32-bit
}
// unsigned
return Math.abs(h) % 100;
}
// usage:
if (stableBucket("device-abcdef", "newpay-v1") < 20) {
// in 20% rollout
}
5) Example: server-side evaluation service (Node + Redis)
A small HTTP endpoint that evaluates flags for a device; caches rules in Redis for speed.
// server/evaluator.js (Node/Express)
const express = require('express');
const redis = require('redis');
const bodyParser = require('body-parser');
const app = express();
app.use(bodyParser.json());
const client = redis.createClient();
function stableBucket(deviceId, salt="") {
let h = 5381;
const s = salt + "|" + deviceId;
for (let i = 0; i < s.length; i++) {
h = ((h << 5) + h) + s.charCodeAt(i);
h &= 0xffffffff;
}
return Math.abs(h) % 100;
}
// load flags from Redis (or DB) -- assume cached JSON at "flags:all"
async function getFlags() {
return new Promise((res, rej) => {
client.get('flags:all', (err, data) => {
if (err) return rej(err);
res(data ? JSON.parse(data) : {});
});
});
}
app.post('/evaluate', async (req, res) => {
const { device_id, user_id, country, device_model } = req.body;
const flags = await getFlags();
const result = {};
for (const f of Object.values(flags)) {
if (f.kill_switch) { result[f.key] = false; continue; }
let on = f.default;
// rules short-circuit: if any allow rule hit -> set true; explicit deny rules could be supported too
for (const rule of f.rules || []) {
if (rule.type === 'user_id_allow' && rule.ids?.includes(user_id)) { on = true; break; }
if (rule.type === 'device_model' && rule.values?.includes(device_model)) { on = true; break; }
if (rule.type === 'country' && rule.values?.includes(country)) { on = true; break; }
if (rule.type === 'percentage') {
if (stableBucket(device_id, rule.salt || f.key) < (rule.percent || 0)) { on = true; break; }
}
}
result[f.key] = on;
}
res.json({ flags: result, server_time: new Date().toISOString() });
});
app.listen(3000);
6) Example: client SDK (JS) with local cache & fallback
Client requests /evaluate on startup and caches result. If offline, use last cached flags + safe defaults. Use short TTL for critical flags.
// client/flagClient.js
async function fetchFlags(deviceInfo) {
try {
const r = await fetch('https://flags.example.com/evaluate', {
method: 'POST',
body: JSON.stringify(deviceInfo),
headers: {'content-type':'application/json'}
});
const json = await r.json();
localStorage.setItem('flags:cached', JSON.stringify({ts:Date.now(), payload: json}));
return json.flags;
} catch (e) {
// offline or network failure: use cache
const cached = JSON.parse(localStorage.getItem('flags:cached') || 'null');
if (cached) return cached.payload.flags;
// final fallback: empty
return {};
}
}
// usage in app
(async () => {
const deviceInfo = { device_id: DEVICE_ID, user_id: USER_ID, country: 'IN', device_model: 'Pixel5' };
const flags = await fetchFlags(deviceInfo);
if (flags['new_payment_flow']) {
startNewPaymentFlow();
} else {
startLegacyPayment();
}
})();
Security note: if a flag unlocks a sensitive server path, the server must authorize (server-side) — don't rely purely on client flags.
7) Orchestration & rollout strategies
-
Gradual % rollout: start at 0 → 1% → 5% → 25% → 100%. Use deterministic bucketing so devices stay in same bucket across updates.
-
Canary by cohort: route a percentage of traffic or specific devices (internal QA devices) to the new feature.
-
Geo / device targeting: limit to certain countries or device models.
-
User segment: power users, paid users, etc.
-
Time-based rules: enable on a date/time window.
-
Kill switch: global boolean that can be toggled to instantly disable feature everywhere. Put kill_switch evaluation before rules.
8) Telemetry and safety
-
Emit an exposure event whenever the client or server evaluates a flag:
{timestamp, flag_key, device_id_hash, variant, context}. Use hashed device id to preserve privacy. -
Track errors and KPIs (error rate, latency, crash rate) by flag exposure. Ramp back if errors rise.
-
Automated alerting based on metric thresholds.
-
Keep audit trail of who changed flags + when.
9) Performance & caching
-
Keep flag config small and cache on CDN/edge. TTL tradeoffs:
-
Short TTL (e.g., 30s–1min): quick kill but more load.
-
Long TTL (10m–1h): less load but slower response to kills.
-
-
Use push (WebSocket/FCM/APNs) to notify clients of critical flag changes (e.g., kill switch) so they fetch immediately.
10) Data schema changes & migrations
-
Use flags to gate new database schema changes (e.g., feature off uses old schema). Migrate progressively:
-
Add optional columns.
-
Run backend to write both old+new data under a gate.
-
Flip flag for backends and clients to read new schema.
-
Remove old code after full rollout.
-
11) Rollback & incident playbook
-
Have a single global kill switch that bypasses rules (immediately off).
-
Maintain a team runbook: how to flip kill switch (UI + API + direct DB update), who authorizes, communication plan.
12) Example: signed client config (prevent tampering)
If you do client evaluation, sign the flag bundle with server private key and verify in the client:
// server: produce signed config
// pseudo:
const payload = JSON.stringify({flags: {...}, exp: Date.now()+60000});
const signature = HMAC_SHA256(serverSecret, payload);
return { payload, signature };
// client:
if (verifyHMAC(serverSecretPublic, resp.payload, resp.signature)) {
useFlags(JSON.parse(resp.payload).flags);
} else {
// ignore, use fallback
}
(Use proper crypto libs; never embed private keys in the client.)
13) Multi-platform notes
-
Mobile (iOS/Android): use SDK that caches flags, receives push notifications for immediate refresh, uses deterministic device id (or hashed advertising id) for bucketing. Keep TTL small for kill switches.
-
Web: fetch flags on boot, store in cookie/localStorage; consider server-side rendering to ensure search engines and first render obey server evaluation.
-
Backend: do sensitive checks server-side (e.g., enable payment route only if server says so).
14) Danger / anti-patterns to avoid
-
Shipping logic for enabling sensitive features purely to the client (easy to tamper). Always enforce critical checks on server.
-
Using non-deterministic random for rollouts (users get different experiences every request). Use deterministic hashing.
-
Long TTLs for critical flags (makes kill slow).
-
No telemetry or metrics — you must measure.
15) Example: quick Python evaluator (for integrations)
# evaluator.py
import hashlib, json, time
FLAGS = {} # load from DB/redis
def stable_bucket(device_id: str, salt: str="") -> int:
s = (salt + "|" + device_id).encode('utf-8')
h = int(hashlib.sha1(s).hexdigest()[:8], 16)
return h % 100
def eval_flag(flag, ctx):
if flag.get('kill_switch'): return False
on = flag.get('default', False)
for rule in flag.get('rules', []):
t = rule['type']
if t == 'user_id_allow' and ctx.get('user_id') in rule.get('ids', []):
return True
if t == 'device_model' and ctx.get('device_model') in rule.get('values', []):
return True
if t == 'country' and ctx.get('country') in rule.get('values', []):
return True
if t == 'percentage':
if stable_bucket(ctx.get('device_id',''), rule.get('salt','')) < rule.get('percent',0):
return True
return on
16) Orchestration & workflow (who does what)
-
Product configures flag intent and rollout plan.
-
SRE/Release sets kill switch and rollout gates in the control plane.
-
Engineers instrument exposure events and gate code paths.
-
Data/Analytics monitor KPI changes and recommend ramp up/down.
-
Legal/Privacy ensure device ids are hashed and PII handled correctly.
17) Checklist before rollout to millions
-
Flag control UI + API (audit logs + RBAC).
-
Deterministic bucketing implemented.
-
Server enforcement for sensitive paths.
-
Client caching + fallback behavior defined.
-
Telemetry for exposures + metrics instrumented.
-
Kill switch tested (flip and observe).
-
Test groups (QA, internal) pinned.
-
Rollout plan (percent steps + criteria to proceed).
-
Alerts for spike in errors/crashes.
If you want, I can:
-
Produce a small, ready-to-deploy reference repo layout (server + client) in your preferred language/framework.
-
Draft the rollout playbook and monitoring queries you should add to Prometheus/Grafana (or Datadog).
-
Give a short, copyable policy text for RACI/approval for flips and kills.
Which of those would be most useful next?
No comments:
Post a Comment