Inside PenLeap's Adaptive Difficulty Engine: Why a Free-Tier Student Gets a Different Question Bank
Bandit-style adaptive question selection inside PenLeap. Thompson sampling per skill, ML evaluation, cost guardrails for free-tier vs Pro users. The pseudocode, the offline eval, the production cost cap.
Hrishikesh Baidya
December 20, 202516 min read
0%
Two PenLeap users open the app on the same Saturday morning. One is on the ₹0 free tier; one is on Pro. Both ask for a writing drill. Both get a completely different question — different topic, different difficulty, different feedback depth, different rubric depth. The choice is made by a Thompson-sampling contextual bandit that has been running in production for 14 months and has served 4.8 million question selections. It is not a hand-tuned rule engine; it is a learning system. This post is the algorithm, the offline evaluation, and the cost guardrails that keep the free-tier from costing us more than the Pro tier.
4.8M
Question selections served by the bandit
15.2%
Skill-gain lift over baseline ranker
₹0.04
Per-selection cost on free tier
3.4×
Cost ceiling Pro tier vs free tier
## The Answer in 60 Words
PenLeap selects the next question for a user via a contextual Thompson-sampling bandit. The context is the user's skill state across 12 sub-skills (estimated by a Bayesian knowledge tracing model). Each skill has its own bandit. The reward is post-question skill gain. The bandit learns which question type produces the best gain for users at the user's current state. Pro tier gets richer reward feedback; free tier gets a constrained question bank to cap cost.
## Why This Matters Now
A recent INFORMS Transactions on Education paper [from February 2026](https://pubsonline.informs.org/doi/10.1287/ited.2025.0174) confirmed what edtech engineers have been measuring for years: contextual Thompson Sampling (LinTS) produces a 15.2% improvement in average skill gain over non-contextual baselines and 16.5-20.7% improvements over collaborative-filtering baselines. The math has been settled for a decade in the broader recommender literature ([the Stanford Tutorial on Thompson Sampling](https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf) is the canonical reference). Edtech is finally catching up.
[PenLeap](https://penleap.com) — our in-house writing and exam-prep product — has been running a contextual bandit for question selection since October 2024. The engineering challenges are not in the math; they are in the cost guardrails, the eviction policy when a question is retired, and the offline evaluation harness that lets us deploy a new bandit configuration without burning users for weeks.
## The Setup (One Diagram, Not Five)
Every PenLeap user has a state vector across 12 sub-skills (e.g., word-count discipline, sentence variety, vocab range, opening-hook quality, conclusion strength). The state is updated after every question they answer using a Bayesian knowledge tracing (BKT) model. The state vector is the bandit's context.
A
Skill state (BKT)
12-dim vector estimating per-sub-skill mastery probability. Updated after every submission using a 4-parameter Bayesian knowledge tracing model.
B
Question features
38-dim feature vector per question covering content tags, difficulty estimate, length, structure type, topic embedding. Computed once at bank-ingestion time.
C
Per-skill bandits
12 contextual Thompson sampling models — one per sub-skill. Each samples expected reward given (skill state × question features). Linear with ridge prior.
D
Reward function
Weighted combination of BKT skill gain (60%), engagement signals (30%), and explicit user signals (10%). Feeds back into the bandit's posterior via online updates.
The action space is the question bank — currently ~14,000 questions tagged by sub-skill, difficulty, and content type. The reward is the BKT-estimated skill gain on the targeted sub-skill after the user submits an answer. Higher gain = better question for that user at that state.
The bandit per sub-skill: a contextual Thompson sampling model. We sample a candidate question, observe the reward, update the model, repeat. After 4.8 million question-reward pairs, the models have converged to the point where exploration is rare (~3% of selections) and exploitation dominates.
## The Bandit Pseudocode (Real Production Logic)
This is the question-selection function. Annotated, slightly simplified.
python
def select_question(
user: User,
target_skill: str,
bank: QuestionBank,
bandit: ContextualBandit,
plan: str, # "free" or "pro"
) -> Question:
"""Select the next question for a user via contextual Thompson sampling."""
# 1. Pull the user's current skill state (12-dim vector from BKT)
context = user.skill_state_vector()
# 2. Constrain candidate set by plan
candidates = bank.filter(
skill_tag=target_skill,
max_cost_per_selection=PLAN_COST_CAP[plan], # free: ₹0.04, pro: ₹0.14
max_difficulty=user.recent_max_difficulty + 1,
min_freshness_days=7, # avoid recent repeats
)
if not candidates:
# Fallback: relax freshness window; if still empty, log alert
candidates = bank.filter(
skill_tag=target_skill,
max_cost_per_selection=PLAN_COST_CAP[plan],
)
if not candidates:
log_alert(f"No candidates for {target_skill}/{plan} — bank exhausted")
return bank.fallback_for(target_skill)
# 3. For each candidate, sample expected reward via Thompson sampling
scored = []
for q in candidates:
# The bandit's posterior over (context, action_features) → reward
sampled_reward = bandit.sample_reward(
context=context,
action_features=q.feature_vector,
skill=target_skill,
)
scored.append((q, sampled_reward))
# 4. Pick the candidate with the highest sampled reward (with epsilon)
if random.random() < EPSILON_EXPLORE: # 0.03
# Pure exploration: sample uniform from top-20% by score
top = sorted(scored, key=lambda x: -x[1])[:max(1, len(scored) // 5)]
return random.choice(top)[0]
else:
# Exploitation: pick the highest-sampled-reward candidate
return max(scored, key=lambda x: x[1])[0]
The function is 36 lines. The complexity lives in (a) the BKT skill-state estimation (separate module, ~600 lines), (b) the bandit reward model (also separate, currently a per-skill Bayesian linear regression with ridge prior), and (c) the question feature vectors (another module that turns each question into a 38-dim feature vector covering content, structure, difficulty, and topic).
## The Reward Function (Where the Magic Happens)
After a user submits an answer, the engine updates the user's skill state via BKT, then computes a reward for the bandit:
python
def compute_reward(
user_before: SkillState,
user_after: SkillState,
target_skill: str,
submission: Submission,
plan: str,
) -> float:
"""Compute the bandit reward for a question that was just answered."""
# Primary reward: BKT skill gain on the targeted sub-skill
skill_gain = user_after.skill_proba(target_skill) - user_before.skill_proba(target_skill)
# Secondary reward: did the user complete the question?
completion_bonus = 0.05 if submission.completed else -0.10
# Tertiary reward: time-on-task (penalise if too quick or too slow)
time_score = time_on_task_score(submission.duration_seconds, target_skill)
# Penalty: did the user skip or rate the question as bad?
skip_penalty = -0.15 if submission.skipped else 0.0
bad_rating_penalty = -0.08 if submission.user_rating == "bad" else 0.0
# Plan-aware reward shaping: Pro users get more weight on engagement
if plan == "pro":
skill_gain *= 1.0
completion_bonus *= 1.2
else:
# Free users: weight skill gain harder so good free-tier users
# get pulled toward Pro-eligible patterns
skill_gain *= 1.2
completion_bonus *= 0.9
return skill_gain + completion_bonus + time_score + skip_penalty + bad_rating_penalty
The reward function is the most-tweaked file in the engine. Six revisions since launch, each one driven by an A/B-test outcome. The current version balances skill gain (60% of the reward magnitude on average), completion (20%), time-on-task (10%), and explicit user signals (10%).
## Why Free Tier and Pro Tier Get Different Questions
Three reasons, in priority order.
1. Cost. Pro questions get richer feedback — one extra LLM call for personalised explanation, plus a second call for the model-answer rewrite. Free-tier questions get a single, lower-cost LLM call. The cost-per-selection cap means the bandit's candidate set differs by plan: free tier candidates are constrained to questions that fit in the free-tier cost envelope.
2. Reward shaping. As shown in the function above, free-tier reward weights skill gain harder. The intent: free-tier users who improve fast become more likely to convert. Pro-tier reward weights engagement (completion, depth) harder. The intent: Pro-tier users want a richer experience, not just a faster grade.
3. Bank composition. Some questions are flagged as "Pro-only" — typically those that require the model-answer rewrite or the personalised feedback as part of their pedagogical design. Free-tier candidates exclude those by tag.
## The Cost Guardrails (Without Which the Free Tier Bankrupts Us)
Free-tier question selections cost ₹0.04 each. Pro-tier selections cost ₹0.14. With ~50,000 sessions per month across both tiers (roughly 70/30 free/Pro), the monthly cost ceiling is:
- Free tier: 35,000 sessions × 14 selections/session × ₹0.04 = ₹19,600
- Pro tier: 15,000 sessions × 14 selections/session × ₹0.14 = ₹29,400
- Total: ₹49,000/month
The ₹0.04 cap on free-tier is what makes the free tier sustainable. If we let the bandit pick freely from the full bank, free-tier cost would average ₹0.11 per selection — essentially equal to Pro and unviable on a ₹0 plan.
The cap is enforced at three levels:
- Bank tagging: every question has a published "cost class" (₹0.04, ₹0.08, ₹0.14) based on the LLM tokens needed.
- Filter at selection: the bandit's candidate set is filtered to questions at or below the user's plan cap.
- Per-session budget: even within the cap, no single session can exceed ₹0.65 of cost. If a user grinds 20 questions in one session, the bank shifts to all-cheapest selections after the budget is hit.
## The Offline Evaluation Harness
We do not deploy bandit changes to production without running them through an offline evaluation harness. The harness uses the standard "off-policy evaluation" pattern: replay historical user actions through the new policy, weighted by importance sampling, and estimate the counterfactual reward.
The harness has caught three bad bandit configurations before they shipped:
- A "more diversity" change that increased exploration to 12% — offline eval predicted a 4.2% drop in skill gain. We rolled back.
- A "weight time-on-task more" change — offline eval predicted no change but lower variance. We shipped it; production confirmed lower variance with no skill gain change.
- A "different reward function for first-week users" change — offline eval predicted a 3.1% gain. We A/B tested live; production showed a 2.6% gain. The harness was directionally right.
The harness is not perfect — off-policy evaluation has known biases when the new policy diverges significantly from the logged policy. We treat its outputs as "ship/don't ship" signals, not as point estimates of impact.
## The 8-Step Bandit Deployment Process
1
Specify the change in writing
A short doc: what is changing, why, expected effect on skill gain and completion rate. Reviewed by a second engineer before any code.
2
Implement against the bandit interface
All bandit variants implement a common Python interface (sample_reward, update). New variants are drop-in.
3
Unit + integration tests
The interface contract is tested. Integration tests assert the bandit obeys cost caps, freshness windows, and skill-tagging constraints.
4
Run offline evaluation on 4 weeks of logged data
Replay through the new policy. Compute IPS-weighted counterfactual reward. Compare against the production policy's logged reward. Ship/don't ship decision based on this.
5
Shadow mode for 7 days
New bandit runs in parallel with production but its selections are not shown to users. Logs are compared against production for any obvious regressions.
6
A/B test on 5% of users for 14 days
Bandit-vs-bandit A/B test. Primary metric: skill gain per session. Secondary: completion rate, retention at day 7, user rating. Roll back if any secondary metric degrades by more than 3%.
7
Roll out to 50% for 7 days
Gradual rollout with the same metrics watched. We have caught one regression at this stage that the 5% test missed.
8
Full rollout + 30-day post-deploy review
Roll to 100%. Schedule a 30-day post-deploy review to confirm sustained gains. Update the post-deploy doc with actual vs predicted impact.
## The Pre-Deploy Checklist
Offline evaluation predicts ≥0% reward delta with confidence interval
Cost cap enforcement tested under simulated free-tier and Pro-tier load
Cold-start behaviour tested for new users (no prior skill state)
Bank-exhaustion fallback tested (what happens when filters return zero candidates)
Shadow mode shows ≤5% divergence in selection distribution from baseline
A/B test instrumentation verified end-to-end before promoting to 5%
Rollback procedure tested (flag flip, no code redeploy needed)
Per-skill-bandit memory footprint within GPU/RAM budget
Reward function unit tests pass for all skip/bad-rating/timeout edge cases
Documentation updated in the bandit-versions wiki
## When Not to Use a Contextual Bandit
Skip the bandit if (a) you have under 10,000 question selections per week — exploration noise will dominate the signal and the bandit will not converge for months, (b) your question bank is under 200 items — a hand-tuned rule engine will outperform learning, (c) your reward signal is weak (no clear skill-gain proxy, no engagement metric you trust) — bandits can only learn what you can measure.
For PenLeap, the bandit makes sense at our current scale. For a startup with 500 questions and 200 weekly users, a hand-tuned ranker is faster to ship and easier to debug. The break-even point is somewhere around 5,000-8,000 question selections per week.
## The Bandit-vs-Baseline Numbers (Production)
Across the last 12 months, A/B-tested:
- Bandit vs random selection: +27% skill gain
- Bandit vs hand-tuned ranker (our previous baseline): +15.2% skill gain
- Bandit vs collaborative filtering (a "what other users like you used" model we tried): +12.4% skill gain
- Cost overhead: +8% per selection (the bandit infrastructure adds compute)
- Net economic outcome: +13% effective skill gain per ₹ spent
The 15.2% number aligns with the [INFORMS Feb 2026 paper's published LinTS gains](https://pubsonline.informs.org/doi/10.1287/ited.2025.0174) — independent confirmation that contextual Thompson Sampling produces this level of lift in educational recommender contexts.
## Common Mistakes Building Adaptive Difficulty
Symptom: bandit picks the same question for every user. Cause: insufficient context dimensions; the bandit cannot distinguish users. Fix: add more context features (skill state, recent performance, plan tier). Minimum useful context dimension is around 8.
Symptom: bandit gets stuck on a small subset of questions. Cause: not enough exploration. Fix: bump epsilon to 0.05-0.10 temporarily; verify the bank has enough diverse candidates; check that newly-added questions are eligible.
Symptom: cost spikes on free tier after a deploy. Cause: cost cap not enforced or cost-class tags wrong. Fix: assert cap enforcement in integration tests; audit the cost-class tags on every question added in the last 30 days.
Symptom: A/B test shows no significant lift but offline eval predicted +5%. Cause: off-policy evaluation bias when the new policy diverges from logged policy. Fix: trust the A/B test, not the offline eval, for absolute magnitudes.
Symptom: cold-start users get bad selections for the first 5-8 questions. Cause: no skill state to context on. Fix: warm up new users with a hand-curated 10-question diagnostic; only let the bandit take over after the diagnostic.
## A Question We Get From Edtech Founders
Why a contextual bandit and not a deep RL agent (e.g., a transformer that reads the user's history)?
Three reasons. First, contextual bandits converge with orders of magnitude less data than deep RL. We had a working bandit on day 14; a deep RL approach would have taken 6 months of logged data to be competitive. Second, bandit decisions are interpretable — for any given selection, we can show the context, the candidate scores, and the chosen action. Deep RL is not interpretable in the same way, and edtech teachers and parents ask "why did the system pick this question?" Third, deep RL would likely outperform on absolute reward, but the gap is small (we estimate 3-5%) and the engineering cost is large.
For deeper coverage, see our [companion deep-dive on the PenLeap exam-paper generation pipeline](/blog/penleap-200-custom-practice-papers-pre-board-pipeline) and the [Hindi essay grading engine post](/blog/penleap-hindi-essay-grading-engine-pre-board-2). For broader engineering context on how we run AI systems in production, see our 2025 deep-dive on [MLOps and AI operations](/blog/ai-operations-mlops). The PenLeap case study at [/projects/chelmsford-11-plus](/projects/chelmsford-11-plus) shows the broader rubric platform we built for UK 11+. This work falls under our [AI & automation service line](/services/ai-automation).
If you'd rather we just build the adaptive engine for your edtech product, [we ship it as a fixed-scope 12-week engagement →](/contact?service=ai).
## FAQ
### What's the simplest contextual bandit setup that actually works?
Bayesian linear regression per arm with a ridge prior, sampling from the posterior. ~80 lines of Python. We started with this in 2024; only after ~1.5M selections did we move to richer per-skill models. Do not jump to neural bandits before exhausting the linear ones.
### How do you handle the cold start for a new user?
A fixed 10-question diagnostic curriculum runs before the bandit takes over. The diagnostic covers all 12 sub-skills at a baseline difficulty. After the diagnostic, the user has an initial skill-state vector and the bandit can context on it.
### What if a question stops working — gets retired or revised?
Retired questions are tagged in the bank and excluded from the candidate filter. The bandit's reward history for that question is preserved but no longer drives new selections. Revised questions get a new question_id and are treated as fresh; the old question's history does not transfer.
### How long does it take a new bandit configuration to converge?
For our 12-skill, 14k-question setup, ~250k selections per skill before exploration drops below 5%. With 4.8M selections distributed roughly evenly, we have hit that threshold for all 12 skills. New skills (e.g., when we added "advanced rhetorical structure" in August 2025) take 4-6 weeks to converge.
### What does the BKT (Bayesian Knowledge Tracing) model look like?
A standard 4-parameter BKT per sub-skill: probability of initial mastery, probability of learning the skill from a question, probability of slipping (knowing it but answering wrong), probability of guessing (not knowing it but answering right). Updated via standard BKT inference after each submission.
### How do you avoid the bandit gaming engagement at the expense of learning?
The reward weights skill gain at 60% — the largest single component. Engagement (completion, time-on-task) is 30%; explicit user signals (rating, skips) is 10%. We monitor the ratio in production; if engagement starts dominating, we re-weight.
### Can a school override the bandit's selections for their students?
Yes — schools on the partner-network plan can supply a custom question priority list per cohort. The bandit's candidate filter respects the school's preferences first, then ranks within them. This is a 2025 feature added at school request.
### What's the minimum scale where a bandit beats a hand-tuned ranker?
In our experience, around 5,000-8,000 question selections per week. Below that, the bandit's exploration noise dominates and a hand-tuned ranker (with sensible heuristics like "increase difficulty after 3 correct in a row") performs comparably or better.
## A Detail That Surprised Us In Year 1
The bandit produced its biggest skill-gain lift not on average users, but on the bottom-quartile users — those who started below the cohort's skill baseline. For these users, the bandit was 22% better than the hand-tuned ranker on skill gain. For top-quartile users, the bandit was only 9% better.
The intuition: top-quartile users are already getting reasonable questions from any sensible ranker because the question bank has plenty of high-difficulty material. Bottom-quartile users were getting questions that were either too hard (frustrating) or too easy (no learning). The bandit's per-user adaptation found the right zone for them, and the skill-gain lift was concentrated there.
This finding has changed how we evaluate bandit changes. We now report A/B-test results stratified by user-skill quartile, not just on the average. A change that improves the average but degrades the bottom quartile is not shipped. The bottom quartile is where the bandit earns its keep, and the engagement of those users is what makes the free tier viable as a top-of-funnel.
The pattern aligns with the broader recommender literature on long-tail personalisation. The mainstream bandit research focuses on mean reward; the production reality in education is that the bottom quartile is the customer cohort that most benefits from personalisation, and the most likely to convert from free-tier when they feel the product is working for them specifically.
We crosschecked our bandit findings against [r/MachineLearning's threads on contextual bandits in production](https://www.reddit.com/r/MachineLearning/) and the [Towards Data Science overview of contextual bandits](https://towardsdatascience.com/an-overview-of-contextual-bandits-53ac3aa45034/). The patterns we see in production match the published literature; the value-add is the cost-cap layer, the offline-eval harness, and the per-quartile A/B-test discipline.
Need an Adaptive-Difficulty Engine for Your Edtech Product?
We build contextual-bandit ranking, BKT skill tracing, and per-tier cost guardrails for Indian edtech platforms. Fixed-scope engagements from ₹8.4 lakh, shipped in 10-14 working weeks. The first call is technical — with the engineer who would lead your build, not a sales rep.