CBSE Pre-Board II Begins Today: Inside PenLeap's Hindi Essay Grading Engine
Hindi NLP pipeline behind PenLeap's essay grader — Devanagari tokenisation, SPAG-Hindi rubric, IndicBERT scoring. Pearson 0.87 against human raters across 8,400 essays. The architecture, the gotchas, the cost.
K
Khushi Singh
November 29, 202514 min read
0%
Pre-Board II opens today across most CBSE-affiliated schools. Class 10 Hindi B is one of the trickier papers to grade at scale — not because the content is harder, but because most off-the-shelf NLP tooling treats Hindi as a second-class citizen. Tokenisers split "नमस्ते" into three sub-tokens. Embeddings cluster सुंदर and खूबसूरत three vectors apart instead of one. Spell-checkers either over-flag or under-flag. We built PenLeap's Hindi essay grading engine to handle 8,400 essays this term across our partner network. Pearson correlation with the human rater panel: 0.87. Cost per essay: ₹0.34. This post is the architecture.
8,400
Hindi essays graded in Q3 2025
0.87
Pearson correlation with human rater panel
₹0.34
Per-essay grading cost (Nov 2025)
3.4 s
Median grading wall-clock per essay
## The Answer in 60 Words
PenLeap's Hindi grader runs a four-stage pipeline: Devanagari tokenisation via [iNLTK](https://inltk.readthedocs.io/), syntactic parsing via Stanza, semantic scoring via [IndicBERT](https://huggingface.co/ai4bharat/indic-bert) embeddings, and rubric scoring via a Claude Haiku call constrained by SPAG-Hindi rules. The four scores aggregate via a learned weighted formula tuned against human rater medians. Pearson 0.87, ₹0.34 per essay, 3.4 s wall-clock.
## Why This Matters Now
Pre-Board II opens today at most CBSE schools. The Hindi B paper is typically scheduled in the first week of December, which means schools have ~7 days from when the Pre-Board II Hindi script is collected to when feedback should land back with students. At 8,400 essays across our 28-school network, that is a hard manual-grading deadline. The grading engine is not a "nice to have" — it is the only way our partner schools turn essays around in the available window.
[PenLeap](https://penleap.com), our in-house writing platform, was designed for English first. The Hindi pipeline was a 2025 build-out specifically for Indian schools running CBSE Hindi B and Hindi A. Most of the architecture transfers; the language-specific layers (tokenisation, embedding, rubric) are different.
## The Stack (One Diagram, Not Three)
The grader has four components in series and one feedback loop on top.
A
Tokeniser + cleaner
iNLTK Devanagari tokeniser. Splits compound nouns, normalises Unicode (NFC), handles inherent vowel removal in conjuncts. Output: a stream of word tokens with morphological tags.
B
Syntactic parser (Stanza)
Stanford Stanza Hindi pipeline for POS tags, dependency parses, sentence boundaries. Used to score sentence structure, complexity, and SVO ordering — a CBSE rubric axis.
C
IndicBERT semantic scorer
AI4Bharat's IndicBERT v2 produces sentence embeddings. We compute coherence (adjacent-sentence cosine), topic adherence (essay vs prompt cosine), and lexical diversity (entropy of unigram embeddings).
D
SPAG-Hindi rubric pass
Claude Haiku with a Hindi-specific rubric prompt. Scores spelling, punctuation, agreement, gender, vibhakti (case marker) usage. Output: per-axis scores with citation to the offending phrase.
## Why Hindi Specifically Is Harder Than English
Three structural reasons.
Devanagari is morphologically rich. A single Hindi word frequently encodes person, number, gender, case, and tense. "लड़कियों" (lit. "of/to the girls") is one orthographic word but four morphemes. A naive tokeniser splits it badly; a good one (iNLTK or Stanza) keeps it whole and emits the morpheme tags.
Embeddings on Hindi are sparser. IndicBERT is trained on a fraction of the corpus that English BERT is trained on. Synonym pairs that English BERT clusters tightly (e.g., "beautiful" / "pretty") are sometimes 3-4 vectors apart in Hindi (e.g., सुंदर / खूबसूरत). We compensate with a curated synonym list that runs as a pre-processing pass.
SPAG rules are language-specific. English SPAG is "spelling, punctuation, and grammar." Hindi SPAG includes vibhakti (case markers — का, की, के, में, से), gender agreement on adjectives and verbs, and the matra system (vowel diacritics). A grading rubric built on English SPAG misses 60-70% of typical Hindi essay errors.
The published research on Hindi essay scoring agrees with this. The H-AES paper from 2023 notes that LSTM and transformer architectures both work for Hindi but require Hindi-specific pre-processing — there is no shortcut around the morphology and the script.
## The Scoring Pipeline (Detailed)
1
Pre-process: Unicode normalisation + tokenisation
Run NFC normalisation (this matters for matra-bearing characters which can have multiple Unicode representations). Tokenise via iNLTK. Output: a JSON list of tokens with morphological tags.
2
Stanza POS + dependency parse
Stanza Hindi pipeline produces POS tags and dependency arcs. We compute average sentence length, depth of dependency tree, and complexity score (number of subordinate clauses per 100 words).
3
IndicBERT embeddings for coherence + topic
Embed every sentence. Compute adjacent-sentence cosine for coherence. Compute essay-prompt cosine for topic adherence. Both feed into the rubric's "content" and "structure" axes.
4
SPAG-Hindi rubric pass via Claude Haiku
One LLM call with the essay, the prompt, the morphological tags, and the SPAG-Hindi rubric. Returns per-axis scores plus a list of cited errors. We force structured output via Zod schema.
5
Aggregation
Weighted aggregation of: structure (Stanza output), content (IndicBERT cosine), language quality (LLM rubric), SPAG (LLM rubric). Weights tuned against human rater medians via ridge regression on a held-out 800-essay set.
6
Feedback generator
A second LLM call writes student-facing feedback in Hindi, citing 3-5 of the most-cited errors and suggesting one improvement. Length capped at 180 Hindi words for readability.
## Accuracy: How We Measure It
We sample 8% of all graded essays per month for human-rater calibration. Three trained Hindi-language teachers score the sample on the same rubric. We compute Pearson correlation and Spearman rank correlation between the engine's score and the rater median.
Current numbers (October 2025 calibration, 672 essays sampled):
- Overall Pearson: 0.87
- Overall Spearman: 0.84
- Per-rubric axis: Content 0.89, Structure 0.83, Language 0.86, SPAG 0.91
The SPAG axis is the strongest because it is the most rule-bound — vibhakti errors and gender-agreement errors are detectable with high precision once morphological tags are correct. The Structure axis is the weakest because it depends on Stanza's parse quality, which itself is around 90% accurate on Hindi (vs 96%+ on English).
We also track inter-rater agreement (Krippendorff's alpha) on the human panel: currently 0.78. When this drops below 0.70 we re-train the panel with a fresh annotated set before trusting the calibration run.
## The Cost (Real Numbers)
The two LLM calls dominate the cost. The IndicBERT embedding pass is cheap because we run it on a Hetzner GPU spot instance (~₹4,200/month, amortised across all essays in a batch). The Stanza/iNLTK pass is essentially free on CPU.
For a school batch of 1,200 Hindi essays, the API + compute bill is ₹408 per cohort. Compared to the cost of three teachers grading 1,200 essays (roughly 80 hours of teacher time at ₹600/hour = ₹48,000), the engine pays for itself across one Pre-Board II run.
## The SPAG-Hindi Rubric (Real Excerpt)
This is the rubric prompt we ship to the LLM. Trimmed for length.
code
You are a CBSE Class 10 Hindi B examiner. Grade the essay below on
four axes (1-6 each) using the SPAG-Hindi rubric:
CONTENT (1-6): Topic adherence, depth of argument, examples used.
STRUCTURE (1-6): Introduction, body, conclusion. Logical flow.
LANGUAGE (1-6): Vocabulary range, sentence variety, register.
SPAG (1-6):
- Spelling: matra placement, conjunct correctness
- Punctuation: full stop (।), comma, quotation marks
- Grammar: vibhakti (का/की/के/में/से/पर) usage
- Agreement: gender + number on adjectives and verbs
Cite the offending phrase for every SPAG deduction. Output strict JSON.
ESSAY:
{essay_text}
PROMPT:
{essay_prompt}
MORPHOLOGICAL TAGS (from Stanza):
{morph_tags}
The morphological tags from Stanza are passed in as a hint — they help the LLM detect agreement errors that it would otherwise miss. Example: if a Stanza tag says a verb is feminine-singular but its noun subject is masculine-plural, the LLM is far more likely to catch the disagreement when it has the tag in front of it than when it has to infer from raw text.
## What "Devanagari Tokenisation" Actually Costs You If You Get It Wrong
In an early prototype, we used HuggingFace's default BERT tokeniser on Hindi. It produced sub-word tokens that broke at byte boundaries instead of grapheme clusters. The string "क्षत्रिय" (a single grapheme cluster across multiple Devanagari characters) was split into four sub-tokens. Downstream, our coherence and topic-adherence scores were 30-40% worse than they should have been.
The fix was a one-day change: swap to iNLTK's tokeniser, which respects Devanagari grapheme clusters. Pearson correlation against human raters jumped from 0.71 to 0.84 overnight. The lesson: NLP pipelines for Indian-language content live or die on the tokeniser. Spending a week evaluating four candidate tokenisers is the cheapest way to add 0.13 to your correlation.
## When Not to Use This Engine
Skip the engine if (a) your school has under 200 Hindi scripts per term — the calibration overhead is not worth it, (b) your scripts are predominantly handwritten and the OCR step adds error your engine cannot recover from (we recommend digital submission for accurate grading), (c) you are grading creative-writing tasks where rubric scoring is poorly defined — narrative essays score badly under any rule-based rubric, including ours.
For Class 10 Hindi B, 6-mark essay tasks, with the 200-300 word standard length, the engine is well-fit. For Class 9 short stories or Class 11 elective Hindi creative writing, defer to human grading.
## A Real Example: One School's First Pre-Board II Hindi Run
A Bangalore CBSE school onboarded the engine in early October 2025. They submitted 412 Class 10 Hindi B essays from Pre-Board I on November 9. Wall-clock time from upload to per-student feedback report: 47 minutes for the engine pass, plus 18 minutes of human verification on flagged essays (the engine flagged 14 essays as low-confidence). Final correlation against the school's own teacher panel on a 30-essay sample: 0.86.
The Hindi HoD's reaction was the most useful feedback we got: "the per-axis scores told my teachers what to teach next, not just what each student's overall score was." That is the actual point of automated grading — the analytics, not the grade.
## Common Mistakes Building a Hindi NLP Pipeline
Symptom: embedding similarity is noise. Cause: tokeniser broke grapheme clusters or the embedding model was trained on too little Hindi. Fix: use iNLTK + IndicBERT or Sarvam's embeddings; do not use vanilla multilingual BERT.
Symptom: SPAG misses obvious vibhakti errors. Cause: morphological tags not passed to the LLM. Fix: feed Stanza morph tags into the rubric prompt as context.
Symptom: gender agreement gets scored wrong on essays with diverse pronouns. Cause: Hindi gender agreement is complex when the subject is non-binary or collective. Fix: skip the agreement check for collective subjects; cite explicitly when the engine cannot make a determination.
Symptom: Pearson correlation drops sharply when you add a new school. Cause: schools have local stylistic preferences in Hindi marking. Fix: per-school rubric calibration with 30-50 essays graded by both engine and the school's own panel before treating the engine's output as production-grade.
Symptom: feedback text reads stilted in Hindi. Cause: LLM-generated Hindi without a register-specific prompt. Fix: include 3 sample feedback paragraphs (in correct school-teacher register) in the system prompt; specify "informal but respectful, second person, 120-180 words."
## A Question We Get From CBSE School Heads
Why not just hire more Hindi teachers for the grading window?
Three reasons. First, Hindi teachers in most CBSE schools are already at full capacity — finding 80 marginal hours of grading capacity in November is structurally hard. Second, human grading at scale produces inconsistent feedback — different teachers grade differently, and students get noise rather than signal. Third, the engine produces per-axis analytics in real-time, which manual grading cannot match. The engine is not "instead of teachers." It is "freeing teachers from the mechanical grading work so they can focus on the per-axis weakness reports."
For deeper coverage, see our [companion piece on the 5 common mistakes Class 10 students make on the English paper](/blog/class-10-english-pre-board-1-five-mistakes-penleap-data) and our [PenLeap exam-paper generation pipeline post](/blog/penleap-200-custom-practice-papers-pre-board-pipeline) from earlier in November. For the broader engineering picture, our [2025 piece on AI marketing automation](/blog/ai-marketing-automation-2025) covers the architecture pattern we adapt across products. This work falls under our [AI & automation service line](/services/ai-automation).
If you'd rather we build a Hindi grading engine for your edtech platform, [we ship it as a fixed-scope 10-week engagement →](/contact?service=ai).
## FAQ
### How does PenLeap handle Hindi grammar that has no English equivalent?
The SPAG-Hindi rubric explicitly enumerates Hindi-specific grammar items (vibhakti, gender on adjectives/verbs, matra placement). The LLM is given the rubric and the morphological tags from Stanza in the prompt. It cites the offending phrase for every deduction.
### What if a student writes in Hinglish (Hindi-English mix)?
The pipeline detects code-switching at sentence level using a small language-ID model. Pure Hindi sentences go through the Hindi pipeline. English sentences are scored on the English rubric (different scorer, same architecture). Mixed sentences are flagged for human review — the engine declines to score them.
### Can the engine grade other Indian languages?
Bengali and Marathi are in pilot — both use the same architecture with language-specific tokenisers (iNLTK supports both) and IndicBERT v2 (which covers them). Tamil and Malayalam are harder because their script and morphology are very different from Devanagari; we are evaluating an alternative pipeline using AI4Bharat's IndicNLP tokenisers.
### How do you handle student handwriting in Hindi?
We strongly prefer digital submission. For handwritten Hindi, we use Google Document AI's Devanagari handwriting model — accuracy 88.4% character-level. Essays with OCR confidence below 0.85 are flagged for human verification before grading. Most of our partner schools moved to digital submission in 2024-25 specifically to use the engine.
### What's the licence cost for IndicBERT?
[IndicBERT is open-source](https://huggingface.co/ai4bharat/indic-bert) and free for commercial use. AI4Bharat publishes both the model and the training corpus. The cost in our pipeline is purely compute (GPU time for inference), not licensing.
### Can a school review and override the engine's grades?
Yes. The engine surfaces per-axis scores with cited evidence. A teacher can adjust any score before the report goes to students. We log every override and use the corrections to retrain the rubric weights — the engine gets better the more a school uses it.
### How long until the engine is calibrated for a new school?
Plan 4-6 weeks. The first 1-2 cohorts of essays are graded jointly by the engine and the school's panel; their corrections drive a per-school rubric calibration. After 200-300 jointly-graded essays, the engine is production-grade for that school.
### Does the engine support Hindi A (literature heavy) as well as Hindi B?
Hindi B (language) is fully supported. Hindi A (with literature analysis) needs a richer rubric for literary device analysis (alankar, ras, chhand) — we have a working pipeline but are still calibrating it against three partner schools.
## A Detail That Mattered During Onboarding
In one early onboarding, the engine consistently scored a Bangalore school's Hindi essays 0.5-0.8 marks lower than the school's own panel. We initially assumed model drift. The actual cause: the school taught a regional Hindi register that included Kannada-influenced word choices and idioms. Our SPAG-Hindi rubric, calibrated on Standard Hindi, was flagging these as errors.
We added a "regional variant tolerance" flag to the rubric, took 60 of the school's previously-graded essays, and re-trained the rubric weights with a Karnataka-specific term whitelist (about 240 words and 18 idioms). The score gap closed within two weeks. The lesson: standardised rubrics are a starting point, not an endpoint. Per-school calibration is the difference between "the engine works generally" and "the engine works for this school."
The deeper point connects to our broader stance on AI in education: the rubric is the most important file, and the rubric is local. We can build the engine; the rubric belongs to the educators who use it.
Need an Exam-Prep AI Built for Your Edtech Product?
We build OCR + tokenisation + grading + analytics pipelines for Indian-language exam prep — Hindi, Bengali, Marathi, Tamil, Telugu, and English. Fixed-scope engagements from ₹5.6 lakh, shipped in 8-12 working weeks. The first call is technical — with the engineer who would lead your build.