Field Notes: First-Pass Handwriting OCR on 1900s Manuscripts (with Kraken)

Fine-Tuning an OCR Model on Handwritten Ohio Bar Admission Ledgers

Digital Humanities · OCR · Archives

The Roll of Attorneys is a digitization effort at the Supreme Court of Ohio. The ledgers are handwritten: hundreds of pages of attorney names, Ohio cities, and dates recorded in cursive by clerks across multiple decades. They have been scanned, but scanned images are not searchable data.

When I was first brought onto this project, someone floated the idea of transcribing six of these books by hand. I said absolutely not, smiled politely, and went home to figure out how to make a machine do it instead. This post documents that process: fine-tuning a custom OCR model on handwritten attorney admission entries and running it against real ledger pages for the first time.

The Records and Why They Matter

Court records are some of the most useful primary sources in American legal history. Attorney admission ledgers, the books that document when and where each lawyer was sworn into the bar, sit at an intersection of legal, biographical, and social history. They can answer questions like: when did women begin appearing on Ohio’s bar rolls? How did the geographic distribution of attorneys shift across the late nineteenth and early twentieth centuries? Which attorneys appear in both bar records and notable case histories?

Right now, those questions require a researcher to physically turn pages. The goal of this project is to change that: transform page images into structured, queryable text without losing the nuance that human transcription would preserve. OCR is the obvious tool. The problem is that standard OCR was built for print. Cursive handwriting, especially across multiple clerks and decades, is a different problem entirely.

Standard OCR was built for print. Cursive handwriting, especially across multiple clerks and decades, is a different problem entirely.

Kraken, Not Tesseract

For this project I chose Kraken, an open-source OCR engine developed with historical documents specifically in mind. Unlike Tesseract, which was designed for printed text, Kraken uses a neural network architecture that can be trained or fine-tuned on custom handwriting samples. It is used widely in the digital humanities community for exactly this kind of archival work.

Kraken’s pipeline has two stages: segmentation, which finds where the lines of text are on a page, and recognition, which reads each line. For this experiment I used Kraken’s built-in blla.mlmodel for segmentation, loaded explicitly as a workaround for a Kraken v6 path change, and focused all the training effort on a custom recognition model for the ledger’s particular handwriting.

Building the Training Data

The foundation of any supervised machine learning model is labeled data. For handwriting recognition, that means image-transcription pairs: a cropped image of a single line, and the exact text it contains. Each cursive image lives in one folder, its typed transcription in another, matched by filename. The training script pairs them up, copies everything into a ground truth directory, and moves on to training.

I assembled 71 such pairs from existing ledger pages, manually cropping individual entries and typing the corresponding transcriptions. Yes, I did do some manual transcription in the end. This is called irony.

Ground truth preparation — sample pairs from the training output

  ok  07_01_cursive.png    "Oscar M. Abt Canton O."
  ok  07_02_cursive.png    "Harry D Auman Potsdam O"
  ok  07_05_cursive.png    "Walter S. Adams Cleveland O."
  ok  08_02_cursive.png    "Wm Agnew 4145 E 95th Street Cleveland Ohio"
  ok  09_07_cursive.png    "Gary Rudolph Alburn Cleveland, Ohio"
  SKIP - empty label: 19_07_typed.txt

71 pairs prepared, 1 skipped

The entries are predominantly attorney names paired with Ohio cities, which is actually a narrow enough domain to be an advantage. A model trained specifically on this writing style and vocabulary can outperform a general-purpose model that has never seen nineteenth-century Ohio cursive. The train/validation split was 90/10, computed automatically from the image list: 64 pairs for training, 7 held out for validation.

I fine-tuned from McCATMuS, a community-trained model for historical handwritten text. Fine-tuning rather than training from scratch meant I could work with a small dataset and still produce something usable. Training ran for 20 epochs using KrakenTrainer with resize=’both’ to handle any size mismatches between the base model’s expectations and the ledger images. Kraken saves a numbered checkpoint each time validation accuracy improves; when training finished, the script picked the highest-numbered checkpoint and saved it as attorney_best.mlmodel.

Training setup — from 01_model_training.ipynb

rec_model = RecognitionModel(
    output=model_output,
    model=BASE_MODEL,          # McCATMuS base
    training_data=train_imgs,
    evaluation_data=val_imgs,
    resize='both',
    format_type='path',
    partition=0.9,
    num_workers=0,
)

trainer = KrakenTrainer(
    min_epochs=10,
    max_epochs=EPOCHS,       # 20
    enable_progress_bar=True,
)
trainer.fit(rec_model)

Model Architecture

The model is a standard CRNN (Convolutional Recurrent Neural Network) with 4.0 million parameters. The convolutional front end extracts visual features through four ActConv2D blocks with MaxPool downsampling and Dropout regularization. Those feature maps get passed through three bidirectional LSTM layers, which model the sequential left-to-right structure of cursive text. A final LinSoftmax layer outputs probabilities over 117 character classes.

Training uses CTC loss (Connectionist Temporal Classification), which handles sequence recognition tasks where the alignment between input features and output characters is not known in advance. This is exactly the situation with cursive, where letter boundaries are ambiguous and the clerks apparently had strong opinions about where one letter ended and another began. Apple’s MPS backend does not support CTC natively, so that computation fell back to CPU. Slower, but correct.

Model layers at a glance

ActConv2D x4 with MaxPool + DropoutFour convolutional blocks extract visual features from the line image. MaxPool downsamples between stages. Dropout regularizes during training to reduce overfitting on a small dataset.

Bidirectional LSTM x3Three stacked bidirectional LSTM layers read the feature sequence left-to-right and right-to-left simultaneously. This matters for cursive, where letter forms depend heavily on surrounding context.

LinSoftmax outputA linear layer followed by softmax outputs a probability distribution over 117 character classes at each time step. CTC loss then aligns those probabilities to the ground truth transcription without requiring per-character segmentation.

Running It Against Real Pages

The transcription script loads attorney_best.mlmodel, segments each page into lines using the baseline model, then runs rpred.rpred across every detected line. Each line gets a confidence score averaged across its characters; anything below 0.70 gets flagged. Output saves as a plain .txt file per page.

Three test pages: page-075, page-076, and page-077. The segmentation model detected 65, 5, and 47 lines respectively. Page 076’s five lines are not a content issue. That page has a different layout and the segmenter largely missed it. Segmentation and recognition are separate problems, and a weak segmenter limits a strong recognizer.

Raw OCR output — page 075, selected lines

[0.99]  Cereland Ohio
[1.00]  Clerrlar Oliio
[0.97]  Clerelargl O
[0.96]  Columbus Ohio
[0.97]  Solerte, Ohio
[0.98]  Apkron Ohiio
[1.00]  Arngsurlel, Ohio
[0.96]  Cohrigphus Ohio

The model has broadly learned what these entries look like. It knows it is reading names and cities. It is producing plausible character sequences. But letter-level accuracy still needs work. Cleveland becomes Cereland or Clerelargl. Akron becomes Apkron. Columbus, notably, comes through clean.

Confidence scores are high across the board, most lines above 0.90, which tells me the model is reading something. High confidence and high accuracy are not the same thing when the training set is small. The model is very sure of itself. The model is also wrong about Cleveland.

Where the Gaps Are

Known limitations going in

Dataset size71 pairs is a proof of concept, not a production dataset. Most serious handwriting recognition projects need thousands of labeled examples to generalize well. The model has learned patterns from a narrow slice of the ledger and does not have enough variety to handle all clerk hands, page conditions, or edge cases. More data is the answer to most of the problems here.

Alphabet gapsDuring training, Kraken flagged a mismatch: 18 characters appeared only in training and not in validation, and one appeared only in validation. Accuracy metrics during training were incomplete as a result. The model may be better or worse than the validation scores suggest on unseen pages.

Legacy segmentationThe base McCATMuS model predates Kraken’s current polygon extraction method, which triggers a deprecation warning at inference time and leaves some segmentation performance on the table. The workaround of explicitly loading the segmentation model handles the Kraken v6 path issue, but retraining from a newer base model is the real fix.

The CoreML warningOn Apple Silicon, coremltools throws a runtime warning about not being able to run predict() on the compiled model. This does not affect Kraken’s inference path. Kraken uses PyTorch, not CoreML. It is noisy but not a real error.

What’s Next

More training data is the single highest-leverage thing I can do. I am aiming for several hundred additional pairs, targeting pages with different clerks and different decades to improve generalization. Yes, this means more manual transcription. The irony continues to deepen.

A post-processing step is also on the list. Many of the errors are phonetically or visually plausible confusions: Cereland for Cleveland, Apkron for Akron. Matching raw output against a known vocabulary of Ohio cities and common Ohio surnames could substantially improve downstream accuracy without requiring more training data.

The end goal is not just transcribed text. It is structured data: name, city, date of admission, page number. Once transcription quality is sufficient, a parsing layer will extract those fields and populate a database that researchers can actually query. And once the model and pipeline are stable, batch transcription across the full scan set begins. Six books, machine-assisted. That is the plan.

Thanks for reading my brain rambles.

Digital Humanities · OCR · Archives