We worked with MSU to scrape MDAH images and digitize the text in those images, then link every student in the records so that their school journey could be traced.

Tech Stack

Python via CLI
Amazon Textract

MDAH Challenges

We had to scrape 211,000 images recorded between 1885 and 1957.

The challenge of scraping this many images was making sure the IP's didn't get blocked by the website, so we had to create custom system.

We had to digitize the text within 211,000 images with some handwritten, and as old as 1885

Digitizing handwriting is hard enough but doing so with handwriting from 1800's and early 1900's is even harder because handwriting style has changed quite a lot since then.

Linking the relatives and students effectively over the years

We had to create algorithms to generate a probability score of the sibling or relatives of people within the counties.

