We worked with MSU to scrape MDAH images and digitize the text in those images, then link every student in the records so that their school journey could be traced.
Teck Stack
Python via CLI
Amazon Textract
MDAH Challenges
N
We had to scrape 211,000 images recorded between 1885 and 1957.
The challenge of scraping this many images was making sure the IP's didn't get blocked by the website, so we had to create custom system.
N
We had to digitize the text within 211,000 images with some handwritten, and as old as 1885
Digitizing handwriting is hard enough but doing so with handwriting from 1800's and early 1900's is even harder because handwriting style has changed quite a lot since then.
N
Linking the relatives and students effectively over the years
We had to create algorithms to generate a probability score of the sibling or relatives of people within the counties.