OCR and non-Latin text

The UBC Digitization Centre is responsible for the creation of more than 50 collections, all available through the Open Collections website. Our collections are diverse in formats, information and languages.

Having non-English materials, or materials that are not written using the Latin-based alphabet, may be a barrier to access and retrieving information. But technology can be used to help us minimize these barriers.

Laura Ferris and Rebecca Dickson, from the Digitization Centre, have discovered a process to generate searchable transcripts for non-Latin text. The idea originated from an article about a workshop on Optical Character Recognition for Bangla. The result of the workshop was the realization that Google Drive was the most accurate tool for generating transcripts for non-Latin text.

With that information in hand, Ferris and Dickson started to explore Google Drive to create an automated workflow for transcribing batches of items.

Are you interested in trying the workflow out for yourself? If so, check the instructions that Rebecca prepared and give it a try!

Access Google Drive, create a “New folder” and rename it
Create a Google Sheet inside the folder
Open the Sheet, click on “Share”, “Receive shared link” and look for the sheet identifier (the numbers and letters between /d/ and /edit?)
In the Sheet, under “Tools” menu, click “Script editor”
Paste the content from “gs” into the script editor
Update the “folderName” with the name of your folder (defined in step 1)
Update the “sheetId” with the identifier that you found in step 3
Click the “clock” icon and select the options: “extractTextOnOpen”, “From spreadsheet” and “On open”
Save the script editor and close it
Upload jpegs to the folder (you can check out the sample items prepared for this work)
Open the spreadsheet and wait for Google to do the work!

If you want to check Laura and Rebecca’s presentation about the topic, check out their slides. If you have questions, feel free to contact us.

Sources:

A workshop on Optical Character Recognition for Bangla (British Library)

OCR for non-English language text (Pixelating)

Pixelating-ocr (GitHub)

Post Comment Click here to cancel reply.

.

Home

What We Do

Documentation

Digitizers' Blog

Contact Us

Address