Presentation of Ethiopic Script OCR App, with NextJs front end and fastapi backend
Size: 536.34 KB
Language: en
Added: Mar 01, 2025
Slides: 25 pages
Slide Content
By Girma Eshete aka Menelik Berhan
About Myself I’m from Addis Ababa, Ethiopia. Studied Civil Engineering @ Addis Ababa University. Learning @ALXSE to be a full stack developer. I’m the sole member of the team.
Background I’m a long time, self-confessed history ‘fanatic’, specifically African history. In pursuing this interest of mine, I try to read old Ethiopian historical documents written in Amharic and Gee’z (both use Ethiopic Script). While reading old historical documents written in Amharic I need to refer Amharic-Amharic dictionaries for meaning of words I couldn’t understand. And also, when reading documents written in Ge'ez , I need to refer Gee’z-Amharic dictionaries for meaning of almost all words.
Background… But repeatedly searching for words in these dictionaries is an arduous task mainly because: most of the comprehensive Amharic-Amharic & Ge’ez-Amharic dictionaries were written at least a century ago, and they could only be found in pdfs made from scanned images. These dictionaries use an alphabet with an A, B, G, D … sequence instead of the current H, L, M … sequence. For these reasons I had always yearned for a digitized versions of these dictionaries long before I started my programming studies.
Background… And as fate has it, the stars has aligned and I got a chance to enroll @ALXSE. I’ve developed this CLI app, which performs OCR on images and pdfs containing Ethiopic script, a s a first stage in building web/mobile dictionary apps from scanned pdfs of old dictionaries. Due to the irregularities in those old dictionaries, I want to perfect the OCR process as much as possible before moving on to developing apps with GUI.
Architecture
Technologies Command Line Interface (CLI) I’ve used cmd module to provide a CLI for the app. I’ve chose cmd since it provides easily understandable structure along with means to provide detailed help for each commands. The CLI has three basic commands: image - for performing OCR on image files pdf - for performing OCR on pdf files default - for displaying and setting default params.
Technologies… PARSE INPUT For parsing the user input, and storing parameters in a dictionary, I’ve used the argparse module. This is a suitable choice since, as the official doc says, “… argparse makes it easy to write user-friendly command-line interfaces. The program defines what arguments it requires, and argparse will figure out how to parse those out of sys.argv”.
Technologies… PDF Page Loading For pdf files, pdf2image module’s convert_from_path function is used to convert the pdf into list of images. Pdf2image is selected for its reliability and ease of use, since it has been around for over a decade and is still consistently maintained and has easy bindings for Python. After generating the list of images, then storing each image in a bytes array, NumPy is used to pass the array to OpenCV.
Technologies… IMAGE PRE-PROCESSING Before performing OCR, to do editing on the images for more accurate result, I’ve used OpenCV-Python, which is a library of Python bindings designed to solve computer vision problems. I’ve opted for OpenCV due to better functionality and performance it provides compared to other viable choices such as Python Imaging Library (PIL). OpenCV provides advanced computer vision algorithms and extensive image manipulation tools. If an application requires complex image processing tasks, such as object detection, feature extraction, OpenCV is likely the better choice. Furthermore, OpenCV is generally optimized for performance, as it is implemented in C++ with Python bindings. It is suitable for real-time applications or scenarios that require high-speed processing.
Technologies… OCR ENGINE Python-tesseract, a python wrapper for Google's open source text recognition Tesseract-OCR, is used for performing the actual optical character recognition. For the OCR engine there are other alternatives, some with better accuracy like Amazon’s Rekognition API. But it comes with trade offs like: It requires an internet connection to access the API. To have access, an amazon web service account is needed, which requires a valid credit card. The free version of the API service has a limited number of API calls allowed per month. There will be added latency introduced by the network connection and results will take longer. For these reasons I've chosen tesseract which offers support for over 100 languages. Tesseract has improved significantly over the years and provides accurate text recognition for various document types and is easily trainable.
Technologies… OUTPUT to FILE I’ve used: d ocx module to write to MS Word files, and f pdf module to write to pdf files. Their comparatively easy learning curve and better availability of is the main reasons for these choices. In addition, they have easily configurable parameters for output formatting and there is plenty of documentation available for both since they have been in use for a while.
Features Perform OCR on image files (jpeg, jpg, png) Perform OCR on pdf files Accept a directory as input Outputs OCR result to either stdout, plain text file (txt), MS Word file (docx) or pdf file. Option to join output into one file for multiple input files. Option to set tesseract configuration and output file formatting parameters Option to display average character recognition confidence level. Option to select either simple or detailed image preprocessing. Set and display default parameters (like input/output directory, tesseract configuration parameters and formatting and style for output files)
Code snippet… validate_parsed_cmd(parsed_dict): Check for valid syntax Check for required arguments Check proper combination of options & arguments Set values from default return(validated_dict)
Code snippet… process_image (validated_dict): L oad image using input path Perform pixel , contrast, threshold, … adjustments return(processed_image)
Process and timeline Research on OCR engines and selecting the pragmatic choice Research about tesseract Research about image preprocessing Research about fonts Code initial draft python script Test using samples from old dictionaries Refactor into separate functions Add a CLI using cmd Add parser using argparse & validation using functions More research about image pre processing & add more adjustments Refine param values for better accuracy by doing more testing Clean up OCT 27 - NOV 16 NOV 1 7 - DEC 7 DEC 8 - DEC 20
Challenges Understanding & Using tesseract Installation: since tesseract requires multiple modules and libraries as a prerequisite, getting the installation right was a bit more challenging than expected. I followed the recommendations in the official documentation and used tips from stackoverflow. Parameters: tesseract has over 600 parameters that are used for different steps of processing, like image processing, script & font detection and outputs. Understanding how these parameters affect the overall app efficiency and setting the proper value for accurate results was a time consuming process. I’ve filtered relevant parameters using inputs from recommendations from stackoverflow and search. Grasping LSTM mode: in its new versions (after 4.0) tesseract includes a Long Short Term Memory (LSTM) mode that gives more accurate results. Before utilizing this feature, I had to spend a considerable time researching about the concept so I could utilize the full feature provided by this mode.
Challenges… Image pre processing Due to the inconsistent scanned image quality of the images this app is geared to process, image preprocessing has to be done for better output accuracy. When using OpenCv for this purpose, the processing should be done with tesseract engines requirement in mind, since tesseract also performs internal image processing. For this implementation, I have to learn many new concepts regarding how images are represented, edited and altered. In addition, since OpenCv provides different options for doing the same task, I have to understand how the module works, which also strains the project's time constraints. As a pragmatic solution, I’ve used templates from recommendations in stackoverflow, and added specific adjustments that best suit this app.
Challenges… Improving OCR result Accuracy Most of the historical documents my app is intended to be used for were written either by hand (using ink and scroll made of animal hides) or old printing press, and as a result their formatting and typeface is not uniform. These include irregular letter shapes and sizes, varying spacing between words and also between lines. Moreover, some of the letters have idiosyncratic appendages that negatively affect the OCR result which is performed using tesseract that is trained using the current standardized and uniform characters. To tackle this problem I had planned to identify and set tesseract’s configuration parameters that will help increase the OCR confidence level, and also to create training data that best represent these irregularities.
Challenges… Improving OCR result Accuracy… But after doing research on how tesseract works internally and on preparing the planned training data, I realized how I underestimated the breadth and depth of new concepts I had to learn from scratch and the time needed to accomplish my goals. Understanding how these parameters, numbering more than 600, affect the overall app efficiency and setting the proper value for accurate results was a time consuming process. After a focused research, I opted for filtering out the parameters that greatly affect the accuracy and efficiency of the result for my particular case. Then I tested ranges of parameter values, and used a minimal approach in configuring tesseract’s parameters.
Challenges… Improving OCR result Accuracy… In trying to create a custom training data, I came across multiple concepts like understanding how fonts work, creating font boxes, and matching the characters. To deliver my MVP project on time, after doing testing with small sample inputs, I’ve opted to use the best training data provided by tesseract, and add will add layers on it to solve common errors in output and further improve accuracy. As an additional measure, I’ve expanded the image preprocessing done using OpenCV to tackle some of the irregularities mentioned above using suggestions and examples provided in tesseract’s official GITHUB documentation page. Adjusting image’s color, contrast, border, pixel and size, binarization of the images using threshold adjustments, and dilation and erosion of edges of characters against a common background to dilate or grow in size (Dilation), or s hrink (Erosion) helped to increase the OCR output confidence to a satisfactory level for now.
Technical interests as a result of this project Personally, I’ve learned that discovering and learning new things is my forte. I’ve enjoyed exploring new topics such as LSTM learning, image processing, fonts and creation and formatting of MS word and pdf documents using python. I have developed or re-discovered interests in furthering my studies in programming, especially focused on: AI Languages ( lexicography, typesetting, etymology …), and Data science.