Python text recognition from screen
View Discussion Show
Improve Article Save Article View Discussion Improve Article Save Article ImageGrab and PyTesseractImageGrab is a Python module that helps to capture the contents of the screen. PyTesseract is an Optical Character Recognition(OCR) tool for Python. Together they can be used to read the contents of a section of the screen. Installation –
Implementation of codeThe following functions were primarily used in the code –
The objectives of the code are:
Code : Python code to use ImageGrab and PyTesseract
OutputThe above code can be used to capture a certain section of the screen and read the text contents of it. Read about other libraries used in the codeNumpy In this blog post, we will try to explain the technology behind the most used Tesseract Engine, which was upgraded with the latest knowledge researched in optical character recognition. This article will also serve as a how-to guide/ tutorial on how to implement OCR in python using the Tesseract engine. We will be walking through the following modules:
Table of Contents
IntroductionOCR = Optical Character Recognition. In other words, OCR systems transform a two-dimensional image of text, that could contain machine printed or handwritten text from its image representation into machine-readable text. OCR as a process generally consists of several sub-processes to perform as accurately as possible. The subprocesses are:
The sub-processes in the list above of course can differ, but these are roughly steps needed to approach automatic character recognition. In OCR software, it’s main aim to identify and capture all the unique words using different languages from written text characters. For almost two decades, optical character recognition systems have been widely used to provide automated text entry into computerized systems. Yet in all this time, conventional online OCR systems (like zonal OCR) have never overcome their inability to read more than a handful of type fonts and page formats. Proportionally spaced type (which includes virtually all typeset copy), laser printer fonts, and even many non-proportional typewriter fonts, have remained beyond the reach of these systems. And as a result, conventional OCR has never achieved more than a marginal impact on the total number of documents needing conversion into digital form. Optical Character Recognition process (Courtesy)Next-generation OCR engines deal with these problems mentioned above really good by utilizing the latest research in the area of deep learning. By leveraging the combination of deep models and huge datasets publicly available, models achieve state-of-the-art accuracies on given tasks. Nowadays it is also possible to generate synthetic data with different fonts using generative adversarial networks and few other generative approaches. Optical Character Recognition remains a challenging problem when text occurs in unconstrained environments, like natural scenes, due to geometrical distortions, complex backgrounds, and diverse fonts. The technology still holds an immense potential due to the various use-cases of deep learning based OCR like
Have an OCR problem in mind? Want to reduce your organization's data entry costs? Head over to Nanonets and build OCR models to extract text from images or extract data from PDFs with AI based PDF OCR! There are a lot of optical character recognition software available. I did not find any quality comparison between them, but I will write about some of them that seem to be the most developer-friendly. Tesseract - an open-source OCR engine that has gained popularity among OCR developers. Even though it can be painful to implement and modify sometimes, there weren’t too many free and powerful OCR alternatives on the market for the longest time. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It gained popularity and was developed by HP between 1984 and 1994. In 2005 HP released Tesseract as an open-source software. Since 2006 it is developed by Google. google trends comparison for different open source OCR toolsOCRopus - OCRopus is an open-source OCR system allowing easy evaluation and reuse of the OCR components by both researchers and companies. A collection of document analysis programs, not a turn-key OCR system. To apply it to your documents, you may need to do some image preprocessing, and possibly also train new models. In addition to the recognition scripts themselves, there are several scripts for ground truth editing and correction, measuring error rates, determining confusion matrices that are easy to use and edit. Ocular - Ocular works best on documents printed using a hand press, including those written in multiple languages. It operates using the command line. It is a state-of-the-art historical OCR system. Its primary features are:
SwiftOCR - I will also mention the OCR engine written in Swift since there is huge development being made into advancing the use of the Swift as the development programming language used for deep learning. Check out blog to find out more why. SwiftOCR is a fast and simple OCR library that uses neural networks for image recognition. SwiftOCR claims that their engine outperforms well known Tessaract library. In this blog post, we will put focus on Tesseract OCR and find out more about how it works and how it is used. Tesseract OCRTesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn't have a built-in GUI, but there are several available from the 3rdParty page. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line. OCR Process Flow to build API with Tesseract from a blog postTesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. It has its origins in OCRopus' Python-based LSTM implementation but has been redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow. To recognize an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM is a popular form of RNN. Read this post to learn more about LSTM. Technology - How it worksLSTMs are great at learning sequences but slow down a lot when the number of states is too large. There are empirical results that suggest it is better to ask an LSTM to learn a long sequence than a short sequence of many classes. Tesseract developed from OCRopus model in Python which was a fork of a LSMT in C++, called CLSTM. CLSTM is an implementation of the LSTM recurrent neural network model in C++, using the Eigen library for numerical computations. Tesseract 3 OCR process from paperLegacy Tesseract 3.x was dependant on the multi-stage process where we can differentiate steps:
Word finding was done by organizing text lines into blobs, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page. Modernization of the Tesseract tool was an effort on code cleaning and adding a new LSTM model. The input image is processed in boxes (rectangle) line by line feeding into the LSTM model and giving output. In the image below we can visualize how it works. How Tesseract uses LSTM model presentationAfter adding a new training tool and training the model with a lot of data and fonts, Tesseract achieves better performance. Still, not good enough to work on handwritten text and weird fonts. It is possible to fine-tune or retrain top layers for experimentation. Installing TesseractInstalling tesseract on Windows is easy with the precompiled binaries found here. Do not forget to edit “path” environment variable and add tesseract path. For Linux or Mac installation it is installed with few commands. After the installation verify that everything is working by typing command in the terminal or cmd:
And you will see the output similar to:
You can install the python wrapper for tesseract after this using pip. Tesseract library is shipped with a handy command-line tool called tesseract. We can use this tool to perform OCR on images and the output is stored in a text file. If we want to integrate Tesseract in our C++ or Python code, we will use Tesseract’s API. Running Tesseract with CLICall the Tesseract engine on the image with image_path and convert image to text, written line by line in the command prompt by typing the following:
To write the output text in a file:
To specify the language model name, write language shortcut after -l flag, by default it takes English language:
By default, Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region, try a different segmentation mode, using the --psm argument. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation but does not perform orientation and script detection. To specify the parameter, type the following:
There is
also one more important argument, OCR engine mode (oem). Tesseract 4 has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the --oem option. OCR with Pytesseract and OpenCVPytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. More info about Python approach read here. The code for this tutorial can be found in this repository.
Preprocessing for TesseractTo avoid all the ways your tesseract output accuracy can drop, you need to make sure the image is appropriately pre-processed. This includes rescaling, binarization, noise removal, deskewing, etc. To preprocess image for OCR, use any of the following python functions or follow the OpenCV documentation.
Let's work with an example to see things better. This is what our original image looks like - The Aurebesh writing systemAfter preprocessing with the following code
and plotting the resulting images, we get the following results. The image after preprocessingThe output for the original image look like this -
Here's what the output for different preprocessed images looks like - Canny edge image (not so good)-
Thresholded image -
Opening image -
Getting boxes around textUsing Pytesseract, you can get the bounding box information for your OCR results using the following code. The script below will give you bounding box information for each character detected by tesseract during OCR.
If you want boxes around words instead of
characters, the function Have an OCR problem in mind? Want to digitize invoices, PDFs or number plates? Head over to Nanonets and build free online OCR models for free! We will use the sample invoice image above to test out our tesseract outputs.
This should give you the following output - Using this dictionary, we can get each word detected, their bounding box information, the text in them and the confidence scores for each. You can plot the boxes by using the code below -
Here's what this would look like for the image of a sample invoice. Text template matchingTake the example of trying to find where a date is in an image. Here our template will be a regular expression pattern that we will match with our OCR results to find the appropriate bounding
boxes. We will use the
As expected, we get one box around the invoice date in the image. Page segmentation modesThere are several ways a page of text can be analysed. The tesseract api provides several page segmentation modes if you want to run OCR on only a small region or in different orientations, etc. Here's a list of the supported page segmentation modes by tesseract - 0 Orientation and script detection (OSD) only. To change your page segmentation mode, change the Detect orientation and scriptYou can detect the orientation of text in your image and also the
script in which it is written. The following image - after running through the following code -
will print the following output.
Detect only digitsTake this image for example - The text extracted from this image looks like this.
You can recognise only digits by changing the config to the following
The output will look like this.
Whitelisting charactersSay you only want to detect certain characters from the given image and ignore the rest. You can specify your whitelist of characters (here, we have used all the lowercase characters from a to z only) by using the following config.
Output -
Blacklisting charactersIf you are sure some characters or expressions definitely will not turn up in your text (the OCR will return wrong text in place of blacklisted characters otherwise), you can blacklist those characters by using the following config.
Output -
Detect in multiple languagesYou can check the languages available by typing this in the terminal
To download tesseract for a specific language use
where LANG is the three letter code for the language you need. You can find out the LANG values here. You can download the Note - Only
languages that have a To specify the language you need your OCR output in, use the
Take this image for example - You can work with multiple languages by changing the LANG parameter as such -
and you will get the following output -
Note - The language specified first to the Unfortunately tesseract does not have a feature to detect language of the text in an image automatically. An alternative solution is provided by another python module called
This module again, does not detect the language of text using an image but needs string input to detect the language from. The best way to do this is
by first using tesseract to get OCR text in whatever languages you might feel are in there, using Say we have a text we thought was in english and portugese.
This should output a list of languages in the text and their probabilities.
The language codes used by We get the text again by changing the config to
Note - Tesseract performs badly when, in an image with multiple languages, the languages specified in the config are wrong or aren't mentioned at all. This can mislead the langdetect module quite a bit as well. Using tessdata_fastIf speed is a major concern for you, you can replace your tessdata language models with tessdata_fast models which are 8-bit integer versions of the tessdata models. According to the tessdata_fast github - This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine. These models only work with the LSTM OCR engine of Tesseract 4.
To use Need to digitize documents, receipts or invoices but too lazy to code? Head over to Nanonets and build OCR models for free! Training Tesseract on custom dataTesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy on document images. Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts. In order to successfully run the Tesseract 4.0 LSTM training tutorial, you need to have a working installation of Tesseract 4 and Tesseract 4 Training Tools and also have the training scripts and required trained data files in certain directories. Visit github repo for files and tools. Tesseract 4.00 takes a few days to a couple of weeks for training from scratch. Even with all these new training data, therefore here are few options for training:
A guide on how to train on your custom data and create We will not be covering the code for training using Tesseract in this blog post. Limitations of TesseractTesseract works best when there is a clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee these types of setup. There are a variety of reasons you might not get good quality output from Tesseract like if the image has noise on the background. The better the image quality (size, contrast, lightning) the better the recognition result. It requires a bit of preprocessing to improve the OCR results, images need to be scaled appropriately, have as much image contrast as possible, and the text must be horizontally aligned. Tesseract OCR is quite powerful but does have the following limitations. Tesseract limitations summed in the list.
There's of course a better, much simpler and more intuitive way to perform OCR tasks. OCR with NanonetsThe Nanonets OCR API allows you to build OCR models with ease. You do not have to worry about pre-processing your images or worry about matching templates or build rule based engines to increase the accuracy of your OCR model. You can upload your data, annotate it, set the model to train and wait for getting predictions through a browser based UI without writing a single line of code, worrying about GPUs or finding the right architectures for your deep learning models. You can also acquire the JSON responses of each prediction to integrate it with your own systems and build machine learning powered apps built on state of the art algorithms and a strong infrastructure. Using the GUI: https://app.nanonets.com/ You can also use the Nanonets-OCR API by following the steps below: Step 1: Clone the Repo, Install dependencies
Step 2: Get your free API Key Step 3: Set the API key as an Environment Variable
Step 4: Create a New Model
Note: This generates a MODEL_ID that you need for the next step Step 5: Add Model Id as Environment Variable
Note: you will get YOUR_MODEL_ID from the previous step Step 6: Upload the Training Data
Step 7: Train Model
Step 8: Get Model State
Step 9: Make Prediction
Nanonets and Humans in the LoopThe 'Moderate' screen aids the correction and entry processes and reduce the manual reviewer's workload by almost 90% and reduce the costs by 50% for the organisation. Features include
All the fields are structured into an easy to use GUI which allows the user to take advantage of the OCR technology and assist in making it better as they go, without having to type any code or understand how the technology works. Have an OCR problem in mind? Want to automate your organization's data entry costs? Head over to Nanonets and build OCR models to convert image to text or extract data from PDFs! ConclusionJust as deep learning has impacted nearly every facet of computer vision, the same is true for character recognition and handwriting recognition. Deep learning based models have managed to obtain unprecedented text recognition accuracy, far beyond traditional information extraction and machine learning image processing approaches. Tesseract performs well when document images follow the next guidelines:
The latest release of Tesseract 4.0 supports deep learning based OCR that is significantly more accurate. The OCR engine itself is built on a Long Short-Term Memory (LSTM) network, a kind of Recurrent Neural Network (RNN). Tesseract is perfect for scanning clean documents and comes with pretty high accuracy and font variability since its training was comprehensive. I would say that Tesseract is a go-to tool if your task is scanning of books, documents and printed text on a clean white background. Further Reading
Update: Update 2: Can Python read text on screen?PyTesseract is an Optical Character Recognition(OCR) tool for Python. Together they can be used to read the contents of a section of the screen.
How does Python recognize text?OCR with Pytesseract and OpenCV. Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.
How does Python recognize text from an image?Now we have everything we need and can easily extract text from image using Python:. from PIL import Image.. from pytesseract import pytesseract.. img = Image. open(path_to_image). text = pytesseract. image_to_string(img). print(text). How do I use Tesseract to read text from an image?Create a Python tesseract script
Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.
|