Build a Django OCR Application with Python and Tesseract: Extract Text from Images easily

Valentius Kryptix - Build a Django OCR Application with Python and Tesseract: Extract Text from Images easily
Sunayana Kamble Avatar

Tired of manually transcribing text from scanned documents or images in your Django app? Imagine a world where you can automate this process, saving countless hours of tedious work. Whether you’re dealing with invoices, receipts or Aadhaar cards, extracting text from images has never been easier. With the power of Python, Django, and Tesseract OCR, you can instantly convert images into editable, searchable text.

In this blog, I’ll walk you through a step-by-step guide on how to seamlessly integrate Tesseract OCR into your Django application, making image-to-text extraction both simple and efficient. Ready to revolutionize your data entry process? Let’s dive in!

Why Use Tesseract OCR with Django?

Tesseract is one of the most accurate open-source OCR engines, supporting 100+ languages. Pair it with Django, and you’ve got a powerful system for:

  • Extracting text from scanned documents, receipts, or ID cards
  • Automating data entry (e.g., Aadhaar card details, invoices)
  • Building searchable document archives

OCR Alternatives: Quick review

Text extraction from images is a common task and several methods have been developed over time to accomplish this efficiently.

Let’s explore some of the most effective techniques used until now for automating text extraction


There are several frameworks, tools and libraries available for extracting text from images as follows :

  1. Tesseract OCR:
    Description: An open-source OCR engine that supports over 100 languages.

Use Case: Suitable for extracting text from scanned documents, images, or PDFs.

Integration: Python wrapper pytesseract allows seamless use with Python.

Advantages: High accuracy, supports customization through training datasets.

  1. EasyOCR:
    Description: A Python library built on PyTorch for OCR tasks.

Use Case: Supports 80+ languages and is easy to set up and use.

Advantages: Simple interface, good for multilingual text extraction.

  1. PyOCR:
    Description: A Python wrapper for multiple OCR engines like Tesseract and CuneiForm.

Use Case: Provides a unified interface to access various OCR engines.

Advantages: Flexibility to switch between different OCR engines.

  1. OpenCV:
    Description: A computer vision library used for preprocessing images before applying OCR.

Use Case: Enhances image quality (e.g., converting to grayscale, thresholding) to improve OCR accuracy.

Advantages: Extensive image processing capabilities.

  1. OCRopus:
    Description: A document analysis and OCR system focused on layout analysis and text recognition.

Use Case: Best for structured document processing.

  1. Cloud-Based APIs:
    Examples include Google Cloud Vision API, AWS Textract, and Microsoft Azure Cognitive Services.

Use Case: Ideal for enterprise-level applications requiring high scalability and accuracy.

Advantages: No setup required; supports advanced features like handwriting recognition.

  1. PDFelement:
    Description: A desktop application with an integrated OCR engine for extracting text from images or PDFs.

Use Case: Suitable for non-programmers looking for a ready-to-use tool.

Before diving in, let’s compare popular OCR tools:


1. Set Up Your Environment

Tool
Description

Use Case

Advantages
Tesseract OCROpen-source, 100+ languagesScanned docs, images, PDFsHigh accuracy, customizable
EasyOCR
PyTorch-based, 80+ languages
Multilingual extractionSimple interface, easy setup
PyOCR
Wrapper for multiple OCR engines
Unified OCR interface
Flexibility in engine choice
OpenCVComputer vision, preprocessingImage enhancementExtensive image processing
OCRopus
Layout analysis, text recognition
Structured docs (e.g., books)Complex layout optimization
Cloud APIsGoogle, AWS, AzureEnterprise apps, scalability
No setup, advanced features
PDFelementDesktop app with OCRNon-programmers, ready-to-useUser-friendly, no coding

Tesseract wins for open-source flexibility and Django integration.

I have used Tesseract OCR, Here’s a detailed step-by-step guide for extracting text from images using Tesseract OCR within a Django application:

Step-by-Step: Text Extraction in Django with Tesseract OCR

Step 1: Set Up the Environment & install the necessary dependencies

You need pytesseract, Pillow, opencv-python and numpy. Install them using:

pip install pytesseract Pillow opencv-python numpy
Also make sure Tesseract OCR is installed on your system. For Windows, set the path to the Tesseract executable in your code.

Step 2: Import the necessary Libraries

import pytesseract
from PIL import Image
import re
from django.shortcuts import render
from .forms import AadharCardForm
import cv2
import numpy as np
Libraries used:
  • pytesseract: The Python wrapper for Tesseract OCR.
  • PIL (Pillow): A library for opening, manipulating, and saving image files.
  • re: The regular expression library for pattern matching.
  • cv2 (OpenCV): A library used for image processing tasks.
  • numpy: A library for numerical operations, used here for image data manipulation.

Step 3: Configure Tesseract Path

For Windows :

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

This allows pytesseract to access the OCR engine.

Step 4: Define the View Function

def extract_aadhar_data(request):
name = ""
yob = ""
aadharno = ""
addrs = ""
raw_text = ""

(Initialize variables to store extracted information.)

Step 5: Handle Form Submission

if request.method == 'POST' and 'image' in request.FILES:
form = AadharCardForm(request.POST, request.FILES)

(Check if the request method is POST and if an image file is included in the request)

Step 6: Validate the Form

if form.is_valid():

(Validate the form to ensure that all required fields are correctly filled out.)

Step 7: Open and Process the Image

image_file = request.FILES['image']
print("Image uploaded:", image_file.name)
img = Image.open(image_file)

Convert the image to grayscale for better OCR results:

img_cv = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
gray_img = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
  • Open the uploaded image using Pillow.
  • Convert the image to a format suitable for OpenCV processing (BGR).
  • Convert it to grayscale to enhance OCR accuracy.

Step 8: Extract Text Using Tesseract

raw_text = pytesseract.image_to_string(gray_img)
print("Extracted Text (Raw):", raw_text)

(Use Tesseract to extract text from the processed grayscale image.)

Step 9: Extract Specific Data Using Regular Expressions

name_ptrn = r"\s[:-]?\s([A-Za-z\s]+)"
name_srch = re.search(name_ptrn, raw_text)
name_is = name_srch.group(1) if name_srch else "Not found"
n = name_is.replace('\t', '')

Define a regex pattern to find the name in the extracted text and clean it up.

Filter Out Unwanted Words
exclude = ['DOB', 'aH', 'att']
wrds = n.split()
final = [word for word in wrds if word not in exclude]
only_name = ' '.join(final[:3])
print("Extracted Name:", only_name)

Exclude unwanted words and retain only relevant parts of the name.

Extract Year of Birth
yob_ptrn = r"\s[:-]?\s(\d{4})(?:/\d{2}/\d{4})?"
yob_srch = re.search(yob_ptrn, raw_text)
yob = yob_srch.group(1) if yob_srch else "Not found"
print("Extracted Year of Birth:", yob)
Extract Aadhaar Number
aadhar_ptrn = r"\s[:-]?\s(\d{4}\s?\d{4}\s?\d{4})"
aadhar_srch = re.search(aadhar_ptrn, raw_text)
aadharno = aadhar_srch.group(1) if aadhar_srch else "Not found"
print("Extracted Aadhar Number:", aadharno)
Extract Address

addrs = extract_address(raw_text)
print(“Extracted Address:”, addrs)

Address Extraction Function

def extract_address(text):
addrs_ptrn = re.compile(
r"Address:\s(\d{1,5}[A-Za-z\s]+(?:,\s[A-Za-z\s]+){3,}\d{6}\s[A-Za-z\s]+(?:,\s[A-Za-z\s]+),\s\d{6}\s*[A-Za-z\s]+)")
addrs_match = addrs_ptrn.search(text)

if addrs_match:
    address = addrs_match.group(1).strip()
    only_address = re.sub(r'\s+', ' ', address)
    only_address = re.sub(r'\s*,\s*', ', ', only_address)  # Clean up comma spaces
    return only_address
else:
    return "Not found"

Step 10: Save Data to PostgreSQL

AadharData.objects.create(
name=only_name,
year_of_birth=yob,
Aadhar number=aadharno,
address=addrs,
raw_text=raw_text
)
print("Data saved to PostgreSQL!")

Save the extracted data into your PostgreSQL database using Django’s ORM.

Step 11: Render Template with Extracted Data

return render(request, 'ocr/upload_image.html', {
'form': form,
'name': only_name,
'yob': yob,
'aadharno': aadharno,
'addrs': addrs,
'raw_text': raw_text,

})

Step 12: Handle GET Requests

else:
form = AadharCardForm()

return render(request, 'ocr/upload_image.html', {
'form': form,
'name': name,
'yob': yob,
'aadharno': aadharno,
'addrs': addrs,
'raw_text': raw_text
})

Lets look at how the different Django files align with the MTV (Model-Template-View)

1. Model (M)
  • Purpose: Defines the data structure of your application.
  • Explanation: In Django, the Model is responsible for the data layer. This file contains Python classes that represent database tables. Each class attribute corresponds to a database field, and Django automatically creates the database schema based on these models.
models.py
The models.py file defines the structure of the database table where extracted data is stored. In this case, it includes fields for storing Aadhaar card details.
from django.db import models

class AadharData(models.Model):
name = models.CharField(max_length=255) # Stores extracted name
year_of_birth = models.CharField(max_length=4) # Stores year of birth
aadhar_number = models.CharField(max_length=14) # Stores Aadhaar number
address = models.TextField() # Stores extracted address
raw_text = models.TextField() # Stores raw OCR text
created_at = models.DateTimeField(auto_now_add=True) # Timestamp for record creation
def __str__(self):
    return self.name
2. View (V)
  • Purpose: Handles the logic of the application and responds to user requests.
  • Explanation: In Django, the View is where business logic resides. It acts as a controller in the traditional MVC pattern. A view receives user requests, processes them (with the help of models), and returns a response (usually by rendering a template).

Everything that we discussed above step by step code will be included in the views.py file

3. Templates(T)
Templates are used to render HTML pages dynamically based on data passed from views. In this case, the template displays the form for uploading images and shows extracted data along with historical records.

Purpose:

  • Displays a form to upload images.
  • Shows extracted data (name, year of birth, Aadhaar number, address, and raw text).
  • Lists historical records retrieved from the database.
4. URLs (Routing)
  • Purpose: Maps URLs to views.
  • Explanation: Django’s URLs file links the incoming web requests to specific views. In this file, you define URL patterns that point to your views, essentially creating routes for the application.
Project-Level urls.py
Location: Found in the root directory of the Django project (e.g., myproject/urls.py).

Purpose: Acts as the central URL configuration for the entire project. It routes incoming requests to specific app-level urls.py files or directly to views.

Key Features:

  • Uses include() to delegate URL routing to app-level configurations.
  • Provides a clear structure for managing multiple apps within a single project.
App-Level urls.py
Location: Found in individual app directories (e.g., myapp/urls.py).

Purpose: Manages URLs specific to the app, making it easier to organize and maintain routes for each app independently.

Key Features:

  • Defines URLs specific to the app’s functionality.
  • Can be included in the project-level urls.py using include() for modularity.
  • App-level urls.py files define specific routes and link them to corresponding views.
How They Work Together:
The project-level urls.py delegates requests to app-level urls.py files using include().
5. Forms – forms.py
  • Purpose: Handles user input, validation, and processing.
  • Explanation: In Django, Forms are used to handle user input, perform validation, and manage data. Forms are defined in the forms.py file and can be used to create or update model instances.
6. Settings – settings.py
  • Purpose: Configures the Django project.
  • Explanation: The Settings file contains global configuration for the entire Django project, such as database settings, installed apps, middleware, templates, static files, and more.

Project Directory Structure:

Here’s how these files fit into a typical Django project structure:

aadhar_extractor/
├── manage.py
├── aadhar_extractor/
│ ├── init.py
│ ├── settings.py
│ ├── urls.py
│ ├── wsgi.py
├── ocr/
│ ├── migrations/
│ ├── templates/
│ │ └── ocr/
│ │ └── upload_image.html
│ ├── init.py
│ ├── admin.py
│ ├── apps.py
│ ├── models.py
│ ├── tests.py
│ ├── urls.py
│ └── views.py
├── db.sqlite3 (or PostgreSQL database configured)


Valentius Kryptix - Build a Django OCR Application with Python and Tesseract: Extract Text from Images easily

FAQs: Python, Django & Tesseract OCR:

1. What is Tesseract OCR, and how does it work?

  • Answer: Tesseract OCR is an open-source optical character recognition (OCR) engine that extracts text from images. It supports over 100 languages and works well with scanned documents and images. It can be used in Python through the pytesseract wrapper.

2. How do I install and configure Tesseract OCR for use in Django?

  • Answer: To install Tesseract, download the executable for your operating system and install the required Python libraries (pytesseract, Pillow, opencv-python, numpy). Then, set the path to the Tesseract executable in your code for Python to access the OCR engine.

3. How can I improve OCR accuracy in Django?

  • Answer: Improve OCR accuracy by preprocessing images using OpenCV (grayscale conversion, thresholding, noise reduction) to enhance text visibility. Also, using high-quality images and customizing Tesseract with training data for specific fonts can further improve results.

4. Can I use Tesseract OCR for handwritten text recognition?

  • Answer: Tesseract OCR works well for printed text but struggles with handwritten text. For handwriting recognition, you can explore other OCR services like Google Cloud Vision or AWS Textract, which provide better support for handwritten text.

5. What are some use cases for integrating OCR in Django applications?

  • Answer: OCR can be used to automate data extraction from documents such as Aadhaar cards, invoices, receipts, or any form-based data. Django applications can process uploaded images or PDFs to extract key details like names, addresses, and identification numbers.

6. How can I extract specific data (e.g., name, Aadhaar number) from OCR text?

  • Answer: After extracting raw text with Tesseract OCR, you can use regular expressions (regex) to search for and extract specific data such as names, Aadhaar numbers, and addresses. This approach helps clean and structure the extracted text into meaningful information.

7. Is Tesseract better than EasyOCR?

  • Answer: Tesseract is generally more accurate for high-quality scans, especially for printed text. However, EasyOCR is easier to set up, supports more than 80 languages, and is useful for multilingual text extraction. Tesseract may be faster and more accurate for high-resolution images, while EasyOCR is more versatile for diverse languages.

8. How do I speed up OCR in Django applications?

  • Answer: Speed up OCR by:
    • Preprocessing images with OpenCV to enhance quality.
    • Using multiprocessing for bulk OCR tasks to handle multiple images at once.
    • Caching results with Redis to avoid redundant OCR processing and improve response times.

These FAQs cover the most essential aspects of using OCR in Django applications, helping users understand the setup, use cases, optimization strategies, and comparison between Tesseract and other OCR libraries.


Conclusion and Final Thoughts

In this guide, we’ve explored how to leverage Tesseract OCR to extract text from images within a Django application. By following the step-by-step process, you can easily set up an OCR-powered system that processes images, extracts meaningful data and stores it in a database like PostgreSQL.

From setting up the environment to integrating OCR with Django views, models, and templates, we’ve seen how different components of the Django MVC (MTV) model work together seamlessly. Using libraries like Pillow, OpenCV, and Pytesseract, you can enhance OCR results and customize text extraction to meet specific needs, such as extracting Aadhaar card details from scanned images.

Extracting text with Python, Django, and Tesseract is a game-changer for automating data entry. Whether you’re processing Aadhaar cards, invoices, or receipts, this setup saves hours of manual work. By integrating OCR into your Django application, you can build systems that automate tedious tasks, significantly boosting productivity and reducing human error.

Next Steps

To further enhance your text extraction workflow, consider these next steps:

  1. PDF Text Extraction: Integrate PyPDF2 to extract text from PDFs, expanding your OCR capabilities beyond just image-based content.
  2. Live Camera OCR: Use OpenCV combined with Django Channels to capture and process live camera feeds, allowing real-time text extraction from images captured via a webcam or mobile camera.

With these additions, your Django-based OCR solution will be even more powerful, flexible, and ready for various real-world applications. Happy coding!

Got questions? Drop them below! 🚀


Leave a Reply

Your email address will not be published. Required fields are marked *

Tagged in :

Sunayana Kamble Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Love