Extract text from pdf python

View Discussion

Improve Article

Save Article

Read

Discuss

View Discussion

Improve Article

Save Article

All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

Extracting Text from PDF File

Python package PyPDF can be used to achieve what we want [text extraction], although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

Note: For more information, refer to Working with PDF files in Python

Installation

To install this package type the below command in the terminal.

pip install PyPDF2

Example:

Input PDF:

import PyPDF2

pdfFileObj = open['example.pdf', 'rb']

pdfReader = PyPDF2.PdfFileReader[pdfFileObj]

print[pdfReader.numPages]

pageObj = pdfReader.getPage[0]

print[pageObj.extractText[]]

pdfFileObj.close[]

Output:

Let us try to understand the above code in chunks:

```
pdfFileObj = open['example.pdf', 'rb']
```
We opened the example.pdf in binary mode. and saved the file object as pdfFileObj.
```
pdfReader = PyPDF2.PdfFileReader[pdfFileObj]
```
Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.
```
print[pdfReader.numPages]
```
numPages property gives the number of pages in the pdf file. For example, in our case, it is 20 [see first line of output].
```
pageObj = pdfReader.getPage[0]
```
Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage[] which takes page number [starting form index 0] as argument and returns the page object.
```
print[pageObj.extractText[]]
```
Page object has function extractText[] to extract text from the pdf page.
```
pdfFileObj.close[]
```
At last, we close the pdf file object.

I am adding code to accomplish this: It is working fine for me:

# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings["ignore"]


def download_file[url]:
    local_filename = url.split['/'][-1]
    local_filename = local_filename.replace["%20", "_"]
    r = requests.get[url, stream=True]
    print[r]
    with open[local_filename, 'wb'] as f:
        shutil.copyfileobj[r.raw, f]

    return local_filename


class PDFExtractor[]:
    def __init__[self, url]:
        self.url = url

    # Downloading File in local
    def break_pdf[self, filename, start_page=-1, end_page=-1]:
        pdf_reader = PdfFileReader[open[filename, "rb"]]
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range[start_page, end_page]:
            output = PdfFileWriter[]
            output.addPage[pdf_reader.getPage[i]]
            with open[str[i + 1] + "_" + filename, "wb"] as outputStream:
                output.write[outputStream]

    def extract_text_algo_1[self, file]:
        pdf_reader = PdfFileReader[open[file, 'rb']]
        # creating a page object
        pageObj = pdf_reader.getPage[0]

        # extracting extract_text from page
        text = pageObj.extractText[]
        text = text.replace["\n", ""].replace["\t", ""]
        return text

    def extract_text_algo_2[self, file]:
        pdfResourceManager = PDFResourceManager[]
        retstr = StringIO[]
        la_params = LAParams[]
        device = TextConverter[pdfResourceManager, retstr, codec='utf-8', laparams=la_params]
        fp = open[file, 'rb']
        interpreter = PDFPageInterpreter[pdfResourceManager, device]
        password = ""
        max_pages = 0
        caching = True
        page_num = set[]

        for page in PDFPage.get_pages[fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True]:
            interpreter.process_page[page]

        text = retstr.getvalue[]
        text = text.replace["\t", ""].replace["\n", ""]

        fp.close[]
        device.close[]
        retstr.close[]
        return text

    def extract_text[self, file]:
        text1 = self.extract_text_algo_1[file]
        text2 = self.extract_text_algo_2[file]

        if len[text2] > len[str[text1]]:
            return text2
        else:
            return text1

    def extarct_table[self, file]:

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf[file, output_format="csv"]
        except:
            print["Error Reading Table"]
            return

        print["\nPrinting Table Content: \n", df]
        print["\nDone Printing Table Content\n"]

    def tiff_header_for_CCITT[self, width, height, img_size, CCITT_group=4]:
        tiff_header_struct = '


				
					

                 
	Bài Viết Liên Quan
	
	 	
		
		   
		   
		   
		
		
			Python program to print prime numbers from 1 to n

		
	

		
		
		   
		   
		   
		
		
			Thời sự 19 giờ ngày 25 tháng 6 năm 2023

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn global python la gì

		
	

		
		
		   
		   
		   
		
		
			Find string in string javascript

		
	

		
		
		   
		   
		   
		
		
			How do i put text on the next line in html?

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng php server trong PHP

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn np.reshape trong python

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn pack trong python

		
	

		
		
		   
		   
		   
		
		
			How to count output in python

		
	

		
		
		   
		   
		   
		
		
			Hãy đếm số lượng chữ số của số nguyên dương n python

		
	

		
		
		   
		   
		   
		
		
			Passing javascript variable to php

		
	

		
		
		   
		   
		   
		
		
			Biểu đồ vnindex 2023

		
	

		
		
		   
		   
		   
		
		
			Javascript round to 6 decimal places

		
	

		
		
		   
		   
		   
		
		
			How do you create a partition in python?

		
	

		
		
		   
		   
		   
		
		
			How to connect ftp server using php?

		
	

		
		
		   
		   
		   
		
		
			Which of the following is correct about constants vs variables in php

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng stored procedure trong PHP

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn python getattr child class

		
	

		
		
		   
		   
		   
		
		
			Python calculate bearing between two coordinates

		
	

		
		
		   
		   
		   
		
		
			Hướng dẫn dùng atan2 math python

		
	

	
	




Toplist mới

 
	
	 
		#1
		
			Top 7 tết mậu thân năm 1968 đã diễn ra sự kiện gì ở miền nam nước ta 2023
			5 tháng trước
		
	



	
	 
		#2
		
			Top 13 luyện từ và câu: dấu gạch ngang lớp 4 trang 45 2023
			5 tháng trước
		
	



	
	 
		#3
		
			Top 6 trong mặt phẳng oxy ảnh của đường thẳng d 3x y 4=0 2023
			5 tháng trước
		
	



	
	 
		#4
		
			Top 6 thử thách thần chết thuyết minh phần 2 2023
			5 tháng trước
		
	



	
	 
		#5
		
			Top 4 vở bài tập tiếng việt lớp 3 tập 2 chính tả trang 15 2023
			5 tháng trước
		
	



	
	 
		#6
		
			Top 5 áo khoác nam quảng châu cao cấp 2023
			5 tháng trước
		
	



	
	 
		#7
		
			Top 4 nội dung nào sau đây không phải là trách nhiệm của đơn vị đầu mối cung cấp thông tin 2023
			5 tháng trước
		
	



	
	 
		#8
		
			Top 9 mẫu đồng phục công sở đẹp 2022 2023
			5 tháng trước
		
	



	
	 
		#9
		
			Top 5 ốp lưng iphone 13 pro bảo vệ camera 2023
			5 tháng trước
		
	






		


	Bài mới nhất
	
	 	
		
		   
		   
		   
		
		
			Văn phòng hà sơn hải vân ở hà nội năm 2024

		
	

		
		
		   
		   
		   
		
		
			Bài tập trắc nghiệm amino axit có đáp án năm 2024

		
	

		
		
		   
		   
		   
		
		
			Tuổi tân mùi hợp tuổi nào mở hàng khai trương năm 2024

		
	

		
		
		   
		   
		   
		
		
			Bài văn viết về sở thích bằng tiếng anh năm 2024

		
	

		
		
		   
		   
		   
		
		
			Bê tông thương phẩm mác 250 là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Thư giãn đầu óc tiếng anh là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Đi xe grab 4 chỗ trung bình giá bao nhiêu năm 2024

		
	

		
		
		   
		   
		   
		
		
			Quạt treo tường điều khiển từ xa loại nào tốt năm 2024

		
	

		
		
		   
		   
		   
		
		
			Làm thế nào để cài nhạc chuông cho iphone 6 năm 2024

		
	

		
		
		   
		   
		   
		
		
			Sau when là quá khứ đơn trước when là gì năm 2024

		
	

	
	
                 
	Chủ Đề
	
	
	
		  programming
		  Hỏi Đáp
		  Là gì
		  Mẹo Hay
		  Toplist
		  Địa Điểm Hay
		  Học Tốt
		  mẹo hay
		  Công Nghệ
		  Nghĩa của từ
		  Bao nhiêu
		  Khỏe Đẹp
		  đánh giá
		  Tiếng anh
		  Top List
		  bao nhieu
		  bao nhiêu
		  hướng dẫn
		  Món Ngon
		  So Sánh
		  So sánh
		  Bài tập
		  Xây Đựng
		  Sản phẩm tốt
		  Ngôn ngữ
		  Bài Tập
		  javascript
		  Ở đâu
		  Thế nào
		  Hướng dẫn
		  Dịch 
		  Tại sao
		  Máy tính
		  Đại học