Showing posts with label ocr. Show all posts

Tuesday, November 1, 2022

[FIXED] How can i optimize my Python loop for speed

November 01, 2022 for-loop, ocr, performance, python, python-tesseract No comments

Issue

I wrote some code that uses OCR to extract text from screenshots of follower lists and then transfer them into a data frame.

The reason I have to do the hustle with "name" / "display name" and removing blank lines is that the initial text extraction looks something like this:

Screenname 1

name 1

Screenname 2

name 2

(and so on)

So I know in which order each extraction will be. My code works well for 1-30 images, but if I take more than that its gets a bit slow. My goal is to run around 5-10k screenshots through it at once. I'm pretty new to programming so any ideas/tips on how to optimize the speed would be very appreciated! Thank you all in advance :)


from PIL import Image
from pytesseract import pytesseract
import os
import pandas as pd
from itertools import chain

list_final = [""]
list_name = [""]
liste_anzeigename = [""]
list_raw = [""]
anzeigename = [""]
name = [""]
sort = [""]
f = r'/Users/PycharmProjects/pythonProject/images'
myconfig = r"--psm 4 --oem 3"

os.listdir(f)
for file in os.listdir(f):
    f_img = f+"/"+file
    img = Image.open(f_img)
    img = img.crop((240, 400, 800, 2400))
    img.save(f_img)

for file in os.listdir(f):
    f_img = f + "/" + file
    test = pytesseract.image_to_string(PIL.Image.open(f_img), config=myconfig)

    lines = test.split("\n")
    list_raw = [line for line in lines if line.strip() != ""]
    sort.append(list_raw)

    name = {list_raw[0], list_raw[2], list_raw[4],
            list_raw[6], list_raw[8], list_raw[10],
            list_raw[12], list_raw[14], list_raw[16]}
    list_name.append(name)

    anzeigename = {list_raw[1], list_raw[3], list_raw[5],
                   list_raw[7], list_raw[9], list_raw[11],
                   list_raw[13], list_raw[15], list_raw[17]}
    liste_anzeigename.append(anzeigename)

reihenfolge_name = list(chain.from_iterable(list_name))
index_anzeigename = list(chain.from_iterable(liste_anzeigename))
sortieren = list(chain.from_iterable(sort))

print(list_raw)
sort_name = sorted(reihenfolge_name, key=sortieren.index)
sort_anzeigename = sorted(index_anzeigename, key=sortieren.index)

final = pd.DataFrame(zip(sort_name, sort_anzeigename), columns=['name', 'anzeigename'])
print(final)

Solution

Use a multiprocessing.Pool.

Combine the code under the for-loops, and put it into a function process_file. This function should accept a single argument; the name of a file to process.

Next using listdir, create a list of files to process. Then create a Pool and use its map method to process the list;

import multiprocessing as mp

def process_file(name):
    # your code goes here.
    return anzeigename # Or watever the result should be.


if __name__ is "__main__":
    f = r'/Users/PycharmProjects/pythonProject/images'
    p = mp.Pool()
    liste_anzeigename = p.map(process_file, os.listdir(f))

This will run your code in parallel in as many cores as your CPU has. For a N-core CPU this will take approximately 1/N times the time as doing it without multiprocessing.

Note that the return value of the worker function should be pickleable; it has to be returned from the worker process to the parent process.

Answered By - Roland Smith

Answer Checked By - Mary Flores (PHPFixing Volunteer)

[FIXED] How can I extract specific texts from an HTML file by using Notepad++ or Adobe Dreamweaver?

July 26, 2022 data-entry, dreamweaver, html, notepad++, ocr No comments

Issue

I want to extract the ID attribute from an HTML file by using Notepad++ or Dreamweaver. Delete all other texts.

For Eg:

<div id="header" class="header-blue sticky">
<div id="header-message" class="alert alert-dismissible">
<form id="contact-form" class="custom-form" method="POST" action="https://www.google.com">
<input id="your-email" type="email" class="form-email"  placeholder="Your Email">

I want to extract only ID attribute from HTML like this;

id="header"
id="header-message"
id="contact-form"
id="your-email"

So what I can do? Please help me..

Solution

Ctrl+H
Find what: <\w+[^>]+(id=".+?").*?>
Replace with: $1
CHECK Wrap around
CHECK Regular expression
Replace all

Explanation:

<\w+            # opening tag
[^>]+           # 1 or more not >
(id=".+?")      # group 1, id and value
.*?             # 1 or more any character, not greedy
>               # close tag

Screenshot (before):

Screenshot (after):

Answered By - Toto

Answer Checked By - Clifford M. (PHPFixing Volunteer)

[FIXED] How to extract/recognize text from documents?

January 12, 2022 lamp, ms-word, ocr, pdf, php No comments

Issue

I need to extract plain text from uploaded documents in order to make them searchable. Documents could be MS Word or pdf (either scanned or containing text). The application in question is running on a LAMP stack, but installing other software could be an option. Is there any tool, service, library or combination of those that you could recommend to accomplish this task?

Solution

You can use a combination of shell utilities like pdftotext for PDFs, wvWare for DOCs, docx2txt.pl for DOCX's, like the textractor rubygem does.

# on Ubuntu
apt-get install wv xpdf-utils links

There are also native php classes for extracting PDF and docx.

Another rubygem, which even does OCR for you though Tesseract, is docsplit.

It might be a good idea to consider Solr for indexing and searching. You may use the Solr Cell plugin to index and search Word documents, PDF's and more. I use it successfully in one of my projects. Solr Cell is based on several projects like Apache POI, Tika and PDFBox.

The tricky part is to set up all the cell dependent jars and solr schema, and to figure out the indexing request parameters, but all can be thought out from the wiki documentation. Here's my jars and schema to get you started, the relevant part of the schema is the line containing "attachment".

Solr Cell does not do OCR, though. You will have to use an OCR Engine first to make them searchable.

For OCR you can use the OpenSource Engine Tesseract, which is developed by Google or you might want to have a look at the commercial engine Abbyy. Both come as commandline utils, which you can run from your php scripts. To get the comparable results from Tesseract as from Abbyy, you will have to do some pre- and postprocessing 1. There are also cloud services, which might be an easier option. For instance, Wisetrend and Abbyy Cloud. The latter is in beta at the moment, so it's free of charge and it has ready-to-go PHP code samples.

Answered By - clyfe

Tuesday, November 1, 2022

[FIXED] How can i optimize my Python loop for speed

Issue

Solution

Tuesday, July 26, 2022

[FIXED] How can I extract specific texts from an HTML file by using Notepad++ or Adobe Dreamweaver?

Issue

Solution

Wednesday, January 12, 2022

[FIXED] How to extract/recognize text from documents?

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Tuesday, November 1, 2022

Issue

Solution

Tuesday, July 26, 2022

Issue

Solution

Wednesday, January 12, 2022

Issue

Solution

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To