python - pyocr with tesseract runs out of memory -

i made script batch-process pdf scans text of tesseract , pyocr. code below. problem is, when processing lots of files, 20+, @ moment script runs out of memory , fails oserror. made can catch smoothly @ crashed after manual restart, these manual restarts tedious.

since pyocr black box me, tried wrapping script other python scripts restart on crash, seem go down on error, freeing memory when every related script terminates.

the other solution can think of, make external wrapper, check if script running , restart if not , there still unprocessed files.

but maybe there better solution? or maybe made lame code can improved avoid these memory crashes? (other know lame, works enough :) ).

from io import bytesio wand.image import image pil import image pi import pyocr import pyocr.builders import io import os import os.path import ast   def daemon_ocr(tool, img, lang):     txt = tool.image_to_string(         pi.open(bytesio(img)),         lang=lang,         builder=pyocr.builders.textbuilder()     )     return txt   def daemon_wrap(image_pdf, tool, lang, iteration):     print(iteration)     req_image = []     final_text = ''     image_pdf_bckp = image_pdf     image_jpeg = image_pdf.convert('jpeg')      img in image_jpeg.sequence:         img_page = image(image=img)         req_image.append(img_page.make_blob('jpeg'))      img in req_image:         txt = daemon_ocr(tool, img, lang)         final_text += txt + '\n '     if 'работ' not in final_text , 'фактура' not in final_text , 'Аренда' not in final_text , 'Сумма' not in final_text\             , 'аренде' not in final_text , 'товара' not in final_text:         if iteration < 5:             iteration += 1             image_pdf = image_pdf.rotate(90)             final_text = daemon_wrap(image_pdf_bckp, tool, lang, iteration)     return final_text   def daemon_pyocr(food):     tool = pyocr.get_available_tools()[0]     lang = tool.get_available_languages()[0]     iteration = 1     image_pdf = image(filename='{doc_name}'.format(doc_name=food), resolution=300)     final_text = daemon_wrap(image_pdf, tool, lang, iteration)     return final_text   files = [f f in os.listdir('.') if os.path.isfile(f)] output = {} print(files) path = os.path.dirname(os.path.abspath(__file__)) if os.path.exists('{p}/output'.format(p=path)):     text_file = open("output", "a")     first = false else:     text_file = open("output", "w")     first = true  f in files:     if f != 'ocr.py' , f != 'output':         try:             output[f] = daemon_pyocr(f)             print('{f} done'.format(f=f))             if first:                 text_file.write(str(output)[1:-1])                 first = false             else:                 text_file.write(', {d}'.format(d=str(output)[1:-1]))             output = {}             os.rename('{p}/{f}'.format(p=path, f=f), "{p}/done/{f}".format(p=path, f=f))         except oserror:             print('{f} failed: not enough memory.'.format(f=f))

i having same issue, sorted out. real problem not pyocr, sequence of wand.image.image.

you can use destroy() method of image object free memory. use with statement while dealing wand.

there questions on topic here , here

here code, converts pdf image blob, if helps you

def convert_pdf_to_image_blob(pdf):     req_image = []     wi(filename=pdf, resolution=150) image_jpeg:         image_jpeg.compression_quality = 99         image_jpeg = image_jpeg.convert('jpeg')          img in image_jpeg.sequence:             wi(image=img) img_page:                 req_image.append(img_page.make_blob('jpeg'))     image_jpeg.destroy()  # frees memory used image object.      return req_image

thanks

Search This Blog

Breniser

python - pyocr with tesseract runs out of memory -

Comments

Post a Comment

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

python - Error: Unresolved reference 'selenium' What is the reason? -

asp.net ajax - Jquery scroll to element just goes to top of page -