python - pyocr with tesseract runs out of memory -
i made script batch-process pdf scans text of tesseract , pyocr. code below. problem is, when processing lots of files, 20+, @ moment script runs out of memory , fails oserror. made can catch smoothly @ crashed after manual restart, these manual restarts tedious.
since pyocr black box me, tried wrapping script other python scripts restart on crash, seem go down on error, freeing memory when every related script terminates.
the other solution can think of, make external wrapper, check if script running , restart if not , there still unprocessed files.
but maybe there better solution? or maybe made lame code can improved avoid these memory crashes? (other know lame, works enough :) ).
from io import bytesio wand.image import image pil import image pi import pyocr import pyocr.builders import io import os import os.path import ast def daemon_ocr(tool, img, lang): txt = tool.image_to_string( pi.open(bytesio(img)), lang=lang, builder=pyocr.builders.textbuilder() ) return txt def daemon_wrap(image_pdf, tool, lang, iteration): print(iteration) req_image = [] final_text = '' image_pdf_bckp = image_pdf image_jpeg = image_pdf.convert('jpeg') img in image_jpeg.sequence: img_page = image(image=img) req_image.append(img_page.make_blob('jpeg')) img in req_image: txt = daemon_ocr(tool, img, lang) final_text += txt + '\n ' if 'работ' not in final_text , 'фактура' not in final_text , 'Аренда' not in final_text , 'Сумма' not in final_text\ , 'аренде' not in final_text , 'товара' not in final_text: if iteration < 5: iteration += 1 image_pdf = image_pdf.rotate(90) final_text = daemon_wrap(image_pdf_bckp, tool, lang, iteration) return final_text def daemon_pyocr(food): tool = pyocr.get_available_tools()[0] lang = tool.get_available_languages()[0] iteration = 1 image_pdf = image(filename='{doc_name}'.format(doc_name=food), resolution=300) final_text = daemon_wrap(image_pdf, tool, lang, iteration) return final_text files = [f f in os.listdir('.') if os.path.isfile(f)] output = {} print(files) path = os.path.dirname(os.path.abspath(__file__)) if os.path.exists('{p}/output'.format(p=path)): text_file = open("output", "a") first = false else: text_file = open("output", "w") first = true f in files: if f != 'ocr.py' , f != 'output': try: output[f] = daemon_pyocr(f) print('{f} done'.format(f=f)) if first: text_file.write(str(output)[1:-1]) first = false else: text_file.write(', {d}'.format(d=str(output)[1:-1])) output = {} os.rename('{p}/{f}'.format(p=path, f=f), "{p}/done/{f}".format(p=path, f=f)) except oserror: print('{f} failed: not enough memory.'.format(f=f))
i having same issue, sorted out. real problem not pyocr
, sequence
of wand.image.image
.
you can use destroy()
method of image
object free memory. use with
statement while dealing wand.
there questions on topic here , here
here code, converts pdf image blob, if helps you
def convert_pdf_to_image_blob(pdf): req_image = [] wi(filename=pdf, resolution=150) image_jpeg: image_jpeg.compression_quality = 99 image_jpeg = image_jpeg.convert('jpeg') img in image_jpeg.sequence: wi(image=img) img_page: req_image.append(img_page.make_blob('jpeg')) image_jpeg.destroy() # frees memory used image object. return req_image
thanks
Comments
Post a Comment