[問題] 使用pytesseract 做ocr

作者: PHONm (USA~USA)   2016-08-08 17:32:48
我想要做字元辨識,但是字元圖像有些破裂,有些字元會變成亂碼,
所以就用OpenCV先進行一些前處理,然後存成新檔後再進行一次OCR,
但是會有UnicodeDecodeError,可是程式碼都沒有用到中文啊@@!
不曉得是否是OpenCV轉檔那邊出問題,
=====================Result=====================
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=397x112 at 0x3C0DF28>
24-D 1813f-ml 1-1
154?Dbb
<PIL.BmpImagePlugin.BmpImageFile image mode=RGB size=397x112 at 0x1131080>
Traceback (most recent call last):
File "C:/Users/cash.chien/PycharmProjects/OCR/OCRv1.1.py", line 19, in
<module>
str2 = image_to_string(img2)
File "C:\Anaconda3\lib\site-packages\pytesseract\pytesseract.py", line 167,
in image_to_string
return f.read().strip()
UnicodeDecodeError: 'cp950' codec can't decode byte 0xe2 in position 11:
illegal multibyte sequence
========================以下為原始碼=======================
from pytesseract import image_to_string
from PIL import Image
import time
import cv2
import numpy as np
img = Image.open('12.bmp')
print(img)
str = image_to_string(img)
print(str)
img1 = cv2.imread('12.bmp',1)
kernel = np.ones((3,3))
opening = cv2.morphologyEx(img1, cv2.MORPH_OPEN, kernel)
cv2.imwrite('opening.bmp',opening)
img2 = Image.open('opening.bmp')
print(img2)
str2 = image_to_string(img2)
print(str2)
感謝!
作者: Sunal (SSSSSSSSSSSSSSSSSSSSSSS)   2016-08-09 02:16:00
應該是cmd的輸出問題 改成utf8試試
作者: goldflower (金色小黃花)   2016-08-11 00:22:00
先轉str的編碼 不過你直接把str命名掉不太好吧XD

Links booklink

Contact Us: admin [ a t ] ucptt.com