I use google vision api for the text recognition on the image. The problem is that it recognizes the text, but does not recognize the speaker.
Let us imagine for example the following table:
picture in full size
On the surface it seems that the solution is simple:
To get the text from column 3, paragraph – I just-just need
point in the image (the left-upper corner and lower-right corner) of the point.
However, the problem lies in the fact that the column 4 paragraph may have a different width and column points to paragraph 4 will change.
So as one would look for in terms of header cells, but if
the picture quality is poor, then the title can not be recognized because
a title by a few points less than plain text.
I tried to use OpenCV to node.js . But he singled me no cell boundaries, and the boundaries of all, up to the borders of each letter.
In what direction should I look to find a solution to my problem?
Answer 1, Authority 100%
I offer you a solution that I once did. It is NOT going out of the box. But the core of it working. Written on python3. * And uses the
openCV . If you have a query, you can write to me and I will collect it for you. At the time 2 years ago at the round was not open solutions that solve this problem more or less satisfactorily. My tulza provides a more or less stable generation of cells if they are not too small. The answer is provided in the form of a dict from a list of cell corners. Thus, they can parse and considered as separate images.
There are Adobe Reader, which is able to. But it is expensive.
Ho Chi Minh City is designed to extract the textual information from the tables presented in PDF, pictures, or other format. If the table is highly inclined, it would extract the data correctly. PDF is first converted into images (one page), and then each image is processed separately.
The utility ignores the text layer and considering any pdf as a picture.
It is assumed that on the same page, there is exactly one table that is not transferred to the next page. If a page there are a few tables, it is likely to happen Recognition correctly.
Ho Chi Minh City can not work with complex tables in which multiple cells are united together.
Detect idea is to extract connected components sufficiently long blocks that are likely lines. Next, the closure of these lines, and, in fact, they are extracted. This is all flavored with a number of heuristics, which were picked up by experience.
The approximate phases of work:
- Detection of horizontal and vertical lines
Filtering unnecessary points
For the work you need to OpenCV, and he hochiminh:
from hochiminh import pdf_parser from hochiminh.image_processing import hochiminh from hochiminh.image_processing.connected_components import ConnectedComponents from hochiminh.image_processing.cross_detector import CrossDetector from hochiminh.image_processing.lines_detector import SobelDirector from hochiminh.image_processing.ocr import TesseractWrapper from hochiminh.io import pdfconverter, reader path = "../data/test/ho_chi_minh/" parser = pdf_parser.PDFParser ( table_extractor = hochiminh.HoChiMinh ( reader = reader.ImagePDFReader ( pdfconverter.PDFConverter (in_path = path + 'pdf /', out_path = path + 'pdf / images /', resolution = 130) ), lines_detector = SobelDirector (), connected_components = ConnectedComponents (), cross_detector = CrossDetector (max_steps = 20, detected_steps = 18, line_width = 8), ocr = TesseractWrapper (), binarization = 210 ) ) tabels = parser.extract_table () for tabel in tabels: print ('--------------- Table ---------------') for cell in tabel: print (cell .__ dict__)
Use Kenny’s border detector on the image, then cut off all non-vertical / horizontal lines. Or use minAreaRect like here