Active7 years, 6 months ago
Download itext-xtra-5.1.1.jar. Itext/itext-xtra-5.1.1.jar.zip( 34 k) The download jar file contains the following class files or Java source files. Itext Pro 1 3 0 – Ocr & Translator Raw Power 1 4 1 Clear Disk 2 12 X 12. 11/9/2020 Cleans a disk by removing all partition information and un-initializing it. IText could recognize and translate text from any image. It's so easy to take image. Use iText's built-in tool to capture any screen. Drag an image to iText's icon in menu bar. Select an image file. Auto merge text when continuously recognize. The recognition result is very accurate. Powered by Google online OCR service.
Did any one of you guys have experience with the accuracy of iTextSharp when reading text from a multi-page scanned pdf?
Techtool pro 11 0 6. the things is i have tried to read a pdf with both the basic search-function within the adobe reader, and also using the iTextSharp.
itextsharp manages to find roughly 50% of the occurrences of a given word compared to (what i call) 100% by adobe
1/3 As A Percent
is this a known 'problem'?
edit: i should add: its already been ocr'ed by the time i'm searching.
Gergo Erdosi38k21 gold badges110 silver badges90 bronze badges
Jens LangenbachJens Langenbach
2As @ChrisHaas already explained, without code and PDF samples its hard to be specific.
First of all, saying itextsharp manages to find roughly 50% of the occurences of a given word is a bit misleading as iText(Sharp) does not directly expose methods to find a specific text in a PDF and, therefore, actually finds 0%. It merely provides a framework and some simple examples for text extraction.
Using this framework for seriously searching for a given word requires more than applying those simple sample usages (provided by the
SimpleTextExtractionStrategy
and the LocationTextExtractionStrategy,
also working under the hood when using PdfTextExtractor.GetTextFromPage(myReader, pageNum)
) in combination with some Contains(word)
call. You have to:- create a better text extraction strategy which
- has a better algorithm to recognize which glyphs belong to which line; e.g. the sample strategies can fail utterly for scanned pages with OCR'ed text with the text lines not 100% straight but instead minimally ascending;
- recognizes poor man's bold (printing the same letter twice with a very small offset to achieve the impression of a bold character style) and similar constructs and transforms them accordingly;
- create a text normalization which
- resolves ligatures;
- unifies alternative glyphs of the semantically same or similar characters;
- normalize both the extracted text and your search term and only then search.
Furthermore, as @ChrisHaas mentioned, special attention has to be paid to spaces in the text.
If you create an iText-based text search with those criteria in mind, you'll surely get an acceptable hit rate. Getting as good as Adobe Reader is quite a task as they already have invested quite some resources into this feature.
For completeness sake, you should not only search the page content and everything referred to from there but also the annotations which can have quite some text content, too, which may even appear as if it was part of the page, e.g. in case of free text annotations. Roxio download gratis.
mklmkl81.7k13 gold badges113 silver badges235 bronze badges
0Without knowing the specifics of your situation (PDF in question, code used, etc) we can't help you too much.
1 To The Power Of 3
However I can tell you that iTextSharp has more of a literal text extractor. Since text in a PDF can be and often is non-contiguous and non-linear, iTextSharp takes any contiguous characters and builds what we think of as words and sentences. It then also tries to combine characters that appears to be 'pretty much on the same line' and does the same (such as text on a slight angle as OCR'd text often is). There's also 'spaces' which should be simple ASCII 32 characters but often isn't. iTextSharp goes the extra mile and attempts to calculate whether two text runs should be separated by spaces.
Adobe probably has further heuristics that are able to guess even more about text. My guess would be that they have a larger threshold for guessing at combining non-linear text.
Chris HaasChris Haas49.2k12 gold badges130 silver badges252 bronze badges