![]() ![]() Lengthy comments below discuss representing smaller images from DjVu document page as separate objects, which is not easily possible because DjVu document page is itself just a single image with optional text layer, with no "information" about smaller images as separate objects. Which is identical to input DjVu file and has text layer inside: Then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products: sample.djvu This is where pdfbeads comes in play, and we simple execute: ![]() So that we end with these file in out work folder: sample.djvu Now we extract DjVu page to TIFF format with:ĭdjvu -format=tiff -page=10 sample.djvu pg10.tif Sed intervention corrects class names in output hOCR (which is just simple HTML file) We can use djvu2hocr command (from ocrodjvu package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:ĭjvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |