PDF to normal editable text?

Beekeeping & Apiculture Forum

Help Support Beekeeping & Apiculture Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.

Poly Hive

Queen Bee
Joined
Dec 4, 2008
Messages
14,094
Reaction score
395
Location
Scottish Borders
Hive Type
National
Number of Hives
12 and 18 Nucs
I have three scanned articles that I need to convert to normal text to upload to a website, any suggestions as to how to convert the scans which are in PDF to normal text please?

PH
 
Just tried it out in Linux Mint - click on pdf file to open - "select all" - "copy" - pasted into a new document in Libre Office with no drama or hassle.......

That's for "text" - if it's "images" of the text used in the pdf, you'll have to run it through an OCR programme (optical character recognition)
 
Last edited:
The difficulty of "images" can arise as pdf files can use both text and images - if you scan a document and just bung the resultant "image" into a pdf file, it's effectively a "picture" as you would take with a camera (could be a black and white teapot as far as the computer's concerned) SO if that's the case you could probably copy and paste that image as an "image", but it wouldn't be editable in the sense of being able to alter the text in any way - which is where "OCR" comes in - high end scanners used to include it, so that it would effectively "read" the image when scanned, and spit out proper text. If you have a pdf file with an image of text, you may be able to get editable text by using OCR (there are several free downloadable programmes). It can be pretty effective, but it's a bit like using Google translate, you'll need to check it thoroughly for pigdin English!
 
I have three scanned articles that I need to convert to normal text to upload to a website, any suggestions as to how to convert the scans which are in PDF to normal text please?
Basic point to remember is that the PDF file is only a container. What is inside can be text, with or without formatting and font information or it can be images in various formats (jpeg, tiff, png etc). A quick test of what you have in an unknown PDF is to try selecting a block of text. If you can select individual words. it's text; if the whole page highlights it's a scanned image. A scanned image of a page remains an image until it is run through OCR (optical character recognition) software. That needs a clear scan of regular fonts to work out which blob on the image is which letter. Small size, marks on the paper, unusual fonts and printing on cheap paper that smudges characters together (e.g. old newpapers) all make OCR less accurate.

There are paid for OCR applications, there are versions bundled with some scanners and there are a few free options. When I did more OCR at home than I do now, the Abbyy Finereader that came bundled with a scanner did a basic job at extracting text from most pages but always needed proof reading and reformatting. There is a free trial of the latest paid for version listed on their web site. It's only one of many OCR suites, I've used some of the Fuji and Xerox products before, but the versions I've seen are linked into their full document processing software suites and scanner hardware. One free route that I know some occasional OCR users take is to route via Google docs.

An alternative to OCR immediately is to put the image pdf on the website and link to it with a title, intro or summary paragraph that you type in. It leaves the original scan at the best available resolution for the reader to make whatever they like of it. Eventually, if the website is regularly scanned by the Google bots the pdf will appear as a 'Quick view' which is their automated OCR rendition into html.
 
HHKL2_002.jpg


or OCR
 
On linux I use pdftotext to get the text out and pdfimages to extract the images.

I just tested pdftohtml (which does what it says on the tin) and it does a tolerable job on the text.
 
I just tested pdftohtml (which does what it says on the tin) and it does a tolerable job on the text.

Blimey, I just tried "pdftohtml -c something.pdf" and it creates an html file including the images.

There is a windows binary of a program with the same name at http://sourceforge.net/projects/pdftohtml/files/ . I have not tried that version.

Ahh, just re-read the original post and you state scanned pdf files. This will obviously not work for you. You need to go down the OCR route. Might be useful to someone else though.
 
On a practical level I have assistance kindly offered.

Linux wise it did not work, as it is a scan.

It's a long article in two parts printed in the ABJ, which will eventually be available for all to read, as it may be of some interest to those who have an interest in swarming.....

PH
 
Back
Top