Convert Searchable Image PDF to html?

Does anyone know of a simple way to convert a searchable PDF + text to a webpage without any visible differences?

My Great Aunt made a scrapbook of family history in the 60s and I have been scanning it with a view to making an annotated version of it - ocr on her typewritten notes, updated with links to the latest sources online, and popups with neatened high resolution copies of the photos which are stuck on the pages.

I’ve found some programmes that say they convert pdf, but all they do is create a similar copy - i.e. layout, font etc - but not as it was originally.

I have an idea that I may be able to do this within an actual pdf and then use a pdf viewer on a webpage, but I feel the size of the pdf would be far too large to expect someone to wait to download it in one go when they first go to the page.

Does anyone have any ideas on the best way of getting the result I’m after?



Well, the short answer is that it is not really possible to automate.

Long answer is that you are trying to fit an octagonal peg into postcard. Which, if you think about it is impossible. Here’s why. PDF is a layout format that is designed to ensure that the document appears the same across all platforms. Margins, placement, kerning are all ordinal. HTML on the other had is intentionally designed to use relational instead of ordinal. This allows for different screen resolutions, “tool bars”, etc.

Now CSS has made it much easier to creates an illusion of ordinal positioning but it will never be perfect like a PDF.

While I am sure this is not “helpful” I hope it explains the problem.

I had a feeling that might be the case.

Is there some way that you can put hidden machine readable text on the image? Would it be possible to hide a text file of the ocred content so that search engines can read it and at least associate it with the correct page as an image?

If it doesn’t look neat when someone selects the text on the page it’s no biggie - I’d really like anyone else who has a family connection to be able to find the information and make use of it, yet at the same time have the experience of reading the scrapbook.


Doesn’t Google index PDFs?

I have seen something like that, with the same sort of similar layout but totally different look that I mentioned before - but I’d like someone on my actual page to be able to select and copy if they want.

I think with the google approach, you can click to go to the original page, but once you navigate from there (and I could be wrong) you would be back to images only within the scheme of the original website.

At least people could google the pages, though which would be great. Does anyone know if yahoo and the other search engines do something similar, or are able to use google’s versions of pages?

Ah, I meant that Google simply displays search results for PDFs – sorry, didn’t see your caveat about size.

Once you have the text, why not simply display the text as HTML? Provide a link to the real image for those who want to see the original. Provide a thumbnail or low-res version for inclusion on the HTML.

Better yet, get a CSS goddess to help you define popup text when you hover over segments of the image that contain OCRd text. Use low res version that will decrease download time. If you do it right the text will be part of the HTML code and therefor indexed by Google.

Thanks Jaysen,

That may be the way to go.

I know no CSS Goddess, but I have a GUI based website programme which I think I can trick into believing portions of the image have incredibly long punctuated titles. :slight_smile:

And your idea of the separate page with text could be just the trick, with a link back to the original image in case someone found it through searches.

Thanks for everyone’s input,