Intellectual Property Forum The Intellectual Property Forum

Please login or register.

Login with username, password and session length
Advanced search  

News:

We are looking for moderators.  Message the admin if interested.

Author Topic: OCR/PDF recommendation ?  (Read 1817 times)

dbmax

  • Senior Member
  • ****
  • Posts: 262
    • View Profile
OCR/PDF recommendation ?
« on: 10-31-16 at 11:57 pm »

I've had no luck finding OCR software that can recognize underscore or strike-through text and save it so that it can be searched, for example in MS word, for the underline or strike-through font effect.

Software I have seen produces simple unformatted text (strike-throughs and underlines omitted) beneath a scanned image. 

Pretty awkward for searching or editing markedup USPTO scanned PDFs.
 
Anyone have any suggestions in that regard?

db
Logged

MYK

  • Lead Member
  • *****
  • Posts: 4030
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #1 on: 11-01-16 at 12:32 am »

Never heard of one, sorry.  Google's Tesseract is open-source, though, so if you want to train it. . . .  It's not all that great at recognition, though.
Logged
"The life of a patent solicitor has always been a hard one."  Judge Giles Rich, Application of Ruschig, 379 F.2d 990.

Disclaimer: not only am I not a lawyer, I'm not your lawyer.  Therefore, this does not constitute legal advice.

Robert K S

  • Lead Member
  • *****
  • Posts: 1712
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #2 on: 11-01-16 at 11:31 am »

If it's really worth it to you, for a few thousand bucks you could probably contract with a developer over freelancer.com or a similar service to develop what you need.  I once hired a developer in Portugal to develop a software application and was pleased with the result.  You don't always want to go with the lowest bidder, but instead find the ones who ask the best questions.
Logged
This post is made in the context of professional discussion of general patent law issues and nothing contained herein may be construed as legal advice.

dbmax

  • Senior Member
  • ****
  • Posts: 262
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #3 on: 11-01-16 at 11:15 pm »

MYK, Robert K S,

The best OCR I've used (for speed, accuracy, and interface) is Foxit Phantom.  (uses the ABBYY OCR engine). Others are not even close. I have not tried Acrobat's OCR, but have not seen any encouraging reports on it.

But as with others I have tried, ABBYY recognizes strike-through and underline characters as unformatted  text, thus corrupting the entire converted document.

Licensing ABBYY and then hiring a contractor to modify it to preserve underline and strikethrough characters is an unappealing option.

Thanks,
db
« Last Edit: 11-04-16 at 12:26 am by dbmax »
Logged

Robert K S

  • Lead Member
  • *****
  • Posts: 1712
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #4 on: 11-02-16 at 10:07 am »

I had never heard of ABBYY but it looks like there is an SDK available (or a few different SDKs, actually) (https://www.abbyy.com/sdk/), although I'm not sure if they let a programmer into the guts of the recognition engine to retrain on underline and strikethrough text.  I'm surprised ABBYY doesn't at least handle underlining appropriately given that their marketing materials advertise that their software "precisely retains the formatting ... of original documents".

It's probably a nontrivial problem to train for strikethrough, since it essentially constitutes a whole new character set, and the software would have difficulty discerning between them and would have to be pretty heavily context-reliant.  (Are two crossed lines a stricken-through lowercase L, or a lowercase T?  May be impossible to say, unless it can be determined with certainty that surrounding letters are stricken-through.)

Sometimes the solutions you need now are still years away, and that might be the case here.
Logged
This post is made in the context of professional discussion of general patent law issues and nothing contained herein may be construed as legal advice.

inventurous

  • Junior Member
  • **
  • Posts: 23
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #5 on: 11-02-16 at 01:20 pm »

I use Acrobat XI Pro and have been pretty happy with the OCR abilities, although it does sometimes fail to keep the columns clean (including line numbers as part of a column). Not sure how it handles markups, since I never scan any, but you might give the trial a shot.
Logged

dbmax

  • Senior Member
  • ****
  • Posts: 262
    • View Profile
Re: OCR/PDF recommendation ?
« Reply #6 on: 01-05-17 at 12:07 am »

I use Acrobat XI Pro and have been pretty happy with the OCR abilities, although it does sometimes fail to keep the columns clean (including line numbers as part of a column). Not sure how it handles markups, since I never scan any, but you might give the trial a shot.

Thanks inventurous,

Downloaded a trial of Acrobat DC Pro. Not as accurate as Foxit. But, one benefit is that after text recognition, a spell checker can be used to crudely locate stricken text. In my tests, Foxit recognized the stricken text typically without error, so that locating it after recognition is more difficult.

bd

Logged
 



Footer

www.intelproplaw.com

Terms of Use
Feel free to contact us:
Sorry, spam is killing us.

iKnight Technologies Inc.

www.intelproplaw.com

Page created in 0.109 seconds with 20 queries.