You are here

Document Imaging

Converting Paper Documents into Electronic Files

Converting paper documents into electronic files helps us manage, store, access and archive the organizational information we have “locked up” in paper documents.  Utilizing high-quality document scanners, a top-end six-engine Optical Character Recognition (OCR) system and maintaining Quality Controls to provide a successful Imaging solution.  Once converted, these electronic files can be indexed and searched, stored easier, and accessed and distributed faster, easier and cheaper than their paper originals.

Benefits and Advantages

The biggest benefits of converting your documents can be summarized in one comparison: up to 60,000 pages can be stored on one DVD! Other benefits include:

  • Make "Corporate Knowledge" locked up in paper documents accessible through indexable and searchable files.
  • Electronic files provide search results both faster and more complete than manual searches.
  • Electronic files can be stored on file servers and posted on web servers giving broader access to information than paper documents.
  • Office space can be saved by converting cabinets of paper into CD's and microfilm.
  • In line with government guidance per the President's Management Agenda on Expanded Electronic Government, DOE's e-Gov Initiative, the Government Paperwork Elimination Act and other policy.
  • Recommended Acrobat Image Plus Text format retains scan of original document, i.e. no information is lost during OCR.

Preparation and Processing Steps

  • Prepare your documents by separating them according to the electronic files to be created, i.e., do the 50 pages in this folder make one file or ten files.
  • Use our Document Separator Page to identify who you are and what name you want these files to have.
  • Document/file names should be no longer than 61 characters to allow for transfer and archival on CD's and DVD's following all univrsal file naming rules.
  • Our default processing is for black and white scanning, if there are any pages or documents that are desired to be scanned in grayscale or color please identify them.
  • Make sure the documents are nicely contained in boxes or other reliable containers for transportation between the customer's office and the Imaging office. If you are unable to do so please contact our office for assistance.

The Imaging Group will:

  • Prepare documents for scanning by removing stapes, clips, and any binding including spiral or glue binding.
  • The documents will be prepared for feeding into a sheet-fed scanner.
  • Scan the documents. This includes Quality Control to make sure all the pages are scanned and as readable as possible.
  • Run Optical Character Recognition (OCR) software to create the text of the pages. DOE's standard format is Acrobat Image + Text, and other formats are also available. Please see the File Format section for more information.
  • Write a CD or DVD with the final files. When possible this will be held until there is enough data to fill a disc; one CD can hold about 10,000 pages, and one DVD can hold about 60,000 pages of Acrobat Image + Text files.
  • Bind, clip, or otherwise group the pages to make sure the documents stay in order after processing and during transport. The staff will not re-staple documents or re-bind documents.

File Format Information

DOE's standard file format for static document archival is Acrobat Image + Text. This is an Acrobat file that contains the actual scans of the pages for viewing and printing purposes, and has the OCR'd text behind the image for indexing and searching. By containing the actual scanned page no information is lost, all handwriting, charts, photos, etc. are viewed and printed. With OCR'd text behind the image these files can be indexed and searched like any other text-based file, and the text can be copied or exported if desired (though there are some text and formatting issues depending on the document.) Acrobat files can be indexed with the Acrobat program, document storage and management systems, and search engines that can be PC, file server or web server based.

There are other file formats we can convert to, which include: 

  • Text only (ASCII)
  • Rich Text Format (.rtf) for use in word processors. Note that the process acurately recognizes (OCRs) machine readable words but does not create a formatted, original word processing file; the output still needs to be cleaned up and formatted by the document owner.
  • HTML for web posting. This is a down-and-dirty way of getting paper documents into web format very quickly. It is still advised for web posting to use Acrobat files.
  • The other Acrobat file formats:
    • Acrobat Text Only - similar to rtf, the OCR software replicates, as best possible, the text and formatting of the document, and puts it in Acrobat format.
    • Acrobat Image Only - this just puts an Acrobat wrapper on the scanned images

Other Technical Details

Hardware:

  • 1 NEW high-speed color scanner, scans both sides of the page in the same pass, capable of scanning up to 11"x17".
  • 1 high-speed B&W scanner, scans both sides of the page in the same pass, capable of scanning up to 11"x17".
  • 1 color scanner with an 11"x17" flatbed for manual scanning.
  • Documents are normally scanned at 300 d.p.i. for B&W and 200 or 240 d.p.i. for color.

Software:

We use PrimeOCR, a top-end OCR Server System that achieves up to 82% better accuracy than the best conventional OCR products through the implementation of "Voting OCR" technology. PrimeOCR can use up to seven OCR engines instead of just one, utilizing engines from ABBYY, Caere/Calera/ScanSoft, ExperVision, and NewSoft. PrimeOCR passes the scanned image through each of these OCR engines and uses voting technology along with artificial intelligence algorithms to determine the character recognition of images.

The best single-engine OCR software products achieve about 98% average accuracy recognizing text on typical quality document images. While 98% accuracy sounds impressive, that still leaves 40 errors on a typical 2000 character text page! As a result, many imaging installations that want to use OCR software wind up with inadequate and untrustworthy files. Installations that use PrimeOCR can cut errors per page by 65-80% (down to 8 errors per average 2000 character page).

Here at DOE, by utilizing PrimeOCR and all seven of its engines we can create an Imaged file that has indexable and searchable text that can be trusted and relied upon for accurate search results. Give us a try and see how accurate your Document Imaging can be.

Use this link to the Contact Us page for the Document Imaging contacts.

Use this link to go to the FAQs for Document Imaging.

Last updated 10/27/14