Converting Paper Documents into Electronic Files
Converting paper documents into electronic files helps us manage, store, access and archive the organizational information we have “locked up” in paper documents. Utilizing high-quality document scanners, top-end Optical Character Recognition (OCR) systems, and maintaining quality controls to provide a successful Imaging solution. Once converted, these electronic files can be indexed and searched, stored easier, and accessed and distributed faster, easier and cheaper than their paper originals.
Benefits and Advantages
The biggest benefits of converting your documents can be summarized in one comparison: up to 60,000 pages can be stored on one DVD! Other benefits include:
- Make "Corporate Knowledge" locked up in paper documents accessible through searchable files.
- Electronic files provide search results both faster and more complete than manual searches.
- Electronic files can be stored on file servers and posted on web servers giving broader access to information than paper documents.
- Office space can be saved by converting cabinets of paper into electronic files that are available on-demand/as needed.
- In line with government guidance per the President's Management Agenda on Expanded Electronic Government, DOE's e-Gov Initiative, the Government Paperwork Elimination Act and other policy.
- Recommended Acrobat Image Plus Text format (Acrobat Searchable Image - Exact) retains the scan of original document, i.e. no information is lost during OCR. The scanned page is always used for viewing and printing, whilke the OCR'd text is available for text searching and even extraction.
Preparation and Processing Steps
- Prepare your documents by separating them according to the electronic files to be created, i.e., do the 50 pages in this folder make one file or ten files.
- Use our Document Separator Page to identify who you are and what name you want these files to have.
- Document/file names should be no longer than 61 characters to allow for transfer and archival on CD's and DVD's following all universal file naming rules.
- Help identify if your documents need to be scanned in full color or black-and-white (B&W). Full color allows for better readability even if the content is predominantly B&W, however if you have B&W-only content the file sizes will be significantly smaller.
- Make sure the documents are nicely contained in boxes or other reliable containers for transportation between the customer's office and the Imaging office. If you are unable to do so please contact our office for assistance.
The Imaging Group will:
- Prepare documents for scanning by removing stapes, clips, and any binding including spiral or glue binding.
- The documents will be prepared for feeding into a sheet-fed scanner.
- Scan the documents. This includes Quality Control to make sure all the pages are scanned and as readable as possible.
- Run Optical Character Recognition (OCR) software to create the text of the pages. DOE's standard format is Acrobat Image + Text, and other formats are also available.
- Write a CD or DVD with the final files. When possible this will be held until there is enough data to fill a disc; one CD can hold about 10,000 pages, and one DVD can hold about 60,000 pages of Acrobat files.
- It is possible for files to be transferred by other methods, including e-mail or direct posting to a client's shared drive. Direct shared drive-access may take a little time to set up to ensure proper securty controls are instituted.
- Bind, clip, or otherwise group the pages to make sure the documents stay in order after processing and during transport. The staff will not re-staple documents or re-bind documents.
File Format Information
DOE's standard file format for static document archival is Acrobat Image + Text / Acrobat Searchable Image - Exact. This is an Acrobat file that contains the actual scans of the pages for viewing and printing purposes, and has the character recognized text behind the image for indexing and searching. By containing the actual scanned page no information is lost, all handwriting, charts, photos, etc. are viewed and printed. With text behind the image these files can be indexed and searched like any other text-based file, and the text can be copied or exported if desired (though there are some text and formatting issues depending on the document.) Acrobat files can be indexed with the Acrobat program, document storage and management systems, and search engines that can be PC, file server or web server based.
There are other file formats we can convert to, which include:
- Text only (ASCII)
- Rich Text Format (.rtf) for use in word processors. Note that the process accurately recognizes (OCRs) machine readable words but does not create a formatted, original word processing file; the output still needs to be cleaned up and formatted by the document owner.
- The other Acrobat file formats:
- Acrobat Text Only - similar to rtf, the OCR software replicates, as best possible, the text and formatting of the document, and puts it in Acrobat format.
- Acrobat Image Only - this just puts an Acrobat wrapper on the scanned images
Other Technical Details
Scanner Hardware: High-speed, scans in black-and-white and in color, capable of scanning pages up to 11"x17".
NARA Standards Information
The files created by the Document Imaging Grouo meet NARA standards, with a few howevers. It is the responsibility of the organization/office whose documents need to be NARA compliant to independently determine that the files created and delivered meet the NARA standards for their content. If needed the group can make some accommodations/changes in its workflow, though other more time-intensive changes may not be possible. If you require NARA compliance please work with the Records Manager for your organization to idntify the standards for your type of records.
NARA Published Resources
Probably the best documents to discuss what NARA needs for converting paper documents into electronic files are:
- U.S. National Archives and Records Administration (NARA) Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files – Raster Images. For the Following Record Types- Textual, Graphic Illustrations/Artwork/Originals, Maps, Plans, Oversized, Photographs, Aerial Photographs, and Objects/Artifacts (pdf)
- NARA Transfer Instructions for Permanent Electronic Records in PDF format, section 3.4 Requirements for Scanned Paper or Image Formats Converted to PDF
Service Yes's and However's
- The service does "produce digital images that look like the original records (textual, photograph, map, plan, etc.) and are a reasonable reproduction without enhancement". (see above referenced NARA Technical Guidelines page 5, Introduction).
- Documents are scanned at 300 dpi, the NARA standard for laser printed documents, documents with poor legibility or handwriting, and documents "where color is important to the interpretation of the information or content, or desire to produce the most accurate representation". (see above referenced NARA Technical Guidelines page 51.)
- If specified we can scan at 400 dpi.
- Scanning is done is either B&W or color, based on the best mode to reproduce the content of the original documents. In cases where straight B&W is not adequate, the job is scanned in color, even if some of the pages could be just B&W. Color scanning supports not only color-based images but also enhanced reading/viewing capabilities for hard to read pages such as old pages, hard to read pages, and pages where there is poor contrast in the content.
- Deliver Acrobat Image + Text files, also known as format "Searchable Image - Exact", the proper file format.
- "NARA will accept PDF records that have been OCR'd using processes that do not alter the original bit-mapped image. An example of an output process that accomplishes this requirement is Searchable Image - Exact." See NARA Transfer Instructions for Permanent Electronic Records in PDF format, section 3.4 Requirements for Scanned Paper or Image Formats Converted to PDF
- The group does not normally enter metadata into the files. A client/document owner would need to do that.
- However (a double however), if there were simple metadata requirements, and there was a simple method for the group to know what the metadata was, the group could consider performing this.
- The group does reasonable filenaming if the document owners provide good information on naming conventions. It is also possible that the document owners would need to name/re-name the files if there is a specific naming convention that cannot be documented prior to conversion.
MAAdm updated 11/26/2018 - New Format