Converting Paper Documents into Electronic Files
The Document Imaging group at Headquarters utilizes high-quality document scanners, top-end Optical Character Recognition (OCR) systems, and maintains quality controls to provide a thorough and sustainable conversion of paper documents into electronic files. The conversion results in Acrobat Image + Text format files, that contain the actual scans of the pages for viewing and printing purposes, and have a layer of searchable text behind the images to facilitate searching and indexing.
Benefits and Advantages
Converting paper documents into electronic files helps us manage, store, access and archive the organizational information we have “locked up” in paper documents. The resulting electronic files can be stored on computer systems and document storage systems, and accessed and distributed through standard electronic sharing and communications.
- Convert Corporate Knowledge that is locked up in paper documents into electronic files, giving broader access to their information and resources;
- Office space can be saved by converting cabinets of paper into electronic files that are available on-demand/as needed;
- In line with government guidance per the President's Management Agenda on Expanded Electronic Government, DOE's e-Gov Initiative, the Government Paperwork Elimination Act, and other policy;
- Recommended Acrobat Image Plus Text format (Acrobat Searchable Image - Exact) retains the scan of original document, no information is lost during the conversion. The scanned page is always used for viewing and printing, while the OCR'd text is available for searching and even basic extraction (with limitations).
Preparation and Processing Steps
Please discuss the pre-scanning preparation process with a Document Imaging representative to coordinate the best method to organize documents and plan for file creation to meet your needs.
The basic steps to coordinate are:
- Prepare your documents by separating them according to the electronic files to be created, i.e., do the 50 pages in this folder make one file or ten files. This may already be done if the documents are in an organized filing system.
- Specify, or have a system to determine what file name should be given to each file.
- Document/file names should be no longer than 61 characters to allow for transfer and archival, and follow universal file naming rules.
- Help identify if the documents need to be scanned in full color or black-and-white (B&W). Full color allows for better readability even if the content is predominantly B&W, however if you have B&W-only content the file sizes will be significantly smaller.
- Make sure the documents are nicely contained in boxes or other reliable containers for transportation between the customer's office and the Imaging office. If you are unable to do so, please contact our office for assistance.
The Imaging Group will:
- Prepare documents for scanning by removing stapes, clips, and any binding including spiral or glue binding.
- The documents will be prepared for feeding into a sheet-fed scanner.
- Scan the documents. This includes Quality Control steps to ensure all the pages are scanned and as readable as possible.
- Run Optical Character Recognition (OCR) software to create the text layer of the pages. DOE's standard format is Acrobat Image + Text, and other formats are also available.
- Coordinate delivery of the files. Due to the restrictions on removable media being used on DOECOE PC's the delivery may need further involvement to create electronic containers on customer data systems, and possibly ensure the proper security controls are instituted. Traditional delivery on USB dives and DVD's is possible.
- Bind, clip, or otherwise group the pages to make sure the documents stay in order after processing and during transport back to the owner. The staff will not re-staple documents or re-bind documents. Storage, disposition, and/or destruction of the documents is the responsibility of the document owner.
File Format Information
DOE's standard file format for static document archival is Acrobat Image + Text / Acrobat Searchable Image - Exact. This is an Acrobat file that contains the actual scans of the pages for viewing and printing purposes, and has the recognized text behind the image for indexing and searching. By containing the actual scanned page no information is lost, all handwriting, charts, photos, etc. are viewed and printed.
Other Technical Details
Scanner Hardware: High-speed, scans in black-and-white and in color, capable of scanning pages up to 11"x17".
NARA Standards Information
The files created by the Document Imaging Group meet NARA standards, with a few howevers. It is the responsibility of the organization/office whose documents need to be NARA compliant to independently determine that the files created and delivered meet the NARA standards for their content. If needed the group can make some accommodations/changes in its workflow, such as increasing the scanned resolution, though other more time-intensive changes may not be possible. If you require NARA compliance, please work with the Records Manager for your organization to identify the standards for your type of records.
NARA Published Resources
Probably the best documents to discuss what NARA needs for converting paper documents into electronic files are:
- U.S. National Archives and Records Administration (NARA) Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files – Raster Images. For the Following Record Types- Textual, Graphic Illustrations/Artwork/Originals, Maps, Plans, Oversized, Photographs, Aerial Photographs, and Objects/Artifacts (pdf)
- NARA Transfer Instructions for Permanent Electronic Records in PDF format, section 3.4 Requirements for Scanned Paper or Image Formats Converted to PDF
Service Yes's and Howevers
- The service does "produce digital images that look like the original records (textual, photograph, map, plan, etc.) and are a reasonable reproduction without enhancement." (see above referenced NARA Technical Guidelines page 5, Introduction).
- Documents are scanned at 300 dpi, the NARA standard for laser printed documents, documents with poor legibility or handwriting, and documents. "where color is important to the interpretation of the information or content, or desire to produce the most accurate representation" (see above referenced NARA Technical Guidelines page 51.)
- If specified, we can scan at 400 dpi.
- Scanning is done is either B&W or color, based on the best mode to reproduce the content of the original documents. Color scanning supports not only color-based images but also enhanced reading/viewing capabilities for hard to read pages such as old pages, hard to read pages, and pages where there is poor contrast in the content.
- Deliver Acrobat Image + Text files, also known as format "Searchable Image - Exact", the proper file format.
- "NARA will accept PDF records that have been OCR'd using processes that do not alter the original bit-mapped image. An example of an output process that accomplishes this requirement is Searchable Image - Exact." See NARA Transfer Instructions for Permanent Electronic Records in PDF format, section 3.4 Requirements for Scanned Paper or Image Formats Converted to PDF
- The group does not enter metadata into the files. A client/document owner would need to do that.
- However (a double however), if there were simple metadata requirements, and there was a simple method for the group to know what the metadata was, the group could consider performing this.
- The group does reasonable file naming if the document owners provide good information on naming conventions. It is also possible that the document owners would need to name/re-name the files if there is a specific naming convention that cannot be documented prior to conversion.
MAAdm updated 12/10/2021 - New Format