Create OCR for Multiple Documents (Bulk OCR)

OCR (optical character recognition) refers to the process of creating machine-readable text from the printed text in image files. The text extracted from documents in this way can then be indexed for search purposes.

Processing allows you to create OCR (including word coordinates for existing images) for document pages for which it is missing, using English or other language dictionaries. Results include:

  • OCR content is added to the Extracted Text tab in Review.

  • It is also added to the case’s EXTRACTEDTEXT field.

  • Word coordinates are created for images to allow search highlighting.

NOTES: Before performing bulk OCR operations, note the following:

  • Documents lacking image files are skipped during the bulk OCR process. Documents containing no readable text are also skipped (for example, a photo file).

  • If OCR already exists for a document, it will not be overwritten.

  • In this topic, “OCR” or “extracted text” refers to the content in the Extracted Text tab or field, as well as word coordinates.

  • If your case includes documents of different languages, identify the languages to be used. Also note that one language at a time can be processed.

To create OCR:

  1. Getting started:

    1. Identify which documents need OCR, either an entire case or documents matching particular search criteria. If a search will be used, identify the needed search criteria or saved search.

    2. TIP: If the documents are diverse from a search standpoint, one approach is to place them all in a private folder and then search for documents in the folder.

    3. Select an appropriate time to add OCR.

      Note: Depending on the size of the database, system capabilities, and other factors, global functions may consume significant system resources and time. It is recommended that you carry out bulk opertions during “off hours” to minimize the impact to the system and your users.

    4. If users might be logged on, alert them that performance may be affected while you perform this bulk operation.

  2. On the Dashboard, click the Processing module.

  3. In the left navigation panel, click the OCR tab.

  4. On the OCR page, select a Client ID and a Case Name from the drop-down menus.

  5. Click on the Start button. The OCR wizard displays.

  6. Select the scope of the operation in the box labeled Documents to Process:

    • Process Entire Case: OCR will be created for all images for which it is missing.

    • Saved Search: Select this option and then select the needed saved search.

    • Advanced Search: Select this option, click Search, then define and run the needed search. For details on searching, see the Use Advanced Search section in Review. The Doc Count field will display the number of documents in the search results.

  7. If a language(s) other than English exists in the images selected, complete the following steps to ensure the correct language dictionary is used for this OCR session:

    1. Click Select OCR Language.

    2. Select the needed language. Note that one language at a time can be selected.

  8. Select needed options; see the following table.

  9. Bulk OCR Options



    Use Default Viewer Rotation

    If pages are rotated correctly in Review, select this option to OCR the pages based on the Desktop rotation. This is the preferred selection.

    Use Auto-Rotation

    If this option is selected, a sample of image text is analyzed to identify the correct rotation. This option may not work for certain images, such as those with text in the document margins.

    Include Image Key in OCR Text

    Select this option to add the document’s image key at the beginning of the extracted text, in the Extracted Text field and tab.

    RE-OCR Rotated Pages

    Some pages may have been initially scanned with the wrong rotation, yielding meaningless OCR.

    If the images for these pages are rotated correctly in Review, select this option to re-OCR the pages based on the Desktop rotation.

    RE-OCR All Documents and Pages

    Select this option to create OCR for all pages of all documents selected; existing OCR will be overwritten in the Extracted Text field and tab.

    Replace Existing Extracted Text

    Re-OCR rotated images or Re-OCR all documents and pages must be selected in order to select this option.

    Select this option to create OCR for all pages of all documents selected. OCR will be overwritten in the Extracted Text field and tab.

    Select Document Processing Timeout Value

    Select this option to determine how long processing can occur before timing out.

  10. When all options are selected, click Start. OCR text will be placed in the case database directory.

  11. Observe status at the bottom of the workspace. The amount of time the operation will take depends on how many images are being processed.

  12. When the operation completes, note important information in the status bar. If errors exist, click View Log File to evaluate the error log.


Related Topics

Overview: Processing Files