Create OCR for Multiple Documents (Bulk OCR)

OCR (optical character recognition) refers to the process of creating machine-readable text from the printed text in image files. The text extracted from documents in this way can then be indexed for search purposes.

Processing allows you to create OCR (including word coordinates for existing images) for document pages for which it is missing, using English or other language dictionaries. Results include:

  • OCR content is added to the Extracted Text tab in Eclipse SE Desktop.

  • It is also added to the case’s EXTRACTEDTEXT field.

  • Word coordinates are created for images to allow search highlighting.

NOTES: Before performing bulk OCR operations, note the following:

  • Documents lacking image files are skipped during the bulk OCR process. Documents containing no readable text are also skipped (for example, a photo file).

  • If OCR already exists for a document, it will not be overwritten.

  • In this topic, “OCR” or “extracted text” refers to the content in the Extracted Text tab or field, as well as word coordinates.

  • If your case includes documents of different languages, identify the languages to be used. Also note that one language at a time can be processed.

For more information about creating OCR for multiple documents, view the ClosedBulk OCR video.

To create OCR:

  1. Getting started:

    1. Identify which documents need OCR, either an entire case or documents matching particular search criteria. If a search will be used, identify the needed search criteria or saved search.

    2. TIP: If the documents are diverse from a search standpoint, one approach is to place them all in a private folder and then search for documents in the folder.

    3. Select an appropriate time to add OCR.

      Note: Depending on the size of the database, system capabilities, and other factors, global functions may consume significant system resources and time. It is recommended that you carry out bulk opertions during “off hours” to minimize the impact to the system and your users.

    4. If users might be logged on, alert them that performance may be affected while you perform this bulk operation.

  2. In Eclipse SE Administration, click the Case Management tab.

  3. In the navigation panel, expand the Processing menu and click OCR.

  4. When the (bulk) OCR workspace appears, select the needed client and case.

  5. Select the scope of the operation in the box labeled Documents to Process:

    • Process Entire Case: OCR will be created for all images for which it is missing.

    • Saved Search: Select this option and then select the needed saved search.

    • Advanced Search: Select this option, click Search, then define and run the needed search. For details on searching, see the Use Advanced Search section in Eclipse SE Desktop. The Doc Count field will display the number of documents in the search results.

  6. If a language(s) other than English exists in the images selected, complete the following steps to ensure the correct language dictionary is used for this OCR session:

    1. Click Select OCR Language.

    2. Select the needed language. Note that one language at a time can be selected.

  7. Select needed options; see the following table.

  8. Bulk OCR Options

    Option

    Description

    Include Image Key in OCR Text

    Select this option to add the document’s image key at the beginning of the extracted text, in the Extracted Text field and tab.

    RE-OCR Pages Rotated since last Bulk OCR

    Some pages may have been initially scanned with the wrong rotation, yielding meaningless OCR.

    If the images for these pages are rotated correctly in Eclipse SE Desktop, select this option to re-OCR the pages based on the Desktop rotation.

    RE-OCR All Documents and Pages

    Select this option to create OCR for all pages of all documents selected; existing OCR will be overwritten in the Extracted Text field and tab.

    Do Not Replace Existing Extracted Text

    Select this option to create word-coordinate files without updating any existing extracted text.

    Image Rotation

    Choose one of the rotation options:

    User Viewer Rotation: If pages are rotated correctly in Eclipse SE Desktop, select this option to OCR the pages based on the Desktop rotation. This is the preferred selection.

    Auto-Rotate Images: If this option is selected, Eclipse SE analyzes a sample of image text to identify the correct rotation. This option may not work for certain images, such as those with text in the document margins.

  9. When all options are selected, click Start. OCR text will be placed in the case database directory.

  10. Observe status at the bottom of the workspace. The amount of time the operation will take depends on how many images are being processed.

  11. When the operation completes, note important information in the status bar. If errors exist, click View Log File to evaluate the error log.

 

Related Topics

Overview: Processing Files