Manage Ingestion Settings

You can view and modify ingestion settings for a Processing case in OPEN DISCOVERY. These settings, which are also found in eCapture, offer you the flexibility of customizing how files are ingested into your case. Settings modified in OPEN DISCOVERY are updated in eCapture, and vice-versa.

Follow the instructions below to learn how to view and modify ingestion settings in OPEN DISCOVERY, as well as to review detailed definitions for each setting.

Note: There are additional ingestion settings available in eCapture. For further customization options, use the eCapture app. See Create a Streaming Discovery Job for more information.

For a visual overview of Ingestion settings, see the below video.

View and Modify Ingestion Settings

To view and/or modify ingestion settings:

  1. Open Case Management and locate the Processing case whose settings you would like to view or modify.

  2. Click the hamburger icon corresponding to the needed case.

  3. From the menu that appears, select Case Settings.

  4. Click the Ingestion tab at the top of the Case Settings work area.

  5. On this page you can view a selection of ingestion and filtering settings. To view and/or modify the full set of ingestion settings available in OPEN DISCOVERY, click the Manage Ingestion button.

  6. Update settings as needed. See the table below for more information about these settings.

  7. When finished, click the Save button in the top-right corner of the page. If you exit without saving, any changes you made will be lost.

    Note: You can discard any changes you have made by clicking the Cancel button in the top-right corner of the page, or by exiting the page.

How Hash Values are Generated For Deduplication

Deduplication does not take into account the filename, but only the content, when hashing. Emails are a little different, and you may customize the fields used to generate the hash value. For documents that are part of a family, the entire family will be included when deduping.

Understand Ingestion Settings

Review the table below to learn more about the various ingestion settings you can update in OPEN DISCOVERY.

Setting

Definition

Time Zone

Select the time zone to be applied to extracted date metadata. Use the dropdown to locate the needed time zone.

For more information about Time Zone Handling, see How IPRO eCapture Handles Dates and Time Zones.

Container handling

Determine the treatment of archive (.zip, .rar, etc.) and PDF Portfolio/Package containers. When "Treat as containers" is selected all extracted files will be treated as a single family of documents with the container being identified as the parent.

File extraction

The File extraction is on option is selected by default. The related Extract options are also selected by default and may be cleared independently, if desired.

If this option is disabled (the related Extract options are also cleared) and data is submitted for extraction, no extraction occurs from file types, such as mail stores and archives. This enables documents to be sent through Streaming Discovery knowing that all the docs were already extracted including file parents (e.g., emails and edocs).

Note: Node records are generated for container files such as .PST, .NSF, and archives; however, no items are extracted. The status indicator states: "No Content extracted, file extraction disabled by user".

  • Extract inline images from emails: When enabled, inline images in email messages (e.g., signature files) are extracted as attachments and treated as child documents.

    When disabled, inline images are not extracted as children. The images are not treated as separate documents, and therefore are not OCRed, language-identified, or indexed. The images are rendered inline as they would look in the native file.

    Note: Selecting this option can lead to a significant increase in the number of documents extracted.

  • Extract embedded files when possible: An embedded file is an object that has been inserted into a document and, if extracted, can act as a standalone document. This option consolidates Excel documents, Word documents, PowerPoint documents, Email File Attachments (Outlook.FileAttach), Visio drawings, Package-Embedded documents, Acrobat documents, Email Message Attachments (MailMsgATT), and Email File Attachments (MailFileAtt).

    When selected, the embedded files are extracted as separate documents and treated as child documents. If this option is not selected, then the embedded files are not extracted as separate documents.

    All files embedded inside of non-emails (e-docs) are extracted. These files are sent through the discovery, text extraction, metadata extraction and export with their parent. However, if this option is not selected, all files embedded inside of non-emails (edocs) are not extracted. They are ignored and only the parent document is processed.

OCR

Configure your OCR settings. Optical Character Recognition is used to identify and extract text from image-based files that can be indexed and searched in Review. To use OCR on image-based files such as .pdf, .jpg, .bmp, .tiff, etc., ensure the slider at the top is set to OCR is on.

Note: You can turn off OCR by selecting the slider so that it turns gray. Note that disabling OCR prevents the identification and extraction of text from image-based files.

You can likewise set the following options:

  • OCR Images: Images are OCRed to retrieve any available text from the image. The OCR is available for indexing and searching in the Review application.

  • OCR PowerPoint: Turn this option on to perform OCR on Microsoft PowerPoint files during indexing to get text from embedded content in the slides. This results in slower indexing speeds for PowerPoint files, but more accurate search results.

  • OCR PDF pages missing text: PDFs with no embedded text perform OCR before indexing or language identification. PDF pages with embedded text (text-behind) will have text extracted. Comments on a PDF file are also extracted. The OCR text is added to any extracted text from the PDF. All text is available for indexing and searching in the Review application.

    Optionally, select the option Specify OCR threshold for PDFs and indicate a value. The default value is 75 characters. The maximum value is 100. If the value is less than the threshold you set, eCapture will send the page to be OCRed; otherwise, the text will just be extracted.

  • Minimum average OCR confidence (1-100): The level range settings are from 1 to 100. The default is 50. The confidence level is the average percentage of confidence for each document for all pages within a document on which OCR was performed. Success or failure of OCR results is based on the average confidence level of the document. If the average confidence level is below the selected threshold, the page is considered as an OCR error.

  • Use OCR Workers: Select this option to simultaneously create an OPEN DISCOVERY OCR job with the OPEN DISCOVERY Streaming Discovery Job.

    Workers that are OPEN DISCOVERY Eligible or OPEN DISCOVERY Exclusive will accept OCR tasks if licensing is available. A different task table may be specified for OPEN DISCOVERY OCR Workers.

    Selecting this option can improve performance. If the Use OCR Workers option is not selected, OCR tasks are assigned to licensed Streaming Discovery Workers.

    OCR worker task table: If a custom task table is selected from the drop-down menu, OCR tasks are sent to those Workers assigned to the selected task table.

    Note: For information about the OCR Worker Task Table, see Create Task Tables and Assign Task Tables to Workers.

  • OCR Languages: OPEN DISCOVERY includes multi-language OCR capability. A valid multi-language OCR license must be available in order to modify the original selected languages, if necessary.

    In the OCR Languages field, click to display the OCR Foreign Language dialog box.

    After selecting the languages, click OK to close the dialog box.

    Click Closedhere to view a list of supported languages.

    • English

    • Arabic

    • Chinese Simplified

    • Chinese Traditional

    • Japanese

    • Korean

    • Afrikaans

    • Albanian

    • Basque

    • Belarusian

    • Bulgarian

    • Catalan

    • Croatian

    • Czech

    • Danish

    • Dutch

    • Estonian

    • Faorese

    • Finnish

    • French

    • Galician

    • German

    • Greek

    • Hungarian

    • Icelandic

    • Indonesian

    • Italian

    • Latvian

    • Lithuanian

    • Macedonian

     

    • Norwegian

    • Polish

    • Portuguese

    • Portuguese Brazil

    • Romanian

    • Russian

    • Serbian

    • Serbian Cyrillic

    • Slovak

    • Slovenian

    • Spanish

    • Swedish

    • Turkish

    • Ukrainian

    Click here to view some Closedcaveats to OCR Foreign Language handling.

    English is the only language that is selected by default. The more languages that are selected; the lower the confidence level will be for correctly identifying the languages in a document.

    • If English is selected, Arabic will not be available for selection.

    • If Arabic is selected, all other languages will not be available for selection.

    • If one of the CJK (Chinese, Japanese, Korean) languages are selected, then all remaining CJK languages will not be available for selection. Other languages (excluding Arabic) may be selected.

    • If Chinese Simplified is selected, Chinese Traditional, Japanese, and Korean will not be available for selection.

    • If Chinese Traditional is selected, Chinese Simplified, Japanese, and Korean will not be available for selection.

    • If Japanese is selected, Chinese Simplified, Chinese Traditional, and Korean will not be available for selection.

    • If Korean is selected, Chinese Simplified, Chinese Traditional, and Japanese will not be available for selection.

Email Hash

When an email is discovered within OPEN DISCOVERY, it is assigned a hash value based on fields chosen by the user. The values of these fields are concatenated, and the text is hashed. Select from the following email fields to generate the hash value:

Note: Changing selected fields midway through a case will not update values of historic documents. This may result in the failure to identify future duplicates.

  • Subject

  • Body

    • Body Whitespace: Whitespace in the email body could cause slight differences between the same emails, which could result in different hashes being generated. On the Body Whitespace drop-down menu, select either Remove or Include. Remove - removes all whitespace between lines of text in the email body before hashing. Include - keeps the whitespace.

  • Recipients

  • From/Author

  • Email Date: Sent Date is used for all emails. When no Sent Date is available, Create Date will be used instead.

  • CC

  • BCC

  • Attachment Count

  • Attachment Names

 

Related Topics

Manage Settings for a Processing Case

Manage Filtering Settings