MARS Scanner Data Extraction

Once a document ingested by the Helix MARS Scanner has been separated and classified (and OCR has been performed), the next step is to extract specific, targeted data fields from it. This extraction is typically guided by rules or templates associated with the identified document type.

Data Extraction Techniques Used:

Zonal Extraction: This method relies on predefined coordinates (zones) within a document template. The Scanner extracts the data found within the specified rectangular area on the page. This is highly effective for structured forms where fields appear in consistent locations.
Pattern-Based Extraction: Uses content patterns to locate and extract data. This often involves:
Regular Expressions (RegEx): Defining precise patterns to find and capture data like dates, ID numbers, email addresses, or specific codes within the recognized text.
Keyword Anchoring: Locating data based on its proximity to known text labels (e.g., extracting the value immediately following "Subtotal:").
OCR/ICR/OMR/Barcode Results: The data recognized by the various character, mark, and barcode recognition engines serves as direct input for extraction. For example, the decoded string from a barcode is extracted as a specific field value.
Table Extraction: Algorithms designed to identify rows and columns in tabular data presented within the document, extracting cell values while potentially maintaining their relationship to row/column headers.

The extracted data, now structured, is typically passed along with metadata (like the document classification) to the next stage, often involving validation or export.