MARS Scanner Classification Methods

After separating a scanned batch into potential documents and performing OCR, the Helix MARS Scanner application employs methods to classify or identify the specific type of each document (e.g., invoice, application, correspondence). This classification is crucial for applying the correct data extraction rules.

Classification Approaches:

Template-Based Rules: Classification is often driven by matching document content or layout against predefined templates. These templates contain rules based on:
Keywords: Identifying the presence, absence, or specific location of unique words or phrases characteristic of a document type (e.g., the word "INVOICE" appearing prominently). Regular expressions can be used for more flexible keyword matching.
Layout Patterns: Recognizing specific structural elements or positional relationships between text blocks or graphical elements that are consistent for a particular document type.
Machine Learning (ML) Classification: For more complex scenarios or where explicit rules are difficult to define, ML models can be trained on large sets of example documents. These models learn to recognize the distinguishing visual and textual features of different document classes (e.g., using techniques like Support Vector Machines) and can classify new, unseen documents based on those learned patterns.
Combined Approaches: Often, a combination of rule-based checks and ML models might be used to achieve higher accuracy and handle variations within document types.

Accurate classification ensures that the appropriate data extraction logic (defined in associated templates or rules) is applied to each document, maximizing the success rate of automated data capture.