MDMS Form Design Capabilities
The MARS Data Mining Studio (MDMS) provides an interactive environment for designing extraction 'forms' or templates. These act as detailed blueprints that define precisely how data should be located and captured from specific types of documents, especially those with relatively consistent layouts like business forms, invoices, statements, or standardized reports.
Core Elements within Form Design:
- Field Definition and Naming: Users can define logical names for each piece of information they intend to extract (e.g.,
customer_id,invoice_date,subtotal,product_sku). These names become the keys or labels for the extracted data in the structured output. - Data Typing and Formatting: For each defined field, users specify the expected data type, which assists both the extraction engine in interpreting found values and the validation engine in checking correctness. Supported types include:
- Text: For general alphanumeric strings, often combined with pattern matching (RegEx, wildcards).
- Date / Time / DateTime: Allows specification of various expected input formats (e.g.,
MM/DD/YYYY,YYYY-MM-DD,DD-Mon-YYYY HH:MM) for accurate parsing. - Currency: Designed to handle numeric values representing money, often recognizing currency symbols and decimal separators according to locale settings.
- Number: Supports various numeric representations including integers, decimals (floating-point), and potentially scientific notation.
- Boolean / Choice: Used for capturing selections, typically extracted from checkboxes (via OMR) or specific keywords indicating binary states (Yes/No, True/False).
- Extraction Method Mapping: The crucial step involves linking each defined field to one or more specific extraction methods within the MDMS toolkit. This is often done visually on a sample document:
- Zoning: Drawing a box around the area where the field's data is consistently located.
- Pattern Definition: Specifying a RegEx or wildcard pattern that uniquely identifies the data for the field.
- Anchor Definition: Defining keywords or labels near the target data that help pinpoint its location.
- OCR/ICR/OMR/Barcode Configuration: Specifying the type of recognition needed if the field data is graphical or encoded.
- Table Region Definition: For data appearing in tables, specialized tools allow defining the table boundaries and mapping columns to specific output fields.
The result of this design process is a saved template (Helix Config File) containing a precise, reusable set of instructions for MDMS to follow when extracting data from documents conforming to that layout and structure.