MDMS Structuring Unstructured Data

A significant capability of the MARS Data Mining Studio (MDMS) is its proficiency in processing Unstructured Data—information lacking a predefined organizational schema, such as free-form text documents, scanned images, emails, or complex PDFs—and extracting relevant elements to generate Structured Data outputs. This transformation unlocks the value contained within large volumes of previously inaccessible enterprise information.

The Structuring Workflow:

Ingestion of Unstructured Content: MDMS takes various forms of unstructured or semi-structured data as input. This can range from scanned TIF images and PDFs to emails (MSG/EML), word processing documents, text files, and even complex print stream formats retrieved from archives or feeds.
Intelligent Analysis and Extraction: Using its array of extraction techniques (including OCR/ICR for images, RegEx and pattern matching for text, zoning for layout, keyword analysis, AI/ML classification), MDMS identifies and isolates specific, targeted data points within the unstructured source. This step is guided by predefined templates or rules designed for the specific type of unstructured content being processed. For example, finding and extracting the 'Invoice Number', 'Date', and 'Total Amount' from thousands of scanned vendor invoices.
Data Cleansing, Validation, and Transformation: Once the raw data elements are extracted, MDMS can apply rules to cleanse them (e.g., removing currency symbols, standardizing date formats), validate them against expected criteria (e.g., checking if a value falls within a valid range, verifying checksums), and transform them if necessary (e.g., converting units, concatenating fields).
Generation of Structured Output: The final step involves assembling the cleaned, validated, and transformed data elements into a well-defined, structured format suitable for downstream use. Common structured outputs generated by MDMS include:
- XML: Data organized according to a user-defined or industry-standard XML schema.
- CSV: Comma-separated value files, easily importable into spreadsheets or databases.
- Database Inserts: Direct population of tables within relational databases.
- JSON: JavaScript Object Notation format, suitable for web services and modern applications.
- Formatted Text: Custom-layout text files for specific reporting or system input needs.
- Searchable PDF Layers: Embedding the extracted structured text data as a hidden layer within output PDF files to make them searchable.

By systematically converting valuable information from unstructured sources into usable structured data, MDMS enables improved analytics, enhanced automation (like automated data entry), and seamless integration with enterprise applications that require structured input.