Connect with us

Thought Leaders

Making Sense of the Mess: LLMs Role in Unstructured Data Extraction




Recent advancements in hardware such as Nvidia H100 GPU, have significantly enhanced computational capabilities. With nine times the speed of the Nvidia A100, these GPUs excel in handling deep learning workloads. This advancement has spurred the commercial use of generative AI in natural language processing (NLP) and computer vision, enabling automated and intelligent data extraction. Businesses can now easily convert unstructured data into valuable insights, marking a significant leap forward in technology integration. 

Traditional Methods of Data Extraction 

Manual Data Entry 

Surprisingly, many companies still rely on manual data entry, despite the availability of more advanced technologies. This method involves hand-keying information directly into the target system. It is often easier to adopt due to its lower initial costs. However, manual data entry is not only tedious and time-consuming but also highly prone to errors. Additionally, it poses a security risk when handling sensitive data, making it a less desirable option in the age of automation and digital security. 

Optical Character Recognition (OCR)  

OCR technology, which converts images and handwritten content into machine-readable data, offers a faster and more cost-effective solution for data extraction. However, the quality can be unreliable. For example, characters like “S” can be misinterpreted as “8” and vice versa.  

OCR's performance is significantly influenced by the complexity and characteristics of the input data; it works well with high-resolution scanned images free from issues such as orientation tilts, watermarks, or overwriting. However, it encounters challenges with handwritten text, especially when the visuals are intricate or difficult to process. Adaptations may be necessary for improved results when handling textual inputs. The data extraction tools in the market with OCR as a base technology often put layers and layers of post-processing to improve the accuracy of the extracted data. But these solutions cannot guarantee 100% accurate results.  

Text Pattern Matching 

Text pattern matching is a method for identifying and extracting specific information from text using predefined rules or patterns. It's faster and offers a higher ROI than other methods. It is effective across all levels of complexity and achieves 100% accuracy for files with similar layouts.  

However, its rigidity in word-for-word matches can limit adaptability, requiring a 100% exact match for successful extraction. Challenges with synonyms can lead to difficulties in identifying equivalent terms, like differentiating “weather” from “climate.”Additionally, Text Pattern Matching exhibits contextual sensitivity, lacking awareness of multiple meanings in different contexts. Striking the right balance between rigidity and adaptability remains a constant challenge in employing this method effectively. 

Named Entity Recognition (NER)  

Named entity recognition (NER), an NLP technique, identifies and categorizes key information in text. 

NER's extractions are confined to predefined entities like organization names, locations, personal names, and dates. In other words, NER systems currently lack the inherent capability to extract custom entities beyond this predefined set, which could be specific to a particular domain or use case. Second, NER's focus on key values associated with recognized entities does not extend to data extraction from tables, limiting its applicability to more complex or structured data types. 

 As organizations deal with increasing amounts of unstructured data, these challenges highlight the need for a comprehensive and scalable approach to extraction methodologies. 

Unlocking Unstructured Data with LLMs 

Leveraging large language models (LLMs) for unstructured data extraction is a compelling solution with distinct advantages that address critical challenges. 

Context-Aware Data Extraction 

LLMs possess strong contextual understanding, honed through extensive training on large datasets. Their ability to go beyond the surface and understand context intricacies makes them valuable in handling diverse information extraction tasks. For instance, when tasked with extracting weather values, they capture the intended information and consider related elements like climate values, seamlessly incorporating synonyms and semantics. This advanced level of comprehension establishes LLMs as a dynamic and adaptive choice in the domain of data extraction.  

Harnessing Parallel Processing Capabilities 

LLMs use parallel processing, making tasks quicker and more efficient. Unlike sequential models, LLMs optimize resource distribution, resulting in accelerated data extraction tasks. This enhances speed and contributes to the extraction process's overall performance.  

Adapting to Varied Data Types 

While some models like Recurrent Neural Networks (RNNs) are limited to specific sequences, LLMs handle non-sequence-specific data, accommodating varied sentence structures effortlessly. This versatility encompasses diverse data forms such as tables and images. 

Enhancing Processing Pipelines 

The use of LLMs marks a significant shift in automating both preprocessing and post-processing stages. LLMs reduce the need for manual effort by automating extraction processes accurately, streamlining the handling of unstructured data. Their extensive training on diverse datasets enables them to identify patterns and correlations missed by traditional methods. 

This figure of a generative AI pipeline illustrates the applicability of models such as BERT, GPT, and OPT in data extraction. These LLMs can perform various NLP operations, including data extraction. Typically, the generative AI model provides a prompt describing the desired data, and the ensuing response contains the extracted data. For instance, a prompt like “Extract the names of all the vendors from this purchase order” can yield a response containing all vendor names present in the semi-structured report. Subsequently, the extracted data can be parsed and loaded into a database table or a flat file, facilitating seamless integration into organizational workflows. 

Evolving AI Frameworks: RNNs to Transformers in Modern Data Extraction 

Generative AI operates within an encoder-decoder framework featuring two collaborative neural networks. The encoder processes input data, condensing essential features into a “Context Vector.” This vector is then utilized by the decoder for generative tasks, such as language translation. This architecture, leveraging neural networks like RNNs and Transformers, finds applications in diverse domains, including machine translation, image generation, speech synthesis, and data entity extraction. These networks excel in modeling intricate relationships and dependencies within data sequences. 

Recurrent Neural Networks 

Recurrent Neural Networks (RNNs) have been designed to tackle sequence tasks like translation and summarization, excelling in certain contexts. However, they struggle with accuracy in tasks involving long-range dependencies.  

 RNNs excel in extracting key-value pairs from sentences yet, face difficulty with table-like structures. Addressing this requires careful consideration of sequence and positional placement, requiring specialized approaches to optimize data extraction from tables. However, their adoption was limited due to low ROI and subpar performance on most text processing tasks, even after being trained on large volumes of data. 

Long Short-Term Memory Networks 

Long Short-Term Memory (LSTMs) networks emerge as a solution that addresses the limitations of RNNs, particularly through a selective updating and forgetting mechanism. Like RNNs, LSTMs excel in extracting key-value pairs from sentences,. However, they face similar challenges with table-like structures, demanding a strategic consideration of sequence and positional elements.  

 GPUs were first used for deep learning in 2012 to develop the famous AlexNet CNN model. Subsequently, some RNNs were also trained using GPUs, though they did not yield good results. Today, despite the availability of GPUs, these models have largely fallen out of use and have been replaced by transformer-based LLMs. 

Transformer – Attention Mechanism 

The introduction of transformers, notably featured in the groundbreaking “Attention is All You Need” paper (2017), revolutionized NLP by proposing the ‘transformer' architecture. This architecture enables parallel computations and adeptly captures long-range dependencies, unlocking new possibilities for language models. LLMs like GPT, BERT, and OPT have harnessed transformers technology. At the heart of transformers lies the “attention” mechanism, a key contributor to enhanced performance in sequence-to-sequence data processing. 

The “attention” mechanism in transformers computes a weighted sum of values based on the compatibility between the ‘query' (question prompt) and the ‘key' (model's understanding of each word). This approach allows focused attention during sequence generation, ensuring precise extraction. Two pivotal components within the attention mechanism are Self-Attention, capturing importance between words in the input sequence, and Multi-Head Attention, enabling diverse attention patterns for specific relationships.  

In the context of Invoice Extraction, Self-Attention recognizes the relevance of a previously mentioned date when extracting payment amounts, while Multi-Head Attention focuses independently on numerical values (amounts) and textual patterns (vendor names). Unlike RNNs, transformers don't inherently understand the order of words. To address this, they use positional encoding to track each word's place in a sequence. This technique is applied to both input and output embeddings, aiding in identifying keys and their corresponding values within a document.  

The combination of attention mechanisms and positional encodings is vital for a large language model's capability to recognize a structure as tabular, considering its content, spacing, and text markers. This skill sets it apart from other unstructured data extraction techniques.

Current Trends and Developments 

The AI space unfolds with promising trends and developments, reshaping the way we extract information from unstructured data. Let's delve into the key facets shaping the future of this field. 

Advancements in Large Language Models (LLMs) 

Generative AI is witnessing a transformative phase, with LLMs taking center stage in handling complex and diverse datasets for unstructured data extraction. Two notable strategies are propelling these advancements: 

  1. Multimodal Learning: LLMs are expanding their capabilities by simultaneously processing various types of data, including text, images, and audio. This development enhances their ability to extract valuable information from diverse sources, increasing their utility in unstructured data extraction. Researchers are exploring efficient ways to use these models, aiming to eliminate the need for GPUs and enable the operation of large models with limited resources.
  1. RAG Applications: Retrieval Augmented Generation (RAG) is an emerging trend that combines large pre-trained language models with external search mechanisms to enhance their capabilities. By accessing a vast corpus of documents during the generation process, RAG transforms basic language models into dynamic tools tailored for both business and consumer applications.

Evaluating LLM Performance 

The challenge of evaluating LLMs' performance is met with a strategic approach, incorporating task-specific metrics and innovative evaluation methodologies. Key developments in this space include: 

  1. Fine-tuned metrics: Tailored evaluation metrics are emerging to assess the quality of information extraction tasks. Precision, recall, and F1-score metrics are proving effective, particularly in tasks like entity extraction.
  1. Human Evaluation: Human assessment remains pivotal alongside automated metrics, ensuring a comprehensive evaluation of LLMs. Integrating automated metrics with human judgment, hybrid evaluation methods offer a nuanced view of contextual correctness and relevance in extracted information.

Image and Document Processing  

Multimodal LLMs have completely replaced OCR. Users can convert scanned text from images and documents into machine-readable text, with the ability to identify and extract information directly from visual content using vision-based modules. 

Data Extraction from Links and Websites 

LLMs are evolving to meet the increasing demand for data extraction from websites and web links These models are increasingly adept at web scraping, converting data from web pages into structured formats. This trend is invaluable for tasks like news aggregation, e-commerce data collection, and competitive intelligence, enhancing contextual understanding and extracting relational data from the web. 

The Rise of Small Giants in Generative AI 

The first half of 2023 saw a focus on developing huge language models based on the “bigger is better” assumption. Yet, recent results show that smaller models like TinyLlama and Dolly-v2-3B, with less than 3 billion parameters, excel in tasks like reasoning and summarization, earning them the title of “small giants.” These models use less compute power and storage, making AI more accessible to smaller companies without the need for expensive GPUs. 


Early generative AI models, including generative adversarial networks (GANs) and variational auto encoders (VAEs), introduced novel approaches for managing image-based data. However, the real breakthrough came with transformer-based large language models. These models surpassed all prior techniques in unstructured data processing owing to their encoder-decoder structure, self-attention, and multi-head attention mechanisms, granting them a deep understanding of language and enabling human-like reasoning capabilities. 

 While generative AI, offers a promising start to mining textual data from reports, the scalability of such approaches is limited. Initial steps often involve OCR processing, which can result in  errors, and challenges persist in extracting text from images within reports.  

 Whereas, extracting text inside the images in reports is another challenge. Embracing solutions like multimodal data processing and token limit extensions in GPT-4, Claud3, Gemini offers a promising path forward. However, it's important to note that these models are accessible solely through APIs. While using APIs for data extraction from documents is both effective and cost-efficient, it comes with its own set of limitations such as latency, limited control, and security risks.  

 A more secure and customizable solution lies in fine tuning an in-house LLM. This approach not only mitigates data privacy and security concerns but also enhances control over the data extraction process. Fine-tuning an LLM for document layout understanding and for grasping the meaning of text based on its context offers a robust method for extracting key-value pairs and line items. Leveraging zero-shot and few-shot learning, a finetuned model can adapt to diverse document layouts, ensuring efficient and accurate unstructured data extraction across various domains. 

Jay Mishra, COO at Astera, a leading provider of no-code data solutions, is a seasoned data and analytics leader with 20+ years of experience driving transformative strategies to empower organizations through AI-powered data solutions.