Get GenAI guide

Access HaxiTAG GenAI research content, trends and predictions.

Thursday, August 29, 2024

Insights and Solutions for Analyzing and Classifying Large-Scale Data Records (Tens of Thousands of Excel Entries) Using LLM and GenAI Tools

Traditional software tools are often unsuitable for complex, one-time, or infrequent tasks, making the development of intricate solutions impractical. For example, while Excel scripts or other tools can be used, they often require data insights that are only achievable through thorough analysis, leading to a disconnect that complicates the quick coding of scripts to accomplish the task.

As a result, using GenAI tools to analyze, classify, and label large datasets, followed by rapid modeling and analysis, becomes a highly effective choice.

In an experimental approach, we attempted to use GPT-4o to address this issue. The task needs to be broken down into multiple small steps to be completed progressively using a step-by-step strategy. When categorizing and analyzing data for modeling, it is advisable to break down complex tasks into simpler ones, gradually utilizing AI to assist in completing them.

The following solution and practice guide outlines a detailed process for effectively categorizing these data descriptions. Here are the specific steps and methods:

1. Preparation and Preliminary Processing

Export the Excel file as a CSV: Retain only the fields relevant to classification, such as serial number, name, description, display volume, click volume, and other foundational fields and data for modeling. Since large language models (LLMs) perform well with plain text and have limited context window lengths, retaining necessary information helps enhance processing efficiency.

If the data format and mapping meanings are unclear (e.g., if column names do not correspond to the intended meaning), manual data sorting is necessary to ensure the existence of a unique ID so that subsequent classification results can be correctly mapped.

2. Data Splitting

Split the large CSV file into multiple smaller files: Due to the context window limitations and the higher error probability with long texts, it is recommended to split large files into smaller ones for processing. AI can assist in writing a program to accomplish this task, with the number of records per file determined based on experimental outcomes.

3. Prompt Creation

Define classification and data structure: Predefine the parts classification and output data structure, for instance, using JSON format, making it easier for subsequent program parsing and processing.

Draft a prompt; AI can assist in generating classification, data structure definitions, and prompt examples. Users can input part descriptions and numbers and return classification results in JSON format.

4. Programmatically Calling LLM API

Write a program to call the API: If the user has programming skills, they can write a program to perform the following functions:

  • Read and parse the contents of the small CSV files.
  • Call the LLM API and pass in the optimized prompt with the parts list.
  • Parse the API’s response to obtain the correlation between part IDs and classifications, and save it to a new CSV file.
  • Process the loop: The program needs to process all split CSV files in a loop until classification and analysis are complete.

5. File Merging

Merge all classified CSV files: The final step is to merge all generated CSV files with classification results into a complete file and import it back into Excel.

Solution Constraints and Limitations

Based on the modeling objectives constrained by limitations, re-prompt the column data and descriptions of your data, and achieve the modeling analysis results by constructing prompts that meet the modeling goals.

Important Considerations:

  • LLM Context Window Length: The LLM’s context window is limited, making it impossible to process large volumes of records at once, necessitating file splitting.
  • Model Understanding Ability: Given that the task involves classifying complex and granular descriptions, the LLM may not accurately understand and categorize all information, requiring human-AI collaboration.
  • Need for Human Intervention: While AI offers significant assistance, the final classification results still require manual review to ensure accuracy.

By breaking down complex tasks into multiple simple sub-tasks and collaborating between humans and AI, efficient classification can be achieved. This approach not only improves classification accuracy but also effectively leverages existing AI capabilities, avoiding potential errors that may arise from processing large volumes of data in one go.

The preprocessing, splitting of data, reasonable prompt design, and API call programs can all be implemented using AI chatbots like ChatGPT and Claude. Novices need to start with basic data processing in practice, gradually mastering prompt writing and API calling skills, and optimizing each step through experimentation.

Related Topic