What if LLM is the ultimate data janitor

Jun 29, 2024

This week, I will cover why I think data janitor work is dying and companies that are built in on top of data janitor work could be ripe for disruption through LLMs and what to do about it.

A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a "data wrangler", a data janitor sifts through data for companies in the information technology industry.

If this phrase/definition does not take you back in 2010’s, I do not know what it would.

This post will be three sections as I want to expand different posts a lot more detail in each section a lot more. Usual programming will continue next week as usual. Let me know what you think about these commentaries in the comments section or through email.

Schema-Free Learning: why we do not need schemas anymore in the data and learning capabilities to make the data “clean”. This does not mean that data quality is not important, data cleaning will still be very crucial, but data in a schema/table is no longer requirement or pre-requisite for any learning and analytics purposes.
Analytics/Answers are included(batteries included in LLM): In the consumption of the data after data janitor work, we no longer have to depend on tables, spreadsheets or any other your favorite analytics tool for messaging and formatting this dataset to build the decks/presentations that you want to communicate the insights and learnings. This could even mean the traditional
One LLM to rule them All?: No, not really. The data quality, preprocessing, weighting and even sampling are important, but no longer the data that needs to be cleaned from natural language is needed, necessary in the world of LLMs. If you are using various taggers to tag different entities and then learn about what the entity is about and learn about new relations in these entity space, LLM can do it all without needing different systems to do any type of data processing on top of this. However, you still need the data, accurate data and provide guardrails so that when LLM does not have the facts, it would not make things up(so-called hallucination).

Enter to the Big Data Era

Prior to 2020 and specifically 2010’s, there was “big data”, this era laid out the foundations of the datasets that we use; think Spark, Hadoop, Map-Reduce, Kafka, MongoDB {insert your favorite streaming/batching data based solution}; good old data heavy times. Big data covered ML capabilities as well but it was a different time of ML and it definitely did not cover deep learning capabilities and LLM was not a thing back then.

This created multiple data infrastructure companies(Databricks, Looker → Google, Cloudera, Confluent) as well as data based businesses where the business might be aggregating data in a structured manner and selling this to various customers. I will go this in a more detailed manner and show different categories of companies in the second post.

Schema-Free Learning

We moved away from schemas in the datasets and when we think about quality of datasets, it is no longer the schema that actually is important or how structured or less structured of data, but rather if the data itself is factual, accurate and/or labels that comes with the data.

Prior to 2020’s, schema and how to comply this schema in the datasets were very important for a variety of purposes for companies to consume. They want to index this data(search engine) purposes, they want to understand the relationships between entities to understand the facts about these entities or form new relationships. Schema.org is established in tech companies to provide this schema framework so that web sites can use this schema to structure the information in the web page.

Before LLMs, there was Never-Ending Language Learning, this program was in the spirit similar to what current modern LLMs are doing, but of course prior to the the technological breakthrough which made it closer to the goal/mission of the program. Back then, the structure of the information, graph of the entities, how they are related to each other and what type of relations it has inside of the graph was deemed to be crucial as the most of the LLM models were still schema based.

Because the machinery that required learning among entities still require entities to be separated out, structured in a much more schema-heavy manner as the machine learning model would exploit graph structure, entities and build relationships between these entities.

For these efforts and other efforts, traditionally, data had to be meticulously structured and organized into predefined schemas before it could be analyzed or used for machine learning purposes. This process often involved complex data cleaning and wrangling tasks, such as:

Data Parsing: Extracting data from various sources (e.g., logs, documents, databases) and converting it into a structured format.
Data Transformation: Applying rules and functions to transform data into a consistent and usable format (e.g., date formatting, unit conversions, data type conversions).
Data Validation: Ensuring data adheres to predefined rules and constraints, identifying and handling missing values, outliers, and inconsistencies.
Data Normalization: Scaling and transforming data to a common range or distribution to ensure compatibility across different data sources and models.
Data Deduplication: Identifying and removing duplicate records or entries to maintain data integrity and consistency.

Because of that, the data companies that built various data juggling businesses actually thrive in this era as the data was available and the tools to ingest and apply these transformations are also creates as part of this effort(Hadoop, Spark, Kafka, etc)

However, with the advent of LLMs, we can now leverage their ability to understand and process unstructured data in its raw form, bypassing the need for extensive data cleaning and schema-fitting.

LLM of course does not need any of these schema because they are able to “understand” natural language how people would use this natural language in a free text form and if the text is available in the open-web, then, one can use this natural language to learn the information from the natural language itself. By doing so, LLMs can be trained on vast amounts of textual data, enabling them to comprehend and extract insights from unstructured sources such as emails, documents, social media posts without the need for data normalization, transformation and parsing. For example, an LLM trained on a large corpus of financial reports and news articles could potentially analyze and extract insights from unstructured data sources without the need for extensive data cleaning or transformation. The model could identify and extract relevant financial metrics, trends, and insights directly from the raw text, eliminating the need for complex data parsing and transformation pipelines. While data quality remains crucial, the requirement for rigid schemas is no longer a prerequisite for learning and analytics purposes. LLMs can effectively process and make sense of data in its natural, unstructured form, opening up new possibilities for data-driven insights and decision-making.

Analytics/Answers are included(batteries included in LLM)

Traditional data analysis often involved a complex workflow, starting with extracting data from various sources, followed by cleaning and transforming it using specialized tools and scripts. This process often involved tasks such as:

Data Extraction: Retrieving data from databases, flat files, APIs, or other sources using SQL queries, scripting languages (e.g., Python, R), or specialized ETL (Extract, Transform, Load) tools.
Data Cleaning: Handling missing values, removing duplicates, correcting inconsistencies, and ensuring data integrity using techniques like imputation, deduplication, and data profiling.
Data Transformation: Converting data into a format suitable for analysis, such as pivoting, merging, or reshaping data using tools like pandas or dplyr.
Data Visualization: Creating visualizations like charts, graphs, and dashboards using specialized tools like Tableau, Power BI, or Python libraries like Matplotlib or Plotly.
Reporting and Presentation: Generating reports, presentations, or interactive dashboards to communicate insights and findings to stakeholders.

With LLMs, however, the entire process can be streamlined and simplified. These models are not only capable of understanding and processing unstructured data but can also generate human-readable insights, summaries, and even visualizations directly from the raw data sources. For example, an LLM trained on a large corpus of customer feedback data could potentially analyze and summarize customer sentiments, identify common pain points or areas of satisfaction, and even generate visualizations or reports highlighting key findings – all without the need for separate data cleaning, transformation, or visualization tools. Recently, OpenAI has made a number of announcements on their data analysis reporting capabilities which would help LLM to create digestible and concise information and reporting.

This part is currently a bit undersold and not as emphasized, but I believe this area will be a lot more important as the models themselves become more capable. Especially, when numerical accuracy becomes important, LLM can actually learn to call calculator instead of doing the computation by itself to save the precious GPU resources and can use the existing machines’s CPU cycles to do the computation. By doing so, it would increase its numerical accuracy and analytics capabilities become much more reliable. Currently, there are already LLMs that are doing that.

One LLM to Rule Them All? Not Quite (yet)

While LLMs offer unprecedented capabilities in processing unstructured data and generating insights, it would be an oversimplification to claim that a single LLM can address all data-related challenges. Data quality, preprocessing, weighting, and sampling remain crucial aspects that require careful consideration and domain expertise. LLMs excel at understanding and processing natural language data, but they may struggle with highly specialized or technical domains where precise terminology and domain-specific knowledge are essential. For example, in the field of bioinformatics, where data often consists of complex genomic sequences and annotations, additional data preprocessing and feature engineering may still be necessary to ensure accurate and reliable results. Also, while LLMs can generate human-readable insights and summaries, they are not immune to the risk of hallucination – the tendency to generate plausible-sounding but factually incorrect information when faced with gaps in their knowledge. To mitigate this risk, one still has to implement robust data quality checks, fact-checking mechanisms, and guardrails to ensure the integrity and accuracy of the LLM's outputs. For instance, in the financial domain, where accurate and reliable data is critical for decision-making, organizations may need to implement additional checks and validation processes to ensure that the LLM's outputs align with regulatory requirements, accounting standards, and industry best practices.

These are going to take time, but they will happen at some time in future and at that point, one LLM indeed can rule them all.

MLOps Newsletter

Discussion about this post