Eight Things to Know about Large Language Models

SelfRec, Pipeline RL, TRL, Knowledge Pack, DataHerald

May 03, 2025

Samuel Bowman has a rather interesting paper to talk about some of the properties of LLMs and the future direction on this area. He argues the following 8 things in the paper for LLM to be true:

LLMs predictably get more capable with increasing investment, even without targeted innovation.
Many important LLM behaviors emerge unpredictably as a byproduct of increasing investment.
LLMs often appear to learn and use representations of the outside world.
There are no reliable techniques for steering the behavior of LLMs.
Experts are not yet able to interpret the inner workings of LLMs.
Human performance on a task isn’t an upper bound on LLM performance.
LLMs need not express the values of their creators nor the values encoded in web text.
Brief interactions with LLMs are often misleading

I will try to expand these areas in the following parts:

1. LLMs Predictably Get More Capable With Increasing Investment, Even Without Targeted Innovation

A foundational insight is that the capabilities of LLMs improve predictably as a function of scale-measured in three primary dimensions:

Model size (number of parameters)
Training data volume
Compute used for training (FLOPs)

This relationship is formalized in scaling laws paper, which show that as these dimensions increase, the model’s performance on a broad range of language tasks improves smoothly and predictably, often following power-law trends.

For example, the progression from GPT to GPT-2 to GPT-3 involved relatively minor architectural changes but massive increases in training compute (up to 20,000× more for GPT-3) and data, resulting in qualitative leaps in capability. GPT-4 continued this trend, outperforming humans on many professional exams.

This enables LLMs to demonstrate the following properties when it comes to scaling:

Predictability: Scaling laws allow researchers to estimate performance improvements before training expensive models, reducing trial-and-error.
Economic justification: The ability to forecast returns on investment has driven multi-billion-dollar funding rounds.
Limited innovation needed: Most gains come from investing more compute/data rather than fundamentally new architectures or training algorithms.

2. Many Important LLM Behaviors Emerge Unpredictably as a Byproduct of Increasing Investment

While overall performance improves predictably, specific capabilities often emerge abruptly and unpredictably once the model crosses certain scale thresholds. This phenomenon is called emergent abilities. Some of these emergent abilities demonstrated in LLMs are:

Few-shot learning: The ability to perform new tasks from just a few examples in the prompt, which was not present in smaller models.
Chain-of-thought reasoning: The capacity to generate step-by-step reasoning improving performance on complex tasks.

These enable the following properties for LLMs:

Uncertainty: Developers know that larger models will be better overall but cannot reliably predict which new skills will appear or when.
“Mystery box” effect: Investing in larger models is akin to buying a black box with unknown but potentially valuable new capabilities.
Planning challenges: Responsible deployment and preparation for novel capabilities require flexibility and ongoing monitoring.

3. LLMs Often Appear to Learn and Use Representations of the Outside World

Despite being trained solely on text prediction through training datasets compiled through various resources, LLMs develop internal representations that correspond to real-world concepts and abstractions:

Semantic representations: Models encode color concepts in ways that align with human perception.
Theory of mind: LLMs can infer what an author knows or believes and use this to predict text continuation.
Spatial and object representations: Models track properties and locations of objects in stories, sometimes representing spatial layouts.
Visual reasoning: Even without direct visual training, models like GPT-4 can generate instructions in graphics languages to draw objects.
Game state tracking: Models trained on textual game move descriptions learn internal representations of board states.
Common sense and fact-checking: LLMs can distinguish misconceptions from facts and estimate claim plausibility.
Passing reasoning tests: LLMs perform well on benchmarks like the Winograd Schema Challenge, which require commonsense reasoning beyond surface text cues.

These capabilities enables the following properties for LLMs to demonstrate:

Beyond next-word prediction: While technically LLMs predict text, their learned representations enable abstract reasoning and world modeling.
Weak but growing: These abilities are currently imperfect and sporadic but improve with scale and training innovations.
Augmentation: Integration with vision models and external tools further enhances world understanding.

4. There Are No Reliable Techniques for Steering the Behavior of LLMs

LLMs are pretrained to predict text continuations, but practical applications require them to follow instructions or behave in desired ways. Steering model behavior is challenging:

Fine-tuning and instruction tuning: Adjusting model weights on specialized datasets can improve alignment but is costly and imperfect.
Prompt engineering: Carefully crafting input prompts can guide outputs but is brittle and often unreliable.
Reinforcement learning from human feedback (RLHF): Human preferences guide model outputs but cannot guarantee consistent behavior.
Lack of interpretability: Without understanding internal mechanisms, it is hard to predict or guarantee model responses.