LLMs can’t replace expert qualitative research

Published

January 5, 2025

I notice that there is a belief among folk that preside over qualitative research that large language models (such as Ollama and GPT-4o)¹ will be able to replace the work done by qualitative researchers. Echoing back to the (unfulfilled) promises of quantitative data mining, there are folks who believe that LLMs will not just be able to perform at parity with expert qualitative researchers in the synthesis of non-numeric information, but that they will be able to uncover deeper things. That they’ll find patterns that our feeble cranial blobs are simply too unsophisticated to detect. And with these new, deep insights, we’ll be able to have unprecedented success and make several zillion dollars.

However, those who have actually practiced with LLMs know that out-of-the-box solutions rarely perform as well as non-novices in crystallizing various themes from textual responses provided by users, survey respondents, etc. What they can do is perform passable, surface-level analyses faster than any human can²—but anyone who thinks that they, in their current form, can replace trained qualitative researchers are wrong for three reasons.

1. Most human-generated text is incredibly banal

Unless you’re generating data from bona fide experts in whatever domain you’re studying, I can promise you that the vast majority of texts that humans create do not lend themselves to sophisticated analyses. This sounds like it can get Carlin-esque pretty quickly³, but it’s not because your average Joes and Janes are dumb—it’s because non-experts usually don’t approach tasks in the same way as experts do. Look at how a critic breaks down a movie compared to the average Letterboxd or Rotten Tomatoes user. And now remember that even these “average” folks are sufficiently motivated and interested in the topic to write anything at all. They’re actually not average—the true average user writes nothing at all.

I’m not saying that critics are always right or anything; this isn’t a matter of taste. But critics are often more verbose and expose⁴ more involved thought-processes than average people do. A lot of it comes down to connections and comparisons; as you see/do more of a type of thing, you have more that you can draw from and reference. The language you use to describe it often expands and changes as you pick-up new vocabulary. If you’re sampling from a “general” population of users, attendees, citizens, etc—the vast majority of sentiments you get from their words are going to boil down to 1-2 short sentences. And these can usually be summed up as “I liked this part and I didn’t like that part.” “Hidden” meanings don’t magically emerge at greater volumes like they do in quantitative analyses⁵. The best that happens is that you learn that particular words load strongly into different latent dimensions—often because they’re flexible in used in various contexts. Most text responses are short and unflourished. You cannot squeeze blood from a stone.

2. The architecture of LLMs prioritize common correspondences

For the vast majority of people, LLMs are basically magic. You submit some sort of text (for most people, it will be through a chatbot interface) and you—often—get a cogent, reasonably accurate response. These outputs can appear to be very “creative” in that they are usually responses to relatively unique or off-the-wall prompts (“write a poem about my dog going on a walk in the style of Homer’s Odyssey”). But they aren’t “creative” in the sense of reliably synthesizing disparate threads of information into a novel-yet-accurate statements. That is by design. I strongly recommend everyone interested in LLMs watch 3Blue1Brown’s series on neural networks and LLMs (or, at least, this 8 minute short version), but here’s the TLDW: These models effectively work by ingesting gargantuan volumes of text data to create a predictive model that answers the following question: “Given all of the text that has come before, what is the most likely word to come next?” “Likely” here usually means “with the greatest probability”—and that probability is largely informed empirically. It’s determined by looking at the text in the training data. So if the words the model saw were "It was a dark and stormy", it’d be much more likely to return "night" over "afternoon"—but both of those are way more likely to be spat out than "cocker spaniel". Because I’m pretty sure that I’m the first human being ever to imply that “cocker spaniel” be associated with that phrase.

Let’s pause on that point for a moment. The model’s job is to predict the most likely next word and that determination is made based on preexisting combinations of text—which, again, are overwhelmingly unsophisticated and average. The whole idea of “finding entirely novel meanings” is completely anathema to the model’s architecture! Its whole job is to return expected and quotidian words.⁶ Think of what the alternative would entail: a model that shot back unexpected results would be prone to gibberish and hallucinations. It wouldn’t be at all useful! But that then necessarily means that those looking at these models to deliver them better-than-expert syntheses are tilting at windmills.⁷

3. Expertise is what happens outside of the context window

So If I’ve just spent the last few paragraphs arguing that:

Human text is often too short and banal to find “deeper meanings.”
LLMs are designed to present expected (and therefore common and unoriginal) “interpretations” of text.

Why on earth should it be the case that human qualitative researchers will do better? Machines beat us at Chess, Go, and—if you’re a scrub like me—first person shooters and real-time strategy games. Why should people do this better?

First, I would like to remind everyone that there are, in facts, lots of things that are actually pretty tough for machines to do that humans can do quite reliably. Like understand objects in 3 dimensional space or count the number of “r”s in “strawberry”.

Second, it’s because LLMs are not omniscient. They are incredibly impressive, but one of their core limitations is their so-called “context window”. That is, the number of tokens that they can reliably use when doing the whole “predicting what word comes next” thing. Some models boast truly impressive context windows: Hundreds of thousands to a million tokens. Most, though, are in the tens of thousands. This is why many long-running chat conversations often seem to involve the AI “forgetting” what was said before. But unless you’ve managed to translate someone’s knowledge into perfectly parseable text—knowledge gained over hundreds if not thousands of hours of study, observation, and experience—the LLMs are simply not going to have access to it. And if they don’t have access to it, they can’t draw the connections that actually generate new insights.

In practice, generating “novel insights” is less about what words are within the text you’re analyzing. It’s about how those words relate to other knowledge that experts have about the subject matter. For example, if I had a bunch of comments talking about people feeling hot standing in a queue, the LLM might derive the novel insight of “provide shade and fans to people so that they’re more comfortable.” But I might know that the line is actually inside a building—so putting in shade would be a pretty silly use of my time. But what might be a better use would be to check if the AC is working properly or if we need to have fewer people within that room, etc. Actual, useful business decisions come from knowing the context of the business. And that’s not often something that LLM’s can magically deduce just from the text that was provided. But the humans paid to understand both the text and the business context in which that text was generated are likely to do pretty decent at offering a next step forward.

Footnotes

Or whatever the current trendy player is by the time you read this; the landscape is changing almost daily.↩︎
I know that this verbiage sounds pejorative, but sometimes “passable, surface-level analyses” is more than sufficient for a particular analysis in very much the same way how a simple cross-tab or COUNT query is more appropriate than a full-on multilevel Bayesian regression analysis.↩︎
a la “Think of how stupid the average person is and realize that half of them are stupider than that.”↩︎
Emphasis is on “expose” here. I genuinely think the stuff that goes on between most people’s ears is far more complex than they are able to reliably write down. Even the most verbose, effectual, locquacious writers flail about when trying to clearly convey what’s happening in their heads.↩︎
And, I want to emphasize, that a lot of “hidden relationships” found in “big data mining” exercises are either entirely spurious or too small to practically matter.↩︎
At least, to an extent. You can change the “creativity” by changing the “temperature” of the models—but this largely has the effect of artificially tweaking the probabilities of selection for that next word. But it’s a tweak. It’ll make "afternoon" more likely to pop up in the "dark and stormy" example, but you’d better not hold your breath waiting on "bamboo shoot" or something.↩︎
It’s not impossible to train or fine-tune a model to do better, but that also means that you have a large enough volume of the specific type of text to teach it new associations. And, remember, these things were trained off the entire bloody internet.↩︎