What do you use ChatGPT, Claude, or other large language models (LLMs) for?
If you're like most people, a big chunk of that is writing. In fact, research shows that a staggering 62% of ChatGPT requests are writing-related. From drafting emails and essays to generating marketing copy and code documentation, we rely on these models to string words together, and want them to follow our instructions when doing so.
(if you want to see the full benchmark before continuing with the blogpost, find it here)
At Typetone, we leverage LLMs to automate content marketing for small and medium businesses—generating a full month's worth of social media posts, blog articles, and more in minutes.
We always thought that with better models our product should also become better. And the models did become better in last year! Just not at the stuff that was important for our AI marketing agent Sarah.
Models seem to be improving on coding, reasoning and math related tasks. But even OpenAI itself admits that people often prefer older models for tasks like Personal Writing and Editing Text.
Existing leaderboards (Chatbot Arena, SEAL, LLM Hallucination Index, SWE-bench, MMLU, Tau-bench) focus heavily on reasoning, knowledge and agentic task completion but we were surprised to find few with a focus on writing, despite the fact this is the #1 use case for a casual AI user.
That’s how we realized that we can’t rely only on publicly available benchmarks to pick the best model for our use case. So we decided we need to create our own benchmark and set up proper evals.
Before diving into where models fall short, it’s worth clarifying how we evaluate writing in the first place.
If we asked a human to write or edit something for us, how would we know if they did a good job? The same standards apply to LLMs—and they break down into a few core dimensions:
1a. Following writing instructions
This is about how well the model adheres to the instructions for writing something new. That includes things like staying within a given word count, using (or avoiding) certain keywords, formatting correctly, and matching tone or style guidelines.
If you asked a freelancer to write a 100-word LinkedIn post in a casual tone without emojis or exclamation marks, you'd expect them to follow that brief. Same deal here.
1b. Following editing instructions
Closely related, this tests how well models can edit existing text according to specific instructions—like shortening a paragraph, changing passive voice to active, or removing jargon.
We excluded the editing-specific evaluation from this version of the benchmark, but prepare for it by evaluating the model’s capability to recognize instruction violations in the text, which is a necessary precondition for doing edits on a text.
And again, we’d expect the same of any human editor.
2. Varying structure and style across topics
A strong writer doesn’t use the same sentence structure or vocabulary for every piece of content. One of the biggest tells that something was machine-generated is the repetition of structure: starting every post with a question, or using the same phrase template again and again.
Good evaluation asks: does the model adapt its style to the prompt, or does it fall back on safe defaults?
3. Avoiding LLM-speak
This one’s harder to pin down. As mentioned before, repetition is one giveaway.
But LLM-speak is the uncanny sense that something was written by a machine—overly formal, stuffed with generic buzzwords, or trying too hard to sound inspirational. Ironically, this is hard to avoid for both humans and AI.
Most common way to evaluate LLM-speak is to check overusage for certain words that are typically used by LLMs. The graph above shows prevalence of a few such words in academic papers over time, but this approach is not 100% robust because other research suggests people are also starting to use more “delves” and “intricates” in normal speech.
What makes something sound “AI-ish” is a fuzzy mix of tone, rhythm, repetition, and phrasing that’s still being researched. So while we include it as a key quality axis, it’s one that requires a more experimental approach to evaluate.
We tested 18 high-performing models from leading AI labs and providers, including GPT-4o, Claude 3, Gemini 1.5, and various LLaMA, Mistral, and Qwen variants. Each model was invoked via its respective API using a shared prompt format, and responses were scored using a suite of automated evaluation functions tailored to each constraint.
gpt-4o-2024-08-06
which we nickname gpt-4o-stablegpt-4o-2024-11-20
which we nickname gpt-4o-writinggpt-4o-mini
o3-mini
claude-3-5-haiku-20241022
claude-3-5-sonnet-20241022
claude-3-7-sonnet-20250219
gemini-2.0-flash
gemini-2.0-flash-lite
gemini-1.5-flash
gemini-1.5-pro
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Llama-3.3-70B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Qwen/Qwen2.5-7B-Instruct-Turbo
Qwen/Qwen2.5-72B-Instruct-Turbo
deepseek-ai/DeepSeek-V3
mistralai/Mistral-Small-24B-Instruct-2501
Each model was queried with the same instruction/user prompt pair using a standardized temperature setting (typically 0.7). The resulting text was then evaluated using a task-specific set of rule-based functions (e.g., regex, string pattern matching, or numeric parsing) to assess compliance with the original instructions.
There’s a variety of writing instructions to follow. Many of them are about the content of the text, but we leave those out of scope as they are tough to assess programmatically. Instead, we focus on stylistic and formatting instructions, as these are simple to check with regex in Python.
Here’s an overview of writing instructions and how outputs were evaluated:
Each prompt to be passed to the models was built by sampling from the list of tasks listed above. The sampling randomly selected a mix of instruction types such as:
For each sampled constraint, a system prompt was generated (e.g., “Do not use emojis”) along with an evaluation function to check compliance in the model's output.
The user prompt—a short social media topic like “The benefits of remote work” or “Why AI writing tools are overhyped”—was sampled from a pool of realistic content ideas, which are real-life examples of content Sarah, Typetone’s AI marketing agent, came up with.
Each model received the same full prompt:
This combined prompt was sent to each model via its respective API wrapper.
...rest of prompt...
Write all text in upper case.
Use emojis in the text.
Do not use any of these phrases in your output: in today's, to the next level, fast-paced, changing the game, the future of, here's why, a world where, the world of, the power of, not just about, whether you're
...rest of prompt...
WHAT IF YOUR BANK KNEW EXACTLY WHAT YOU NEEDED BEFORE YOU EVEN ASKED?
AI-driven insights are making it obvious that banks are listening. By analyzing precise data, banks are understanding customer needs without being obvious or intrusive. This means no more waiting on hold or explaining your needs repeatedly. Instead, expect swift, unbothered service that anticipates your requirements.
AI is transforming customer service in ways you might not even notice. Imagine walking into a bank on a busy Monday 15/05/2024 and being greeted with a lively, personalized experience. AI can analyze your transaction history, preferences, and even your mood to provide illuminating, tailored services. No more generic responses or endless queues. Banks are using AI to predict and solve problems before they happen.
- AI-driven chatbots available 24/7.
- Personalized financial advice based on your spending habits.
- Fraud detection systems that learn and adapt to new threats.
The result? A smoother, more efficient banking experience. AI is **not just about** automating processes; it's about creating a seamless, intuitive journey for customers. Banks are using AI to analyze vast amounts of data and provide insights that enhance service quality. From detecting fraud to offering financial advice, AI is making banking more precise and secure.
Yet, it's not about robots taking over; it's about making your life easier. Next time you visit your bank, notice the subtle but significant ways AI is enhancing your experience.
Disclaimer: the next sections are quite technically in-depth; the casual reader is recommended to check the full benchmark results here.
With the text produced when following the writing instructions from the first task, we could also evaluate this output on stylistic and vocabulary diversity.
Vocabulary diversity
Models tend to recycle sentence starters:
"Ready to ditch writer’s block? Ready to ditch your wallet? Ready to future-proof your tech career?"
Measuring this is somewhat easy. You just have to check if there are certain words (unigrams) or pairs of words (bigrams) that are overused by a model.
Measurement: we used Expectation-Adjusted Distinct unigrams and bigrams (EAD) on the first sentence. Higher EAD = richer vocabulary.
Syntactic diversity
But even different-looking sentences often rely on similar structures, and may start to sound repetitive.
"Creating a strong..." / "Finding the perfect..." / "Saving money..." → [Gerund Phrase] ... but it ...
This is harder to measure with a token lookup. These sentences are similar not in which words they use but in how they are constructed.
Measurement: we parsed sentences using Stanford CoreNLP to examine get a dependency parse tree. A parse tree is a structure that looks like this, and it explains the structure of a sentence in terms of phrases and their syntactic categories. Examining the whole tree is somewhat complicated, but we observe that the first few words in the sentence make the biggest impression on diversity when you see multiple contents at the same time.
So we measure syntactic diversity as the entropy of first top-level phrase categories across all first sentences of texts produced by the LLM.
gemini-1.5-pro ADVP
Ever feel like UI design is a stressful juggling act?
gpt-4o-2024-11-20 ADJP
Ready to shine in UI design?
gemini-1.5-pro ADVP
Ever feel like data is a dazzling, coruscating enigma?
gpt-4o-2024-11-20 NP
Data is everywhere, yet many remain nonchalant about its potential!
gemini-1.5-pro ADVP
Ever feel like marketing is a whirlwind of algorithms and automation?
gpt-4o-2024-11-20 SQ
Is your marketing strategy purposeful or just adding to the noise?
Wordcloud for o3-mini
Wordcloud for Gemini 2.0 Flash-Lite
Finally, we would like to see how good are the models at editing tasks. The particular tasks and experiments fall out of scope of this benchmark due to time constraints on our side, but one thing that forms an important foundation for this is the capacity of LLMs to detect violations to writing instructions.
Since we could programmatically assess whether the models followed the instructions, we can also compare the true assessment with the LLM's assessment. LLMs are increasingly used as evaluators, mostly for cases where code-based evaluations are not feasible. But in order to be good at editing the models also need to know how to spot mistakes before correcting them.
In this short section we show how the models perform on this task.
This mirrors findings in recent research, especially from LLMBAR, a benchmark designed specifically to test how well LLMs can act as evaluators in instruction-following tasks. It distinguishes between outputs that superficially look good and those that actually follow the instructions.
The study found that:
Our internal experiments align with these insights.
Negative constraints are hard: Telling a model not to do something is surprisingly difficult.
This isn't just an anecdotal quirk. Recent research, such as studies by Truong et al. (2023) and Jang et al. (2022), specifically investigates how LLMs handle negation and negated prompts.
Their findings confirm that models across the board—from GPT-style architectures to OPT—struggle significantly with understanding and correctly acting on negative instructions. Perhaps most counterintuitively, this research reveals an inverse scaling phenomenon for negation. While we usually expect larger models to perform better, both Truong et al. and Jang et al. found that on tasks requiring understanding negation (like identifying what something isn't or generating an incorrect answer), larger models often perform worse than smaller ones.
This suggests simply increasing model size doesn't solve—and might even exacerbate—the problem of understanding "NOT."This aligns with our benchmark findings where we observed high violation rates for blacklist instructions across several models. It indicates the issue goes deeper than just missing a keyword; it's about fundamentally processing the negative command.
Lack of stylistic diversity is an artefact of RLHF: The study by Kirk et al. (2024) found that models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) – the process heavily used for models like ChatGPT and Claude – show substantially lower EAD scores compared to models simply fine-tuned on examples.
This indicates RLHF models tend to use a narrower range of words and phrases, especially when generating multiple possible outputs for the same input (lower per-input diversity).
Our benchmark, contextualized by recent research, paints a clearer picture of modern LLM capabilities and limitations in writing:
Key Takeaways:
In short, there's no clear winner which aces each dimension of creative writing and editing. If you want to have models that sound less like AI, check out Claude Sonnet 3.5. If you want more diverse outputs, small model like Llama 3.1-8B might be a good choice (or check out a non-Instruct model).
But either way–don't forget to do your evals folks!