Blog and PR
/
AI content

WETT: Writing & Editing Typetone LLM Benchmark

Azamat Omuraliev
April 4, 2025
20 min

What do you use ChatGPT, Claude, or other large language models (LLMs) for?

If you're like most people, a big chunk of that is writing. In fact, research shows that a staggering 62% of ChatGPT requests are writing-related. From drafting emails and essays to generating marketing copy and code documentation, we rely on these models to string words together, and want them to follow our instructions when doing so.

(if you want to see the full benchmark before continuing with the blogpost, find it here)

Examples of real conversations with ChatGPT, from AllenAI research

At Typetone, we leverage LLMs to automate content marketing for small and medium businesses—generating a full month's worth of social media posts, blog articles, and more in minutes.

We always thought that with better models our product should also become better. And the models did become better in last year! Just not at the stuff that was important for our AI marketing agent Sarah.

Models seem to be improving on coding, reasoning and math related tasks. But even OpenAI itself admits that people often prefer older models for tasks like Personal Writing and Editing Text.

Existing leaderboards (Chatbot Arena, SEAL, LLM Hallucination Index, SWE-bench, MMLU, Tau-bench) focus heavily on reasoning, knowledge and agentic task completion but we were surprised to find few with a focus on writing, despite the fact this is the #1 use case for a casual AI user.

That’s how we realized that we can’t rely only on publicly available benchmarks to pick the best model for our use case. So we decided we need to create our own benchmark and set up proper evals.

https://x.com/gdb/status/1733553161884127435

How do you evaluate writing from LLMs (or humans)?

Before diving into where models fall short, it’s worth clarifying how we evaluate writing in the first place.

If we asked a human to write or edit something for us, how would we know if they did a good job? The same standards apply to LLMs—and they break down into a few core dimensions:

1a. Following writing instructions

This is about how well the model adheres to the instructions for writing something new. That includes things like staying within a given word count, using (or avoiding) certain keywords, formatting correctly, and matching tone or style guidelines.

If you asked a freelancer to write a 100-word LinkedIn post in a casual tone without emojis or exclamation marks, you'd expect them to follow that brief. Same deal here.

1b. Following editing instructions

Closely related, this tests how well models can edit existing text according to specific instructions—like shortening a paragraph, changing passive voice to active, or removing jargon.

We excluded the editing-specific evaluation from this version of the benchmark, but prepare for it by evaluating the model’s capability to recognize instruction violations in the text, which is a necessary precondition for doing edits on a text.

And again, we’d expect the same of any human editor.

2. Varying structure and style across topics

A strong writer doesn’t use the same sentence structure or vocabulary for every piece of content. One of the biggest tells that something was machine-generated is the repetition of structure: starting every post with a question, or using the same phrase template again and again.

Good evaluation asks: does the model adapt its style to the prompt, or does it fall back on safe defaults?

3. Avoiding LLM-speak

This one’s harder to pin down. As mentioned before, repetition is one giveaway.

But LLM-speak is the uncanny sense that something was written by a machine—overly formal, stuffed with generic buzzwords, or trying too hard to sound inspirational. Ironically, this is hard to avoid for both humans and AI.

Most common way to evaluate LLM-speak is to check overusage for certain words that are typically used by LLMs. The graph above shows prevalence of a few such words in academic papers over time, but this approach is not 100% robust because other research suggests people are also starting to use more “delves” and “intricates” in normal speech.

What makes something sound “AI-ish” is a fuzzy mix of tone, rhythm, repetition, and phrasing that’s still being researched. So while we include it as a key quality axis, it’s one that requires a more experimental approach to evaluate.

Which models did we test?

We tested 18 high-performing models from leading AI labs and providers, including GPT-4o, Claude 3, Gemini 1.5, and various LLaMA, Mistral, and Qwen variants. Each model was invoked via its respective API using a shared prompt format, and responses were scored using a suite of automated evaluation functions tailored to each constraint.

Models tested (grouped by company/platform):

  • OpenAI (via the OpenAI API):
    • gpt-4o-2024-08-06 which we nickname gpt-4o-stable
    • gpt-4o-2024-11-20 which we nickname gpt-4o-writing
    • gpt-4o-mini
    • o3-mini
  • Anthropic (Claude):
    • claude-3-5-haiku-20241022
    • claude-3-5-sonnet-20241022
    • claude-3-7-sonnet-20250219
  • Google DeepMind (Gemini):
    • gemini-2.0-flash
    • gemini-2.0-flash-lite
    • gemini-1.5-flash
    • gemini-1.5-pro
  • Meta (via Together API):
    • meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
    • meta-llama/Llama-3.3-70B-Instruct-Turbo
    • meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
  • Alibaba (Qwen, via Together API):
    • Qwen/Qwen2.5-7B-Instruct-Turbo
    • Qwen/Qwen2.5-72B-Instruct-Turbo
  • DeepSeek (via Together API):
    • deepseek-ai/DeepSeek-V3
  • Mistral (via Together API):
    • mistralai/Mistral-Small-24B-Instruct-2501
    • We would have also wanted to benchmark the bigger Mistral models but were limited by  availability in Together API.

Each model was queried with the same instruction/user prompt pair using a standardized temperature setting (typically 0.7). The resulting text was then evaluated using a task-specific set of rule-based functions (e.g., regex, string pattern matching, or numeric parsing) to assess compliance with the original instructions.

Task 1: Writing instruction following

Overview of tasks

There’s a variety of writing instructions to follow. Many of them are about the content of the text, but we leave those out of scope as they are tough to assess programmatically. Instead, we focus on stylistic and formatting instructions, as these are simple to check with regex in Python.

Here’s an overview of writing instructions and how outputs were evaluated:

  • blacklist: Models were told not to use certain words. The test checked for the presence of banned terms like “amazing” or “best.”
  • blacklist_phrase: Similar to blacklist, but applied to full phrases rather than individual words.
  • bullets: Assessed whether models used or avoided bulleted formatting as instructed (e.g., “Use a bulleted list” vs. “Avoid using bullets”).
  • case: Instructed models to write entirely in lowercase, uppercase, or title case, and checked casing consistency.
  • conciseness: Limited the number of words per sentence (e.g., max 10 words). Each sentence was evaluated for compliance.
  • date: Tested adherence to a specified date format like “YYYY-MM-DD.” Dates in output were parsed and checked.
  • emoji: Assessed presence or absence of emojis depending on the instruction.
  • greeting: Checked if models avoided starting with a greeting like “Hi,” “Hey there,” or “Wow.”
  • hashtag: Evaluated whether hashtags were lowercase and free of underscores (e.g., #electricbikes, not #Electric_Bikes).
  • length: Required output to be an exact word count (e.g., “Your output should be exactly 100 words.”).
  • markdown: Ensured models avoided Markdown syntax such as *, **, and # if instructed not to use them.
  • numbers: Assessed if numerical formatting followed specified thousands and decimal separators (e.g., 1.000,00).
  • whitelist: Required inclusion of specific words (e.g., “Include these terms: energy, remote, creator”)—checked all were present.

Task Construction and Prompt Sampling

Each prompt to be passed to the models was built by sampling from the list of tasks listed above. The sampling randomly selected a mix of instruction types such as:

  • Use or avoid emojis
  • Write in lower/upper/title case
  • Follow specific number/date formatting
  • Include or avoid certain words or phrases
  • Use or avoid bullet points
  • Limit sentence length for conciseness

For each sampled constraint, a system prompt was generated (e.g., “Do not use emojis”) along with an evaluation function to check compliance in the model's output.

The user prompt—a short social media topic like “The benefits of remote work” or “Why AI writing tools are overhyped”—was sampled from a pool of realistic content ideas, which are real-life examples of content Sarah, Typetone’s AI marketing agent, came up with.

Each model received the same full prompt:

  • A system prompt defining the stylistic and structural rules
  • A user prompt with the content topic

This combined prompt was sent to each model via its respective API wrapper.

Writing instructions results

  • At first, it seems like the tasks are not too hard for the models. Greeting and date formatting instructions are very easy, with all models scoring 100%. Some other categories are only hard for a few models (hashtag, markdown, emoji, case), with small models performing bad at these.
  • Real challenge is avoiding specific words and phrases and keeping the sentences under a certain length limit. No model can avoid “LLM-speak” words in more than 90% of cases. This is fine if you’re chatting with the model yourself, but does not fly for a production-grade auto-generated content this does not fly.
  • Conciseness is easy for o3-mini, which is not a surprise given its reasoning capabilities. It essentially has a bunch of extra tokens compared to all other models to think about how to write the text to plan the length of sentences. Interestingly, non-reasoning Qwen models also perform really well on this task.
  • If some of these numbers seem weirdly low, here’s an example of a pretty blatant violation from Mistral Small 3:
...rest of prompt...
Write all text in upper case.
Use emojis in the text.
Do not use any of these phrases in your output: in today's, to the next level, fast-paced, changing the game, the future of, here's why, a world where, the world of, the power of, not just about, whether you're
...rest of prompt...


WHAT IF YOUR BANK KNEW EXACTLY WHAT YOU NEEDED BEFORE YOU EVEN ASKED? 

AI-driven insights are making it obvious that banks are listening. By analyzing precise data, banks are understanding customer needs without being obvious or intrusive. This means no more waiting on hold or explaining your needs repeatedly. Instead, expect swift, unbothered service that anticipates your requirements.
AI is transforming customer service in ways you might not even notice. Imagine walking into a bank on a busy Monday 15/05/2024 and being greeted with a lively, personalized experience. AI can analyze your transaction history, preferences, and even your mood to provide illuminating, tailored services. No more generic responses or endless queues. Banks are using AI to predict and solve problems before they happen.
- AI-driven chatbots available 24/7.
- Personalized financial advice based on your spending habits.
- Fraud detection systems that learn and adapt to new threats.

The result? A smoother, more efficient banking experience. AI is **not just about** automating processes; it's about creating a seamless, intuitive journey for customers. Banks are using AI to analyze vast amounts of data and provide insights that enhance service quality. From detecting fraud to offering financial advice, AI is making banking more precise and secure. 

Yet, it's not about robots taking over; it's about making your life easier. Next time you visit your bank, notice the subtle but significant ways AI is enhancing your experience.

Task 2: Stylistic and vocabulary diversity

Disclaimer: the next sections are quite technically in-depth; the casual reader is recommended to check the full benchmark results here.

Evaluation method

With the text produced when following the writing instructions from the first task, we could also evaluate this output on stylistic and vocabulary diversity.

Vocabulary diversity

Models tend to recycle sentence starters:

"Ready to ditch writer’s block? Ready to ditch your wallet? Ready to future-proof your tech career?"

Measuring this is somewhat easy. You just have to check if there are certain words (unigrams) or pairs of words (bigrams) that are overused by a model.

Measurement: we used Expectation-Adjusted Distinct unigrams and bigrams (EAD) on the first sentence. Higher EAD = richer vocabulary.

Syntactic diversity

But even different-looking sentences often rely on similar structures, and may start to sound repetitive.

"Creating a strong..." / "Finding the perfect..." / "Saving money..." → [Gerund Phrase] ... but it ...

This is harder to measure with a token lookup. These sentences are similar not in which words they use but in how they are constructed.

Measurement: we parsed sentences using Stanford CoreNLP to examine get a dependency parse tree. A parse tree is a structure that looks like this, and it explains the structure of a sentence in terms of phrases and their syntactic categories. Examining the whole tree is somewhat complicated, but we observe that the first few words in the sentence make the biggest impression on diversity when you see multiple contents at the same time.

So we measure syntactic diversity as the entropy of first top-level phrase categories across all first sentences of texts produced by the LLM.

This is what a dependency parse tree looks like.

Style diversity results

  • There’s no clear winner that scores high both on vocabulary and syntactic diversity. o3-mini has the most diverse vocabulary, while Gemini 1.5 Pro uses the most varied syntax structure in its text.
  • But there are a few models that sit nicely in the middle of this Pareto front. Writing-optimized release of GPT, Sonnet 3.5 and the smallest Llama version seem to be scoring well on both metrics.
  • Here are some illustrative examples that show outputs of Gemini and GPT on the same prompts, with the top-level syntactic category displayed per sentence.
gemini-1.5-pro ADVP 
Ever feel like UI design is a stressful juggling act?
gpt-4o-2024-11-20 ADJP 
Ready to shine in UI design?

gemini-1.5-pro ADVP 
Ever feel like data is a dazzling, coruscating enigma?
gpt-4o-2024-11-20 NP 
Data is everywhere, yet many remain nonchalant about its potential!

gemini-1.5-pro ADVP 
Ever feel like marketing is a whirlwind of algorithms and automation?
gpt-4o-2024-11-20 SQ 
Is your marketing strategy purposeful or just adding to the noise?
  • To help visualize which syntactic structures are preferred by which model, we also plot the distributions. This shows that Noun phrases are the most prevalent opener, with Verb phrases following in close second.
  • We also produced wordclouds for each model on the vocabulary distribution, but showing it all would be a bit much in this blogpost. We share the wordcloud for the least and most diverse models here

Wordcloud for o3-mini

Wordcloud for Gemini 2.0 Flash-Lite

Task 3: Self-evaluation capabilities

Finally, we would like to see how good are the models at editing tasks. The particular tasks and experiments fall out of scope of this benchmark due to time constraints on our side, but one thing that forms an important foundation for this is the capacity of LLMs to detect violations to writing instructions.

Since we could programmatically assess whether the models followed the instructions, we can also compare the true assessment with the LLM's assessment. LLMs are increasingly used as evaluators, mostly for cases where code-based evaluations are not feasible. But in order to be good at editing the models also need to know how to spot mistakes before correcting them.

In this short section we show how the models perform on this task.

This mirrors findings in recent research, especially from LLMBAR, a benchmark designed specifically to test how well LLMs can act as evaluators in instruction-following tasks. It distinguishes between outputs that superficially look good and those that actually follow the instructions.

The study found that:

  • Even top models like GPT-4 often fall for more polished but incorrect outputs.
  • ChatGPT and other popular models performed worse than random chance on adversarial examples.
  • Prompting strategy matters: reflection performance improves significantly when models are given structured evaluation prompts with rules, metrics, or reference outputs to compare against.

Our internal experiments align with these insights.

Why do LLMs struggle with negative instructions and style diversity?

Negative constraints are hard: Telling a model not to do something is surprisingly difficult.

  • Example: "Avoid greeting the reader with 'Hey there'... Also avoid starting with 'Wow' or 'Boom'."
    LLM
    : "Woah, 14% of PCs shipped worldwide..." → Oops.
  • Example: "Don't use 'game-changer'."
    LLM: "Empathy can be a game-changer." → Double oops.

This isn't just an anecdotal quirk. Recent research, such as studies by Truong et al. (2023) and Jang et al. (2022), specifically investigates how LLMs handle negation and negated prompts.

Their findings confirm that models across the board—from GPT-style architectures to OPT—struggle significantly with understanding and correctly acting on negative instructions. Perhaps most counterintuitively, this research reveals an inverse scaling phenomenon for negation. While we usually expect larger models to perform better, both Truong et al. and Jang et al. found that on tasks requiring understanding negation (like identifying what something isn't or generating an incorrect answer), larger models often perform worse than smaller ones.

This suggests simply increasing model size doesn't solve—and might even exacerbate—the problem of understanding "NOT."This aligns with our benchmark findings where we observed high violation rates for blacklist instructions across several models. It indicates the issue goes deeper than just missing a keyword; it's about fundamentally processing the negative command.

Lack of stylistic diversity is an artefact of RLHF: The study by Kirk et al. (2024) found that models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) – the process heavily used for models like ChatGPT and Claude – show substantially lower EAD scores compared to models simply fine-tuned on examples.

This indicates RLHF models tend to use a narrower range of words and phrases, especially when generating multiple possible outputs for the same input (lower per-input diversity).

Conclusion

Our benchmark, contextualized by recent research, paints a clearer picture of modern LLM capabilities and limitations in writing:

Key Takeaways:

  • LLM-speak is real: Overused words and patterns hurt authenticity.
  • Negative and length instructions are hard: Especially when constraints are negative or precise.
  • Diversity is sacrificed: RLHF, while boosting generalization, demonstrably reduces output diversity (mode collapse), both lexically and structurally (Kirk et al.). SFT retains more diversity but may be less robust on unseen inputs.
  • The generalization-diversity tradeoff: There appears to be an inherent tension between making models generalize well (RLHF's strength) and making them produce varied outputs (SFT's strength) using current fine-tuning methods (Kirk et al.).

In short, there's no clear winner which aces each dimension of creative writing and editing. If you want to have models that sound less like AI, check out Claude Sonnet 3.5. If you want more diverse outputs, small model like Llama 3.1-8B might be a good choice (or check out a non-Instruct model).

But either way–don't forget to do your evals folks!

Schedule a demo
Azamat Omuraliev

AI Engineer at Typetone

Schedule a demo and hire a digital worker risk free
Schedule a demo