Why Large Language Models (LLM) Do Not Use Schema Markup in Their Core Training

Published by NewsPR Today | August 2025

Structured data, or schema markup, has quickly become part of the fabric of the modern SEO industry. It makes it easier for search engines to interpret content more clearly, as well as improve the way information is presented in search results.

Meanwhile, Large Language Models (LLMs) like GPT-4 have become disruptive technologies that generate coherent, context-aware text without needing pre-specified templates.

This comparison begs the question: Why don’t LLMs make use of schema markup in their core training corpora?

The reason lies in the way LLMs are trained, how they internally process language and what the very objectives of these models are. This article is organized in those aspects, starting from what tokenization is, how we are handling schema during training, some examples we are seeing around, how to use it in practice, pros and cons of them and a small discussion about schema on if it could have a chance on being used by LLM based systems in the future.

What is Schema Markup?

Schema markup is a subset of structured data that is written in a standardized format (e.g. JSON-LD, RDFa, Microdata). It is a fragment of code that you can add to webpages to enable search engines to understand the content in a more accurate manner. For instance, a recipe page may have schema markup where you can mark the name of the meal, its cooking time, calorie content, etc.

Search engines, such as Google or Bing, process this kind of data to create rich results, containing, for example, recipes, events or product reviews. That’s because schema markup is not implicit and instead explicit, which tells the machine exactly what the content piece in question pertains to, no inference needed.

Freeform text, on the other hand, is tacit and must be inferred. The function of an LLM is to process this implicit information and interpret it without the use of structured metadata.

Related Article: From Confused to Confident: Your 2025 Guide to SEO and AI Search Terms

How Big Tech Plans to Manage the AI Crisis: How Large Language Models Process Text

Tokenization

LLMs are trained on huge amounts of unstructured text by cutting it into smaller pieces called tokens. Tokenization does not rely on any rules of grammar, but it divides text into a sequence of tokens (i.e.character n-grams). For instance, the phrase:

"@type": "Organization"

would not be a single concept. Instead, it is broken down into individual tokens like “@”, “type’ and “Organization”. Each of these tokens is just a token as far as the model is concerned, the same as any other it’d ever see in text – regardless of whether they emanate from schema markup, a work of fiction, or a status update.

Training Objective

The training objective of an LLM is to predict the next token in a sequence of billions of samples. Learning over time, the model builds up statistical priors on how words are likely to co-occur. There is, in principle, no innate marker of whether a given token is from the schema or the free-text.

That is to say, the close connections that schema-encoded structured markup create are kind of “flattened” into the space we reserve for ordinary language. The structured information signal is explicitly discarded in LLMs by trade, due to their unsupervised training nature, in which LLMs learn from patterns instead of obeying rules.

Why a Schema is “Destroyed” in Tokenization

The magic in schema markup is the explicit nature of it. Labelling only works when it’s used consistently and properly.

For example, the schema-type property “@type”: “Organization” specifies unambiguously that the thing is an organization. But this explicitness does not survive in Tokenization.

The model doesn’t understand the word – it’s only a token, just like every other token when seen in an example sentence like “She works for an international organization.” The character @ is not used in the sense of a schema property. Rather, it is a sign just like any other, depending on the context.

However, this means the value of the schema markup to an LLM at training time is reduced to just another string of text. The significant structure of the narrative is lost, and the relationships that were wanted no longer exist.

Examples

Here’s an example of a schema markup:

{

"@context": "https://schema.org",

"@type": "Person",

"name": "John Smith",

"jobTitle": "Software Engineer",

"affiliation": {

"@type": "Organization",

"name": "TechCorp"

}

}

For a search engine, this markup makes it obvious that John Smith is a person, that he has a specific job title, and that he works for a particular company! Each property serves its purpose.

For an LLM, however, tokenization can yield a sequence of tokens like {, “@”, “context”, “Person”, “name”, “jobTitle”, “affiliation”, “Organization”. The model does not remember their full hierarchical relation. It does not learn the meaning of words, but only that certain words are closely associated with others.

The difference between “jobTitle” is schema property and “job title” “The” and “a” inside of Trim Query seem useless. I could not see any usage for them inside of the relevant code but since this space character being there from the very beginning – it must be used for something, is it possible that they are used to infer if “jobTitle” is a single word or is it possible that “First is greater than Second” is assumed to mean “This at the end of Job Title, and in this case the input must be processed differently by different algorithm”, or whether this tag should be used at least once for the input? Is text becomes basically inapplicable.

Use Cases Where Schema Could Intersect with LLMs

Although schema does not influence the core training of LLMs, there are scenarios where it could play a role when combined with these models:

  1. Retrieval-Augmented Generation (RAG):
  2. Schema could be used to retrieve relevant structured information, which is then passed to an LLM for contextualized responses.
  3. Domain-Specific Fine-Tuning:
  4. An LLM fine-tuned on schema-rich datasets for a narrow purpose, such as product catalogues or medical data, could learn to associate tokens with specific meanings more consistently.
  5. Hybrid Search Systems:
  6. Search engines already combine structured and unstructured data. LLMs integrated into these systems could leverage schema as part of a broader pipeline, even if not in their core training.

Benefits of Schema for Machines

Although schema is not beneficial to LLMs in their raw training phase, it is clearly advantageous beyond training:

  • Clarity and Rigour: Schema prevents vagueness by making relationships explicit.
  • Enhances Search Rankings: It adds Rich Snippets to your articles and pages.
  • Interoperability: Schema markup is a language that both machines and humans can read.
  • M2M Communication: When systems communicate (e.g., APIs or knowledge graphs), a lot of structure is at play.

For this purpose, a schema is still necessary for search engines, web applications, and knowledge graphs.

The Difficulties of Schema in LLM Instruction

Loss of Explicitness

As indicated, tokenization removes the structured nature that makes schema useful. LLMs do not respect hierarchies or the hierarchy between property values.

Scale and Coverage

Not everything uses Schema markup as well. Many sites do not even make use of it, and when they do, it can be incomplete and contain errors. For LLMs trained on large datasets, schema markup would be but a small and unreliable percentage of their training data.

Purpose Misalignment

The main goal of an LLM is to understand text that is not structured. Introducing a schema doesn’t fit with that. Instead, it provides redundancy as LLMs already receive the same information from freeform language.

Future Outlook

While schema markup does not lend itself explicitly to LLM-core training, it may become more pertinent for hybrid systems. For instance, the structure data can be included at the retrieval level in the processing of Retrieval-Augmented Generation in order to obtain more accurate results in factoid queries. Similarly, enterprise applications might consist of knowledge of graphs, schema and LLMs in order to provide both explainable and trustworthy outputs as well.

Over a longer horizon, as AI models become increasingly integrated in the sense of combining symbolic reasoning with statistical prediction, schema may become more in vogue. For the moment, its level of explicitness still does not match well with LLM tokenization, however.

Conclusion

Schema markup and large language models are intended for different purposes. Schema is explicit, ordered and rule-based, so it is very useful for SE and Mass-computer communication. LLMs, however, feed off freeform, unstructured text and make use of tokenization, which in turn obliterates schema explicitness.

While schema will not affect the burden of LLMs’ core training, it still plays an important role in the entire digital information ecosystem. The most likely way forward is in the area of hybrid systems, where structured data is used for retrieval and validation, and the end-to-end LLMs for natural language understanding and generation.

It is this difference in order that allows SEOs, AI experts and data scientists to use each tool where it applies and not dispute the existence of one over the other.

About Nitesh Gupta

Hi, I'm Nitesh Gupta, SEO Manager at NewsPR Today. As a writer and digital marketing enthusiast, I simplify Google algorithm updates, AI advancements, and digital trends. At NewsPR Today, we inform, educate, and empower readers with clear, up-to-date insights for... [Read more]

Stay ahead of the curve.

Get the latest marketing news and insights delivered to your inbox.

CAPTCHA image

This helps us prevent spam, thank you.