Semantic HTML as an LLM indexing strategy
LLMs parse HTML structure to understand content hierarchy. Semantic HTML is no longer just accessibility — it is discoverability.
We ran an experiment. We created two versions of the same article: one with semantic HTML5 elements (<article>, <section>, <time>, <figure>) and one with generic <div> soup. We measured how often each was cited by ChatGPT, Perplexity, and Claude.
The semantic version was cited 3.2x more often.
Why HTML structure matters to LLMs
When LLMs process web pages, they do not just read text. They parse the DOM tree. Structure provides context:
<article>signals self-contained content<section>witharia-labelledbycreates topical boundaries<time datetime="...">provides machine-readable dates<figure>+<figcaption>associates images with descriptions<dl>,<dt>,<dd>create definition relationships
These elements act as chunk boundaries for retrieval systems. A well-structured page is easier to embed, chunk, and retrieve.
The experiment
We published two articles about vector databases:
Version A: Semantic HTML with proper heading hierarchy, <article> wrapper, <section> elements, and schema.org microdata.
Version B: Same text wrapped in <div> elements with class names.
Both had identical CSS styling. Both ranked similarly on Google. But LLM citations differed dramatically:
| System | Version A | Version B |
|---|---|---|
| ChatGPT | 47 citations | 14 citations |
| Perplexity | 38 citations | 12 citations |
| Claude | 29 citations | 9 citations |
Implementation checklist
For every article we publish, we verify:
- Single
<h1>per page - Logical heading hierarchy (no skipped levels)
-
<article>for main content -
<section>with headings for topical divisions -
<time datetime="...">for all dates -
<figure>and<figcaption>for images -
<blockquote cite="...">for quotations -
<address>for author information -
<nav>for table of contents - Schema.org JSON-LD for structured data
Accessibility and GEO alignment
This is the rare case where accessibility and GEO are perfectly aligned. Screen readers and LLMs both benefit from semantic structure. A page that passes WCAG 2.1 AA is likely well-optimized for LLM retrieval.
The future
We expect search engines to increasingly weight semantic structure. As LLM-based search grows, the incentives for clean HTML will strengthen. The best time to fix your markup was when you built the site. The second best time is now.