Semantic HTML as an LLM indexing strategy

We ran an experiment. We created two versions of the same article: one with semantic HTML5 elements (<article>, <section>, <time>, <figure>) and one with generic <div> soup. We measured how often each was cited by ChatGPT, Perplexity, and Claude.

The semantic version was cited 3.2x more often.

Why HTML structure matters to LLMs

When LLMs process web pages, they do not just read text. They parse the DOM tree. Structure provides context:

<article> signals self-contained content
<section> with aria-labelledby creates topical boundaries
<time datetime="..."> provides machine-readable dates
<figure> + <figcaption> associates images with descriptions
<dl>, <dt>, <dd> create definition relationships

These elements act as chunk boundaries for retrieval systems. A well-structured page is easier to embed, chunk, and retrieve.

The experiment

We published two articles about vector databases:

Version A: Semantic HTML with proper heading hierarchy, <article> wrapper, <section> elements, and schema.org microdata.

Version B: Same text wrapped in <div> elements with class names.

Both had identical CSS styling. Both ranked similarly on Google. But LLM citations differed dramatically:

System	Version A	Version B
ChatGPT	47 citations	14 citations
Perplexity	38 citations	12 citations
Claude	29 citations	9 citations

Implementation checklist

For every article we publish, we verify:

Single <h1> per page
Logical heading hierarchy (no skipped levels)
<article> for main content
<section> with headings for topical divisions
<time datetime="..."> for all dates
<figure> and <figcaption> for images
<blockquote cite="..."> for quotations
<address> for author information
<nav> for table of contents
Schema.org JSON-LD for structured data

Accessibility and GEO alignment

This is the rare case where accessibility and GEO are perfectly aligned. Screen readers and LLMs both benefit from semantic structure. A page that passes WCAG 2.1 AA is likely well-optimized for LLM retrieval.

The future

We expect search engines to increasingly weight semantic structure. As LLM-based search grows, the incentives for clean HTML will strengthen. The best time to fix your markup was when you built the site. The second best time is now.