Safely normalize HTML for testing, semantic diffs and readability.
Semantics-preserving and semantics-determined reformatting:
- whitespace collapsed, making no assumptions about CSS styling
- attributes, class values and inline style declarations sorted
- self-closing tags, void elements, quotation marks, escaped text and more normalized
- lines intelligently reflowed and indented for readability
- declare tags, attributes to be ignored, or normalized with regex
- optional conservative mode that assumes CSS might change the layout semantics of most anything.
- tag soup left as-is, not auto-fixed
which enables:
- program output and test case format independence
- readable tests
- clean and succinct diffs on test failures
You are developing software that outputs HTML. You want to write tests based on expected outputs. For at least these tests the exact formatting doesn't matter.
<blockquote id='wt13', class='pullquote, lyrics'>
<ul>
<li>Wild thing, <p>I think I love you</p></li>
<li>But I wanna know for sure</li>
</ul>
</blockquote>
(B)
Actual Output
<blockquote class="lyrics, pullquote", id="wt13" ><ul><li>Wild thing,<p>I think I love you</p></li><li>But I wanna know for sure</li></ul></blockquote>'
Even though (B) has attributes and class values in different orders, double-quotes instead of single and all insignificant whitespace removed, it is semantically identical to (A). The test should pass. You don't want to change the program output, or you can't because it's out of your control. What do you do?
You could try to do a semantic comparison by parsing both into DOM trees and then analyzing. That's very hard to implement. Computationally expensive. Unnecessary. You can't easily produce a diff that succinctly highlights any errors.
Simply pass both expected and actual values through htmlnorm
and use your
chosen assertion library's simple string comparison:
import htmlnorm from 'htmlnorm'
let expected = ...
let actual = program.render(...)
assert.strictEqual(htmlnorm(actual), htmlnorm(expected))
(C)
Result of Both htmlnorm(A)
and htmlnorm(B)
<blockquote class='lyrics, pullquote' id='wt13'>
<ul>
<li>
Wild thing,
<p>I think I love you</p>
</li>
<li>
But I wanna know for sure
</li>
</ul>
</blockquote>
The test will now pass
as it should.
Good assertion libraries produce a diff when a string equality assertion fails. By processing with htmlnorm
first, the diff is better.
Before:
TODO
After:
TODO
By using htmlnorm
on both expected and actual values, your tests achieve
format independence. Your program can generate HTML in whatever form it needs
— maybe for programming convenience, or execution efficiency, or because it uses
a library it can't control. It doesn't matter. Your tests can be written however
is best for the tests. And you won't have to update your tests just because a
new version of the code adds or removes inconsequential whitespace or any other
difference handled by htmlnorm
.
See also Why semantic comparison is needed for stable tests.
Many pretty printers basically make adjustments to the source HTML, removing whitespace here, adding newlines and spaces there. There is no guarantee that two semantically equivalent strings of HTML will be identical after pretty-printing.
htmlnorm
fully parses the HTML parser and rewrites it from scratch.
In the process it mimics browser logic for collapsing whitespace and performs
all the other normalizations listed above.
I tried various HTML pretty printers, but they all came up short. You can't just hack a regex to collapse whitespace in accordance with HTML rules. You can't just take a run of whitespace and reducing it to one or zero spaces.
Whitespace rules can be tricky
remove all spaces | keep a space before tag | keep a space after tag | |
---|---|---|---|
before | ⇢␠␠␠␠<em>⇢Thing</em> |
Wild⇢<em>⇢Thing</em> |
Wild<em>⇢Thing</em>. |
after | <em>Thing</em> |
Wild <em>Thing</em> |
Wild<em> Thing</em>. |
So even if your program suddenly started producing the following zany output, maybe because you switched HTML rendering libraries, your tests will still pass as long as it has not changed semantically:
(D)
Semantically the same as (A), (B) and (C), will normalize to (C)
\t<blockquote class="lyrics, pullquote", id="wt13">\t<ul>\t<li>\tWild thing,\t<p>\tI think I love you </p>
</li> <li> But\tI wanna know for sure </li>
</ul> </blockquote>
One browser behavior that htmlnorm
does not mimic is
tag soup parsing. Doing so would hide
test failures. htmlnorm
leaves malformed HTML as-is.
For a comprehensive explanation and illustration of everything htmlnorm
does, [TODO, link to DOCS.md]
> npm install htmlnorm
htmlnorm
is an ESM module exporting one function:
import htmlnorm from 'htmlnorm'
let html_with_a_personality = // html producing code here
let normalized_html = htmlnorm(html_with_a_personality)
If you are unfamiliar with ESM modules or unsure about how to use them, this might help you. I don't plan to publish a CommonJS build unless someone makes a very compelling case. Node.js fully supports ESM now, and everyone should just make the switch.
This alpha version of htmlnorm
has no options, but I expect to add some in
later releases if there is enough demand. For example:
- User choice on reformatting strategies, particularly element wrapping and indentation.
- Control over which kinds of differences are considered significant. For example, some tests may care whether a character was escaped, even if the HTML spec didn't require it.
- support option to not close tags? (right now we just error out.) See https://html.spec.whatwg.org/#syntax-tag-omission
There are also some good ideas worth adopting from other libraries.
This is extensively tested, but it is still the v0.1. If something doesn't work as you expect/want, instead of just walking away, please submit an issue. I promise a quick turn-around:
- bugs will be fixed pronto,
- feature that aligned with the goals of
htmlnorm
and otherwise make sense will be put on the roadmap. In any case I will respond to your request with an honest answer. - requests that I don't think fit will get an quick and honest reply explaining why.
it's easy to make a request, just format your request like the test cases: an example input and your expected output.
As long as they make sense, and don't add unnecessary UX complexity.
I hunted high and low, though no doubt missed a possibly excellent library out there. Let me know. Here are comments about the ones that I did find:
library | strategy | why it didn't work for me |
---|---|---|
prettydiff | semantic diff? | Is a rather large library supporting a large number of languages (code as well as markup), and using a complex parser (sparse). I hesitate to use something so large. See prior README. See MY TODO.md notes. Seems to do whitespace collapsing correctly! CC0 Creative Commons license. |
Open Web Components Semantic Dom Diff | semantic diff | This is appears to be a very powerful tool! It does a lot of good things, including things I would like to add to future versions of htmlnorm . But it removes more than whitespace with no way to disable that. |
AngleSharp.Diffing | semantic diff | Another powerful library, but it's written in C#. I has a lot of configuration options, including control for how whitespace is handled, but since it's C# I wasn't able to test whether it does properly. |
js-beautify | pretty-print | only does simplistic collapsing of whitespace runs into a single space. Can try it online: https://beautifier.io |
pretty | pretty-print | just a customization of js-beautify , same issues |
diffable-html | "opinionated formatter" focused on diffs | "Be aware that this plugin is intended for making HTML diffs more readable. We took the compromise of not dealing with white-spaces like the browsers do." adds whitespace between inline elements where there was none, materially changing it doesn't sort attribs or classes. prob more diffs. |
htmlnorm | diffable-html | prettydiff | Semantic Dom Diff | js-beautify | |
---|---|---|---|---|---|
sort attribs | ✅ | ❌ | |||
sort classes | ✅ | ❌ | |||
properly preserves inline whitespace | ✅ | ❌1 | ✅ (need to confirm) | ||
Conservative mode | ✅ | ❌ | |||
small, focused library | ✅ | ❌ |
Footnotes
-
diffable-html adds leading and trailing whitespace to inline elements in order to display their text content on separate lines. This is NOT a semantics preserving transformation. ↩