Note that Georg Heeg's public wiki (which I found once and never again :)) has a T-Gen based HTML parser which prefers HTML-Tidy sanitized input.
The heeg.de domain appears to be down, but Google found: www.heeg.de/english/services.downloads.html and http://www.heeg.de/english/start.html.
--- Noel