[Seaside] How to think about "Unicode spew"

Esteban Maringolo emaringolo at gmail.com
Fri Feb 14 14:14:19 UTC 2020


There is something weird in the code you shared, it has regular tags
<body><p> along things withing angular brackets ( <, >) that are URLs.
And following that you have < and > entities that seemed to
belong to tags.

I suggest you run the input through tidy [1] before doing any HTML
parsing, and avoid "string replacements" such as `copyReplaceAll:
'<' with:'<'`.

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Fri, Feb 14, 2020 at 11:01 AM Karsten Kusche <karsten at heeg.de> wrote:
>
> That doesn’t look like an encoding problem. The only places where you have these question marks is right behind a <. Try to look at the source with a hex-editor to identify the actual character that’s placed behind <. My guess would be character 0 or something similar.
>
> Karsten
>
> Georg Heeg eK
> Wallstraße 22
> 06366 Köthen
>
> Tel.: 03496/214328
> FAX: 03496/214712
> Amtsgericht Dortmund HRA 12812
>
>
> Am 13. Februar 2020 um 21:27:18, tty (gettimothy at zoho.com) schrieb:
>
> Hi Folks.
>
> Over at http://menmachinesmaterials.com/WikitextParser ***
>
> When hitting HamburgerIcon->Database->Random Page I occasionally get what
> I call "Unicode spew"
>
> Here is a portion of a page.
> *<�!DOCTYPE html><�html class="no-js" lang="en"
> dir="ltr"><�head><�title>WikitextParser<�/title><�meta
> charset="utf-8"/><�link rel="stylesheet" type="text/css"
> href="/files/WADevelopmentFiles/development.css"/>...*
>
>
> However, on the image, if I run the page manually, the resulting XMLElement
> looks just fine.
>
> Here is the thing that caused the spew.
>
> *<body><p> Thierry IV or Theoderic IV ({{circa}} 720{{spaced ndash}}c. 782)
> was a Frankish <https://www.wikipedia.org/wiki/Franks> noble. Count of
> Autun <https://www.wikipedia.org/wiki/Autun> and Toulouse
> <https://www.wikipedia.org/wiki/Toulouse> ; he was thought to be a son of
> Sigebert V <https://www.wikipedia.org/wiki/Sigebert_V> , and grandson of
> Sigebert IV of Raze <https://www.wikipedia.org/wiki/Sigebert_IV_of_Raze> .
> It is now well documented that his supposed Davidic blood was a hoax (see
> Priory of Sion <https://www.wikipedia.org/wiki/Priory_of_Sion> ). Thierry
> married Auda <https://www.wikipedia.org/wiki/Auda_of_France> , daughter of
> Charles Martel <https://www.wikipedia.org/wiki/Charles_Martel> , sister of
> Pepin III <https://www.wikipedia.org/wiki/Pepin_III> .</p>
> Children
> <ul><li><a
> href="https://www.wikipedia.org/wiki/William_of_Gellone">William of
> Gellone</a> (755 – 28 May 812/4)</li><li>Alda of Gellone (born ca.
> 770); married Fredalon</li><li><a
> href="https://www.wikipedia.org/wiki/Adalhelm_of_Autun">Adalhelm of
> Autun</a></li></ul><p>{{Persondata <div/>| NAME = Thierry
> 04| ALTERNATIVE NAMES =| SHORT DESCRIPTION = Frankish noble| DATE OF BIRTH
> =| PLACE OF BIRTH =| DATE OF DEATH =| PLACE OF DEATH
> =}}{{DEFAULTSORT:Thierry 04}} Category:720s births
> <https://www.wikipedia.org/wiki/Category:720s_births> Category:780s deaths
> <https://www.wikipedia.org/wiki/Category:780s_deaths> Category:Counts of
> Autun <https://www.wikipedia.org/wiki/Category:Counts_of_Autun>
> Category:Counts of Toulouse
> <https://www.wikipedia.org/wiki/Category:Counts_of_Toulouse>
> Category:Frankish people
> <https://www.wikipedia.org/wiki/Category:Frankish_people>
> </p><p>{{France-noble-stub}}</p></body>*
>
>
> The method that posts the output is straightforward enough:
>
> *renderParsedOn: html
> | wikiGrammar wikiParser input actor|
>
> actor := PEGWikiMediaGeneratorTables new.
> actor transcripton
> ifTrue:[ Transcript clear].
>
> wikicode isNil
> ifTrue:[input := '== Welcome To WikitextParserBrowser ==']
> ifFalse:[input := wikicode].
>
> wikiGrammar := PEGParser grammarWikiMediaTables reading positioning.
> wikiParser := PEGParser parserPEG parse: 'Grammar' stream: wikiGrammar
> actor: PEGParserParser new.
> [[output := wikiParser parse: 'Page' stream: input actor: actor. ]
> on: Error
> do:[:ex | output := '
> Error parsing. see Wikicode tab for source
> ']]
> ensure:[
> output := ((output asString copyReplaceAll: '<body>' with:'' )
> copyReplaceTokens:'</body>' with:'') .
> output := (output asString copyReplaceAll: '>' with:'>'
> asTokens:false).
> output := (output asString copyReplaceAll: '<' with:'<'
> asTokens:false)].
> html break;break.
> html html: output.
>
> *
>
> Is there something I should be doing to "output" to make the garbage go
> away?
>
> thanks in advance
> *** Alpha/Beta dev tool. If you get a DNU just hit the back button and try
> again. Please do not hit Debug (:
>
>
>
> --
> Sent from: http://forum.world.st/Seaside-General-f86180.html
> _______________________________________________
> seaside mailing list
> seaside at lists.squeakfoundation.org
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
> _______________________________________________
> seaside mailing list
> seaside at lists.squeakfoundation.org
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside


More information about the seaside mailing list