[squeak-dev] editing the Squeak wiki XML file library

Chris Cunnington brasspen at gmail.com
Fri Oct 24 16:47:00 UTC 2014


https://www.dropbox.com/s/xs1gjsi93i4d7rn/pages.zip?dl=0

I was talking to Tim at CSV about the Squeak wiki and he was very interested in editing it. He has a number of ideas for pruning the Squeak wiki. 

As such, I provide here the entire Squeak wiki as it relates to its XML files. This is all the text content. There are no supporting files for images, code files, etc. 
It’s 22M if you want to download it and it opens to 6191 XML files. The latest number in box2 is 6199, so it’s almost complete. Updating the content is as easy as copying the eight missing files into your set. 

After describing the way the XML is laid out, Tim said: “Well maybe we can just sed and grep away some of the pages.” Precisely. 
There is a caveat that you need to be aware of and Tim was aware of it: links to other pages. It’s not really a problem, I don’t think. Here’s why. 

Take for example a snippet from 6182.xml:

<text>!Introduction
*2726* started out very useful and very popular.

The links all have asterisks around them (i.e. *2726*). That’s not some pointer to an object. That’s a text reference to file named 2726.xml. Simple, right? The page formatter replaces that with the name of the file. A quick look tells me that 2726.xml is titled SqueakMap:

<name>SqueakMap</name>
<text>SqueakMap is a tool used by the Squeak community 

The <name></name> token is interpolated into 6182.xml when it’s formatted so the link appears as SqueakMap and it’s a hyperlink to the other file. 

All this to say, if Tim or others did want to edit this content, the only thing you need to be aware of before throwing out files is the numbers between the asterisks (i.e. *4444*). Say you wanted to delete  3333.xml (and it’s companion 3333.old), then you might want to do a sed/grep for *3333* over the balance of the files to look for links you have to account for. Perhaps create a script to erase those detected links in a clean way? If it’s a hyperlink in a sentence, then you’d need to replace *3333* with the <name></name> token of 3333, so the sentence still read clearly. It would simply no longer be a hyperlink. 

So, editing this set of files is pretty trivial. I bet somebody wrote a program to create a graph showing links between XML/HTML files that could be configured to look for tokens of the description *1234*. 

FWIW, 
Chris 


More information about the Squeak-dev mailing list