Below is a copy of a recent discussion on the box-admins Slack channel. I am copying it here to the mailing list so the discussion will not be lost. Some of this is administrivia but it includes Levente's explanation of why this is a problem on our SeasSide/SqueakSource services, and what can be done to improve the situation.

In addition to blocking the bot traffic, it also may be possible (details below) to make Seaside much less vulnerable to this activity.

Dave


=== thread from slack chat below ===

lewis
  4 days ago
I see some evidence that squeaksource.com running on dan.box.squeak.org is being hit by a bot of some sort. I can see activity in the Squeak process browser (VNC connection), and I also see activity in /proc/<squeakpid>/fd/ that looks like it may be someone scanning projects through the squeaksource.com service. This may be the source of the high system load and sluggish response that we have been seeing, and it does some to be getting worse over time. Is there any kind of log or utility that I can use to get an idea of where these connections are coming from? These would be connections routed through alan to connect to dan. Thanks for any tips or suggestions.
22 replies


leves
  3 days ago
I think I have mentioned it a couple times that ~99% of all traffic is bot traffic.
The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit.
Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist.
A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became pluggable, so my version couldn't be used since Seaside 2.9.
Anyway, we can filter out some of the bot traffic to reduce the load if there's need.


leves
  3 days ago
And to answer your question, yes, there are logs on alan.
/var/log/nginx/squeaksourcecom-access.log has the current day's log (according to UTC) and /var/log/nginx/squeaksourcecom-access.log.1 has the previous day's log. The latter file currently has 2.6 million entries, so there were that many requests.


leves
  3 days ago
Just noticed that you don't have a user on alan yet. I can either create a user for you, or I can copy some of the files over to dan.


lewis
  3 days ago
Thanks Levente. I would appreciate if you can give me an account on alan. I will use it with care, and it will help me to figure out problems like this. Thank you, and thank you for the explanation of the Seaside issues.


lewis
  3 days ago
And yes, if there is a way to filter out some of the bot traffic, I think we are at a point where it is becoming necessary. I am mainly watching dan.box.squeak.org but I expect that the same issues apply to our source.squeak.org server on andreas.box.squeak.org.


lewis
  3 days ago
@leves
 the Seaside that we are using in our SqueakSource servers appears to be about 15 to 20 years old, with some local patches to keep it working in later Squeak images. I am not sure of the history behind this, but I don't think that our squeaksource servers really care what version of Seaside they are running on, just as long as it works. If you can point to any other version of Seaside that contains your alternative session registry for Seaside 2.8, then maybe we should try it? I am happy to work on it.
Also sent to the channel


tim
  3 days ago
I can aver that current-ish Seaside (3.4.etc) works decently on Squeak 6+. There’s a few tweaks I have that are submitted (kinda) but not yet adopted.


lewis
  3 days ago
I guess my question is this - since we obviously do not care about being current with Seaside (we are running a 20 year old version now), is there some runnable version of it that does contain Levente's enhancement that we could use instead? I do not care about being "up to date" I just want it to work. (edited) 


tim
  1 day ago
That’s an interesting question.


tim
  1 day ago
On one side, taking the current image and loading vente’s session changes is ‘simple’ for certain definitions of the word. It changes as little as possible BUT leaves us running very old code.


tim
  1 day ago
On the other side, adopting a current image and Seaside version involves more changes but brings in a couple of decades of improvements.


lewis
  1 day ago
I'm not too interested in a couple of decades of improvements unless the improvements actually improve something. But if 
@leves
 has an alternative session registry that addresses the issues we are seeing on the squeaksource boxes, then I would be very happy to try running it regardless of what version of Seaside it uses.


lewis
  1 day ago
I guess I should also say that I really don't know what version of Seaside we are running now for our SqueakSource services. The MCZ packages seem to be of ancient vintage so I am assuming that they might be something reasonably compatible with an alternative session registry.


leves
  23 hours ago
IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong. :slightly_smiling_face:
Anyway, I've blocked a bunch of bots from the Singaporean Alibaba cloud. Those were responsible for about 3/4 of all traffic. Since they used fake user agent string, I don't know what service they were representing.


leves
  23 hours ago
Another bot that gets down the infinite session rabbit hole is ClaudeBot. I tried to block it via robots.txt, but it's not easy to set up robots.txt with the current nginx setup (there's already a rule for robots.txt that cannot be overriden in nginx...).


lewis
  23 hours ago
@leves
 Thank you! The high system load problem on squeaksource.com and source.squeak.org seems to be resolved. I have not been able to watch closely but I think that both services have been back to normal for the last couple of days.


leves
  23 hours ago
There was some outage ~11 hours ago. During that I decided to filter out those bots. I haven't checked the configuration of source.squeak.org, so the bots can still reach that.


lewis
  23 hours ago
I expect that squeaksource.com is the most vulnerable due to the large number of projects to scan. But source.squeak.org would be vulnerable to the same issues so it would be good to block the bots there also. I suspect that some of the problems that 
@cmm
 has been working on (mutex and process scheduling questions) may in fact be caused by the scanning bots.


leves
  23 hours ago
I think the bot traffic just triggers the issues. But I agree that we would be better off without the bots.
Here are the user agents that sent more than 1000 requests to source.squeak.org yesterday:
    1146 Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html)
   2225 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
   2421 CCBot/2.0 (https://commoncrawl.org/faq/)
   4260 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
   4282 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
  10562 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
  11025 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36
  13001 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
  20779 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
  30990 Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)
  95781 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
 248770 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36
 406213 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)


leves
  22 hours ago
There are about 10000 requests that are not listed above from yesterday. About 10% of those is non-bot traffic. So, ~1000 requests out of ~1 million is non-bot traffic.


leves
  22 hours ago
The 248770 requests with the fake chrome on a mac are the bots from Singapore.


lewis
  22 hours ago
Wow, I was not aware of this at all. Thank you. And I agree with you that 1) bot traffic just triggers the issues and 2) we would be better off without the bots. I am amazed to see the amount of bot traffic.