Below is a copy of a recent discussion on the box-admins Slack channel. I am copying it here to the mailing list so the discussion will not be lost. Some of this is administrivia but it includes Levente's explanation of why this is a problem on our SeasSide/SqueakSource services, and what can be done to improve the situation.
In addition to blocking the bot traffic, it also may be possible (details below) to make Seaside much less vulnerable to this activity.
Dave
=== thread from slack chat below ===
lewis 4 days ago I see some evidence that squeaksource.com running on dan.box.squeak.org is being hit by a bot of some sort. I can see activity in the Squeak process browser (VNC connection), and I also see activity in /proc/<squeakpid>/fd/ that looks like it may be someone scanning projects through the squeaksource.com service. This may be the source of the high system load and sluggish response that we have been seeing, and it does some to be getting worse over time. Is there any kind of log or utility that I can use to get an idea of where these connections are coming from? These would be connections routed through alan to connect to dan. Thanks for any tips or suggestions. 22 replies
leves 3 days ago I think I have mentioned it a couple times that ~99% of all traffic is bot traffic. The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became pluggable, so my version couldn't be used since Seaside 2.9. Anyway, we can filter out some of the bot traffic to reduce the load if there's need.
leves 3 days ago And to answer your question, yes, there are logs on alan. /var/log/nginx/squeaksourcecom-access.log has the current day's log (according to UTC) and /var/log/nginx/squeaksourcecom-access.log.1 has the previous day's log. The latter file currently has 2.6 million entries, so there were that many requests.
leves 3 days ago Just noticed that you don't have a user on alan yet. I can either create a user for you, or I can copy some of the files over to dan.
lewis 3 days ago Thanks Levente. I would appreciate if you can give me an account on alan. I will use it with care, and it will help me to figure out problems like this. Thank you, and thank you for the explanation of the Seaside issues.
lewis 3 days ago And yes, if there is a way to filter out some of the bot traffic, I think we are at a point where it is becoming necessary. I am mainly watching dan.box.squeak.org but I expect that the same issues apply to our source.squeak.org server on andreas.box.squeak.org.
lewis 3 days ago @leves the Seaside that we are using in our SqueakSource servers appears to be about 15 to 20 years old, with some local patches to keep it working in later Squeak images. I am not sure of the history behind this, but I don't think that our squeaksource servers really care what version of Seaside they are running on, just as long as it works. If you can point to any other version of Seaside that contains your alternative session registry for Seaside 2.8, then maybe we should try it? I am happy to work on it. Also sent to the channel
tim 3 days ago I can aver that current-ish Seaside (3.4.etc) works decently on Squeak 6+. There's a few tweaks I have that are submitted (kinda) but not yet adopted.
lewis 3 days ago I guess my question is this - since we obviously do not care about being current with Seaside (we are running a 20 year old version now), is there some runnable version of it that does contain Levente's enhancement that we could use instead? I do not care about being "up to date" I just want it to work. (edited)
tim 1 day ago That's an interesting question.
tim 1 day ago On one side, taking the current image and loading vente's session changes is 'simple' for certain definitions of the word. It changes as little as possible BUT leaves us running very old code.
tim 1 day ago On the other side, adopting a current image and Seaside version involves more changes but brings in a couple of decades of improvements.
lewis 1 day ago I'm not too interested in a couple of decades of improvements unless the improvements actually improve something. But if @leves has an alternative session registry that addresses the issues we are seeing on the squeaksource boxes, then I would be very happy to try running it regardless of what version of Seaside it uses.
lewis 1 day ago I guess I should also say that I really don't know what version of Seaside we are running now for our SqueakSource services. The MCZ packages seem to be of ancient vintage so I am assuming that they might be something reasonably compatible with an alternative session registry.
leves 23 hours ago IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong. :slightly_smiling_face: Anyway, I've blocked a bunch of bots from the Singaporean Alibaba cloud. Those were responsible for about 3/4 of all traffic. Since they used fake user agent string, I don't know what service they were representing.
leves 23 hours ago Another bot that gets down the infinite session rabbit hole is ClaudeBot. I tried to block it via robots.txt, but it's not easy to set up robots.txt with the current nginx setup (there's already a rule for robots.txt that cannot be overriden in nginx...).
lewis 23 hours ago @leves Thank you! The high system load problem on squeaksource.com and source.squeak.org seems to be resolved. I have not been able to watch closely but I think that both services have been back to normal for the last couple of days.
leves 23 hours ago There was some outage ~11 hours ago. During that I decided to filter out those bots. I haven't checked the configuration of source.squeak.org, so the bots can still reach that.
lewis 23 hours ago I expect that squeaksource.com is the most vulnerable due to the large number of projects to scan. But source.squeak.org would be vulnerable to the same issues so it would be good to block the bots there also. I suspect that some of the problems that @cmm has been working on (mutex and process scheduling questions) may in fact be caused by the scanning bots.
leves 23 hours ago I think the bot traffic just triggers the issues. But I agree that we would be better off without the bots. Here are the user agents that sent more than 1000 requests to source.squeak.org yesterday: 1146 Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html) 2225 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 2421 CCBot/2.0 (https://commoncrawl.org/faq/) 4260 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 4282 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 10562 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) 11025 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 13001 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 20779 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 30990 Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com) 95781 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) 248770 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 406213 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
leves 22 hours ago There are about 10000 requests that are not listed above from yesterday. About 10% of those is non-bot traffic. So, ~1000 requests out of ~1 million is non-bot traffic.
leves 22 hours ago The 248770 requests with the fake chrome on a mac are the bots from Singapore.
lewis 22 hours ago Wow, I was not aware of this at all. Thank you. And I agree with you that 1) bot traffic just triggers the issues and 2) we would be better off without the bots. I am amazed to see the amount of bot traffic.
Hi all,
I'm new to Squeak and have been lurking on the mailing lists for a while trying to glean information about the language, community, etc. and have enjoyed being a bystander while I try to find spare time to learn Smalltalk.
That said, I have to say I'm kind of amazed by the idea of running a 20 year old version of a web application framework for a public facing, effectively insecure site. While I completely understand the idea of not wanting to chase the new and shiny, the web has changed significantly in 20 years, in particular the requirements for running a site securely. There are some simple basic things that would help immediately like forcing HTTPS as a redirect from an HTTP request (you can access both of these sites using insecure HTTP), putting a captcha on any forms where a user can POST/PUT/DELETE, etc. I don't know enough about Seaside to know if it's been tested for security vulnerabilities the way huge scale web application frameworks that are used in industry are but I have to believe any 15-20 year old version of web application software regardless of language/runtime has multiple security vulnerabilities at this point. I also wondered if it wouldn't be better to just force these sites to be completely behind a login for all access (again with a captcha at a minimum and forcing HTTPS). I did some checks to see how many pages are indexed in google and it's minimal so the likelihood of searching and landing on a squeaksource.com or source.squeak.org page seems pretty low. Run these queries in google for instance (remove the quotes): "site:squeaksource.com -asdf" - returns 695 pages indexed "site:source.squeak.org -asdf" - returns 125 pages indexed
Just for context, I grew up in the days of the web where I would happily dial in on my modem to participate in an open community of like minded people. I still pine for those days but we all know that place is sadly long gone. squeaksource.com is highly exposed in multiple risky ways for the community the way it's set up right now. Also the idea that the bot traffic is simply web crawlers trying to crawl pages by following links is almost certainly not correct and modifying robots.txt won't work for most crawlers anyway as there's only a few crawlers (namely big US companies like Google, Yahoo, Microsoft) that abide by them anymore. The reality is the web is infested with malicious bots (acting as crawlers and web agents) that are probing for a way *onto* the host server not to pull down the page for a search engine for end users but so they can install malware to create botnets to do real damage.
Anyway, apologies if I'm coming across too harsh/negative, that's not my intent. I would just really be saddened to see squeaksource get crushed by bots or worse having its servers exploited and have to be shut down as it provides a nice service for the community.
My workload lately has been off the charts so I haven't had as much time to dive into Smalltalk as I've wanted but I have over 25 years of web engineering experience in high traffic/distributed web systems and would be happy to assist in any way that might be helpful, I just don't have the Smalltalk chops to write code just yet.. ;)
Mike
On Wed, Apr 24, 2024 at 7:52 PM lewis@mail.msen.com wrote:
Below is a copy of a recent discussion on the box-admins Slack channel. I am copying it here to the mailing list so the discussion will not be lost. Some of this is administrivia but it includes Levente's explanation of why this is a problem on our SeasSide/SqueakSource services, and what can be done to improve the situation.
In addition to blocking the bot traffic, it also may be possible (details below) to make Seaside much less vulnerable to this activity.
Dave
=== thread from slack chat below ===
lewis 4 days ago I see some evidence that squeaksource.com running on dan.box.squeak.org is being hit by a bot of some sort. I can see activity in the Squeak process browser (VNC connection), and I also see activity in /proc/<squeakpid>/fd/ that looks like it may be someone scanning projects through the squeaksource.com service. This may be the source of the high system load and sluggish response that we have been seeing, and it does some to be getting worse over time. Is there any kind of log or utility that I can use to get an idea of where these connections are coming from? These would be connections routed through alan to connect to dan. Thanks for any tips or suggestions. 22 replies
leves 3 days ago I think I have mentioned it a couple times that ~99% of all traffic is bot traffic. The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became pluggable, so my version couldn't be used since Seaside 2.9. Anyway, we can filter out some of the bot traffic to reduce the load if there's need.
leves 3 days ago And to answer your question, yes, there are logs on alan. /var/log/nginx/squeaksourcecom-access.log has the current day's log (according to UTC) and /var/log/nginx/squeaksourcecom-access.log.1 has the previous day's log. The latter file currently has 2.6 million entries, so there were that many requests.
leves 3 days ago Just noticed that you don't have a user on alan yet. I can either create a user for you, or I can copy some of the files over to dan.
lewis 3 days ago Thanks Levente. I would appreciate if you can give me an account on alan. I will use it with care, and it will help me to figure out problems like this. Thank you, and thank you for the explanation of the Seaside issues.
lewis 3 days ago And yes, if there is a way to filter out some of the bot traffic, I think we are at a point where it is becoming necessary. I am mainly watching dan.box.squeak.org but I expect that the same issues apply to our source.squeak.org server on andreas.box.squeak.org.
lewis 3 days ago @leves the Seaside that we are using in our SqueakSource servers appears to be about 15 to 20 years old, with some local patches to keep it working in later Squeak images. I am not sure of the history behind this, but I don't think that our squeaksource servers really care what version of Seaside they are running on, just as long as it works. If you can point to any other version of Seaside that contains your alternative session registry for Seaside 2.8, then maybe we should try it? I am happy to work on it. Also sent to the channel
tim 3 days ago I can aver that current-ish Seaside (3.4.etc) works decently on Squeak 6+. There’s a few tweaks I have that are submitted (kinda) but not yet adopted.
lewis 3 days ago I guess my question is this - since we obviously do not care about being current with Seaside (we are running a 20 year old version now), is there some runnable version of it that does contain Levente's enhancement that we could use instead? I do not care about being "up to date" I just want it to work. (edited)
tim 1 day ago That’s an interesting question.
tim 1 day ago On one side, taking the current image and loading vente’s session changes is ‘simple’ for certain definitions of the word. It changes as little as possible BUT leaves us running very old code.
tim 1 day ago On the other side, adopting a current image and Seaside version involves more changes but brings in a couple of decades of improvements.
lewis 1 day ago I'm not too interested in a couple of decades of improvements unless the improvements actually improve something. But if @leves has an alternative session registry that addresses the issues we are seeing on the squeaksource boxes, then I would be very happy to try running it regardless of what version of Seaside it uses.
lewis 1 day ago I guess I should also say that I really don't know what version of Seaside we are running now for our SqueakSource services. The MCZ packages seem to be of ancient vintage so I am assuming that they might be something reasonably compatible with an alternative session registry.
leves 23 hours ago IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong. :slightly_smiling_face: Anyway, I've blocked a bunch of bots from the Singaporean Alibaba cloud. Those were responsible for about 3/4 of all traffic. Since they used fake user agent string, I don't know what service they were representing.
leves 23 hours ago Another bot that gets down the infinite session rabbit hole is ClaudeBot. I tried to block it via robots.txt, but it's not easy to set up robots.txt with the current nginx setup (there's already a rule for robots.txt that cannot be overriden in nginx...).
lewis 23 hours ago @leves Thank you! The high system load problem on squeaksource.com and source.squeak.org seems to be resolved. I have not been able to watch closely but I think that both services have been back to normal for the last couple of days.
leves 23 hours ago There was some outage ~11 hours ago. During that I decided to filter out those bots. I haven't checked the configuration of source.squeak.org, so the bots can still reach that.
lewis 23 hours ago I expect that squeaksource.com is the most vulnerable due to the large number of projects to scan. But source.squeak.org would be vulnerable to the same issues so it would be good to block the bots there also. I suspect that some of the problems that @cmm has been working on (mutex and process scheduling questions) may in fact be caused by the scanning bots.
leves 23 hours ago I think the bot traffic just triggers the issues. But I agree that we would be better off without the bots. Here are the user agents that sent more than 1000 requests to source.squeak.org yesterday: 1146 Mozilla/5.0 (compatible; AwarioBot/1.0; + https://awario.com/bots.html) 2225 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 2421 CCBot/2.0 (https://commoncrawl.org/faq/) 4260 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 4282 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+ https://webmaster.petalsearch.com/site/petalbot) 10562 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/ ) 11025 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 13001 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 20779 Mozilla/5.0 (compatible; SemrushBot/7~bl; + http://www.semrush.com/bot.html) 30990 Mozilla/5.0 (compatible; DotBot/1.2; + https://opensiteexplorer.org/dotbot; help@moz.com) 95781 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) 248770 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 406213 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
leves 22 hours ago There are about 10000 requests that are not listed above from yesterday. About 10% of those is non-bot traffic. So, ~1000 requests out of ~1 million is non-bot traffic.
leves 22 hours ago The 248770 requests with the fake chrome on a mac are the bots from Singapore.
lewis 22 hours ago Wow, I was not aware of this at all. Thank you. And I agree with you that
- bot traffic just triggers the issues and 2) we would be better off
without the bots. I am amazed to see the amount of bot traffic.
Hi Michael
I love Seaside as an application framework that runs over http. But web tech has moved on and for the good.
The new web....thinks differently ; I do not think Seaside should be shoehorned into it.
Now, Seaside inside Croquet .....peer to peer....not public .....yes.
---- On Thu, 25 Apr 2024 07:03:15 -0400 mike.engelhart@gmail.com wrote ----
Hi all,
I'm new to Squeak and have been lurking on the mailing lists for a while trying to glean information about the language, community, etc. and have enjoyed being a bystander while I try to find spare time to learn Smalltalk.
That said, I have to say I'm kind of amazed by the idea of running a 20 year old version of a web application framework for a public facing, effectively insecure site. While I completely understand the idea of not wanting to chase the new and shiny, the web has changed significantly in 20 years, in particular the requirements for running a site securely. There are some simple basic things that would help immediately like forcing HTTPS as a redirect from an HTTP request (you can access both of these sites using insecure HTTP), putting a captcha on any forms where a user can POST/PUT/DELETE, etc. I don't know enough about Seaside to know if it's been tested for security vulnerabilities the way huge scale web application frameworks that are used in industry are but I have to believe any 15-20 year old version of web application software regardless of language/runtime has multiple security vulnerabilities at this point. I also wondered if it wouldn't be better to just force these sites to be completely behind a login for all access (again with a captcha at a minimum and forcing HTTPS). I did some checks to see how many pages are indexed in google and it's minimal so the likelihood of searching and landing on a squeaksource.com or source.squeak.org page seems pretty low. Run these queries in google for instance (remove the quotes): "site:squeaksource.com -asdf" - returns 695 pages indexed
"site:source.squeak.org -asdf" - returns 125 pages indexed
Just for context, I grew up in the days of the web where I would happily dial in on my modem to participate in an open community of like minded people. I still pine for those days but we all know that place is sadly long gone. squeaksource.com is highly exposed in multiple risky ways for the community the way it's set up right now. Also the idea that the bot traffic is simply web crawlers trying to crawl pages by following links is almost certainly not correct and modifying robots.txt won't work for most crawlers anyway as there's only a few crawlers (namely big US companies like Google, Yahoo, Microsoft) that abide by them anymore. The reality is the web is infested with malicious bots (acting as crawlers and web agents) that are probing for a way onto the host server not to pull down the page for a search engine for end users but so they can install malware to create botnets to do real damage.
Anyway, apologies if I'm coming across too harsh/negative, that's not my intent. I would just really be saddened to see squeaksource get crushed by bots or worse having its servers exploited and have to be shut down as it provides a nice service for the community.
My workload lately has been off the charts so I haven't had as much time to dive into Smalltalk as I've wanted but I have over 25 years of web engineering experience in high traffic/distributed web systems and would be happy to assist in any way that might be helpful, I just don't have the Smalltalk chops to write code just yet.. ;)
Mike
On Wed, Apr 24, 2024 at 7:52 PM lewis@mail.msen.com wrote:
Below is a copy of a recent discussion on the box-admins Slack channel. I am copying it here to the mailing list so the discussion will not be lost. Some of this is administrivia but it includes Levente's explanation of why this is a problem on our SeasSide/SqueakSource services, and what can be done to improve the situation.
In addition to blocking the bot traffic, it also may be possible (details below) to make Seaside much less vulnerable to this activity.
Dave
=== thread from slack chat below ===
lewis 4 days ago I see some evidence that squeaksource.com running on dan.box.squeak.org is being hit by a bot of some sort. I can see activity in the Squeak process browser (VNC connection), and I also see activity in /proc/<squeakpid>/fd/ that looks like it may be someone scanning projects through the squeaksource.com service. This may be the source of the high system load and sluggish response that we have been seeing, and it does some to be getting worse over time. Is there any kind of log or utility that I can use to get an idea of where these connections are coming from? These would be connections routed through alan to connect to dan. Thanks for any tips or suggestions. 22 replies
leves 3 days ago I think I have mentioned it a couple times that ~99% of all traffic is bot traffic. The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became pluggable, so my version couldn't be used since Seaside 2.9. Anyway, we can filter out some of the bot traffic to reduce the load if there's need.
leves 3 days ago And to answer your question, yes, there are logs on alan. /var/log/nginx/squeaksourcecom-access.log has the current day's log (according to UTC) and /var/log/nginx/squeaksourcecom-access.log.1 has the previous day's log. The latter file currently has 2.6 million entries, so there were that many requests.
leves 3 days ago Just noticed that you don't have a user on alan yet. I can either create a user for you, or I can copy some of the files over to dan.
lewis 3 days ago Thanks Levente. I would appreciate if you can give me an account on alan. I will use it with care, and it will help me to figure out problems like this. Thank you, and thank you for the explanation of the Seaside issues.
lewis 3 days ago And yes, if there is a way to filter out some of the bot traffic, I think we are at a point where it is becoming necessary. I am mainly watching dan.box.squeak.org but I expect that the same issues apply to our source.squeak.org server on andreas.box.squeak.org.
lewis 3 days ago @leves the Seaside that we are using in our SqueakSource servers appears to be about 15 to 20 years old, with some local patches to keep it working in later Squeak images. I am not sure of the history behind this, but I don't think that our squeaksource servers really care what version of Seaside they are running on, just as long as it works. If you can point to any other version of Seaside that contains your alternative session registry for Seaside 2.8, then maybe we should try it? I am happy to work on it. Also sent to the channel
tim 3 days ago I can aver that current-ish Seaside (3.4.etc) works decently on Squeak 6+. There’s a few tweaks I have that are submitted (kinda) but not yet adopted.
lewis 3 days ago I guess my question is this - since we obviously do not care about being current with Seaside (we are running a 20 year old version now), is there some runnable version of it that does contain Levente's enhancement that we could use instead? I do not care about being "up to date" I just want it to work. (edited)
tim 1 day ago That’s an interesting question.
tim 1 day ago On one side, taking the current image and loading vente’s session changes is ‘simple’ for certain definitions of the word. It changes as little as possible BUT leaves us running very old code.
tim 1 day ago On the other side, adopting a current image and Seaside version involves more changes but brings in a couple of decades of improvements.
lewis 1 day ago I'm not too interested in a couple of decades of improvements unless the improvements actually improve something. But if @leves has an alternative session registry that addresses the issues we are seeing on the squeaksource boxes, then I would be very happy to try running it regardless of what version of Seaside it uses.
lewis 1 day ago I guess I should also say that I really don't know what version of Seaside we are running now for our SqueakSource services. The MCZ packages seem to be of ancient vintage so I am assuming that they might be something reasonably compatible with an alternative session registry.
leves 23 hours ago IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong. :slightly_smiling_face: Anyway, I've blocked a bunch of bots from the Singaporean Alibaba cloud. Those were responsible for about 3/4 of all traffic. Since they used fake user agent string, I don't know what service they were representing.
leves 23 hours ago Another bot that gets down the infinite session rabbit hole is ClaudeBot. I tried to block it via robots.txt, but it's not easy to set up robots.txt with the current nginx setup (there's already a rule for robots.txt that cannot be overriden in nginx...).
lewis 23 hours ago @leves Thank you! The high system load problem on squeaksource.com and source.squeak.org seems to be resolved. I have not been able to watch closely but I think that both services have been back to normal for the last couple of days.
leves 23 hours ago There was some outage ~11 hours ago. During that I decided to filter out those bots. I haven't checked the configuration of source.squeak.org, so the bots can still reach that.
lewis 23 hours ago I expect that squeaksource.com is the most vulnerable due to the large number of projects to scan. But source.squeak.org would be vulnerable to the same issues so it would be good to block the bots there also. I suspect that some of the problems that @cmm has been working on (mutex and process scheduling questions) may in fact be caused by the scanning bots.
leves 23 hours ago I think the bot traffic just triggers the issues. But I agree that we would be better off without the bots. Here are the user agents that sent more than 1000 requests to source.squeak.org yesterday: 1146 Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html) 2225 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 2421 CCBot/2.0 (https://commoncrawl.org/faq/) 4260 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 4282 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.pe
...
Hi,
Thanks for the reply. Just to be clear I wasn't suggesting abandoning Seaside or http as a transport mechanism, I was merely saying the http protocol is insecure and the squeaksource site already works over the secure https protocol which is identical to http except that all data is encrypted in transit. i.e. http://squeaksource.com https://squeaksource.com
They return the same page to the browser except one is susceptible to a number of attack vectors that https isn't susceptible for.
By having the code that handles the page request inspect the protocol it could detect when requests for http://squeaksource.com pages arrive and redirect them to https://squeaksource.com
Mike
On Thu, Apr 25, 2024 at 7:29 PM gettimothy via Squeak-dev < squeak-dev@lists.squeakfoundation.org> wrote:
Hi Michael
I love Seaside as an application framework that runs over http. But web tech has moved on and for the good.
The new web....thinks differently ; I do not think Seaside should be shoehorned into it.
Now, Seaside inside Croquet .....peer to peer....not public .....yes.
---- On Thu, 25 Apr 2024 07:03:15 -0400 * mike.engelhart@gmail.com mike.engelhart@gmail.com * wrote ----
Hi all,
I'm new to Squeak and have been lurking on the mailing lists for a while trying to glean information about the language, community, etc. and have enjoyed being a bystander while I try to find spare time to learn Smalltalk.
That said, I have to say I'm kind of amazed by the idea of running a 20 year old version of a web application framework for a public facing, effectively insecure site. While I completely understand the idea of not wanting to chase the new and shiny, the web has changed significantly in 20 years, in particular the requirements for running a site securely. There are some simple basic things that would help immediately like forcing HTTPS as a redirect from an HTTP request (you can access both of these sites using insecure HTTP), putting a captcha on any forms where a user can POST/PUT/DELETE, etc. I don't know enough about Seaside to know if it's been tested for security vulnerabilities the way huge scale web application frameworks that are used in industry are but I have to believe any 15-20 year old version of web application software regardless of language/runtime has multiple security vulnerabilities at this point. I also wondered if it wouldn't be better to just force these sites to be completely behind a login for all access (again with a captcha at a minimum and forcing HTTPS). I did some checks to see how many pages are indexed in google and it's minimal so the likelihood of searching and landing on a squeaksource.com or source.squeak.org page seems pretty low. Run these queries in google for instance (remove the quotes): "site:squeaksource.com -asdf" - returns 695 pages indexed "site:source.squeak.org -asdf" - returns 125 pages indexed
Just for context, I grew up in the days of the web where I would happily dial in on my modem to participate in an open community of like minded people. I still pine for those days but we all know that place is sadly long gone. squeaksource.com is highly exposed in multiple risky ways for the community the way it's set up right now. Also the idea that the bot traffic is simply web crawlers trying to crawl pages by following links is almost certainly not correct and modifying robots.txt won't work for most crawlers anyway as there's only a few crawlers (namely big US companies like Google, Yahoo, Microsoft) that abide by them anymore. The reality is the web is infested with malicious bots (acting as crawlers and web agents) that are probing for a way *onto* the host server not to pull down the page for a search engine for end users but so they can install malware to create botnets to do real damage.
Anyway, apologies if I'm coming across too harsh/negative, that's not my intent. I would just really be saddened to see squeaksource get crushed by bots or worse having its servers exploited and have to be shut down as it provides a nice service for the community.
My workload lately has been off the charts so I haven't had as much time to dive into Smalltalk as I've wanted but I have over 25 years of web engineering experience in high traffic/distributed web systems and would be happy to assist in any way that might be helpful, I just don't have the Smalltalk chops to write code just yet.. ;)
Mike
On Wed, Apr 24, 2024 at 7:52 PM lewis@mail.msen.com wrote:
Below is a copy of a recent discussion on the box-admins Slack channel. I am copying it here to the mailing list so the discussion will not be lost. Some of this is administrivia but it includes Levente's explanation of why this is a problem on our SeasSide/SqueakSource services, and what can be done to improve the situation.
In addition to blocking the bot traffic, it also may be possible (details below) to make Seaside much less vulnerable to this activity.
Dave
=== thread from slack chat below ===
lewis 4 days ago I see some evidence that squeaksource.com running on dan.box.squeak.org is being hit by a bot of some sort. I can see activity in the Squeak process browser (VNC connection), and I also see activity in /proc/<squeakpid>/fd/ that looks like it may be someone scanning projects through the squeaksource.com service. This may be the source of the high system load and sluggish response that we have been seeing, and it does some to be getting worse over time. Is there any kind of log or utility that I can use to get an idea of where these connections are coming from? These would be connections routed through alan to connect to dan. Thanks for any tips or suggestions. 22 replies
leves 3 days ago I think I have mentioned it a couple times that ~99% of all traffic is bot traffic. The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became pluggable, so my version couldn't be used since Seaside 2.9. Anyway, we can filter out some of the bot traffic to reduce the load if there's need.
leves 3 days ago And to answer your question, yes, there are logs on alan. /var/log/nginx/squeaksourcecom-access.log has the current day's log (according to UTC) and /var/log/nginx/squeaksourcecom-access.log.1 has the previous day's log. The latter file currently has 2.6 million entries, so there were that many requests.
leves 3 days ago Just noticed that you don't have a user on alan yet. I can either create a user for you, or I can copy some of the files over to dan.
lewis 3 days ago Thanks Levente. I would appreciate if you can give me an account on alan. I will use it with care, and it will help me to figure out problems like this. Thank you, and thank you for the explanation of the Seaside issues.
lewis 3 days ago And yes, if there is a way to filter out some of the bot traffic, I think we are at a point where it is becoming necessary. I am mainly watching dan.box.squeak.org but I expect that the same issues apply to our source.squeak.org server on andreas.box.squeak.org.
lewis 3 days ago @leves the Seaside that we are using in our SqueakSource servers appears to be about 15 to 20 years old, with some local patches to keep it working in later Squeak images. I am not sure of the history behind this, but I don't think that our squeaksource servers really care what version of Seaside they are running on, just as long as it works. If you can point to any other version of Seaside that contains your alternative session registry for Seaside 2.8, then maybe we should try it? I am happy to work on it. Also sent to the channel
tim 3 days ago I can aver that current-ish Seaside (3.4.etc) works decently on Squeak 6+. There’s a few tweaks I have that are submitted (kinda) but not yet adopted.
lewis 3 days ago I guess my question is this - since we obviously do not care about being current with Seaside (we are running a 20 year old version now), is there some runnable version of it that does contain Levente's enhancement that we could use instead? I do not care about being "up to date" I just want it to work. (edited)
tim 1 day ago That’s an interesting question.
tim 1 day ago On one side, taking the current image and loading vente’s session changes is ‘simple’ for certain definitions of the word. It changes as little as possible BUT leaves us running very old code.
tim 1 day ago On the other side, adopting a current image and Seaside version involves more changes but brings in a couple of decades of improvements.
lewis 1 day ago I'm not too interested in a couple of decades of improvements unless the improvements actually improve something. But if @leves has an alternative session registry that addresses the issues we are seeing on the squeaksource boxes, then I would be very happy to try running it regardless of what version of Seaside it uses.
lewis 1 day ago I guess I should also say that I really don't know what version of Seaside we are running now for our SqueakSource services. The MCZ packages seem to be of ancient vintage so I am assuming that they might be something reasonably compatible with an alternative session registry.
leves 23 hours ago IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong. :slightly_smiling_face: Anyway, I've blocked a bunch of bots from the Singaporean Alibaba cloud. Those were responsible for about 3/4 of all traffic. Since they used fake user agent string, I don't know what service they were representing.
leves 23 hours ago Another bot that gets down the infinite session rabbit hole is ClaudeBot. I tried to block it via robots.txt, but it's not easy to set up robots.txt with the current nginx setup (there's already a rule for robots.txt that cannot be overriden in nginx...).
lewis 23 hours ago @leves Thank you! The high system load problem on squeaksource.com and source.squeak.org seems to be resolved. I have not been able to watch closely but I think that both services have been back to normal for the last couple of days.
leves 23 hours ago There was some outage ~11 hours ago. During that I decided to filter out those bots. I haven't checked the configuration of source.squeak.org, so the bots can still reach that.
lewis 23 hours ago I expect that squeaksource.com is the most vulnerable due to the large number of projects to scan. But source.squeak.org would be vulnerable to the same issues so it would be good to block the bots there also. I suspect that some of the problems that @cmm has been working on (mutex and process scheduling questions) may in fact be caused by the scanning bots.
leves 23 hours ago I think the bot traffic just triggers the issues. But I agree that we would be better off without the bots. Here are the user agents that sent more than 1000 requests to source.squeak.org yesterday: 1146 Mozilla/5.0 (compatible; AwarioBot/1.0; + https://awario.com/bots.html) 2225 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 2421 CCBot/2.0 (https://commoncrawl.org/faq/) 4260 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 4282 Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+ https://webmaster.petalsearch.com/site/petalbot) 10562 Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/ ) 11025 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 13001 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 20779 Mozilla/5.0 (compatible; SemrushBot/7~bl; + http://www.semrush.com/bot.html) 30990 Mozilla/5.0 (compatible; DotBot/1.2; + https://opensiteexplorer.org/dotbot; help@moz.com) 95781 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) 248770 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 406213 Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
leves 22 hours ago There are about 10000 requests that are not listed above from yesterday. About 10% of those is non-bot traffic. So, ~1000 requests out of ~1 million is non-bot traffic.
leves 22 hours ago The 248770 requests with the fake chrome on a mac are the bots from Singapore.
lewis 22 hours ago Wow, I was not aware of this at all. Thank you. And I agree with you that
- bot traffic just triggers the issues and 2) we would be better off
without the bots. I am amazed to see the amount of bot traffic.
On 4/25/24 17:38, Michael Engelhart wrote:
Hi,
Thanks for the reply. Just to be clear I wasn't suggesting abandoning Seaside or http as a transport mechanism, I was merely saying the http protocol is insecure and the squeaksource site already works over the secure https protocol which is identical to http except that all data is encrypted in transit. i.e. http://squeaksource.com https://squeaksource.com
They return the same page to the browser except one is susceptible to a number of attack vectors that https isn't susceptible for.
By having the code that handles the page request inspect the protocol it could detect when requests for http://squeaksource.com pages arrive and redirect them to https://squeaksource.com
This still leaves the client and server open to insecure HTTP attacks.
Monticello (or whatever client) would need to upgrade HTTP repositories to HTTPS ones before sending the request out... but sometimes you aren't running HTTPS because you arranged for security some other way, so the client can't do this blindly for every repository. You also can't safely downgrade back to HTTP if HTTPS fails because there's an attack for that.
I do agree that if HTTPS is reliable enough it should be the default protocol.
Good point that the clients and Monticello would need to upgrade to support HTTPS. I was more focused on the site itself due to the reports of bot traffic
Hmm
On Apr 25, 2024, at 8:09 PM, Lauren Pullen drurowin@gmail.com wrote:
On 4/25/24 17:38, Michael Engelhart wrote:
Hi,
Thanks for the reply. Just to be clear I wasn't suggesting abandoning Seaside or http as a transport mechanism, I was merely saying the http protocol is insecure and the squeaksource site already works over the secure https protocol which is identical to http except that all data is encrypted in transit. i.e. http://squeaksource.com https://squeaksource.com
They return the same page to the browser except one is susceptible to a number of attack vectors that https isn't susceptible for.
By having the code that handles the page request inspect the protocol it could detect when requests for http://squeaksource.com pages arrive and redirect them to https://squeaksource.com
This still leaves the client and server open to insecure HTTP attacks.
Monticello (or whatever client) would need to upgrade HTTP repositories to HTTPS ones before sending the request out... but sometimes you aren't running HTTPS because you arranged for security some other way, so the client can't do this blindly for every repository. You also can't safely downgrade back to HTTP if HTTPS fails because there's an attack for that.
I do agree that if HTTPS is reliable enough it should be the default protocol.
I would like to highlight one specific issue that Levente explained, and that I think may deserve follow up. Quoting from Levente's message earlier in this thread:
The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became _pluggable_, so my version couldn't be used since Seaside 2.9.
Dave
On 2024-04-26 00:42, Michael Engelhart wrote:
Good point that the clients and Monticello would need to upgrade to support HTTPS. I was more focused on the site itself due to the reports of bot traffic
Hmm
On Apr 25, 2024, at 8:09 PM, Lauren Pullen drurowin@gmail.com wrote:
On 4/25/24 17:38, Michael Engelhart wrote: Hi,
Thanks for the reply. Just to be clear I wasn't suggesting abandoning Seaside or http as a transport mechanism, I was merely saying the http protocol is insecure and the squeaksource site already works over the secure https protocol which is identical to http except that all data is encrypted in transit. i.e. http://squeaksource.com https://squeaksource.com
They return the same page to the browser except one is susceptible to a number of attack vectors that https isn't susceptible for.
By having the code that handles the page request inspect the protocol it could detect when requests for http://squeaksource.com pages arrive and redirect them to https://squeaksource.com This still leaves the client and server open to insecure HTTP attacks.
Monticello (or whatever client) would need to upgrade HTTP repositories to HTTPS ones before sending the request out... but sometimes you aren't running HTTPS because you arranged for security some other way, so the client can't do this blindly for every repository. You also can't safely downgrade back to HTTP if HTTPS fails because there's an attack for that.
I do agree that if HTTPS is reliable enough it should be the default protocol.
Also from Levente, with details on what to load:
IIRC you can just load the Seaside-Registry-ul.3.mcz package into Seaside 2.8 and it should work, though it was 15 years ago, so I may be wrong.
On 2024-04-26 19:38, lewis@mail.msen.com wrote:
I would like to highlight one specific issue that Levente explained, and that I think may deserve follow up. Quoting from Levente's message earlier in this thread:
The way seaside handles urls is very different than how the rest of the web does, and that confuses bots. They think that the urls they get with the session id and page key can be visited later, but when they do, they'll just create a new session with many new links to visit. Seaside's session management is quadratic: creating/accessing/deleting a session requires as many operations as sessions exist. A long time ago, I created an alternative session registry for Seaside 2.8, that requires amortized constant time to create/access/delete a session. http://leves.web.elte.hu/linkeddictionary/ . But the seaside team decided to go down a different path about session management. It became _pluggable_, so my version couldn't be used since Seaside 2.9.
Dave
On 2024-04-26 00:42, Michael Engelhart wrote: Good point that the clients and Monticello would need to upgrade to support HTTPS. I was more focused on the site itself due to the reports of bot traffic
Hmm
On Apr 25, 2024, at 8:09 PM, Lauren Pullen drurowin@gmail.com wrote:
On 4/25/24 17:38, Michael Engelhart wrote: Hi,
Thanks for the reply. Just to be clear I wasn't suggesting abandoning Seaside or http as a transport mechanism, I was merely saying the http protocol is insecure and the squeaksource site already works over the secure https protocol which is identical to http except that all data is encrypted in transit. i.e. http://squeaksource.com https://squeaksource.com
They return the same page to the browser except one is susceptible to a number of attack vectors that https isn't susceptible for.
By having the code that handles the page request inspect the protocol it could detect when requests for http://squeaksource.com pages arrive and redirect them to https://squeaksource.com This still leaves the client and server open to insecure HTTP attacks.
Monticello (or whatever client) would need to upgrade HTTP repositories to HTTPS ones before sending the request out... but sometimes you aren't running HTTPS because you arranged for security some other way, so the client can't do this blindly for every repository. You also can't safely downgrade back to HTTP if HTTPS fails because there's an attack for that.
I do agree that if HTTPS is reliable enough it should be the default protocol.
squeak-dev@lists.squeakfoundation.org