Knowledge base - Posts tagged with: always be clever
Its not the servers - it is your braindead Wordpress developers!
At Zubr Communications most of our business is designing, deploying and maintaining highly critical infrastructure that carries flashy things others put on a top of it. Some say we do hosting. I say we do the Internet plumbing and we do it very well. We may not be experts in Flash or the latest VRML tricks, but we know how the gear ( be it server or network ) actually works, why it works the way it does and what it means for the applications and the OS that runs on it.
So there's probably nothing that concerns us more than those customers who choose to continuously shoot themselves in the foot by listening to web designers/developers/flakes who between watching Jerry Springer and Maury somehow obtained their "computer and networking degree" from a diploma mill. Sooner or later these "developers" do something that results in the sites going down, and we have to explain yet another time that "No, it was not our servers that caused it. It was your software".
On HP DL140 G3, PS/2 keyboard and mouse... are USB devices.
While tuning the OS is more art than science, deploying a tuned OS on a known platform should be a piece of cake. It was not the case for the upgrade of a customer's HP DL140 G3 to a new clean and tight packaging of Linux optimized to the hardware and specific tasks this particular server was supposed to perform...
[zdeploy@deploy-master] $ zdeploy –target border1.phl2/3:2 –load_target 192.168.2.22 –payload f13-zubrcom-small-web-x64
9 minutes later the system rebooted and came back on the network with 64bit Fedora 13. Everything seemed fine but the keyboard. It was solidly locked with Num Lock lit.
dmesg indicated that the PS/2 keyboard simply was not found.
RAID is Good... Backups are Better... RAID + Backups is The Best
I'm a big fan of RAID-1. Nearly every single one of our servers is RAID-1. I don't even remember when was the last time that we did not deploy RAID-1 on any of the Managed Services for our customers. But I do have a hunch that in this case RAID-1 would not have helped, so let this be a cautionary tale to those who think RAID-1 servers mean that backups are not needed.
Server crashes...
02:16:22 MONITOR – s221-5 – HTTP - GET FAILED – TIMEOUT - 20 SECONDS
02:16:22 MONITOR – s221-5 – SMTP – HELO FAILED – TIMEOUT – 5 SECONDS
02:16:23 MONITOR – s221-5 – ICMP – ECHO REQUEST FAILED – TIMEOUT – 3 SECONDS
02:19:04 MONITOR – s221-5 – MONITORING SUSPENDED – requested by ZUBRWATCH
02:19:11 POWER – s221-5 – PSU FEED – OFF – auto-request by ZUBRWATCH
02:19:15 POWER – s221-5 – PSU FEED – ON – auto-request by ZUBRWATCH
"If it does not scale, it is broken by design"
Today a server of a customer with fantastic uptime suddenly lost its MySQL process while the customer was in the middle of a minor tweak of the WordPress platform.
Investigation revealed that the InnoDB storage engine was not able to allocate memory pages for a routine operation and in its most bizarre way of handling errors did a safe crash of the MySQL server ( No, there is no such thing as a "safe" crash, so please dear MySQL folks add sane error handling or stop pretending you are an "industrial" strength SQL server!)
Further conversation with the customer revealed that the developer, following an example in PHP Manual, decided it was a good idea to do this:
Scalable Drupal Architecture
When bored, our CTO tends to become a bit of a bull in a china shop, taking up a random instance of underperformance and smashing anything that stands in his way of fixing it.
Last Christmas he looked at a traffic graph of a customer having issues with scalability of Drupal...
Drupal performance degradation:


In addition to the cluster flatlining on the Drupal side ("he says "They dont know what are doing. The code is horrible"), you can clearly see that with time the performance of the cluster is dropping.
He found this unacceptable. One week, several kernel patches, and two MySQL patches later his version of the cluster started flatlining the network connections of browsers, not servers, while still maintaining sub 1-second fetch time!
Post Alex's scaling: hey, can we add some more browsers?
State of Internet Backbone Companies
There's really only one word that can be used to describe the current sorry state of IP backbone providers. That word is "pathetic."
I will spare you the details of how the following started — those reasons are totally irrelevant. Suffice to say that within a week I got two "maintenance" notifications on one of the transit circuits. Both were to be service-affecting: first one to be used to upgrade the software on the router; the second one to do something with the fiber the circuit rides on. Both were supposed to be about 30 minutes long.
A tale of backup transit...
A Senior Sales Manager of an unnamed company that claims to be a "regional leader in business connectivity" familiar with our requirements for backup transit ( gige via in building cross-connect, BGP, low CIR, etc.) tells a Senior Sales Droid to call us with a quote. The conversation went like this:
Zubrcom: I need tansit over gige PNI at 401 N. Broad. Can you do that?
Senior Sales Droid: You need PRI?
Zubrcom: No, I need transit over gige PNI.
Senior Sales Droid: Private Network Interface.
Zubrcom: Yes.
Senior Sales Droid: What do you need that for?
Zubrcom: Transit.
Senior Sales Droid: Between where and where?
Sun burning through the clouds....
When a sales droid is selling you virtualization as a way to save over your clueful service provider, not only is he selling you the rainbows and magic, but also this level of availability:
Mon Jul 26 08:46:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 08:49:11 2010|http://www.importantsite.com|Failure|Code: 500|9 second(s).
Mon Jul 26 08:50:04 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:51:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:52:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:53:02 2010|http://www.importantsite.com|Failure|Code: 503|0 second(s).
Mon Jul 26 08:54:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Basic Linux Kernel Networking Tuning
Network Stack TCP tuning
These changes should go into /etc/sysctl.conf
The default maximum for buffers allocated to TCP are totally insane. While the most optimal numbers need to be calculated for specific environments the following settings are a good start:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Increase Linux auto-tuning TCP buffer limits:
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Use robust congestion control:
net.ipv4.tcp_congestion_control=htcp
Don't cache ssthresh from previous connection - well, duh:
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
Pretend we always have a gigabit ethernet, even if we only have fast ethernet:
net.core.netdev_max_backlog = 2500
Interface Transmit Queue Size
The default setting is idiotic. Change it to something sane:

