Its not the servers - it is your braindead Wordpress developers!
At Zubr Communications most of our business is designing, deploying and maintaining highly critical infrastructure that carries flashy things others put on a top of it. Some say we do hosting. I say we do the Internet plumbing and we do it very well. We may not be experts in Flash or the latest VRML tricks, but we know how the gear ( be it server or network ) actually works, why it works the way it does and what it means for the applications and the OS that runs on it.
So there's probably nothing that concerns us more than those customers who choose to continuously shoot themselves in the foot by listening to web designers/developers/flakes who between watching Jerry Springer and Maury somehow obtained their "computer and networking degree" from a diploma mill. Sooner or later these "developers" do something that results in the sites going down, and we have to explain yet another time that "No, it was not our servers that caused it. It was your software".
I have already brought up that pretty much all PHP apps running via Apache's mod_php suffer from being broken by design. This time I'll talk about what happens when a marginally well-written application is paired with a theme created by a clue-challenged developer.
So there's this extremely popular piece of software called WordPress. It is a nice CMS. It is kind of scalable and, under proper supervision, it can be made to create kick-ass websites easily. One of the best things about WordPress is that it has a very easy-to-use theming interface that allows for snazzy customization of sites. And the worst thing about WordPress is that the same theming API is so easy to use that every two-bit web designer that fancies himself a web developer uses it... typically with disastrous results.
Let's look at the log file on a crash-happy server that used to successfully push 120Mbit/sec without breaking a sweat before a new theme was added.
c-71-225-230-19.hsd1.nj.comcast.net - - [10/Jan/2011:18:25:03 -0500] "GET /wp-content/themes/somesite/thumb.php?src=http://www.somesite.com/wp-content/uploads/2011/01/lucky-7.jpg&h=200&w=350&zc=1&q=90 HTTP/1.1" 200 22538 "http://www.somesite.com/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)"
See the two problems?
Problem#1 – Images are resized on the fly.
There’s absolutely no reason to resize images on the fly unless images change size dynamically. I dare to say that there are pretty much no applications that require images to change sizes outside a couple well-defined dimensions known at the design time. In this case, every time a web browser requests an image it is being resized to the same size. If 1000 web browsers request that image, it will be resized 1000 times to the identical size. You say “But we have more than enough CPU to do the resizing. It is quick! I tested it! I have gazzilion cores!” While that may be true, you never want to do a single task more than once in a dynamically built page because you are already using CPU cycles to build the page. It may work fine when you get a hit every couple of seconds, but when your site ends up however briefly on top of Google News you are going to be getting dozens if not hundreds of hits per second and, unless you want to pay for really beefy hardware, your site WILL become unavailable.
Problem #2 – The source images are requested via web from the website.
If this is not the pinnacle of the idiocy, it is definitely a close second. Thumbs.php fetches the image using HTTP protocol from the server itself! Hello? If the images are stored on the server then the php program should know where the images are located. You see not only fetching of the image via HTTP from the same host much slower than reading the image directly from the disk, but such connection uses one of the available slots the server has to process the request. Since after the file is fetched the connection is closed, next time the server would attempt to request another image it would need to re-create a connection from scratch! Considering that in this case a typical page had between 10 to 25 images displayed via thumbs.php, each page required 10 to 25 internal connections just to render images and that’s without all the connections the web server would need to maintain to the browser.
Damn "developers"...
On HP DL140 G3, PS/2 keyboard and mouse... are USB devices.
While tuning the OS is more art than science, deploying a tuned OS on a known platform should be a piece of cake. It was not the case for the upgrade of a customer's HP DL140 G3 to a new clean and tight packaging of Linux optimized to the hardware and specific tasks this particular server was supposed to perform...
[zdeploy@deploy-master] $ zdeploy –target border1.phl2/3:2 –load_target 192.168.2.22 –payload f13-zubrcom-small-web-x64
9 minutes later the system rebooted and came back on the network with 64bit Fedora 13. Everything seemed fine but the keyboard. It was solidly locked with Num Lock lit.
dmesg indicated that the PS/2 keyboard simply was not found.
Googling around yielded the standard gamut of useless results – no one seems to be able to systematically debug the problem to figure out why it is happening ( it seems people only whine about the problems, try tons of workaround and jump dozens of hoops offered by developers who quickly lose any interest, followed by the bug reports getting WONT-FIX status ). Trust us – we systematically tried all the possible workarounds. None of them worked regardless of the poster claims (we now know they could not have possibly worked) and even read the source code of the PS/2 driver. Nothing made sense. That's when someone remembered that we played with Vyatta router on a DL140 with a working keyboard.
[zubrcom@localhost ~]$ dmesg | less
PNP: No PS/2 controller found. Probing ports directly.
i8042.c: Warning: Keylock active.
Failed to disable AUX port, but continuing anyway... Is this a SiS?
If AUX port is really absent please use the 'i8042.noaux' option.
No, this is not SiS and we had already tried all the possible i8042.* options. If nothing here changed, why did the keyboard work now? PS/2 controller was still not found...
Further down the dmesg output we found the answer:
generic-usb 0003:0000:0000.0001: input: USB HID v1.11 Keyboard [ServerEngines SE USB Device] on usb-0000:00:1d.2-1/input0
input: ServerEngines SE USB Device as /devices/pci0000:00/0000:00:1d.2/usb4/4-1/4-1:1.1/input/input3
generic-usb 0003:0000:0000.0002: input: USB HID v1.11 Mouse [ServerEngines SE USB Device] on usb-0000:00:1d.2-1/input1
So even though the keyboard is PS/2 and mouse is PS/2, HP decided to add some additional logic to connect them internally as USB devices and not clearly document it anywhere.
Compile in EHCI and HID support and enjoy a working keyboard. Thanks HP! Hint – when I see a PS/2 keyboard and mouse ports on the back of a server I expect them to be PS/2 ports unless they are explicitly documented not to be. *plonk*
P.S. Would not it be nice if those that claim to have solved the problem checked to see if their solution actually worked?
P.P.S. memtest86 also locks up the keyboard on HP DL140 G3 - it just does it differently - Caps Lock, Num Lock and Scroll Lock work while the other keys do not.
RAID is Good... Backups are Better... RAID + Backups is The Best
I'm a big fan of RAID-1. Nearly every single one of our servers is RAID-1. I don't even remember when was the last time that we did not deploy RAID-1 on any of the Managed Services for our customers. But I do have a hunch that in this case RAID-1 would not have helped, so let this be a cautionary tale to those who think RAID-1 servers mean that backups are not needed.
Server crashes...
02:16:22 MONITOR – s221-5 – HTTP - GET FAILED – TIMEOUT - 20 SECONDS
02:16:22 MONITOR – s221-5 – SMTP – HELO FAILED – TIMEOUT – 5 SECONDS
02:16:23 MONITOR – s221-5 – ICMP – ECHO REQUEST FAILED – TIMEOUT – 3 SECONDS
02:19:04 MONITOR – s221-5 – MONITORING SUSPENDED – requested by ZUBRWATCH
02:19:11 POWER – s221-5 – PSU FEED – OFF – auto-request by ZUBRWATCH
02:19:15 POWER – s221-5 – PSU FEED – ON – auto-request by ZUBRWATCH
Strangely, reboot did not fix it. Rarely do we have server crashes caused by something other than a memory starvation on a high traffic website where someone forgot how to cleanly release memory resources. Reboots always fix those.
02:22:15 MONITOR – s221-5 – MONITORING RESTORED – auto-request by ZUBRWATCH
02:25:22 MONITOR – s221-5 – ALERT - ESCALATION/ONCALL – "s221-5 down hard, cleanup failed"
02:25:22 ZUBRWATCH – DISPATCH – "s221-5 down hard since 02:16:22 EST 2010-12-30"
Of course it had to happen on the 30th of December when the only person responding happened to be me.
At first glance, server s221-5 seemed fine – it had a solid "Power on" light on the front; its network card had "Link" light on the back though that one was not blinking, indicating that the server did in fact experience a hard crash. Even stranger, the switch on the other side of s221-5 cable did not see the link.
As I plugged the keyboard into s221-5 the unmistakable smell of burned plastic finally reached me - something is overheating... The only question is what? When you have racks and racks full of 1U and 2U live servers identifying the source of the smell is difficult and the last thing I want to do is unrack the servers one at a time to locate the server with a component that's about to be BBQed.
Oh well, let's make sure s221-5 is recovered first as none of the environmental monitors inside the servers are sending any alarms... yet
What the hell? Keyboard plugged into s221-5 lights up with orange light on Num Lock. I did not even know a keyboard light could be orange. Unplug the keyboard. Light goes away. Plug it back in. Light comes back. Unplug the keyboard from s221-5 and plug it into a different server... Orange light is back. "Crap. Now we have a dead keyboard" Finally, power off s221-5 and power it back on - this time with a different keyboard... Now the 2nd keyboard has orange light. It is 3am in the morning and only one keyboard remains. Since I'm stuck there's nothing else to do but unrack s221-5. As its cover is popped it becomes obvious s221-5 is the source of the smell.
Things are looking up, at least I know what failed and I don't need to unrack more stuff. Plus I just need to pull the drives from the dead server, pop them into a different server, change the MAC address to match new hardware and s221-5 replacement will be backup. The entire procedure should take no more than 15 minutes...
And that's when I realize that s225-1 is one of 4 servers that does not have the 2nd drive in it, which means it is not RAID-1, which means the drive in it is the only drive... I think to myself "Oh well, when was the last time a dead power supply killed the drive" (probably the same time when a dead power supply made keyboards have Orange Num lock light -- one after the other ).
Move the drive to the different machine, it is not recognized... In fact, I can't even feel it moving when power is applied to it. Just great... We will probably need to do a transplant from a donor drive - that's when we either take off the interface PCB from a matching drive and replace the dead PCB or open the drive and transplant heads or drive platters to restore the data.
In this case the interface PCB is definitely damaged:


Notice the charred chips in the top right. Fifteen minutes later that PCB is replaced with the donor. As the drive is powered on I hear what to me could only be the sounds on heads landing on the platters... Ouch. The drive gets powered off and opened. That's when the level of destruction becomes clear - the inside of the drive smells like burned plastic. The temperature must have been high enough to melt some of it onto the heads and have the heads crash onto the platters while continuing to move around scraping the magnetic coating off the platters. The drive, as well as everything else that was mounted inside the server, was fried:
1 motherboard
1 CPU
2 memory sticks
1 hard drive
2 keyboards
P.S. The surgery on the power supply showed a few blown capacitors and and a half a dozen resistors on the 110V-240V side. Considering all this, my money is on other drives attached to this server would have died as well killing the RAID.
"If it does not scale, it is broken by design"
Today a server of a customer with fantastic uptime suddenly lost its MySQL process while the customer was in the middle of a minor tweak of the WordPress platform.
Investigation revealed that the InnoDB storage engine was not able to allocate memory pages for a routine operation and in its most bizarre way of handling errors did a safe crash of the MySQL server ( No, there is no such thing as a "safe" crash, so please dear MySQL folks add sane error handling or stop pretending you are an "industrial" strength SQL server!)
Further conversation with the customer revealed that the developer, following an example in PHP Manual, decided it was a good idea to do this:
header("Content-type: application/force-download");
//header('Content-type: video/x-ms-wmv');
header("Content-Transfer-Encoding: Binary");
//header("Content-length: ".filesize($file));
header("Content-disposition: attachment; filename=\"".basename($file)."\"");
readfile($file);
And since some of the files could be nearly 500M in size, the developer changed the PHP memory limit to accommodate.
Add a few thousand Apache processes on a server busily serving hundreds of video files using such a code path and you are virtually guaranteed to use all available RAM in minutes, if not seconds. But of course it worked on a test system with one web browser.
Why? Because embedded PHP is designed not to exit, its garbage collector is totally stupid, and probably 90% of the examples in the PHP manual are written by people who do not understand the concept of scalability.
And that's why 90% of PHP applications are broken by design. That's the cause of performance collapse over time for CMS-systems written in PHP.
Scalable Drupal Architecture
When bored, our CTO tends to become a bit of a bull in a china shop, taking up a random instance of underperformance and smashing anything that stands in his way of fixing it.
Last Christmas he looked at a traffic graph of a customer having issues with scalability of Drupal...
Drupal performance degradation:


In addition to the cluster flatlining on the Drupal side ("he says "They dont know what are doing. The code is horrible"), you can clearly see that with time the performance of the cluster is dropping.
He found this unacceptable. One week, several kernel patches, and two MySQL patches later his version of the cluster started flatlining the network connections of browsers, not servers, while still maintaining sub 1-second fetch time!
Post Alex's scaling: hey, can we add some more browsers?



Drupal still does not scale, but at least we can make it work over 20 times better than anyone else on the identical hardware.
State of Internet Backbone Companies
There's really only one word that can be used to describe the current sorry state of IP backbone providers. That word is "pathetic."
I will spare you the details of how the following started — those reasons are totally irrelevant. Suffice to say that within a week I got two "maintenance" notifications on one of the transit circuits. Both were to be service-affecting: first one to be used to upgrade the software on the router; the second one to do something with the fiber the circuit rides on. Both were supposed to be about 30 minutes long.
As the two tasks are not connected, any sane company would schedule both to be done at the same time — while one group does physical work on the fiber, the other group does the software upgrade (this is in no way different from upgrading server RAM in the same outage window as upgrading the operating system). But hey, that would be just too logical of a decision — probably last used around the time I was making them on the 2nd floor of 111 8th Avenue at the old AboveNet.
So when a new sales person of the carrier (it seems the old sales person was no longer with the company — he disappeared just after he tried to slip some creative provisions into a new contract before the end of the fiscal quarter) offered dedicated transport VLAN riding on their fiber backbone (but never touching the IP portion) to a different city I wanted to know a little bit more about how similar maintenance issues would affect this transport service.
My questions were simple:
Over last week your company scheduled two service-affecting outages for next week in PHL, which is pretty darn high considering that I have had entire 1 maintenance on Provider A gige in last 3 years, and 0 on Provider B gige.
Will the transport VLAN take the same path and be affected by this rate of service maintenance ?
Is it a protected VLAN or is it a single channel that will be going down all the time during the path maintenance?
By morning there was a response:
Now this is rich... If we call this "maintenance" then it is not an outage. Huh?! Wondering if there would be a better explanation I inquired again:
XXXXX Co scheduled two outages — and 20 minute downtime during a maintenance window is an outage regardless of how you call it — in one week.
The description of the issue in the notification email clearly indicates that it is your fiber issue and it is not between me and XXXXX Co. If that's the case there should be no customer-visible outage unless you are running every customer on a single fiber pair (or even using single fiber with wave splitting to avoid having to run 2nd fiber all together). And if you are running everything on one fiber you are unlikely to get any more of my business.
I thought that maybe I would get a better response to my second and third questions... Nope. Oh goodie, both of my questions are answered:
He must be confused. The concept of a shortest open path comes from OSPF - the IP routing protocol, which operates on layer 3. We have been talking about VLANs, which operate on layer 2 and don't care about what runs above them. Unless of course they have some über-smart super-secret sauce that when dripped on the Cisco gear they use magically makes things happen. That must be it... but I must know more:
I leave you with a little nugget hidden in a footnote:
Because, as we all know, when buying a car one always pays extra for the tires and pedals.
A tale of backup transit...
A Senior Sales Manager of an unnamed company that claims to be a "regional leader in business connectivity" familiar with our requirements for backup transit ( gige via in building cross-connect, BGP, low CIR, etc.) tells a Senior Sales Droid to call us with a quote. The conversation went like this:
Zubrcom: I need tansit over gige PNI at 401 N. Broad. Can you do that?
Senior Sales Droid: You need PRI?
Zubrcom: No, I need transit over gige PNI.
Senior Sales Droid: Private Network Interface.
Zubrcom: Yes.
Senior Sales Droid: What do you need that for?
Zubrcom: Transit.
Senior Sales Droid: Between where and where?
Zubrcom: No transport. Transit. To the internet.
Senior Sales Droid: Ah to the internet... That's what we call Direct Internet Access
Zubrcom: OK.
Senior Sales Droid: Transit is a circuit that goes between two places on our network.
Senior Sales Droid: So do you want to talk about it?
Zubrcom: No. I need pricing.
Senior Sales Droid: You mean you dont want to talk?
Zubrcom: No. I just want pricing.
Senior Sales Droid: Oh that will take some time, I need to run models.
And the pricing... the pricing was just special.
Sun burning through the clouds....
When a sales droid is selling you virtualization as a way to save over your clueful service provider, not only is he selling you the rainbows and magic, but also this level of availability:
Mon Jul 26 08:46:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 08:49:11 2010|http://www.importantsite.com|Failure|Code: 500|9 second(s).
Mon Jul 26 08:50:04 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:51:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:52:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:53:02 2010|http://www.importantsite.com|Failure|Code: 503|0 second(s).
Mon Jul 26 08:54:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:56:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:57:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:58:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:59:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:59:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:01:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:02:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:02:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:03:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:04:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:05:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:06:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:07:23 2010|http://www.importantsite.com|Failure|Code: 500|22 second(s).
Mon Jul 26 09:08:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:09:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:10:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:11:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:12:05 2010|http://www.importantsite.com|Success|Code: 200|4 second(s).
Mon Jul 26 09:13:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:14:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:15:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:16:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:17:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:18:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:19:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:20:04 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:21:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:22:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:23:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:24:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:26:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:27:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:28:02 2010|http://www.importantsite.com|Failure|Code: 500|60 second(s).
Mon Jul 26 09:29:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:29:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:31:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:32:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:33:08 2010|http://www.importantsite.com|Failure|Code: 503|7 second(s).
Mon Jul 26 09:33:56 2010|http://www.importantsite.com|Failure|Code: 500|114 second(s).
Mon Jul 26 09:34:02 2010|http://www.importantsite.com|Failure|Code: 500|0 second(s).
Mon Jul 26 09:35:03 2010|http://www.importantsite.com|Failure|Code: 500|1 second(s).
Mon Jul 26 09:36:20 2010|http://www.importantsite.com|Success|Code: 200|19 second(s).
Mon Jul 26 09:38:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:39:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:40:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:40:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:41:27 2010|http://www.importantsite.com|Failure|Code: 500|25 second(s).
Mon Jul 26 09:42:02 2010|http://www.importantsite.com|Failure|Code: 500|1 second(s).
Mon Jul 26 09:43:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Basic Linux Kernel Networking Tuning
Network Stack TCP tuning
These changes should go into /etc/sysctl.conf
The default maximum for buffers allocated to TCP are totally insane. While the most optimal numbers need to be calculated for specific environments the following settings are a good start:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Increase Linux auto-tuning TCP buffer limits:
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Use robust congestion control:
net.ipv4.tcp_congestion_control=htcp
Don't cache ssthresh from previous connection - well, duh:
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
Pretend we always have a gigabit ethernet, even if we only have fast ethernet:
net.core.netdev_max_backlog = 2500
Interface Transmit Queue Size
The default setting is idiotic. Change it to something sane:
ifconfig eth0 txqueuelen 1000
This, unfortunately, can't go into /etc/sysctl.conf. Put it /etc/rc.d/rc.local

