"If it does not scale, it is broken by design"
Today a server of a customer with fantastic uptime suddenly lost its MySQL process while the customer was in the middle of a minor tweak of the WordPress platform.
Investigation revealed that the InnoDB storage engine was not able to allocate memory pages for a routine operation and in its most bizarre way of handling errors did a safe crash of the MySQL server ( No, there is no such thing as a "safe" crash, so please dear MySQL folks add sane error handling or stop pretending you are an "industrial" strength SQL server!)
Further conversation with the customer revealed that the developer, following an example in PHP Manual, decided it was a good idea to do this:
header("Content-type: application/force-download");
//header('Content-type: video/x-ms-wmv');
header("Content-Transfer-Encoding: Binary");
//header("Content-length: ".filesize($file));
header("Content-disposition: attachment; filename=\"".basename($file)."\"");
readfile($file);
And since some of the files could be nearly 500M in size, the developer changed the PHP memory limit to accommodate.
Add a few thousand Apache processes on a server busily serving hundreds of video files using such a code path and you are virtually guaranteed to use all available RAM in minutes, if not seconds. But of course it worked on a test system with one web browser.
Why? Because embedded PHP is designed not to exit, its garbage collector is totally stupid, and probably 90% of the examples in the PHP manual are written by people who do not understand the concept of scalability.
And that's why 90% of PHP applications are broken by design. That's the cause of performance collapse over time for CMS-systems written in PHP.
Scalable Drupal Architecture
When bored, our CTO tends to become a bit of a bull in a china shop, taking up a random instance of underperformance and smashing anything that stands in his way of fixing it.
Last Christmas he looked at a traffic graph of a customer having issues with scalability of Drupal...
Drupal performance degradation:


In addition to the cluster flatlining on the Drupal side ("he says "They dont know what are doing. The code is horrible"), you can clearly see that with time the performance of the cluster is dropping.
He found this unacceptable. One week, several kernel patches, and two MySQL patches later his version of the cluster started flatlining the network connections of browsers, not servers, while still maintaining sub 1-second fetch time!
Post Alex's scaling: hey, can we add some more browsers?



Drupal still does not scale, but at least we can make it work over 20 times better than anyone else on the identical hardware.
State of Internet Backbone Companies
There's really only one word that can be used to describe the current sorry state of IP backbone providers. That word is "pathetic."
I will spare you the details of how the following started — those reasons are totally irrelevant. Suffice to say that within a week I got two "maintenance" notifications on one of the transit circuits. Both were to be service-affecting: first one to be used to upgrade the software on the router; the second one to do something with the fiber the circuit rides on. Both were supposed to be about 30 minutes long.
As the two tasks are not connected, any sane company would schedule both to be done at the same time — while one group does physical work on the fiber, the other group does the software upgrade (this is in no way different from upgrading server RAM in the same outage window as upgrading the operating system). But hey, that would be just too logical of a decision — probably last used around the time I was making them on the 2nd floor of 111 8th Avenue at the old AboveNet.
So when a new sales person of the carrier (it seems the old sales person was no longer with the company — he disappeared just after he tried to slip some creative provisions into a new contract before the end of the fiscal quarter) offered dedicated transport VLAN riding on their fiber backbone (but never touching the IP portion) to a different city I wanted to know a little bit more about how similar maintenance issues would affect this transport service.
My questions were simple:
Over last week your company scheduled two service-affecting outages for next week in PHL, which is pretty darn high considering that I have had entire 1 maintenance on Provider A gige in last 3 years, and 0 on Provider B gige.
Will the transport VLAN take the same path and be affected by this rate of service maintenance ?
Is it a protected VLAN or is it a single channel that will be going down all the time during the path maintenance?
By morning there was a response:
Now this is rich... If we call this "maintenance" then it is not an outage. Huh?! Wondering if there would be a better explanation I inquired again:
XXXXX Co scheduled two outages — and 20 minute downtime during a maintenance window is an outage regardless of how you call it — in one week.
The description of the issue in the notification email clearly indicates that it is your fiber issue and it is not between me and XXXXX Co. If that's the case there should be no customer-visible outage unless you are running every customer on a single fiber pair (or even using single fiber with wave splitting to avoid having to run 2nd fiber all together). And if you are running everything on one fiber you are unlikely to get any more of my business.
I thought that maybe I would get a better response to my second and third questions... Nope. Oh goodie, both of my questions are answered:
He must be confused. The concept of a shortest open path comes from OSPF - the IP routing protocol, which operates on layer 3. We have been talking about VLANs, which operate on layer 2 and don't care about what runs above them. Unless of course they have some über-smart super-secret sauce that when dripped on the Cisco gear they use magically makes things happen. That must be it... but I must know more:
I leave you with a little nugget hidden in a footnote:
Because, as we all know, when buying a car one always pays extra for the tires and pedals.
A tale of backup transit...
A Senior Sales Manager of an unnamed company that claims to be a "regional leader in business connectivity" familiar with our requirements for backup transit ( gige via in building cross-connect, BGP, low CIR, etc.) tells a Senior Sales Droid to call us with a quote. The conversation went like this:
Zubrcom: I need tansit over gige PNI at 401 N. Broad. Can you do that?
Senior Sales Droid: You need PRI?
Zubrcom: No, I need transit over gige PNI.
Senior Sales Droid: Private Network Interface.
Zubrcom: Yes.
Senior Sales Droid: What do you need that for?
Zubrcom: Transit.
Senior Sales Droid: Between where and where?
Zubrcom: No transport. Transit. To the internet.
Senior Sales Droid: Ah to the internet... That's what we call Direct Internet Access
Zubrcom: OK.
Senior Sales Droid: Transit is a circuit that goes between two places on our network.
Senior Sales Droid: So do you want to talk about it?
Zubrcom: No. I need pricing.
Senior Sales Droid: You mean you dont want to talk?
Zubrcom: No. I just want pricing.
Senior Sales Droid: Oh that will take some time, I need to run models.
And the pricing... the pricing was just special.
Sun burning through the clouds....
When a sales droid is selling you virtualization as a way to save over your clueful service provider, not only is he selling you the rainbows and magic, but also this level of availability:
Mon Jul 26 08:46:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:48:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 08:49:11 2010|http://www.importantsite.com|Failure|Code: 500|9 second(s).
Mon Jul 26 08:50:04 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:51:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:52:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:53:02 2010|http://www.importantsite.com|Failure|Code: 503|0 second(s).
Mon Jul 26 08:54:03 2010|http://www.importantsite.com|Failure|Code: 503|1 second(s).
Mon Jul 26 08:56:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:57:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:58:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:59:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 08:59:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:01:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:02:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:02:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:03:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:04:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:05:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:06:47 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:07:23 2010|http://www.importantsite.com|Failure|Code: 500|22 second(s).
Mon Jul 26 09:08:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:09:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:10:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:11:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:12:05 2010|http://www.importantsite.com|Success|Code: 200|4 second(s).
Mon Jul 26 09:13:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:14:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:15:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:16:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:17:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:18:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:19:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Mon Jul 26 09:20:04 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:21:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:22:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:23:02 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:24:03 2010|http://www.importantsite.com|Success|Code: 200|1 second(s).
Mon Jul 26 09:26:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:27:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:28:02 2010|http://www.importantsite.com|Failure|Code: 500|60 second(s).
Mon Jul 26 09:29:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:29:47 2010|http://www.importantsite.com|Failure|Code: 500|45 second(s).
Mon Jul 26 09:31:04 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:32:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:33:08 2010|http://www.importantsite.com|Failure|Code: 503|7 second(s).
Mon Jul 26 09:33:56 2010|http://www.importantsite.com|Failure|Code: 500|114 second(s).
Mon Jul 26 09:34:02 2010|http://www.importantsite.com|Failure|Code: 500|0 second(s).
Mon Jul 26 09:35:03 2010|http://www.importantsite.com|Failure|Code: 500|1 second(s).
Mon Jul 26 09:36:20 2010|http://www.importantsite.com|Success|Code: 200|19 second(s).
Mon Jul 26 09:38:02 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:39:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:40:03 2010|http://www.importantsite.com|Failure|Code: 500|61 second(s).
Mon Jul 26 09:40:48 2010|http://www.importantsite.com|Failure|Code: 500|46 second(s).
Mon Jul 26 09:41:27 2010|http://www.importantsite.com|Failure|Code: 500|25 second(s).
Mon Jul 26 09:42:02 2010|http://www.importantsite.com|Failure|Code: 500|1 second(s).
Mon Jul 26 09:43:02 2010|http://www.importantsite.com|Success|Code: 200|0 second(s).
Basic Linux Kernel Networking Tuning
Network Stack TCP tuning
These changes should go into /etc/sysctl.conf
The default maximum for buffers allocated to TCP are totally insane. While the most optimal numbers need to be calculated for specific environments the following settings are a good start:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Increase Linux auto-tuning TCP buffer limits:
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Use robust congestion control:
net.ipv4.tcp_congestion_control=htcp
Don't cache ssthresh from previous connection - well, duh:
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
Pretend we always have a gigabit ethernet, even if we only have fast ethernet:
net.core.netdev_max_backlog = 2500
Interface Transmit Queue Size
The default setting is idiotic. Change it to something sane:
ifconfig eth0 txqueuelen 1000
This, unfortunately, can't go into /etc/sysctl.conf. Put it /etc/rc.d/rc.local
Linux Networking Housekeeping
Routing Table Cleanup
While 169.254.0.0/16 route shows up only on machine with multiple NICs it has no place on the systems at all. Get rid of the annoying Zero Configuartion crap by adding the following lines to /etc/sysconf/network:
NOZEROCONF=yes
MySQL: Fix bad connection error defaults
For whatever reason MySQL server software ships with really stupid defaults for handling problems with the incoming connections. Obviously, Zubrcom's intallation of MySQL does not suffer from this problem. However, if you have braved rolling your own installation it may be helpful to add the following lines in /etc/my.cnf:
[mysqld]
max_connections = 1100 # max number of clients if each client is non-threaded
max_connect_errors=99999999 # dont stop mysql before it gets this many connection errors.
max_user_connections = 1100 # max connections per user
If MySQL is already running, login into root account and use comand SET GLOBAL to change the variable setting without restarting MySQL.
