Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Mon
30
Aug '10

NetApp benefits for virtualization - benchmarked and proven

My colleague Vaughn Stewart explains it in detail here. I didn’t feel we gave this the publicity it deserves…

In a nutshell: We have numbers (published only after VMware engineering themselves approved the paper as accurate and gave their permission) proving that, compared to “traditional” arrays, running virtualized workloads on NetApp gear needs less resources while providing excellent performance.

If you don’t want to spend time reading Vaughn’s article, this link has the goods in impressive detail…

It’s worth noting the “traditional” array had a lot more disks and RAM, but the NetApp array had a Flash Cache module. We are not allowed to publish the vendor of the “traditional” array due to licensing restrictions, but, as mentioned, VMware engineering verified the results – the test was legit.

Some pictures for the impatient:

 

image

 

image

 

image

image

 

Key take-aways:

  1. A lot less disk space needed with NetApp
  2. A lot quicker to provision the VMs
  3. Faster performance than RAID10 even without the Flash Cache (and dramatically higher with)
  4. No-compromise RAID-DP offers same protection as RAID6 without the penalty
  5. NFS for VMware can be pretty fast inded given the appropriate storage behind!

 

D

Sat
7
Aug '10

FUD tales from the blogosphere: when vendors attack (and a wee bit on expanding and balancing RAID groups)

Haven’t blogged in a while, way too busy. Against my better judgment, I thought I’d respond to some comments I’ve seen on the blogosphere, adding one of my trademark extremely long titles. Part response, part tutorial. People with no time to read it all: Skip to the end and see if you know the answer to the question or if you have ideas on how to do such a thing.

It’s funny how some vendors won’t hesitate to wholeheartedly agree when some “independent” blogger criticizes their competition (before I get flamed, independent in quotes since, as I discussed before, there ain’t no such thing whether said blogger realizes it or not – being biased is a basic human condition).

The equivalent of someone posting in an Audi forum about excessive brake dust, and having guys from Mercedes and BMW chime in and claim how they “tested” Audis and indeed they had issues (but of course!) and how their cars are better now and indeed maybe Audi doesn’t have as much of a lead any more (if, indeed, they ever did). I think the term for that is “shill” but I can understand taking every opportunity to harm an opponent.

So the “Storage Architect” posted entries asking about certain features to be implemented on NetApp storage, one of them being able to reduce the size of an aggregate. Then everyone and their mum jumped on and complained how on earth such an important feature isn’t there… :) BTW – I’m not saying such a thing wouldn’t be useful to have from time to time. I’ll just try to explain why it’s tricky to implement and maybe ways to avoid problems.

For the uninitiated, an “aggregate’ is a collection of RAID-DP RAID groups, that are pooled, striped and I/O then hits all the drives from all RAID groups equally for performance. You then carve out volumes out of that aggregate (containers for NFS, CIFS, iSCSI, FC).

A pretty simple structure, really, but effective. Similar constructs are used by many other storage vendors that allow pooling.

So, the question was, why not be able to make an aggregate smaller? (you can already make it bigger on-the-fly, as well as grow or shrink the existing volumes within).

An HP guy them proceeded to complain about how he put too few drives in an aggregate and ended up with an imbalanced configuration while trying to test a NetApp box.

So, some basics… the following picture shows a well-balanced pool – notice the equal number of drives per RAID group:

image

The idea being that everything is load-balanced:

image

Makes sense, right?

You then end up with pieces of data across all disks, which is the intent. Growing it is easy – which is, after all, what 99.99% of customers ever want to do.

However, the HP dude didn’t have enough disks to create a balanced config with the default-sized RAID group (16). So he ended up with something like this, not performance-optimal:

image

So what the HP dude wanted to do, was to reduce the size of the RAID group and remove drives, even though he expanded the aggregate (and by extension the RAID group) originally.

Normally, before one starts creating pools of storage (with any storage system), one also knows (or should) what  one has to play with in order to get the best overall config. It’s like – “I want to build a 12-cylinder car engine, but I only have 9 cylinders”. Well – either buy more cylinders, or build an 8-cylinder engine… don’t start building the 12-cylinder engine and go “oops” :) This is just Storage 101. Mistakes can and do happen, of course.

So, with the current state of tech, if I only had 20 drives to play with (and no option to get more), assuming no spares, I’d rather do one of the following:

  1. Aggregate with 10 + 10 RAID groups inside or…
  2. Use all 20 drives in a single RAID group for max space
  3. Ask someone that knows the system better than I do for some advice

This is common sense and both doable and trivial with a NetApp system. The idea is you set the desired RAID group size for that aggregate BEFORE you put in disks. Not really difficult and pretty logical.

For instance, aggr options HPdudeAggr raidsize 10 before adding the drives would have achieved #1 above. Graphically, the Web GUI has that option in there as well, when you modify an aggregate. The option exists and it’s well-known and documented. Not knowing about it is a basic education issue. Arguing that no education should be needed to use a storage device (with an extreme number of features) properly even for deeply involved, low-level operations, is a romantic notion at best. Maybe some day. We are all working hard to make it a reality. Indeed, a lot of things that would take a really long time in the past (or still, with other boxes) have become trivialized – look at SnapDrive and the SnapManager products, for instance.

Back to our example: if, in the future, 10 more disks were purchased, and approach #1 above was taken, one would simply add the ten disks to the aggregate with aggr add HPdudeAggr 10. Resulting in a 10+10+10 config.

But what if I had done #2 above (make a 20-drive RAID group the default for that aggregate)?

Then, simply, you’d end up imbalanced again, with a 20+10. Some thought is needed before embarking on such journeys.

Maybe a better approach would be to add, say, a more reasonable number of drives to achieve good balance? Adding 12 more drives, for example, would allow for an aggregate with 16+16 drives. So, one could simply change the raidsize using aggr options HPdudeAggr raidsize 16, then, add the 12 disks to the aggregate with aggr add HPdudeAggr –g all 12. 

This would expand both RAID groups contained within the aggregate dynamically to 16 drives per, resulting in a 16+16 configuration. Which, BTW, is not something you can easily do with most other storage systems…

Having said all that, I think that for people that are not storage savvy (or for the storage savvy that are suffering from temporary brain fog), a good enhancement would be for the interfaces to warn you about imbalanced final configs and show you what will be created in a nice graphical fashion, asking you if you agree (and possibly providing hints on how it could be done better).

I’m not aware of any other storage system that does that degree of handholding but hey, I don’t know everything.

Indeed, maybe the nature of the other posts was being bait so I’ll obligingly take the bait and ask the question so you can advertise your wares here: :)

Is anyone aware of a well-featured storage system from an established, viable vendor that currently (Aug 7, 2010, not roadmap or “Real Soon Now”) allows the creation of a wide-striped pool of drives with some RAID structures underneath; then allows one to evacuate and then destroy some of those underlying RAID groups selectively, non-disruptively, without losing data, even though they already contain parts of the stripes; then change the RAID layout to something else using those same existing drives and restripe without requiring some sort of data migration to another pool and without needing to buy more drives? Again, NOT for expansion, but for the shrinking of the pool?

To clarify even further: What the HP guy did was exactly this: He had 20 drives to play with, he created by mistake a pool with 2 RAID groups, 14+2 and a 2+2, how would your solution take those 2 RAID groups, with data, and change the config to something like 10 + 10 without needing more drives or the destruction of anything?

Can you dynamically reduce a RAID group? (NetApp can dynamically expand, but not reduce a RAID group).

I’m not implying such a thing doesn’t exist, I’m merely curious. I could see ways to make this work by virtualizing RAID further. Still, it’s just one (small) part of the storage puzzle.

The one without sin may cast the first stone! :)

D

Technorati Tags: ,,

Mon
29
Mar '10

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

Tue
9
Mar '10

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers - remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance - there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points - so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

Wed
3
Mar '10

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

Mon
8
Feb '10

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

clip_image001

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok.  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So - there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

Sat
9
Jan '10

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

Sun
14
Jun '09

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

What a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 “Leonidas”, a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  - this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • “XFS tuned” means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • “ext3 tuned” means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…

Postmark numbers:

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s  
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D

Mon
18
May '09

Linux filesystem benchmark extravaganza - including Deadline vs CFQ schedulers and ext4 instability

I have some spare time these days so I figured I’d finally test as many filesystems on Linux as I could…

The new ext4 is an option with modern kernels so I loaded Ubuntu 9.04 and tried postmark and bonnie++ on the same partition using various filesystems and switching between the CFQ and Deadline schedulers.

Switching schedulers permanently can be achieved by changing the boot options and appending, say, elevator=deadline, but you can also switch them on the fly by running the following:

echo deadline > /sys/block/sda/queue/scheduler

You can check what’s currently selected by simply typing

cat /sys/block/sda/queue/scheduler

You’ll get back something like:

noop anticipatory [deadline] cfq

The scheduler in brackets is the currently selected one.

Reader beware: Running postmark on ext4 locked up the system repeatedly during the transaction phase of the benchmark, using either my own compiled version and the one from the repository, so obviously there is some issue there and I cannot at this time recommend ext4no other filesystem caused lockups. I did run bonnie++ as well since that didn’t crash with ext4.

The objective of this exercise wasn’t to show which filesystem is fastest, but rather to illustrate that, depending on what you want to do, you may want to re-examine the choice of filesystem and scheduler with your application if you’re running Linux. BTW the current recommendation for Databases and fast intelligent external arrays – and ubuntu’s default in the server edition – is the Deadline scheduler, and not CFQ. However, all other distrubutions at the moment use CFQ!

So, without further ado, some benchmarks… (I’m not including the entire postmark output since it would be far too large, I just kept the most important metrics, anyone that wants the entire results is more than welcome to send me an email and I’ll hook you up).

Postmark MB/s:

Filesystem

Read MB/s

Write MB/s

IOPS

Reiser CFQ

4.85

10.25

227

Reiser Deadline

5.38

11.35

246

XFS CFQ

2.33

4.93

109

XFS Deadline

2.35

4.97

105

XFS Tuned

2.73

5.76

120

JFS CFQ

1.75

3.69

78

JFS Deadline

1.73

3.65

76

Ext3 CFQ

2.71

5.73

115

Ext3 Deadline

2.86

6.03

122

 

MBPS

Postmark IOPS:

iops

Bonnie++ write speed:

Filesystem

IOPS

Block writes KB/s

Rewrite KB/s

Reiser CFQ

428

31657

18199

Reiser Deadline

462

32290

18154

XFS CFQ

471

39901

18557

XFS Deadline

483

39840

19653

XFS Tuned

592

40604

20746

JFS CFQ

433

31651

18528

JFS Deadline

452

39106

18755

Ext3 CFQ

403

31108

17235

Ext3 Deadline

338

31803

17885

Ext4 CFQ

451

39265

18519

Ext4 Deadline

446

39257

18221

bonnieMBPS

Bonnie++ IOPS:

bonnieiops

Observations:

The Deadline scheduler seems to be consistently better for anything that’s not ext-based! A lot of work has been done on the Linux kernel to optimize it for the ext2-3-4 filesystems, and that shows. However, depending on what you want to do, ext3 may not be the best option (I don’t know yet about ext4 for postmark-type loads but based on the bonnie++ results it’s solid).

Here’s a list of some considerations:

  • Will the filesystem host many many small files or a few large ones? Reiser still rules the “many small files” use case, by far. The rest are fairly close, and JFS seriously lags. For large files, XFS is great.
  • Do you care if the filesystem takes a long time to fsck? Ext3 still takes quite long, whereas something like XFS doesn’t. Ext4 should remedy this.
  • Do you care for something that’s still actively being maintained? In this case only ext3-4 and XFS are the options.
  • Do you want defrag tools? Choose wisely since few filesystems do (XFS and ext4).

My current overall recommendation is XFS since it’s mature and also very tunable. For reference, here’s how I got the better results for XFS (the results in the graphs for tuned XFS were with the deadline scheduler):

mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=64m /dev/sda7

mount -o nobarrier,noatime,nodiratime,logbufs=8 /dev/sda7 /test

Don’t just follow the above blindly, normally mkfs tries to auto-adjust those (i.e. the agcount) but the important ones to look for are the log size and the mount options, especially the nobarrier and logbufs. Remember though that nobarrier is only recommended if you have battery backup.

D

Mon
5
Jan '09

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken - I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens - which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance - put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

Mon
15
Dec '08

Cinebench benchmarks - performance comparison between Vista 64 and Mac OS X

Been a while since I posted anything - there’s a TON of material but some of us actually do more than blog, it’s quarter/year end, and I barely have time to go to the bathroom…

But this was an easy one so I thought I’d post it real quick. Using Scribefire, a blogging plug-in for Firefox. I hate it.

Disclaimer: The machines used are not identical.

However, the CPUs supposedly are pretty close in speed (2.6 vs 2.8 GHz). Memory is the same.

Graphics are also similar but the Lenovo box has 128MB VRAM whereas the Mac has 512MB and is a faster GPU.

The contenders: Macbook Pro 2.8GHz vs Lenovo T62p (14″ model) running Vista 64, 2.6GHz.

The Mac is running a 32-bit OS (64-bitness is coming with Snow Leopard next year). It also has switchable graphics and one can choose between the on-chipset Nvidia 9400 or the discrete 9600. Typically on-board graphics are pretty crappy.

Despite the dissimilarity of the machines here are some notables:

  • Cinebench really takes off in 64-bit mode in Vista
  • OS X seems to do quite well even though it’s not 64-bit yet
  • The integrated graphics on the new Mac are awesome
  • The discrete graphics are great for a laptop
  • OS X seems to be more efficient than Vista when doing multi-CPU work, at least in this case
  • If someone is looking for a decent modern laptop they can do far worse than the new Macs, even a plain Macbook would be pretty decent

Here’s a chart of the results:

OS/Config 1-CPU 2-CPU GFX Multiprocessor speedup
Macbook Pro 2.8GHz integrated GFX 3208 6051 4813 1.87
Macbook Pro 2.8GHz discrete GFX 3213 5926 6130 1.84
Lenovo 2.6GHz 32-bit 2693 4755 4264 1.77
Lenovo 2.6GHz 64-bit 3040 5367 4256 1.77
Sun
2
Nov '08

Postmark on late 2008 Macbook Pro

So I’m now the proud owner of a tricked-out 2.8GHz MBP.

Naturally I’ve been tinkering with it already (only had it for 2 days). I’ve disabled swapfile encryption, for instance, and I think it makes it have teh snappy.

I compiled postmark for it with -O3 -m64 and ran the usual. Before doing so though I did disable the Spotlight indexer like this:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.metadata.mds.plist

PostMark v1.5 : 3/27/01
pm>set number 10000
pm>set transactions 20000
pm>set subdirectories 5
pm>set size 500 100000
pm>set read 4096
pm>set write 4096
pm>run

Time:
273 seconds total
256 seconds of transactions (78 per second)

Files:
20092 created (73 per second)
Creation alone: 10000 files (833 per second)
Mixed with transactions: 10092 files (39 per second)
9935 read (38 per second)
10064 appended (39 per second)
20092 deleted (73 per second)
Deletion alone: 10184 files (2036 per second)
Mixed with transactions: 9908 files (38 per second)

Data:
548.25 megabytes read (2.01 megabytes per second)
1158.00 megabytes written (4.24 megabytes per second)

I then enabled spotlight and re-ran the benchmark:

Time:
483 seconds total
468 seconds of transactions (42 per second)

Files:
20092 created (41 per second)
Creation alone: 10000 files (909 per second)
Mixed with transactions: 10092 files (21 per second)
9935 read (21 per second)
10064 appended (21 per second)
20092 deleted (41 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (21 per second)

Data:
548.25 megabytes read (1.14 megabytes per second)
1158.00 megabytes written (2.40 megabytes per second)

Obviously spotlight is very aggressive in its indexing and tries to do it ASAP - you lose half your performance when doing metadata-intensive processing. The results though, while sucky for the specs of the box, are far, far removed (and much better than) what an old colleague got on his beastie: http://recoverymonkey.net/wordpress/?p=62 - granted, my box is faster but it shouldn’t be THAT much faster.

I   urge my newfound Mac brethren to help out in determining the cause.

More benchmarks to follow.

D

Thu
16
Oct '08

A word of caution when setting up a deduplicating VTL

Based on some recent experiences I wanted to make people aware of some caveats with setting up a VTL with deduplication. This is specifically regarding the EMC DL3D (AKA Quantum DXi) but applies to all of them. This will be a mercifully short and to the point post. Here’s the rub:

  • Create small virtual tapes (100GB max, I’d go even smaller, obviously depends on your environment)
  • Create a bunch of virtual tape drives (you might have to create 20-30!)
  • Do NOT I repeat NOT multiplex in the backup software! It screws up the deduplication algorithm.
  • Do not compress the data before the backup
  • Do not encrypt the data
  • Be mindful of your retention policies, start gently then work your way up.
  • I’d personally not multi-stream a server at all, just so I can keep the tape utilization high. What I mean: Say you do not do multiplexing but you are multistreaming – i.e. you’re sending 10 streams from your client. This means you will need 10 tapes without multiplexing, so you’ll end up writing a tiny bit on each tape. It doesn’t take a genius to realize that you’ll end up with a ton of tapes with not much data on them, which will cause them to be appended to with more tiny amounts of data, which will in turn cause them to expire way later than you’d like.
  • If you can use the box as NAS and know how to get the throughput up there then do so, that way there’s no issue with multiple streams. My Data Domain boys are chuckling now (they always prefer to do NAS, but that also has to do with the fact that their box can’t really do VTL properly yet. Oh, the cattiness! BTW my company does sell quite a lot of their stuff).

The same rules apply otherwise as in my previous post about tuning NetBackup for large environments.

Regarding using the DL3D/DXi as NAS: Plug in as many GigE ports as you can, but make sure your switch can do straight-up EtherChannel (not LACP). So you pretty much need to have a “proper” Cisco switch in order to get the full benefit. Then use multiple media servers. Use a separate NAS share per media server. Team the NICs on the backup servers for performance (do LACP or PaGP there, whatever works with the server’s NIC software). Then call me in the morning.

D

 

Mon
16
Jun '08

Massive benchmark comparison between Windows XP, Vista and 2008 Server, 32- and 64-bit

Found this while surfing and couldn’t resist posting the link. This guy did a massive array of tests on pretty much all versions of Windows that matter at the moment. The short version? If it’s performance you’re after, there’s no clear winner, since they all have their strengths. Overall, of the currently-supported OSes 2008 server seems to have the edge, as indicated by my own experiences. Attaching the results below, but here’s a link, too.

microsoft OS benchmarks

Tue
10
Jun '08

Virtualized Windows I/O performance on Xen with and without the optimized PV drivers, and versus the Linux host

One of my readers, Randall Ehren, was kind enough to provide benchmarks for Xen-virtualized Windows 2003 and XP with and without the optimized PV driver, and also compare to the underlying host. Most of the text below is copied verbatim from his correspondence with me, I just added some clarification in places.

physical machine description:
dell poweredge r200 server, 8GB ram, 2×250GB SATA 7200rpm in a RAID1

Xen host: ubuntu 8.0.4 LTS Server edition running xen 3.2 hypervisor (this is referred to as the dom0 machine)

This server is hosting two virtual servers (1 - windows 2003 server (1GB RAM), 2 - windows xp (1GB RAM)) and I performed two postmark benchmarks - one with an out of the box windows installation (indicated by “no PV drivers”), the other with a paravirtualized disk driver (indicated by “with Xen PV 0.9.6 drivers”) whose purpose is to greatly increase disk & network performance for windows-based virtual machines running under Xen. the drivers can be found here:

 http://wiki.xensource.com/xenwiki/XenWindowsGplPv

Postmark settings:

set number 10000
set transactions 20000
set subdirectories 5
set size 500 100000
set read 4096
set write 4096

Underlying host

##
## server: ubu 8 amd64 / iron / ext3fs on LVM2


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
##

Linux vm5 2.6.24-17-xen #1 SMP Thu May 1 15:55:31 UTC 2008 x86_64 GNU/Linux

Time:

        73 seconds total
        59 seconds of transactions (338 per second)

Files:
        20092 created (275 per second)
                Creation alone: 10000 files (10000 per second)
                Mixed with transactions: 10092 files (171 per second)
        9935 read (168 per second)
        10064 appended (170 per second)
        20092 deleted (275 per second)
                Deletion alone: 10184 files (783 per second)
                Mixed with transactions: 9908 files (167 per second)

Data:
        548.25 megabytes read (7.51 megabytes per second)
        1158.00 megabytes written (15.86 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        193 seconds total
        123 seconds of transactions (162 per second)

Files:
        20092 created (104 per second)
                Creation alone: 10000 files (166 per second)
                Mixed with transactions: 10092 files (82 per second)
        9935 read (80 per second)
        10064 appended (81 per second)
        20092 deleted (104 per second)
                Deletion alone: 10184 files (1018 per second)
                Mixed with transactions: 9908 files (80 per second)

Data:
        548.25 megabytes read (2.84 megabytes per second)
        1158.00 megabytes written (6.00 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        129 seconds total
        68 seconds of transactions (294 per second)

Files:
        20092 created (155 per second)
                Creation alone: 10000 files (204 per second)
                Mixed with transactions: 10092 files (148 per second)
        9935 read (146 per second)
        10064 appended (148 per second)
        20092 deleted (155 per second)
                Deletion alone: 10184 files (848 per second)
                Mixed with transactions: 9908 files (145 per second)

Data:
        548.25 megabytes read (4.25 megabytes per second)
        1158.00 megabytes written (8.98 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        336 seconds total
        274 seconds of transactions (72 per second)

Files:
        20092 created (59 per second)
                Creation alone: 10000 files (178 per second)
                Mixed with transactions: 10092 files (36 per second)
        9935 read (36 per second)
        10064 appended (36 per second)
        20092 deleted (59 per second)
                Deletion alone: 10184 files (1697 per second)
                Mixed with transactions: 9908 files (36 per second)

Data:
        548.25 megabytes read (1.63 megabytes per second)
        1158.00 megabytes written (3.45 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        233 seconds total
        181 seconds of transactions (110 per second)

Files:
        20092 created (86 per second)
                Creation alone: 10000 files (222 per second)
                Mixed with transactions: 10092 files (55 per second)
        9935 read (54 per second)
        10064 appended (55 per second)
        20092 deleted (86 per second)
                Deletion alone: 10184 files (1454 per second)
                Mixed with transactions: 9908 files (54 per second)

Data:
        548.25 megabytes read (2.35 megabytes per second)
        1158.00 megabytes written (4.97 megabytes per second)

 

Conclusion: seems that the PV driver does help greatly with I/O performance. Of course, comparing to the performance of the underlying host the VMs suck. I’d like to see Randall run the test and use the same box to run at least 2003 in native mode and then post, that should give a great comparison between NTFS and ext3.

Randall/D

Wed
28
May '08

Finally, some postmark results for OSX! And how does it do versus Windows?

My colleague Ian (last name withheld to save him from the Mac zealots) compiled the postmark code on his beloved Mac and ran it with the same settings I use in general (see older blog posts, just search for postmark).

I’ve been curious for the longest time to see how OSX performs in this test, since most UNIX and -alike systems work great with it. I wanted to see if OSX would be appropriate for a high IOPS-type environment (my belief being that due to the choice of kernel and filesystem it would suck - Mach and HFS+ not being exactly ideally suited to such tasks).

This is obviously not the most scientific test but I think it is good enough to get a rough gauge.

I’m still waiting for the specifics on the Mac but it’s an older Intel-based 17″ Macbok Pro with a 2.16GHz CPU, 5400 RPM HD and 2GB RAM.

The horrendous result (I think my rusty abacus did better once):

Time:
1259 seconds total
1186 seconds of transactions (16 per second)

Files:
20092 created (15 per second)
Creation alone: 10000 files (163 per second)
Mixed with transactions: 10092 files (8 per second)
9935 read (8 per second)
10064 appended (8 per second)
20092 deleted (15 per second)
Deletion alone: 10184 files (848 per second)
Mixed with transactions: 9908 files (8 per second)

Data:
548.25 megabytes read (445.92 kilobytes per second)
1158.00 megabytes written (941.85 kilobytes per second)

To compare and contrast (and save you from searching the older posts):

On a similar-spec Thinkpad T60 running XP (1.8GHz Core Duo, 2GB RAM, 60GB 5400 RPM HD):

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

 

And on the spankin’ T61p running 2008 Server (2.6GHz Core 2 Duo, 4GB RAM, 200GB 7200 RPM HD):

Time:
110 seconds total
39 seconds of transactions (512 per second)

Files:
20092 created (182 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (258 per second)
9935 read (254 per second)
10064 appended (258 per second)
20092 deleted (182 per second)
Deletion alone: 10184 files (207 per second)
Mixed with transactions: 9908 files (254 per second)

Data:
548.25 megabytes read (4.98 megabytes per second)
1158.00 megabytes written (10.53 megabytes per second)

 

The drive and CPU speed is important but postmark results are largely a function of filesystem and cache efficiency. It’s also worth noting that postmark is in no way optimized for windows since it is just standard C code and indeed was meant to be run on Unix boxes. Typically, good Unix filesystems beat Windows in postmark (my record run time was like under 10s on a Solaris box and DMX).

Unless something is wrong, HFS+ and/or the OSX cache are execrable for this kind of workload, which is a pity. Maybe there are better mount options? Some tuning options?

This is huge! Even if there’s some issue (disk fragmentation, for instance) the difference in sheer IOPS performance between OSX and pretty much anything else is staggering.

Any Mac users out there that want to chime in and save the day please let me know, I’ll send you the source and you can compile it whichever which way. I truly hope there is some serious error here.

D

Tue
13
May '08

Lowest-impact antivirus tool I’ve ever tried

I’ve been trying out ESET’s NOD32 on my 64-bit 2008 Server box. Before that I’d tried Avast! – which has great detection but noticeably slows down my computer, even when it’s loading pre-cached and pre-checked content (easy test: load Firefox with and without Avast! several times. It’s ALWAYS much slower to load with antivirus on than without. Without Avast! it loads instantly).

So I put in NOD32 Business Edition and the performance difference is staggering. Indeed, I can’t tell the difference between having it on or off. Unless you ask for a scan of the entire box the antivirus process never even goes to 1% of CPU consumption. If you check various online tests of the different antivirus programs they do show NOD32 having some of the best performance overall (including possibly the best heuristics engine with practically zero false positives), plus it works with 2008.

Other progs (Like Kaspersky) also work well but they’re much slower. I think I’ve found my holy grail when it comes to virus protection.

The one massive drawback is that Business Edition (which is the only one that supports 2008) is ONLY sold in 5-computer packs. It’s not expensive (boils down to like $40 per box, same as Home Edition) but I don’t HAVE 5 servers, I just have my 1 laptop that runs 2008.

I asked ESET and they wouldn’t sell me a single Business license. That’s just silly. The product is priced right, is totally solid but they won’t sell you a less than 5 licenses. I won’t spend 200-odd bucks for one machine.

Their response was that most businesses have more than 5 computers in general, so even if they have only 1-2 servers the rest of the licenses can be used on desktops/laptops. Which makes sense but it doesn’t help me :)

The only other product I’d consider now is Avira’s Antivir (same great speed and detection rate, however it provides many more false positives) but I hear it doesn’t work on 2008.

Damn the box is fast now, I forgot how blazing 2008 feels unencumbered by other fluff :)

D

Tue
25
Mar '08

Windows Server 2008 RTM 64-bit performance versus Vista SP1 64-bit, and using 2008 as a workstation

I’ve been using Vista x64 for a while now, just so I can make use of all the memory on my machine (an über-thinkpad), and because I like shiny new things and 64-bitness and don’t want to be one-upped by smug Mac users with their feline-named OSes, mock turtlenecks and their newfound 64-bit capabilities. Of course, with the good comes some bad – Vista, while in my opinion a step forward in many ways, does take a step backward when it comes to some areas of performance and sheer resource requirements. A lot of it can be attributed to poorly-written drivers, especially any Aero GUI slowdowns with nVidia cards.

Since space was running out I bought a new hard drive (200GB Seagate 7200 RPM) and decided to install the RTM 2008 bits. If something went wrong I figured I could always either go back to my old drive or just move Vista to the new drive with some imaging utility or other, no biggie. If 2008 worked out, I’d keep it.

The reason this comparison is worthwhile is that 2008 and Vista SP1 have the same exact kernel – I checked, NTOSKRNL.EXE is the same in both OSes. One would think that the differences wouldn’t be huge and that therefore there’s no point going to 2008. Of course, there are a lot of other pieces aside from the kernel, and I think that Microsoft checks to see what OS you’re running and maybe disables certain features in the kernel accordingly – I couldn’t get the LargeSystemCache registry parameter to have any effect on Vista, for example.

Let’s compare CPU- and Graphics-benchmarks first, since those shouldn’t really be different. I used Cinebench 64-bit.

 

Vista:

Rendering (Single   CPU): 3040 CB-CPU
Rendering (Multiple CPU): 5367 CB-CPU
Multiprocessor Speedup: 1.77
Shading (OpenGL Standard)          : 4256 CB-GFX

 

2008:

Rendering (Single   CPU): 3053 CB-CPU
Rendering (Multiple CPU): 5379 CB-CPU
Multiprocessor Speedup: 1.86
Shading (OpenGL Standard)          : 4478 CB-GFX

 

Slightly better scores for 2008 it seems, but not dramatically better. Next, postmark, since I/O should be where it shines, it being a server and all:

 

Vista:

Time:

        170 seconds total

        98 seconds of transactions (204 per second)

 

Files:

        20092 created (118 per second)

                Creation alone: 10000 files (200 per second)

                Mixed with transactions: 10092 files (102 per second)

        9935 read (101 per second)

        10064 appended (102 per second)

        20092 deleted (118 per second)

                Deletion alone: 10184 files (462 per second)

                Mixed with transactions: 9908 files (101 per second)

 

Data:

        548.25 megabytes read (3.23 megabytes per second)

        1158.00 megabytes written (6.81 megabytes per second)

 

2008:

Initially I had enabled the “advanced performance” in the device manager for disk, since everyone tells you to do so in all tuning guides…

 

Time:

136 seconds total

45 seconds of transactions (444 per second)

 

Files:

20092 created (147 per second)

Creation alone: 10000 files (263 per second)

Mixed with transactions: 10092 files (224 per second)

9935 read (220 per second)

10064 appended (223 per second)

20092 deleted (147 per second)

Deletion alone: 10184 files (192 per second)

Mixed with transactions: 9908 files (220 per second)

 

Data:

548.25 megabytes read (4.03 megabytes per second)

1158.00 megabytes written (8.51 megabytes per second)

 

Much faster than Vista. I then disabled the “enable advanced performance” to see how much slower it would become:

 

Time:

110 seconds total

39 seconds of transactions (512 per second)

 

Files:

20092 created (182 per second)

Creation alone: 10000 files (454 per second)

Mixed with transactions: 10092 files (258 per second)

9935 read (254 per second)

10064 appended (258 per second)

20092 deleted (182 per second)

Deletion alone: 10184 files (207 per second)

Mixed with transactions: 9908 files (254 per second)

 

Data:

548.25 megabytes read (4.98 megabytes per second)

1158.00 megabytes written (10.53 megabytes per second)

 

Amazingly, much faster, not slower! I did some checking and this is what the setting actually does… it re-introduces an older, somewhat undesirable behavior. A bit hard to find the proper explanation, and I hope Microsoft makes what happens behind the scenes a bit more obvious. At the moment it’s quite obscure, and every guide tells you to enable it for performance. Just leave it alone. BTW the Vista score is with the setting disabled.

 

Could I have run other benchmarks like Sandra etc? Sure, but I just wanted to keep it simple and there just wasn’t enough time.

 

The next step is to run the tests on the same hardware with XP. That’s forthcoming.

 

Conclusion:

 

Seems like Microsoft did something right. Even with the 64-bit version (that takes naturally more RAM than the 32-bit one), 2008 Server takes less memory than Vista (2-300MB less at any given time in my case), runs quicker and just feels better, kinda like an unencumbered Vista. Simple things like searching a huge index in Outlook happen much faster than before. The Server Manager app is awesome, and one can try out the Hyper-V Hypervisor (BTW that, predictably, clashes with VMware and disables your power management, so beware). A server OS is in general also more secure and, over time, probably more reliable, given the workloads it’s supposed to run.

 

Can everyone run it? Should they? No, not unless you have a license for 2008 through MSDN or somesuch, otherwise it’s expensive. Some assembly is also required, and you do need to know what you’re doing. However, if you’re so inclined, you can easily get the demo version of 2008. Apparently there are clean, documented ways to increase the evaluation period (no cracks or BIOS spoofers) that I think come from Microsoft but I’m not going to list them here just in case…

 

In addition, while almost all my apps installed fine (including games and hairy driver stuff like Daemon Tools), 2 things didn’t: Bluetooth and my Logitech mouse drivers. I don’t quite use Bluetooth but I liked some of the features of my mouse (the utterly kickass Logitech VX Revolution), now it’s just like a normal mouse. I’m still keeping 2008. I’m sure other stuff will have issues, like DRM/BluRay. For people that like the Windows Sidebar: there are hacks to get it working that involve copying stuff from Vista. I think the sidebar is largely useless.

 

FYI, there are 2 notable omissions in 2008: Readyboost and Superfetch. Superfetch exists as a service but to even get it to start you have to edit the registry. I didn’t think it helped much so I disabled it again. Readyboost isn’t even an option. And the old-style boot prefetch that worked in 2003 Server doesn’t seem to be there. So it does boot a bit slower than Vista, but not much. Once you get the box up and running it’s fast though.

 

In the end, I’m leaving 2008 on my box, and that’s all that matters.

 

D

Mon
4
Feb '08

NetApp posts SPC-1 results

NetApp posted some SPC results showing their 3040 box performing pretty well in SPC-1 relative to an EMC box.

There have been rumors that when running multiple features in a NetApp box then performance suffers. Which kinda negates the whole value prop of NetApp (since that’s when people typically choose NetApp - they want one box to do everything).

A realistic test would be to have OTHER apps sharing the array (on other spindles), as is usually the case. Almost nobody dedicates an entire array of that size to a single app.

Have the box do CIFS, NFS, iSCSI AND FC.

Show performance over a significant period of time (another point NetApp detractors use – performance declines over time due to WAFL fragmentation).

THEN show the performance delta as each feature is enabled.

Obviously hard to do and maintain kosher SPC results but it would be a worthwhile addendum and, if successful, would shut up the NetApp detractors (since that’s a usual technique for selling against NetApp). I’d also show performance in degraded mode.

Anyone have any data on NetApp performing either way when used as a multi-role box?

A note on the EMC config and interpreting those benchmarks in general, be they SPC or SPEC or whatever: ALWAYS READ THE FULL DISCLOSURE regarding the test, don’t just look at the graph. If you’re not technical, get a techie to explain it to you.

For instance, looking at the way the EMC box was set up, I highly doubt it was done using EMC’s best practices. To wit:

  1. They didn’t maximize the write cache
  2. They seem to not have used separate spindles for the snapshot area (a differentiator since, unlike NetApp, EMC not only allows such a thing to happen but actually encourages it)
  3. They could have used MetaLUNs more instead of striping using Windows.

I’d be willing to bet dollars to nuts that the NetApp box was set up properly :)

Another thing: look at the response times in the graphs.

Like they say, “only believe 50% of the statistics you read”.

D

Sun
9
Dec '07

We need more wizards!

No, I don’t mean Gandalf, I mean the software kind. And before I’m accused of being Gates’ live-in cabana boy (it’s all baseless rumors), let me clarify.

It’s a known fact that most OSes need tuning (sometimes significant) to perform well with heavy-duty applications (I’m not talking about your home web server, I’m talking about Exchange, SAP, Oracle, IIS, Apache etc. in large deployments. I acknowledge the fact that most OSes, out of the box, will work OK for anything small).

Most frequently the application documentation will have some kind of tuning guidelines telling you approximately what to do in each OS. The installer sometimes will apply some tunings for you after asking for your permission. Often, the suggested settings are woefully inadequate for truly large implementations, as with NetBackup (the Veritas-suggested tunings work for smaller environments but I have some magical kernel tunings as posted before that make it truly fly when the ridiculous is asked of it – and the difference in the parameters between my config and what Veritas suggests is huge. Oh, and some of my parameters are way smaller than what Veritas recommends. And I won’t call them Symantec, Veritas is a way cooler name anyway, look it up in a Latin-English dictionary).

Frequently, some tunings are so common that I don’t even know why they’re not in the default configuration in certain OSes. Different conversation.

The problem is, there are experts that DO know how to set up and tune the systems properly, but said experts are rarely the admins that install and administer the thing. Usually, a fair portion of those experts do work at the companies that make the OSes and apps.

The elitist among us might say, “tough, the lowly admins need to learn all this stuff, otherwise they’re not worth what they’re paid”. To which I respond with the following points:

  • Not everyone has the time to learn the arcana of several OSes and applications, learning most of the important features is complicated enough and some shops are truly short-staffed
  • The über-experts themselves don’t know it all: They may know how to perfectly set up Exchange but wouldn’t know how to do the same thing with Oracle, how can the basic admins be expected to have such multi-discipline expertise?
  • I firmly believe in the simplicity of the appliance computing model
  • We all have more important things to do (like taking care of the big picture) than constantly worrying about minutiae
  • The people that complain that the admins should be more intelligent are typically the people that actually enjoy dealing with the apocryphal, their jobs are secure anyway
  • There’s money to be made in the simplification of IT – look at Microsoft, EMC/VMware and NetApp. People like simplicity and are willing to pay for it.

Of course, many larger companies will opt for professional services to do the job, but the quality of people just varies dramatically. Just because you’re getting an expensive Veritas PS guy doesn’t mean that

  1. He knows what the hell he’s doing beyond what’s in the installation manual (you know who you are!) and (less significantly)
  2. Is even a Veritas employee, despite his badge (most vendors subcontract smaller companies).

At the moment, most OSes just apply generic formulas based on memory and/or number of CPUs, though somehow do not take into account CPU speed and load, and, indeed, the ancient formulas are a pain with today’s very large memory systems (usually you have to limit some tunables in large-memory HP-UX and Solaris boxes, otherwise some parameters get out of control).

I understand that making OSes truly self-tuning is not here yet, nor will it be for a while (64-bitness has taken away some of the pain though, at least in Windows). In the interim, there are better ways to approach the problem. My suggestion: Modernize the formulas that build the tunables and use simple AI techniques like Expert Systems. At installation time, benchmark the hardware and ask the user what will the server be running? OK, so if the answer is a web server, under what conditions? How many users? And so on. Admins are far more likely to know the answers to those questions than “how many open file handles do you think you’ll need?”

Based on the answers and the benchmark results, the system should either tell you what you want is possible, or bitch.

If the box is to be serving double-duty (or quintuple, in some cases), the wizard should check and see if the tunings will conflict and, if not, tune the whole box so that it can accommodate all the applications.

If you’re creating a filesystem, what will the intended use be? The defaults for almost all filesystems are wrong! One size fits only the people that have that size. The problem is that, once you’ve put in several TB on filesystems someone built with the default parameters, changing them is almost impossible: you have to take a backup, destroy the filesystems, rebuild them then restore the data. Which could have been avoided if, say, maybe not the OS but at least Oracle had the smarts to query the FS and figure out it’s using insufficient log and block sizes and that performance will suck. At which point it should puke and tell you “sorry, this is sub-optimal, either do such-and-such to fix it or continue anyway at your peril”. But of course you’re using raw disks for Oracle, right? Right?

Or take the example of Logical Volume Managers. They are cool, yes. They can work great. They will also let you do insane things such as create multiple LVs and stripe them, even if they’re on the same physical disk! The checks that should have been performed are so ridiculously simple it boggles the mind.

HP kinda started doing something like this a while ago – look at the templates in SAM, you can apply 2-3 different (useless) templates based on what the box will be doing that will affect a few tunables. HP-UX is guilty of needing the most tuning of any current OS I can think of, BTW (It also pays great dividends if you know what you’re doing, I took a Superdome to 2x the I/O performance once, felt proud but it took a lot of effort and research that could have been avoided).

Seems like the intelligence that would make our lives easier is like the proverbial hot potato: always someone else’s problem.

I know it’s a tall order: the whole solution would rely on much deeper interoperability between the various components than we’re used to. But I think the end result would be worth it.

In the meantime, if you have to do it all yourself, at least use common sense and have some golden OS builds that are each good for a different use, then just replicate them as needed.

Anyway, all this is aggravating my hemorrhoids (I call them The Grapes of Wrath), better stop now.

D

 

Fri
7
Dec '07

(Very) Preliminary Windows Server 2008 impressions and Vista Multimedia Performance under battery power

Out of curiosity, I very briefly tried the new Server 2008 Release Candidate (freely available from Microsoft). I’ve been using Vista 64-bit since I need to see all the memory in my machine and, while it works mostly OK, there are some low-level scheduling issues with it – for instance, sound is really choppy on battery power, no matter what I do with the power settings, so I can’t use the thing to watch a DVD or listen to music on the plane. Many others seem to be having the same issues, despite the funky Multimedia Class Scheduler nonsense that Microsoft put in the OS that makes networking slower (great info here), even though older incarnations were not suffering from media playback issues under load. And no, if I disable the Multimedia Scheduler it does NOT work better, it actually gets worse, which means that the service is there to fix some other kludge-y issue Microsoft introduced with the scheduler or something like excessive power throttling of certain devices.

But, as usual, I digress. This is about Server 2008. What’s noteworthy is that Vista SP1 inherits the exact same kernel as Server 2008.

This will be a short entry, there are others online talking more about 2008. What I noticed:

  1. It’s light for a Windows OS. There’s no excessive bloat guys, the thing takes about 300MB of RAM with the default install, and more can be saved by trimming unnecessary services (of which there are very few).
  2. It’s fast. Under preliminary benchmarking, even the RC code (that probably has some features missing and extra debugging code) seems about as fast as 2003 after SP2 (unlike others that have been releasing benchmarks of, say, Vista SP1 in it’s pre-release form, I’d rather wait until the final code is out).
  3. Seems to work with most Vista drivers so, if you want to turn it into a workstation, you can. You can also install the Vista GUI if you’re so inclined with no adverse effects (aside from the ones that come with the Vista UI that is). Runs very smooth.
  4. Application compatibility is similar to that of Server 2003.
  5. The OS does NOT suffer from the same issues as Vista regarding media playback (I made sure I installed the Power Management driver and selected the same kind of PM scheme as Vista). Maybe a good omen come Vista SP1? We shall see.

The new management interfaces are nicely laid out, and selecting Roles for the server and adding or removing features as needed is very simple. It feels more like a well-integrated 2003 R3 rather than Vista.

I didn’t get to play with the new virtualization, it doesn’t seem to be in the RC code (though, reading some documentation, it seems as if it will have VMotion-like capabilities, which I will believe when I see).

UPDATE: 12/17/07

There is no more Vista multimedia performance issue on 2 separate computers. Some patches just released by Microsoft removed the issue (plus the issue of the mouse cursor stuttering). Interestingly, the patches had no mention of fixing said issues. I thought it was a fluke but having seen this fixed on 2 different boxes (one 32-bit, one 64) I don’t think it is.

For the Vista detractors: I’d advise everyone to wait until SP1 – as with most Microsoft releases. It’s no different. They’re actually getting better, NT4 was unusable until SP3 at least… given the unreal amount of code in the system, I’m surprised it runs this well. They really need to slim it down. Supposedly, Windows 7 will be slimmer (http://apcmag.com/7668/beyond_vista_windows_7_what_we_know_so_far). However, it mostly targets the kernel and it was never the Windows kernel that was the issue (it’s actually surprisingly decent), it’s all the crud around it.

D

Mon
15
Oct '07

Uptempo cache can get paged out! (EDIT: After all, it does NOT).

I normally don’t do retractions unless proven wrong. So, ignore the text below and read Nick’s comment.

—————————-

A warning to those who use Datacore’s Uptempo:

While it works wonderfully as long as the server doesn’t suffer a low memory condition, the memory it reserves for cache will get paged out in low-memory situations.

I found out the hard way (as usual), while running some very demanding VMs (I only have 2GB and not the best laptop, a new machine is forthcoming). The way Uptempo reserves memory is by using a specific process, Dscaddmemory or something like that (I’ve now removed it from my system so I can’t remember the exact name). If you look at Task Manager, that process has as much memory allocated to it as you’ve allocated Uptempo.

When I was running out of RAM, I noticed that the process started shrinking in size, until it was 16MB (out of 280MB). Windows, since it looks like a normal process, decided to page it out in order to reclaim RAM.

Of course, this kinda defeats the purpose. I’d rather page out everything BUT my fancy dedicated cache, the way HP-UX does it if you tell it to (story for another day but HP-UX cache tends to work better if you specify the min and max sizes as the same and not let it auto-allocate).

My real beef with Uptempo is that it didn’t try to reclaim the memory when there most obviously was enough memory for it (after it paged itself out needlessly, I had over 350MB free and plenty in the Windows cache).

It didn’t even try to reclaim the RAM after I quit VMWare and had 1.5GB free.

Obviously, either I’m missing something fundamental or some work needs to be done. Granted, any time you are forced to swap heavily cache won’t help much but they should be at least giving the memory back to the process afterwards.

Supercache never shows up as a process, it grabs the memory when the system boots (it’s one of the first things that happen) and nothing can swap it out. It’s also configurable on-the-fly, Uptempo needs a reboot for any size changes.

With 64-bit all these helper caching programs will probably become obsolete since cache is not limited to 1GB any longer. Though I’m not sure I subscribe to Vista’s Superfetch, since it does make the HD work like crazy when you first start the box and is more suited for boxes that are not shut down it seems. Once it settles down it works OK.

D

Thu
20
Sep '07

WAN acceleration for remote workers

The deluge of WAN accelerators from Cisco, Riverbed, Juniper, Expand, Packeteer,Bluecoat, Silverpeak etc. etc. is proving good for datacenters. Not sure how many vendors will remain viable in a year or two, but the selection at the moment is decent.

However, most of the vendors don’t address remote desktop acceleration, say for people using 3G cards on their laptops or even cable modems - sometimes the routing to corporate networks can be arcane enough that the ms of latency add up, plus most home connections are asymmetrical anyway.

So, it would be pretty cool to have a WAN accelerator in your laptop, right? Well, so far only two companies have stepped forward:

The far more established product, even if you’ve never heard of it, is AcceleNet Enterprise from ICT (Intelligent Compression Technologies, www.ictcompress.com - they were recently bought by ViaSat). ICT has been doing just this for years, with a veritable who is who of clients (no they haven’t paid me to say this, I just think the stuff is cool). Lots of service providers use it.

ICT deploys a server that acts as a proxy, then you install an agent on your laptop. Transfers are compressed both ways.

The other vendor is known to us all - it’s Riverbed. They have now what’s called Steelhead Mobile. Effectively, it puts a Riverbed box inside your laptop. A normal Steelhead is needed to communicate with, as well as a Steelhead Mobile Controller for management. I saw pricing for the controller and it was a bit dear…

You can even adjust how much cache to give your mini-Riverbed, so if you have the space, go nuts.

Of course, you can also use this technology for servers and save money on appliance costs - I wonder if they have something that checks if you’ve installed it on a server OS, and how much CPU does it take to do it’s thing.

I heard somewhere Cisco is also working on something similar, unsurprisingly.

D

Fri
17
Aug '07

Processor scheduling and quanta in Windows (and a bit about Unix/Linux)

One of the more exotic and exciting IT subjects is the one of processor scheduling (if you’re not excited, read on, practical stuff to be seen later in the text). Multi-tasking OSes just give the illusion that they’re doing things in parallel - in reality, the CPUs rapidly skip from task to task using various algorithms and heuristics, making one think the processes truly are running simultaneously. The choice of scheduling algorithm can be immensely important.

Wikipedia has a nice article on schedulers in general: en.wikipedia.org/wiki/Scheduling_%28computing%29, good primer.

To cut a long story short: the processors are allowed to spend finite chunks of time (quanta) per process. Note that the quantum has nothing to do with task priority, it’s simply the amount of time the CPU will spend on the task. Every time the CPU switches to a new process, there’s what’s called a context switch (en.wikipedia.org/wiki/Context_switch), which is computationally expensive. Obviously, we need to avoid excessive context switching but still maintain the illusion of concurrency.

In Windows Server (that uses a multi-level feedback queue algorithm, FYI), the default quantum is a fixed 120ms, close to many UNIX variants (100ms) and generally accepted as a reasonably short length of time that can fool humans into believing concurrency. Compare this to the workstation-level products (Windows Vista/XP/2000 Pro) that have a variable quantum that’s much shorter and also provide a quantum (not priority) boost to the foreground process (the process in the currently active window). In the workstation products, the quantum ranges from 20-60ms typically, with the background processes always relegated to the smallest possible quantum, ensuring that the application one is currently using “feels” responsive and that no background task hampers perceived performance too much. Typically, in a box that’s used as a busy terminal server this will be the better setting to use since it will ensure that the numerous “in-focus” user processes will all get a quantum sooner rather than later.

The longer, fixed quantum of Windows Server means that fewer system resources are wasted on context switching, and that all processes have the same quantum. More total system throughput can be realized with such a scheme, and it’s a more of a fair scheduler. It also explains the higher benchmark numbers when running the scheduler in “background services” mode. It’s obviously best for systems that are running a few intensive processes that can benefit from the longer quantum (and, believe it or not, games and pro audio apps run better like this).

Note that I/O-bound threads (processes waiting on disk, mouse, screen and keyboard I/O) are given priority over CPU-bound threads anyway, which explains why the longer quantum doesn’t harm interactivity much. Try it - have 4 winzip/winrar/7zip sessions running concurrently. You CAN still move your mouse :) Here’s a great primer on internal windows architecture: elqui.dcsc.utfsm.cl/apuntes/guias-free/Windows.pdf. Another, deeper dive: download.microsoft.com/download/5/b/3/5b38800c-ba6e-4023-9078-6e9ce2383e65/C06X1116607.pdf.

Of course, there are ways to tune the timeslice in a more fine-grained fashion. In the registry, check out HKLM\SYSTEM\CurrentControlSet\Control\PriorityControl\Win32PrioritySeparation . Here are some explanations about how it works: www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/29623.mspx?mfr=true and www.microsoft.com/mspress/books/sampchap/4354c.aspx are great.

For instance - what if you don’t care to increase the quantum on the foreground window but, instead, just want short, fixed quanta (effectively around 60ms) for all processes to improve response time on a system with a lot of processes? Setting Win32PrioritySeparation to 0×28 will take care of that.

Here’s a useful Win32PrioritySeparation chart from forums.guru3d.com/showthread.php?p=1451631#post1451631:

2A Hex = Short, Fixed , High foreground boost.
29 Hex = Short, Fixed , Medium foreground boost.
28 Hex = Short, Fixed , No foreground boost.

26 Hex = Short, Variable , High foreground boost.
25 Hex = Short, Variable , Medium foreground boost.
24 Hex = Short, Variable , No foreground boost.

1A Hex = Long, Fixed, High foreground boost.
19 Hex = Long, Fixed, Medium foreground boost.
18 Hex = Long, Fixed, No foreground boost.

16 Hex = Long, Variable, High foreground boost.
15 Hex = Long, Variable, Medium foreground boost.
14 Hex = Long, Variable, No foreground boost.

Here are some other pages where others have figured out the effective quanta (and remember the numbers are not in ms): blogs.msdn.com/embedded/archive/2006/03/04/543141.aspx (for embedded Windows, I have doubts about the accuracy of his calculations regarding the effective quantum but still interesting), www.microsoft.com/technet/sysinternals/information/windows2000quantums.mspx (for Windows 2000, probably still valid).

Here’s a really nice article on the effects of schedulers and I/O-bound processes on virtualization: regions.cmg.org/regions/mcmg/m102006_files/6187_Mark_Friedman_Virtualization.doc

Linux, on the other hand, has not one but several totally different CPU schedulers and I/O elevators available. Just see this page, comparing 2.6.22 with Vista’s kernel, and note how many non-standard features are available as patches: widefox.pbwiki.com/Scheduler . You can get schedulers with cool names such as genetic, anticipatory, etc. Linux used to suffer on the desktop, but with recent patches interactivity has improved tremendously, and is now far more viable as a desktop OS. Here’s some cool info on anticipatory schedulers: www.cs.rice.edu/~ssiyer/r/antsched/. Anticipatory schedulers can help systems with slower I/O (laptops and desktops, especially) feel more interactive, and was the default I/O elevator for a while (CFQ is the current default for I/O, though can have issues with desktop users, see ubuntuforums.org/showthread.php?t=456692). A list of all the I/O elevators in the kernel: ebergen.net/wordpress/2006/01/26/io-scheduling/. Whitepapers: www.cs.ccu.edu.tw/%7Elhr89/linux-kernel/Linux%20IO%20Schedulers.pdf, www.linuxinsight.com/files/ols2004/pratt-reprint.pdf, www.linuxinsight.com/files/ols2005/seelam-reprint.pdf .

Recently, Linux moved to the Completely Fair Scheduler model (www.osnews.com/story.php/18240/Linux-Switches-to-CFS-Scheduler-in-2.6.23), sparking a lot of controversy (www.osnews.com/story.php/18350/Linus-On-CFS-vs.-SD) since it’s not quite done yet (kerneltrap.org/node/14055). More info on CFS: immike.net/blog/2007/08/01/what-is-the-completely-fair-scheduler/.

Interesting benchmarks showing the effects of scheduling on Linux performance: developer.osdl.org/craiger/hackbench/, math.nmu.edu/~randy/Research/Speaches/Disk%20Scheduling%20In%20Linux.ppt.

For anyone wishing to test the various Linux schedulers’ impact on interactivity, Con Kolivas has something: members.optusnet.com.au/ckolivas/interbench/. Con’s Staircase/Deadline (SD) scheduler (lwn.net/Articles/224865/) didn’t make it to the mainline kernel, unfortunately, and a miffed Con announced he’s dropping out of kernel development. Pity, since I think he single-handedly contributed more to the advancement of Linux interactivity on the desktop than anyone else. It’s great to have the choice of schedulers depending on how you’re planning to use your system - it’s already done with the I/O elevator, let it be done with the CPU scheduler. Instead, Linus invoked his Papal-like powers and made what I consider to be an unsound decision.

The real issue with Linux though is the userland. Here’s a great paper showing issues with the userland and how it robs us of speed: ols2006.108.redhat.com/reprints/jones-reprint.pdf . A lot of the CPU and I/O scheduler design is workarounds for those issues. Unless one deliberately chooses a stripped-down Linux distribution, the amount of bloat in the current code is incredible.

Finally, Solaris 10 also comes with a bunch of different schedulers, which you can assign globally or on a per-process/project basis. Tons more info: www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html, blogs.sun.com/andrei/date/20050131, wiki.its.queensu.ca/display/JES/Solaris+10+Containers+and+Fair+Share+Scheduling, docs.sun.com/app/docs/doc/816-0222/6m6nmlsug?l=en&a=view.

Heady reading, no?

D

Mon
30
Jul '07

Just how much is your antivirus harming your I/O?

I just got a new corporate laptop, a nice, shiny T60 (OK, it’s IBM black and therefore thoroughly incapable of reflecting on any part of the spectrum).

I noticed that doing disk-intensive work was much slower than I’ve been used to. I configured it as a server (see previous posts) and that helped a bit but not as much as I’d like to.

It seems the antivirus software is checking each and every file, and takes 100% of a CPU to do so. Were this not a dual-core box it would be begging for mercy.

Taking an entire CPU is unacceptable IMO. So I ran some benchmarks - the trusty postmark once more to the rescue:

 

After tweaking as a server, antivirus running, 100% CPU utilization while bench running:

Time:
344 seconds total
230 seconds of transactions (86 per second)

Files:
20092 created (58 per second)
Creation alone: 10000 files (95 per second)
Mixed with transactions: 10092 files (43 per second)
9935 read (43 per second)
10064 appended (43 per second)
20092 deleted (58 per second)
Deletion alone: 10184 files (1131 per second)
Mixed with transactions: 9908 files (43 per second)

Data:
548.25 megabytes read (1.59 megabytes per second)
1158.00 megabytes written (3.37 megabytes per second)

 

With a more efficient antivirus program instead, variable CPU utilization (from 10%-100%):

Time:
276 seconds total
174 seconds of transactions (114 per second)

Files:
20092 created (72 per second)
Creation alone: 10000 files (123 per second)
Mixed with transactions: 10092 files (58 per second)
9935 read (57 per second)
10064 appended (57 per second)
20092 deleted (72 per second)
Deletion alone: 10184 files (484 per second)
Mixed with transactions: 9908 files (56 per second)

Data:
548.25 megabytes read (1.99 megabytes per second)
1158.00 megabytes written (4.20 megabytes per second)

 

Disabling the antivirus makes it way faster for transactions:

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

Caching with UpTempo for a nice 50% boost in performance:

Time:
121 seconds total
65 seconds of transactions (307 per second)

Files:
20092 created (166 per second)
Creation alone: 10000 files (277 per second)
Mixed with transactions: 10092 files (155 per second)
9935 read (152 per second)
10064 appended (154 per second)
20092 deleted (166 per second)
Deletion alone: 10184 files (509 per second)
Mixed with transactions: 9908 files (152 per second)

Data:
548.25 megabytes read (4.53 megabytes per second)
1158.00 megabytes written (9.57 megabytes per second)

Not tweaking the laptop as a server resulted in > 400s runtimes in the default config (sometimes 500s). FYI, the drive is a smaller, 5400 RPM jobbie, not the 200GB 7200 RPM SATA I have my eye on.

One could extrapolate these results. On a bigger box the end results will differ but everything will remain relatively similar.

Obviously, antivirus is sorely needed in this day and age, but if you’re planning on doing heavy I/O be careful what antivirus program you pick and how it’s configured. Depending on the server, I’d gladly trade some protection in exchange for a bunch more performance. Or you can go Unix/Linux and not really have to bother.

I’d say setting up an antivirus program to only scan extensions that can be infected and only scan on creates/modifies and not reads, can boost performance significantly.

Interestingly, caching didn’t help much with antivirus enabled - most of the bottleneck was the antivirus since everything had to go through it first. What if this was a database/email/fileserver with heavy activity?

D

Sat
2
Jun '07

IBRIX at EMC World

I’ve known about IBRIX for a while, but it was refreshing to talk to a decent techie that knew the product. They have improved it a lot over the past year.

For the uninitiated, IBRIX can be either

  1. A network-based filesystem using the IBRIX client and protocol
  2. Also accessible using NFS or CIFS
  3. SAN-based parallel filesystem

The product’s claim to fame is it’s scalability and performance (realized by adding extra nodes “hot”). Their most famous client is probably Pixar, they replaced a ton of NetApp boxes with an IBRIX cluster and realized huge performance benefits and vastly reduced costs. I always liked cool filesystem technologies and this definitely falls under the realm of “cool”. Some highlights based on notes I took on my Blackberry during the session and questions I asked:

  • No limits on filesystem size (they have deployed single namespace filesystems several PB in size).
  • 300mb/s read, 200mb/s write on small box per node. Bigger boxes can do 1.2GB/s per node, of course your storage needs to be able to keep up.
  • No limit on the number of nodes.
  • Automatic rebalancing of data over time. When you add new disk you rebalance to keep things humming.
  • Dedicated ibrix backup node, works with 3rd party backup SW, can have many backup servers for backup speed.
  • Has snaps now (global), this was a failing of the product before since it was lacking snapshots.
  • No real limit on the number of files per FS.
  • Biggest file size they have tested on production is an 8TB file, no software limit.
  • Nodes use FC to access storage, clients use Ethernet.
  • Client on Windows or Linux, otherwise general NFS and CIFS. Client is fastest.
  • Your prod servers can be the ibrix nodes but very compute-intensive. They recommend the client (IP-based, bonded). or get an 8-core box.
  • There is no single lock manager - this is the coolest thing. There is global metadata and global locking, all nodes participate equally.
  • How are node failures handled? All nodes interchangeable. All see same storage. Storage allocated to remaining servers if you lose a node.
    Can lose all but 1 server.
  • Back-end storage size per node? Unlimited.
  • Multipathing per node? Powerpath works. Can do bonded GigE up to 8 ports per.
  • How are files allocated? The file inode contains the info concerning which node it needs to go to. Round-robin allocation or preferred servers per file type. Also if server over 50% full then it’s skipped.
  • All volumes accessible by all nodes.
  • Can stripe huge files across many nodes.

I’m stoked! I can think of so many uses for this product:

  1. Data mining
  2. Digital media
  3. Oil and gas
  4. Backups

D

Tue
22
May '07

Netbackup best practices for ridiculously busy environments (but not exclusively).

While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might share some of my experience regarding running Netbackup on very large setups - nothing like learning through pain.

Don’t get me wrong - NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.

I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much - we fixed that pronto…

I won’t go too deep into each point. If anyone cares then post a comment and I will expand on it.

If you have a small shop running NBU on a single server, much of this is not for you - but there may still be a nugget or two in there… However, if you don’t at least use barcodes, I will go after you. Use tar or Windows backup, or even a rusty abacus, go to your corner and be quiet.

 

  1. Have a dedicated master server - if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
  2. Go way beyond the tuning recommendations in the manual - if you know what you’re doing. For instance, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, whatever) support… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” whatever, I’ve gotten such ridiculously bad advice from their support I still cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
  3. Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
  4. Separate TAN (Tape Area Network), if you can swing it.
  5. Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (whatever nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
  6. Experiment with TOE cards, such as the Alacritech ones. You will get closer to full gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
  7. Try to use port bonding that works at the switch level, too - 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
  8. Don’t use weak switches at the core. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS!!!! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed - i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly blocked 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? Even after the initial transfer that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
  9. Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
  10. Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
  11. Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
  12. Use snapshots as much as possible.
  13. If you have more than a couple of media servers, consider a VTL.
  14. If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a power supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
  15. If the DBAs can’t be persuaded even after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
  16. As a corollary to #15, only use tape for large jobs that will actually stream your tape drives.
  17. Know what your boxes can push. Most servers, even very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, sustained. Do the math. Beat the score! I cheated, BTW.
  18. Know what expansion slots to use - not all are equal, even if they look the same.
  19. Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
  20. Be super-careful with command-line manipulation of the NBU DB. Perfectly legitimate commands will not function as you might think due to silly heuristics (or lack thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a pronounced limp.
  21. Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
  22. Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
  23. If you use tapes, have 2 copies of everything.
  24. Replicate to remote sites if at all possible. Tape should be a last resort.
  25. Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
  26. Do at least 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session - there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an extra email to an externally available address. You’re not paranoid if they’re really out to get you!
  27. Can you even read from disk as fast as you can write to your backup medium? Benchmark.
  28. What’s your current network throughput if you max out all the media servers? Benchmark.
  29. Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
  30. Use storage unit groups. Why on earth would you not?
  31. Cluster the master.
  32. Do NOT put media traffic through firewalls, it’s too much. ACLs on switches can work just fine.
  33. Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might as well let media server traffic through. If it seems that #32 and #33 are somewhat self-contradictory, give yourself a cigar.
  34. Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.

 

That’s all I have for now. Is there more? Tons, but I need to pee.

D

Fri
4
May '07

Another windows tuning I forgot to mention

I use my laptop so much that I sometimes forget about some server-type tunings.

I resuscitated my hot-rod AMD box - it’s a grossly overclocked monster but only has 1GB RAM (since it’s hard to find that kind of fast RAM in bigger sizes, and using 4 sticks prohibits me from overclocking it so much). Let’s just say the CPU is running a full GHz faster than stock, and with air, not water or peltier coolers.

Anyway, since it only has 1GB RAM and I use it for Photoshop and games, I can’t really use something like Supercache or Uptempo on it.

So I tried O&O Software’s Clevercache. By far not as good as the other 2 products - however, it does a decent job of automatically managing cache so you always have enough free RAM.

Then I tried the DisablePagingExecutive registry tweak - not that obscure, tons of references around.

BTW, there is a way to stop postmark from using caching - set buffering false is the command. However, I want to see the benchmark run on a system that would run normally, not measure the raw speed of my disks. Nobody cares about that anyway, especially in the big leagues (unless the config is truly moronic, of course). Cache is everything. But I digress.

So - postmark once more.

Stock:

Time:
177 seconds total
144 seconds of transactions (138 per second)

Files:
20092 created (113 per second)
Creation alone: 10000 files (333 per second)
Mixed with transactions: 10092 files (70 per second)
9935 read (68 per second)
10064 appended (69 per second)
20092 deleted (113 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (68 per second)

Data:
548.25 megabytes read (3.10 megabytes per second)
1158.00 megabytes written (6.54 megabytes per second)

after tuning as server with the background process, large cache and fsutil as described previously:

Time:
107 seconds total
85 seconds of transactions (235 per second)

Files:
20092 created (187 per second)
Creation alone: 10000 files (526 per second)
Mixed with transactions: 10092 files (118 per second)
9935 read (116 per second)
10064 appended (118 per second)
20092 deleted (187 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (116 per second)

Data:
548.25 megabytes read (5.12 megabytes per second)
1158.00 megabytes written (10.82 megabytes per second)

with clevercache:

Time:
97 seconds total
71 seconds of transactions (281 per second)

Files:
20092 created (207 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (142 per second)
9935 read (139 per second)
10064 appended (141 per second)
20092 deleted (207 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (139 per second)

Data:
548.25 megabytes read (5.65 megabytes per second)
1158.00 megabytes written (11.94 megabytes per second)

Hell, I guess I might get Clevercache for this system - sped it up a bit and manages memory consumption.

But look at this:

All the above plus using the DisablePagingExecutive registry tweak: BOOYA!

Time:
45 seconds total
28 seconds of transactions (714 per second)

Files:
20092 created (446 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10092 files (360 per second)
9935 read (354 per second)
10064 appended (359 per second)
20092 deleted (446 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (353 per second)

Data:
548.25 megabytes read (12.18 megabytes per second)
1158.00 megabytes written (25.73 megabytes per second)

I guess the box is staying this way.

More info on the registry tweak:

http://technet2.microsoft.com/windowsserver/en/library/3d3b3c16-c901-46de-8485-166a819af3ad1033.mspx?mfr=true

In a nutshell, it disables the paging of kernel and driver code, so it’s always memory-resident. Makes sense in some cases, as you can see above :)

It’s so unusual that it gave me that much of a boost, though. I’d tried it a long time ago and it wasn’t quite as dramatic, but that was on a much older system.

One would argue that postmark lied but using a stopwatch and just eyeballing the sucker it was way quicker doing the transactions.

On servers I just didn’t normally set it because I figured they had enough RAM. Maybe I should start doing it on boxes that do a lot of transactional I/O. Damn, I need to try this with Supercache.

Obviously, your mileage may vary.

WARNING: DO NOT DO THIS ON ANY MACHINE THAT NEEDS TO SUSPEND!!!

Which is why I just didn’t do it on the laptop.

D

Wed
2
May '07

Cisco WAAS benchmarks, and WAN optimizers in general

Lately I’ve been dealing with WAN accelerators a lot, with the emphasis on Cisco’s WAAS (some other, smaller players are Riverbed, Juniper, Bluecoat, Tacit/Packeteer and Silverpeak). The premise is simple and compelling: Instead of having all those servers at your edge locations, move your users’ data to the core and make accessing the data feel almost as fast as having it locally, by deploying appliances that act as proxies. At the same time, you will actually decrease the WAN utilization, enabling you to use cheaper pipes, or at least not have to upgrade, where in the past you were planning to anyway.

There are significant other benefits (massive MAPI acceleration, HTTP, ftp, and indeed any TCP-based application will be optimized). Many Microsoft protocols are especially chatty, and the WAN accelerators pretty much remove the chattiness, optimize the TCP connection (automatically resizing Send/Receive windows based on latency, for instance), LZ-compress the data, and to top it all will not transfer data blocks that have already been transferred.

At this point I need to point out that there is a lot of similarity with deduplication technologies - for example, Cisco’s DRE (Data Redundancy Elimination) is, at heart, a dedup algorithm not unlike Avamar’s or Data Domain’s. So, if a Powerpoint file has gone through the DRE cache already, and someone modifies the file and sends it over the WAN again, only the modified parts will really go through. It really works and it’s really fast (and I’m about the most jaded technophile you’re likely to meet).

The reason I’m not opposed to this use of dedup (see previous posts) is that the datasets are kept at a reasonable size. For instance, at the edge you’re typically talking about under 200GB of cache, not several TB. Doing the hash calculations is not as time-consuming with a smaller dataset and, indeed, it’s set up so that the hashes are kept in-memory. You see, the whole point of this appliance is to reduce latency, not increase it with unnecessary calculations. Compare this to the multi-TB deals of the “proper” dedup solutions used for backups…

Indeed, why the hell would you need dedup-based backup solutions if you deploy a WAN accelerator? Chances are there won’t be anything at the edge sites to back up, so the whole argument behind dedup-based backups for remote sites sort of evaporates. Dedup now only makes sense in VTLs, just so you can store a bit more.

On Dedup VTLs: Refreshingly, Quantum doesn’t quote crazy compression ratios - I’ve seen figures of about 9:1 as an average, which is still pretty good (and totally dependent on what kind of data you have). I just cringe when I see the 100:1, 1000:1 or whatever insanity Data Domain typically states. I’m still worried about the effect on restore times, but I digress. See previous posts.

Anyway, back to WAN accelerators. So how do these boxes work? All fairly similarly. Cisco’s, for instance, does 3 main kinds of optimizations: TFO, DRE and LZ. TFO means TCP Flow Optimizations, and takes care of snd/rcv window scaling, enables large initial windows, enables SACK and BIC TCP (the latter 2 help with packet loss).

DRE is the dedup part of the equation, as mentioned before.

LZ is simply LZ compression of data, in addition to everything else mentioned above.

Other vendors may call their features something else, but at the end there aren’t too many ways to do this. It all boils down to:

  1. Who has the best implementation speed-wise

  2. Who is the best administration-wise

  3. Who is the most stable in an enterprise setting

  4. What company has the highest chance of staying alive (like it or not, Cisco destroys the other players here)

  5. What company is committed to the product the most

  6. As a corollary to #5, what company does the most R&D for the product

Since Cisco is, by far, the largest company of any that provide WAN accelerators (indeed, they probably spend more on light bulbs per year than the net worth of the other companies provided), in my opinion they’re the obvious force to be reckoned with, not someone like Riverbed (as cool as Riverbed is, they’re too small, and will either fizzle out or get bought - though Cisco didn’t buy them, which is food for thought. If Riverbed is so great, why would Cisco simply not acquire them?)

Case in point: When Cisco bought Actona (which is the progenitor of the current WAAS product) they only really had the Windows file-caching part shipping (WAFS). It was great for CIFS but not much else. Back then, they were actually lagging compared to the other players when it came to complete application acceleration. Fast forward a mere few months: They now accelerate anything going over TCP, their WAFS portion is still there but it’s even better and more transparent, the product works with WCCP and inline cards (making deployment at the low-end easy) and is now significantly faster than the competitors. Helps to have deep pockets.

For an enterprise, here are the main benefits of going with Cisco the way I see them:

  1. Your switches and routers are probably already Cisco so you have a relationship.

  2. WAAS interfaces seamlessly with the other Cisco gear.

  3. The best way to interface a WAN accelerator is WCCP. And it was actually developed by Cisco.

  4. The Cisco appliances are tunnel-less and totally transparent (I met someone that had Riverbed everywhere - a software glitch rendered ALL WAN traffic inoperable, instead of having it go through unaccelerated which is the way it is supposed to work. He’s now looking at Cisco).

  5. WAAS appliances don’t mess with QoS you may have already set.

  6. The WAAS boxes are actually faster in almost anything compared to the competition.

And now for the inevitable benchmarks:

Depending on the latency, you can get more or less of a speed-up. For a comprehensive test see this: http://www.cisco.com/application/pdf/en/us/guest/products/ps6870/c1031/cdccont_0900aecd8054f827.pdf

Another, longer rev: http://www.cisco.com/web/CA/channels/pdf/Miercom-on-Cisco-WAAS-Riverbed-Juniper-competitive.pdf

Yes, this is on Cisco’s website but it’s kinda hard to find any performance statistics on the other players’ sites showing Cisco’s WAAS (any references to WAFS are for an obsolete product). At least this one compares truly recent codebases of Cisco, Riverbed and Juniper. For me, the most telling numbers were the ones showing how much traffic the server at the datacenter actually sees. Cisco was almost 100x better than the competition - where the other products passed several Mbits through to the server, Cisco only needed to pass 50Kbits or so.

It is kinda weird that the other vendors don’t have any public-facing benchmarks like this, don’t you think?

However, since I tend to not completely believe vendor-sponsored benchmark numbers as much as I may like the vendor in question, I ran my own.

I used NISTnet (a free WAN simulator, http://www-x.antd.nist.gov/nistnet/) to emulate latency and throughput indicative of standard telco links (i.e. a T1). The fact that the simulator is freely available and can be used by anyone is compelling since it allows testing without disrupting production networks (for the record, I also tested on a few production networks with similar results, though the latency was lower than with the simulator).

The first test scenario is that of the typical T1 connection (approx. 1.5Mbits/s or 170KB/s at best) and 40ms of round-trip delay. I tested with zero packet loss, which is not totally realistic but it makes the benchmarks even more compelling. Usually there is a little packet loss, which makes transfer speeds even worse. This is one of the most common connections to remote sites one will encounter in production environments.

The second scenario is that of a bigger pipe (3Mbit) but much higher latency (300ms), emulating a long-distance link such as a remote site in Asia over which developers do their work. I injected a 0.2% packet loss (a small number, given the distance).

It is important to note that, in the interests of simplicity and expediency, these tests are not comprehensive. A comprehensive WAAS test consists of:

  • Performance without WAAS but with latency

  • Performance with WAAS but data not already in cache (cold cache hits). Such a test shows the real-time efficiency of the TFO, DRE and LZ algorithms.

  • Performance with the data already in the cache (hot cache hits).

  • Performance with pre-positioning of fileserver data. This would be the fastest a WAAS solution would perform, almost like a local fileserver.

  • Performance without WAAS and without latency (local server). This would be the absolute fastest performance in general.

The one cold cache test I performed involved downloading a large ISO file (400MB) using HTTP over the simulated T1 link. The performance ranged from 1.5-1.8MB/s (a full 10 times faster than without WAAS) for a cold cache hit. After the file was transferred (and was therefore in cache) the performance went to 2.5MB/s. The amazing performance might have been due to a highly compressible ISO image but, nevertheless, is quite impressive. The ISO was a full Windows 2000 install CD with SP4 slipstreamed - a realistic test with realistic data, since one might conceivably want to distribute such CD images over a WAN. Frankly this went through so quickly that I keep thinking I did something wrong.

T1 results
ftp without WAAS:
ftp: 3367936 bytes received in 19.53Seconds 168.40Kbytes/sec

Very normal T1 behavior with the simulator (for a good-quality T1).

ftp with WAAS:
ftp: 3367936 bytes received in 1.34Seconds 2505.90Kbytes/sec (15x improvement ).

Sending data was even faster:
ftp: 3367936 bytes sent in 0.36Seconds 9381.44Kbytes/sec.

waasT1

 

High Latency/High Bandwidth results

The high latency (300ms) link, even though it had double the theoretical throughput of the T1 link, suffers significantly:

ftp without WAAS
ftp: 3367936 bytes received in 125.73Seconds 26.79Kbytes/sec.

I was surprised at how much the high latency hurt the ftp transfers. I ran the test several times with similar results.

ftp with WAAS
ftp: 3367936 bytes received in 2.16Seconds 1562.12Kbytes/sec. (58x improvement ).

waaslat

 

I have more results with office-type apps but they will make for too big of a blog entry, not that this isn’t big. In any case, the thing works as advertised. I need to build a test Exchange server so I can see how much stuff like attachments are accelerated. Watch this space. Oh, and there’s another set of results at http://www.gotitsolutions.org/2007/05/18/cisco-waas-performance-benchmarks.html

Comments? Complaints? You know what to do.

D

Mon
19
Feb '07

Some clarification on the caching

Re the previous post:

If you want to use supercache or uptempo the idea is that you take AWAY from windows/SQL/exchange cache and add to the fancy cache.

So, even on windows server, in “file and print sharing for microsoft windows” (in the properties for your network card, under file and printer sharing for Microsoft networks, bizarrely enough), you could say “maximize throughput for network applications”. In the various apps you’d just minimize the cache (i.e. only 10MB for Exchange) and just give the rest to supercache/uptempo.

Be aware that supercache is on a PER VOLUME basis, not global (its blessing and its curse at the same time). If you have a lot of volumes maybe just cache a few key data volumes, tempdb and the pagefile partition, or use uptempo, which allows you to allocate a single global cache pool that is then shared among the volumes you choose.

For SQL, using a RAM disk for tempdb seems to work even better.

Having seen the products work wonders only with 128MB dedicated to them, and bearing in mind that most servers have 4GB RAM or more, I’d say go nuts. I’d buy 4GB of RAM and make it cache in a heartbeat.

D


'

On windows filesystem tuning and funky cache mechanisms

Edited: I just realized I must have used different postmark settings for vista and XP. Do NOT use the following numbers to compare Vista to XP performance.

I won’t go into a diatribe on how to tune Windows - there are excellent guides on Microsoft’s and IBM’s sites, among others.

But I wanted to share some goodness based on some recent findings of mine.

First, the part that most probably know (works on XP and 2003):

From a command window do

fsutil behavior set disablelastaccess 1

This will disable access time recording, which IMO is useless unless you really do care when a file was accessed and/or there isn’t much going on with your disk (or are on some fancy EMC box with tons of cache). If you have busy disks, this typically helps a bit.

On 2003, you can also increase the size of the lookaside buffer if you have many concurrent file operations:

fsutil behavior set memoryusage 2

This also works on Vista but not XP, sadly. See more here: http://technet2.microsoft.com/WindowsServer/en/library/9fcf44c8-68f4-4204-b403-0282273bc7b31033.mspx?mfr=true

Now, for the interesting part. I use a laptop that’s pretty decent (100GB 7200RPM drive, 2GB RAM). I hammer my disk since I use the laptop for vmware and other duties (music software with thousands of files, for instance).

I like postmark and iozone for measuring performance. Here’s how I configure postmark:

set number 10000

set transactions 20000

set subdirectories 5

set size 500 100000

set read 4096

set write 4096

run

This will create 10,000 files, then perform 20,000 transactions on them. The files will range from 500 bytes to 100KB in size. This is brutal on CPU, cache and disk. If you want different-sized files you just specify the min and max sizes, just be careful with the number (if you leave it at 10,000 and tell it to make 100GB files, better make sure you have the space).

Anyway, here are some results:

Vista untweaked (10000 files and transactions, 512 byte I/O):

Time:
181 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (83 per second)
Creation alone: 10000 files (121 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (83 per second)
Deletion alone: 10094 files (210 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.43 megabytes per second)
826.79 megabytes written (4.57 megabytes per second)

Vista tweaked with fsutil as described above:

Time:
159 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (94 per second)
Creation alone: 10000 files (158 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (94 per second)
Deletion alone: 10094 files (224 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.62 megabytes per second)
826.79 megabytes written (5.20 megabytes per second)

So it’s a bit better.

Another thing you can do is set the processor quanta to be fixed 120ms chunks (simply done by right clicking on “My Computer”, properties, advanced, performance, settings, advanced, processor scheduling for background services. Yes, I’ve had by far the best luck with XP by tuning it like a server. Your mileage may vary but this also increases postmark results a bit.

You can also play with increasing the cache (in that advanced pane again select “system cache” and, with regedit, go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\size and make it a 3. This is all if you have XP. In 2003 it comes just like that. Unless you want to run SQL, IIS or Exchange, in which case there’s a setting, “maximize throughput for network applications”. This limits cache to 512MB, and lets the apps cache on their own.
OR, you can actually spend some money and ridiculously increase performance by getting a caching product like Superspeed’s Supercache or Datacore’s Uptempo (I tried O&O Clevercache as well and was thoroughly underwhelmed).
Here are results with 20,000 transactions and 4K I/O, XP tuned just like a server:

Time:
386 seconds total
308 seconds of transactions (64 per second)

Files:
20092 created (52 per second)
Creation alone: 10000 files (142 per second)
Mixed with transactions: 10092 files (32 per second)
9935 read (32 per second)
10064 appended (32 per second)
20092 deleted (52 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (32 per second)

Data:
548.25 megabytes read (1.42 megabytes per second)
1158.00 megabytes written (3.00 megabytes per second)

And here are results with the exact same settings but with 256MB of Supercache on that volume, lazy writes on:

Time:
196 seconds total
163 seconds of transactions (122 per second)

Files:
20092 created (102 per second)
Creation alone: 10000 files (344 per second)
Mixed with transactions: 10092 files (61 per second)
9935 read (60 per second)
10064 appended (61 per second)
20092 deleted (102 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (60 per second)

Data:
548.25 megabytes read (2.80 megabytes per second)
1158.00 megabytes written (5.91 megabytes per second)
I am a believer. The size of the dataset far exceeded the capacity of supercache, but it helped tremendously regardless.
Since I don’t believe all benchmarks, I also ran iozone.

4096 8192 16384
64
128
256
512
1024
2048
4096 70011
8192 29264 50257
16384 26229 33289 37198
32768 27578 28827 34778
65536 26982 27890 28997
131072 20901 21680 22223
262144 21769 20789 22249
524288 23076 25270 26258

The top row shows record size, the left column file size. The above is without the cache. Now with cache:

4096 8192 16384
64
128
256
512
1024
2048
4096 279746
8192 264110 262117
16384 250322 249355 238230
32768 233373 238932 233980
65536 204786 232418 234544
131072 234552 230336 225731
262144 164434 227792 222540
524288 35515 31533 41262

These results are for writes, in both cases. Iozone’s output is too large to include here but I’ll gladly send the entire file to anyone that wants it. I would ignore record sizes under 4K since windows will coalesce writes to 4K and up anyway (up to 64K).
It seems that these products are worth a serious look. In most cases, significant benefits will be realized by caching the volume that holds the swapfile, even if only using 128MB. In one case I went from 124 seconds for a postmark run to 70s by caching the swap volume. Even though I had ample memory and windows shouldn’t be using swap.

Unix is generally a bit more robust for caching and virtual memory, so you don’t need extra products. Looks like Windows needs a bit of help. Indeed, Microsoft uses Supercache on the servers that host MSN, I found out…
Anyway, you can see that up to 256MB supercache kicks windows’ cache ass. Now remember, this is a box tuned just like a server, it was using like 1GB of cache even without supercache. After you exceed the size of the cache by using the large 512MB test file, you still realize some benefits, as you can see.

Datacore’s uptempo produced similar results, is far less tunable, uses a unified cache (instead of a chunk per partition), is easier to configure and can be more or less expensive - Supercache for 4 CPUs is like $1K, but half that for 2 CPUs. UpTempo is about $700 regardless. Another difference is that UpTempo is 32-bit only at the moment.

D