Recovery Monkey: Musings on backups, tuning and more

Choose a Topic:

Sun
14
Jun '09

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

hat a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 "Leonidas", a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  - this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • "XFS tuned" means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • "ext3 tuned" means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…   

Postmark numbers:
  

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D

Sat
13
Jun '09

About the Data Domain acquisition – and is EMC really the best place for Data Domain?

Much has already been written about this imminent acquisition of Data Domain by either NetApp or EMC and, since opinions are like you-know-what, and I have one, here it is… if I ramble, forgive me. I have too much to say and I’m trying to be PC… I wrote and subsequently erased all kinds of stuff that could probably get me in trouble (the more you work with a company the more dirt you uncover, and I have several earth movers’ worth).

I do think that both companies waited too long to try and acquire Data Domain – frankly, it’s staggering to me that other companies that make decent products like CommVault haven’t been acquired yet (I mean, seriously, if EMC want to compete in the backup software space they should just drop Networker and buy CommVault). Consolidation is the trend…

Maybe both NetApp and EMC thought their in-house deduplication would work out for everything, maybe they thought Data Domain wouldn’t become a contender. Maybe they thought it was just a phase. Either way, the backup market is still strong, most people don’t want to move en masse to something like Avamar, not everyone needs VTL, and Data Domain does provide a very convenient way to keep using your existing backup product, make next to no changes, and get better efficiencies.

The simple truth is that EMC needed SOMETHING to combat Data Domain so they signed the agreement with Quantum and rushed the product to market. And then tried to strong-arm the resellers into forgetting about Data Domain and instead selling the new and amazing DL3D (that backfired BTW).

As far as EMC is concerned, the attempt to acquire Data Domain is a slap in the face for Quantum and all the customers that have been pitched/sold DL3D (the OEM’ed Quantum DXi product). EMC has spent quite a bit of time belittling Data Domain and instead pushing a product that has seen very limited testing (I know, I’ve been burned personally by it several times). A good example: EMC recently released a patch to allow backups done with EMC’s Networker to actually be deduplicated (talk about a reason to return a product if there ever was one – like a car that can’t go faster than 10 mph or that gets 2 mpg instead of 20 mpg). You see, there was an issue with the filter that figures out what backup app you’re using, and Networker backups were getting only plain old compression, NO deduplication. This is no secret, if anyone bothers to read the release notes of the recent patches they’ll see this info. Maybe if you’re a DL3D customer you should insist on reading the release notes if they’re not easily available? After all, you have a right to know what’s changing!

Think about this: EMC’s own backup product was not tested with DL3D. Yet EMC happily sold DL3D to customers with Networker. To me, this is a sales-driven company, not a customer-driven company.

Not to mention other crippling bugs, slow startup times (especially in the case of unclean shutdowns) and the abysmal performance which simply stems from how the product is designed – it’s spindle-happy and needs about 2 trays of drives to work well. Oh, and don’t EVER fill it beyond 80% capacity. You’re also not supposed to use it as a normal CIFS/NFS share for archiving anything like email or normal files (arguably a great place for dedup).

So, EMC knew about the DL3D issues (well, some of them, it’s not their product after all, indeed I helped them identify some of the bugs) and played coy with customers. Then, they saw NetApp making a move for Data Domain and realized that by buying Data Domain EMC could accomplish several things:

  • Minimize NetApp’s cash reserves if NetApp does in the end succeed in acquiring Data Domain (but is that necessarily a bad thing for NetApp?)
  • Remove the flailing DL3D and replace it with a product that actually works and is selling very well
  • Get a bunch of solid deduplication and consistency checking algorithms
  • Assimilate a competitor that’s been a huge thorn on EMC’s side in that space
  • Reduce the efficiency of NetApp as a competitor

But think from the customer standpoint for a minute (most of the analysts so far seem to miss the most important player here – and that’s certainly not EMC, NetApp or Data Domain, but the customer). You’ve been pitched DL3D, and now you must forget about that and all the bad things you were told about Data Domain – it’s all good now that it belongs to EMC, you’ll be taken care of. Or you can buy the DL3D if you still want it (and I don’t see EMC derailing ANY existing DL3D campaign, no matter what).

I were a DL3D prospect/customer, I’d be worried no matter what.

Let’s talk about the best place for Data Domain to end up. As far as investors go of course, if they want to make a quick buck and run, the EMC cash offer is tantalizing. But for Data Domain employees, EMC can be a black hole and the added complexity and bureaucracy anything but fun. EMC has become almost too diversified – let’s look at just some of EMC’s storage solutions (I won’t mention the software since then it’d be a REALLY long and weird post):

  • Symmetrix
  • Clariion
  • Celerra
  • Centera
  • Atmos
  • EDL
  • DL3D
  • RecoverPoint
  • Avamar (that’s both a software solution and an appliance)

What’s interesting is that, by and large, the teams in charge of the above products don’t talk much, if at all, with each other. Talk about islands! And, when it comes to sales, EMC has internally competing groups of people that sell the above products – for instance, “NAS overlay” guys only get paid on Celerra sales, and I’ve seen them screw up campaigns that were clearly a pure Clariion play just so they could somehow get some Celerra in so they get paid. The basic EMC sales guy you meet can sell them all and indeed doesn’t care, but the people he relies on for support cannot sell them all and do care about what gets sold. It’s all very fragmented and, again, not a model that operates with the customer’s best interests always in mind. It always baffled me why EMC would allow so much fluff in their sales organization.

So, if Data Domain got absorbed, they’d probably not be enjoying all the “melting pot” advantages the EMC corporate bloggers seem so keen on advertising, and the “large startup” feel (maybe it’s like that in MA for a few chosen people – in most other locations it’s decidedly not like that). They’d just be another acquired unit, internally competing with other units, dealing with large-company politics and other inefficiencies. The EMC stock wouldn’t really become much higher than it is now, if at all. It’s been about the same for quite some time now.

Let’s examine the scenario of NetApp buying Data Domain:

  • NetApp is much more focused than EMC – indeed they have literally less than a handful of major offerings that don’t really compete with each other
  • The NetApp sales force is unified and doesn’t internally compete about what to sell
  • NetApp culture is much closer to Data Domain culture
  • It’s not good for innovation to have one company hoarding 3 dedup technologies, NetApp + Data Domain will actually push EMC more and be better for the customers
  • Data Domain could make NetApp much stronger against EMC, in turn driving NetApp’s stock price up significantly. Which, in turn, would give investors back much more than $2bn, thereby making this the better deal.

The only drawback I see (as do most writing about this) is NetApp’s relatively poor history in managing the few acquisitions they’ve made. But I believe that as long as they leave Data Domain alone and slowly try to integrate the technology in the other products it will all work out.

Hopefully all this made some sense…

D

Mon
18
May '09

Linux filesystem benchmark extravaganza - including Deadline vs CFQ schedulers and ext4 instability

I have some spare time these days so I figured I’d finally test as many filesystems on Linux as I could…

The new ext4 is an option with modern kernels so I loaded Ubuntu 9.04 and tried postmark and bonnie++ on the same partition using various filesystems and switching between the CFQ and Deadline schedulers.

Switching schedulers permanently can be achieved by changing the boot options and appending, say, elevator=deadline, but you can also switch them on the fly by running the following:

echo deadline > /sys/block/sda/queue/scheduler

You can check what’s currently selected by simply typing

cat /sys/block/sda/queue/scheduler

You’ll get back something like:

noop anticipatory [deadline] cfq

The scheduler in brackets is the currently selected one.

Reader beware: Running postmark on ext4 locked up the system repeatedly during the transaction phase of the benchmark, using either my own compiled version and the one from the repository, so obviously there is some issue there and I cannot at this time recommend ext4no other filesystem caused lockups. I did run bonnie++ as well since that didn’t crash with ext4.

The objective of this exercise wasn’t to show which filesystem is fastest, but rather to illustrate that, depending on what you want to do, you may want to re-examine the choice of filesystem and scheduler with your application if you’re running Linux. BTW the current recommendation for Databases and fast intelligent external arrays – and ubuntu’s default in the server edition – is the Deadline scheduler, and not CFQ. However, all other distrubutions at the moment use CFQ!

So, without further ado, some benchmarks… (I’m not including the entire postmark output since it would be far too large, I just kept the most important metrics, anyone that wants the entire results is more than welcome to send me an email and I’ll hook you up).

Postmark MB/s:

Filesystem

Read MB/s

Write MB/s

IOPS

Reiser CFQ

4.85

10.25

227

Reiser Deadline

5.38

11.35

246

XFS CFQ

2.33

4.93

109

XFS Deadline

2.35

4.97

105

XFS Tuned

2.73

5.76

120

JFS CFQ

1.75

3.69

78

JFS Deadline

1.73

3.65

76

Ext3 CFQ

2.71

5.73

115

Ext3 Deadline

2.86

6.03

122

 

MBPS

Postmark IOPS:

iops

Bonnie++ write speed:

Filesystem

IOPS

Block writes KB/s

Rewrite KB/s

Reiser CFQ

428

31657

18199

Reiser Deadline

462

32290

18154

XFS CFQ

471

39901

18557

XFS Deadline

483

39840

19653

XFS Tuned

592

40604

20746

JFS CFQ

433

31651

18528

JFS Deadline

452

39106

18755

Ext3 CFQ

403

31108

17235

Ext3 Deadline

338

31803

17885

Ext4 CFQ

451

39265

18519

Ext4 Deadline

446

39257

18221

bonnieMBPS

Bonnie++ IOPS:

bonnieiops

Observations:

The Deadline scheduler seems to be consistently better for anything that’s not ext-based! A lot of work has been done on the Linux kernel to optimize it for the ext2-3-4 filesystems, and that shows. However, depending on what you want to do, ext3 may not be the best option (I don’t know yet about ext4 for postmark-type loads but based on the bonnie++ results it’s solid).

Here’s a list of some considerations:

  • Will the filesystem host many many small files or a few large ones? Reiser still rules the “many small files” use case, by far. The rest are fairly close, and JFS seriously lags. For large files, XFS is great.
  • Do you care if the filesystem takes a long time to fsck? Ext3 still takes quite long, whereas something like XFS doesn’t. Ext4 should remedy this.
  • Do you care for something that’s still actively being maintained? In this case only ext3-4 and XFS are the options.
  • Do you want defrag tools? Choose wisely since few filesystems do (XFS and ext4).

My current overall recommendation is XFS since it’s mature and also very tunable. For reference, here’s how I got the better results for XFS (the results in the graphs for tuned XFS were with the deadline scheduler):

mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=64m /dev/sda7

mount -o nobarrier,noatime,nodiratime,logbufs=8 /dev/sda7 /test

Don’t just follow the above blindly, normally mkfs tries to auto-adjust those (i.e. the agcount) but the important ones to look for are the log size and the mount options, especially the nobarrier and logbufs. Remember though that nobarrier is only recommended if you have battery backup.

D

Mon
27
Apr '09

So, what’s the best way to back up VMs?

Backing up VMs seems to be one of the topics nobody can seem to be able to agree on despite a plethora of reading material on the subject… and maybe because of said plethora.

I will focus on VMware since it is the leading and prevalent virtualization method in the marketplace today (I’m sure the KVM, Xen and Hyper-V fanboys will have their 15 minutes of fame someday).

VMware has several ways for backing up VMs:

  1. Install a backup agent in the VM, just as with a normal client
  2. Back up the entire VM by installing a backup agent in the ESX console
  3. Use VCB (VMware Consolidated Backup).

     

They all have their pros and cons so the short answer to the topic is that there’s no best method, instead you’ll get the “it depends” answer. Sorry. Here’s the skinny on each method:

 

1. Install a backup agent in the VM, just as with a normal client

 

Pros:

  • Everyone understands this, since it works just like a real physical client and can do most of the same things
  • Can do incrementals
  • File-level recovery is straightforward with no confusion as to which VM owns which file
  • Advanced backup features such as DB agents work fine

 

Cons:

  • Impact on the host and network
  • Deployment just as difficult as when using the physical clients
  • Can make backup software licensing more expensive than needed
  • Bare-metal-recovery of VMs only a bit less difficult than with physical boxes

 

2. Back up the entire VM by installing a backup agent in the ESX console

 

Pros:

  • Licensing cost for backup software minimized (1 license needed per ESX server)
  • The entire VM is backed up so recovery is like Bare Metal Recovery – you’ll get the entire box back with a very high probability of success
  • Fast since the virtualization layers are bypassed

 

Cons:

  • Still significant impact on the host and network
  • Cannot restore individual files
  • Advanced backup agents won’t work (no hot backups of SQL or Exchange, for instance)
  • Backups always large since a full backup is required every time
  • Backups take long (see previous point)
  • Requires some scripting knowledge to deploy properly.

 

3. Use VCB (VMware Consolidated Backup).

 

Pros:

  • Works with most backup software
  • Almost no impact on the host or network (backups can be entirely SAN-based)
  • Reduced backup software licensing cost
  • Works with VSS in windows to provide better backup reliability
  • Allows for incremental backups
  • Uses VM snapshots
  • No disk space used for staging of incrementals
  • Very simple DR
  • File-level backups are possible

 

Cons:

  • Cannot back up RDMs in Physical Compatibility Mode
  • Advanced functionality (file-level backups and application integration won’t work with non-windows VMs)
  • Cannot back up clustered VMs (i.e. MSCS-clustered VMs can’t be backed up)
  • FullVM backup speed is limited to 1GB/min (limitation of windows’ cmd.exe but can get around it by creating multiple threads I guess – but you could have speed issues if you cannot break the jobs up and they’re large)
  • Significant disk space needed for Holding Tank (where FullVM copies are placed)
  • Advanced backup agents will not work
  • File-level backups won’t back up the Windows registry
  • File-level recovery is complex and generally a two-step process

 

The lists could go on but as you can see there are serious wrinkles with all the approaches.

The problem is compounded by the fact that most modern backup software has arcane licensing schemes depending on whether an agent is on a VM or not, for instance (CommVault) or allowing you unlimited agents per ESX server as long as you buy the more expensive client license for the ESX server (NetBackup), and various permutations thereof.

Another wrinkle is Deduplication. Products that do source-based Deduplication such as EMC’s Avamar can comfortably have their agents inside the VMs or in the service console since subsequent backups take only a fraction of the time and there’s almost no space penalty. So, with Avamar one could be doing both kinds of backup (entire VM and individual files) and be covered both ways and only worrying about time and space when reading Hawking’s books… The negative is cost.

NetBackup offers another interesting twist since their implementation of VCB allows individual files to be recovered from a FullVM backup – the rationale being that you use their PureDisk Deduplication to store everything in order to reduce the expense of backup disk.

In the end, the only recommendation I can give that doesn’t depend too much on your individual circumstances is to try and do both file-level and FullVM-type backup so that you’re covered in multiple ways. Then replicate those backups, etc… you know the drill by now.

D

 

 

Thu
5
Feb '09

The true XIV fail condition finally revealed (?)

I just got this information:

For XIV to be in jeopardy you need to lose 1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within 30 minutes while writing to the thing.

Have no way of confirming this but it did come from a reliable source.

A customer recently tried pulling random drives and XIV didn’t shut down and was working fine, but they were from the data nodes.

Why can’t anyone post something concrete here? I’m sure IBM won’t post since the confusion serves them well.

For what it’s worth, the customer is really happy with the simplicity of the XIV GUI.

D

Mon
5
Jan '09

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken - I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens - which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance - put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

'

On current smartphones

The time has come for me to get a new phone (my current one can’t keep up with the demands and the speed or lack thereof ends up frustrating me).

So I’ve been looking at the plethora of devices out there - Berries, Windows Mobile, iPhone, etc (disclaimer: I’ve been a Blackberry user for many years now).

For me, the ideal mobile device needs to:

  • Synchronize seamlessly all my Exchange stuff
  • Be able to display PDF and office docs (not necessarily edit them)
  • Be a great phone (reception, clarity)
  • Have tethering ability
  • Be fast when I multitask on it
  • Offer GPS (almost all current ones do)
  • Have a decent supply of third-party apps
  • Be able to last me for a whole day (NOT just a business day) of pretty heavy usage
  • Have not so much an intuitive OS but an OS that doesn’t get in the way
  • Let me input text very, very fast (I’m writing this on my phone now)
  • Be tough (Mil-Spec would be great)
  • The ability to play music/videos is not essential but is nice to have (all do it now)
  • Camera nice-to-have but it doesn’t need to be amazing
  • Should be able to have a decent web browser
  • Shouldn’t be ridiculously huge…

So here’s the Executive Summary if your needs coincide with mine:

- Get a Blackberry Bold

For the nitty-gritty:

We HAVE to mention the iPhone, it’s a marvel of social engineering, industrial design and amazing marketing/branding. Of course the battery life utterly sucks if you try to use it the way I’d need to but, most importantly: I cannot type on the sucker! I don’t have abnormally large sausage fingers, indeed I believe my digits are downright elegant, yep I just cannot type fast or accurately on the iPhone (this paragraph might have taken me 10 mins to write on it). So we stike that one out.

Then you have the new Blackberry Storm, also touch-screen. On this one, the entire screen is a gigantic button that you need to press in order for it to register. I found that this approach seems to make it way more accurate for me than the iPhone. The battery life and build are also great. Too bad that the hardware can’t keep up, it feels decidedly slow, more so than the iPhone. Scratch that one too.

Then we have the narrow-form-factor Berries. Can’t type on them quickly. Out they all go.

Next are the various and sundry Windows Mobile devices. Almost too much choice here, huge third-party support, some great hardware from a few vendors. But I find that the OS really gets in my way and all of them also feel amazingly slow. Battery life is no great shakes, either.

Nokia has some good ones, the E71 is my favorite, but they don’t sync that elegantly with Exchange plus the keyboard is weird. Great build, though. If you like its keyboard go for it. OS can take some getting used to…

What remains is the Blackberry Bold. Sure, size-wise it feels like holding a slipper against your head (fortunately size was never a very important criterion) but it passes almost all the other tests! It also lets you send/receive emails while on the phone, feels fast, and has an amazing keyboard. Probably because it’s slipper-sized…

Well-made but it’s so nice that it needs a decent case to get ruggedized so you keep it looking nice, in which case it’ll look more like a size 13 boot against my face and I won’t be able to see just how nice it is anyway.

Am I alone in believing that many people would gladly pay a premium for a sleek, ruggedized device that doesn’t look like a Casio G-Shock? I’d be totally OK with the silly and easily disfigured plastic chromed bits being replaced by Kevlar or rubber, a scratch-proof screen, the ability to withstand immersion for 30′, successful drop-testing from 1 story to concrete, flexible circuit boards (not the ultra-thin ribbon type, you can get boards that are almost rubbery), Mil-Spec connectors, port covers…

It’s all possible, it just adds to the cost. But I guarantee most professionals will pay $100-200 more for the ruggedized model that doesn’t need clunky cases. Ericsson, Siemens and Nokia all had standard phones (never made it to the US) that would fit the above description with the exception of the scratch-proof screen (the Ericsson one was pretty amazing - they suggested you wash it to get rid of the dirt - albeit pretty large), but they slowly stopped making them. They weren’t even much more expensive than the plain models!

The old, thick Blackberries used to be pretty tough, I dropped mine onto concrete many, many times (drop-kicked it once) and the only damage was that the vibrating thingy inside stopped working 100% of the time, a no-no among Berry addicts. It did look scratched but it wasn’t painted on so the scratches weren’t that visible.

I hear the iPhone can be tough, at least the original one. The 3G - not so sure. A colleague had his stop working after he dropped it 3 feet. It landed on its back (should be an easy knock to absorb), you can’t even see a scratch on it. Unacceptable, IMO.

D

Sat
27
Dec '08

So, how frequently do you really test DR?

It’s after 4AM, I can’t sleep since I’m in pain after a car accident and I’ve had altogether too much caffeine. I’ve already watched 3 movies. BTW, “I am Legend” - WTF! Never have I seen a decent book butchered so much! The ideas in the book were so much stronger. Seriously, go get the book and forget the movie. Sorry, Will.

Now I’m writing from The Throne Chamber once more (blessed be the Colon Drano caffeine). I’m all cramped up and can’t get up, so I thought why not post something… can’t promise it will make sense since my brain ain’t the clearest at the moment…

So - when was the last time you tested DR? Really?

If I had a penny for every time I heard the line “we back up our servers to tape but we don’t test DR, but we’re confident we’ll be up and running within 36 hours in the event of a disaster” I’d be paying Trump more money than he ever made just so he can shine my shoes, and he’d be thankful.

Let me make something clear: You need to test DR a minimum of twice a year, preferably once a quarter. Anything less and you’re just setting yourself up for failure.

Start by testing the most important machines. You probably won’t even have to artificially inject extra problems to solve (Pervy Uncle Murphy usually is right there beside you to take care of that). Marvel at how long it really takes.

If things go real peachy, did you hit your RPO and RTO? if yes, test with more machines, until you can test with the full complement of boxes your company truly needs to be up and running and making money. Document it all.

If you didn’t hit your RPO/RTO, how much did you miss them by? If it’s by a ridiculous amount, maybe the way you’re going about DR will simply not work - try replication and/or VMware…

Once you get good at it, start inventing scenarios. for instance:

- Pretend one of your tapes is bad. See how long your offsite vendor takes to bring you a fresh set once you figure out what are the barcodes you need.
- Pretend one of the critical servers can’t be recovered and you need to go back 3 weeks. How does this affect the business?
- Recover to dissimilar hardware.
- Pretend you’re dead. Are your documented procedures clear enough for your underling to follow? Are they clear enough for the janitor? The janitor’s 3-year-old kid? The kid’s parakeet? Ultimately, your DR runbooks need to be so clear that even your CEO can follow them easily, and he needs to be able to do so right out of bed, before he’s had his morning ablutions, quad-vanilla-soy-latte and his Zoloft.

Ultimately (and sorry if I’m repeating myself), you probably need to be making at least 2 tape copies, 2 copies of your backup catalog, replicating (ideally CDP) and using VMware all at the same time to have any real insurance policy against disaster.

And if you ever tell me “well, we don’t have the time to be doing DR tests” - do you really think you’ll have the time once disaster really strikes?

And, if you think that a disaster is an RGE (Resume Generating Event) then you probably are working for the wrong company and won’t get much job satisfaction there anyway.

I think I’d better get up before I lose my legs.

Nighty-night

D

Mon
22
Dec '08

My frustration with the quality and education of CIOs, CTOs, IT Directors, what have you… what caliber IT manager should you choose?

As a matter of course, I deal with all kinds of IT manager types during the course of a campaign.

Sometimes said managers are well-versed in technology. Other times they have biases, are bigoted, and so on. Which is fine, I’m more opinionated, cranky and obnoxious than most.

It agitates me encountering IT management types that:

  • Have no technology experience
  • Have no concept of how IT relates to the business
  • Have no idea how much technology costs
  • Have no idea how much being penny wise and dollar foolish can hurt their business
  • Cannot recognize an amazing deal due to their lack of a holistic viewpoint.

However, as annoying as the above bullets are, someone with sufficient intelligence and perserverence that cares will eventually “get it” and become able to at least have a conversation about technology. No, what bothers me more are the managers that:

  1. Do not care about technology
  2. Were promoted “from within” because they either knew someone or they were just the nearest body whose temperature was higher than ambient and are also guilty of #1
  3. Have an IQ less than their shoe size (US units)
  4. Are unable to delegate
  5. Are unable to pick proper subordinates (invariably they pick people whose IQ is in the single digits)
  6. Due to their unbelievable ignorance, pathologically distrust whatever vendors tell them or (the even more irksome)
  7. Get blinded by inane and irrelevant marketing gimmicks (look, the box can do a million IOPS with 10 drives, yours is nowhere near as fast!)
  8. They just believe whatever the last vendor told them
  9. Do not value the work solid partners do for them - there are truly few people that will actually add value, instead of just wanting to take your money!

I lost a couple of deals recently because of #7 and #8. If you’re reading this, you fully well know who you are. Here’s an example - would you not be pissed if:

  • You educated the customer far more than any other vendor - they freely admitted they had no idea what they’d need and indeed asked you to figure it out and suggest a way forward
  • You analyzed the performance of their environment and properly engineered a solution that will, scientifically, accomodate what they have plus a pre-stated amount of future growth without just throwing product at them
  • You analyzed their actual business needs and where they need to be and provided a plan to get there
  • Used best practices for DR and backups
  • Did it all while being less expensive than the competition, especially when considering the lack of essential features the competition suffered.

And what happens? Next thing you know, they’re picking the competitor that:

  • Is unproven (not even a handful of installs where we are)
  • Does not have useful functionality that they will need a few years down the line (VMware SRM anyone?)
  • Did not educate them - indeed ,recommended plainly wrong “best practices” that could bring an iSCSI environment to its knees (interesting what you hear when a storage vendor has no idea about Ethernet, switching, port channeling etc)
  • Blinded them with things like “look we have more cache” or “our box takes more drives!” (they’ll never need them)
  • Did not do thorough (or any) performance analysis (”looking” at random perfmon data doesn’t guarantee success)
  • Cannot even do replication
  • Did not offer them snapshots or any application awareness for backups and DR

I guess I was outsold. As someone I greatly respect and like but am frequently infuriated by likes to say, “tell them what they want to hear”. Maybe I need to become more corrupt.

So what would an ideal IT higher-up look like? I know I could do the job while being drowned and quartered, let alone in my sleep. But I’d get bored. A few pointers on who you should hire:

  1. Someone with real IT experience - ideally someone that started on the operational side and migrated to management
  2. Someone that not just understands but actually likes and appreciates technology (too hard to keep up otherwise)
  3. Someone that understands the financial and business ramifications of action or inaction when it comes to IT purchases
  4. Someone that understands the value of partnerships! Indeed, someone that already has solid partnerships.
  5. Even if you have semi-competent people within, sometimes it’s better to just hire someone with real experience and not wait till the internal hire figures it out, especially if you have projects on the line
  6. Get someone that understands RPO, RTO and what those mean in financial terms
  7. Find someone that used to work in a large corporation but “just had enough” - their experience is invaluable but they’re looking to go to a smaller outfit
  8. They should be able to sell better than most salespeople that visit them!

I could go on but you get the idea.

D

Mon
15
Dec '08

Cinebench benchmarks - performance comparison between Vista 64 and Mac OS X

Been a while since I posted anything - there’s a TON of material but some of us actually do more than blog, it’s quarter/year end, and I barely have time to go to the bathroom…

But this was an easy one so I thought I’d post it real quick. Using Scribefire, a blogging plug-in for Firefox. I hate it.

Disclaimer: The machines used are not identical.

However, the CPUs supposedly are pretty close in speed (2.6 vs 2.8 GHz). Memory is the same.

Graphics are also similar but the Lenovo box has 128MB VRAM whereas the Mac has 512MB and is a faster GPU.

The contenders: Macbook Pro 2.8GHz vs Lenovo T62p (14″ model) running Vista 64, 2.6GHz.

The Mac is running a 32-bit OS (64-bitness is coming with Snow Leopard next year). It also has switchable graphics and one can choose between the on-chipset Nvidia 9400 or the discrete 9600. Typically on-board graphics are pretty crappy.

Despite the dissimilarity of the machines here are some notables:

  • Cinebench really takes off in 64-bit mode in Vista
  • OS X seems to do quite well even though it’s not 64-bit yet
  • The integrated graphics on the new Mac are awesome
  • The discrete graphics are great for a laptop
  • OS X seems to be more efficient than Vista when doing multi-CPU work, at least in this case
  • If someone is looking for a decent modern laptop they can do far worse than the new Macs, even a plain Macbook would be pretty decent

Here’s a chart of the results:

OS/Config 1-CPU 2-CPU GFX Multiprocessor speedup
Macbook Pro 2.8GHz integrated GFX 3208 6051 4813 1.87
Macbook Pro 2.8GHz discrete GFX 3213 5926 6130 1.84
Lenovo 2.6GHz 32-bit 2693 4755 4264 1.77
Lenovo 2.6GHz 64-bit 3040 5367 4256 1.77
Sun
2
Nov '08

Postmark on late 2008 Macbook Pro

So I’m now the proud owner of a tricked-out 2.8GHz MBP.

Naturally I’ve been tinkering with it already (only had it for 2 days). I’ve disabled swapfile encryption, for instance, and I think it makes it have teh snappy.

I compiled postmark for it with -O3 -m64 and ran the usual. Before doing so though I did disable the Spotlight indexer like this:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.metadata.mds.plist

PostMark v1.5 : 3/27/01
pm>set number 10000
pm>set transactions 20000
pm>set subdirectories 5
pm>set size 500 100000
pm>set read 4096
pm>set write 4096
pm>run

Time:
273 seconds total
256 seconds of transactions (78 per second)

Files:
20092 created (73 per second)
Creation alone: 10000 files (833 per second)
Mixed with transactions: 10092 files (39 per second)
9935 read (38 per second)
10064 appended (39 per second)
20092 deleted (73 per second)
Deletion alone: 10184 files (2036 per second)
Mixed with transactions: 9908 files (38 per second)

Data:
548.25 megabytes read (2.01 megabytes per second)
1158.00 megabytes written (4.24 megabytes per second)

I then enabled spotlight and re-ran the benchmark:

Time:
483 seconds total
468 seconds of transactions (42 per second)

Files:
20092 created (41 per second)
Creation alone: 10000 files (909 per second)
Mixed with transactions: 10092 files (21 per second)
9935 read (21 per second)
10064 appended (21 per second)
20092 deleted (41 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (21 per second)

Data:
548.25 megabytes read (1.14 megabytes per second)
1158.00 megabytes written (2.40 megabytes per second)

Obviously spotlight is very aggressive in its indexing and tries to do it ASAP - you lose half your performance when doing metadata-intensive processing. The results though, while sucky for the specs of the box, are far, far removed (and much better than) what an old colleague got on his beastie: http://recoverymonkey.net/wordpress/?p=62 - granted, my box is faster but it shouldn’t be THAT much faster.

I   urge my newfound Mac brethren to help out in determining the cause.

More benchmarks to follow.

D

Thu
16
Oct '08

A word of caution when setting up a deduplicating VTL

Based on some recent experiences I wanted to make people aware of some caveats with setting up a VTL with deduplication. This is specifically regarding the EMC DL3D (AKA Quantum DXi) but applies to all of them. This will be a mercifully short and to the point post. Here’s the rub:

  • Create small virtual tapes (100GB max, I’d go even smaller, obviously depends on your environment)
  • Create a bunch of virtual tape drives (you might have to create 20-30!)
  • Do NOT I repeat NOT multiplex in the backup software! It screws up the deduplication algorithm.
  • Do not compress the data before the backup
  • Do not encrypt the data
  • Be mindful of your retention policies, start gently then work your way up.
  • I’d personally not multi-stream a server at all, just so I can keep the tape utilization high. What I mean: Say you do not do multiplexing but you are multistreaming – i.e. you’re sending 10 streams from your client. This means you will need 10 tapes without multiplexing, so you’ll end up writing a tiny bit on each tape. It doesn’t take a genius to realize that you’ll end up with a ton of tapes with not much data on them, which will cause them to be appended to with more tiny amounts of data, which will in turn cause them to expire way later than you’d like.
  • If you can use the box as NAS and know how to get the throughput up there then do so, that way there’s no issue with multiple streams. My Data Domain boys are chuckling now (they always prefer to do NAS, but that also has to do with the fact that their box can’t really do VTL properly yet. Oh, the cattiness! BTW my company does sell quite a lot of their stuff).

The same rules apply otherwise as in my previous post about tuning NetBackup for large environments.

Regarding using the DL3D/DXi as NAS: Plug in as many GigE ports as you can, but make sure your switch can do straight-up EtherChannel (not LACP). So you pretty much need to have a “proper” Cisco switch in order to get the full benefit. Then use multiple media servers. Use a separate NAS share per media server. Team the NICs on the backup servers for performance (do LACP or PaGP there, whatever works with the server’s NIC software). Then call me in the morning.

D

 

Wed
20
Aug '08

What is the value of your data? Do you have the money to implement proper DR that works? How are you deciding what kind of storage and DR strategy you’ll follow? And how does Continuous Data Protection like EMC’s RecoverPoint help?

Maybe the longest title for a post ever. And one of my longest, most rambling posts ever, it seems.

Recently we did a demo for a customer that I thought opened an interesting can of worms. Let’s set the stage – and, BTW, let it be known that I lost my train of thought multiple times writing this over multiple days so it may seem a bit incoherent (unusually, it wasn’t written in one shot).

The customer at the moment uses DASD and is looking to go to some kind of SAN for all the usual reasons. They were looking at EMC initially, then Dell told them they should look at Equallogic (imagine that). Not that there’s anything wrong with Dell or Equallogic… everything has its place.

So they get the obligatory “throw some sh1t on the wall and see what sticks” quote from Dell – literally Dell just sent them pricing on a few different models with wildly varying performance and storage capacities, apparently without rhyme or reason. I guess the rep figured they could afford at least one of the boxes.

So we start the meeting with yours truly asking the pointed questions, as is my idiom. It transpires that:

  1. Nobody looked at what their business actually does
  2. Nobody checked current and expected performance
  3. Nobody checked current and expected DR SLAs
  4. Nobody checked growth potential and patterns
  5. Nobody asked them what functionality they would like to have
  6. Nobody asked them what functionality they need to have
  7. Nobody asked how much storage they truly need
  8. Nobody asked them just how valuable their data is
  9. Nobody asked them how much money they can really spend, regardless of how valuable their data is and what they need.

So we do the dog-and-pony – and unfortunately, without really asking them anything about money, show them RecoverPoint first, which even worse than showing a Lamborghini (or insert your favorite grail car) to someone that’s only ever used and seen badly-maintained rickshaws, to use a car analogy.

To the uninitiated, EMC’s RecoverPoint is the be-all, end-all CDP (Continuous Data Protection) product, all nicely packaged in appliance format. It used to be Kashya (which seems to mean either “hard question” or “hard problem” in Hebrew), then EMC wisely bought Kashya, and changed the name to something that makes more marketing sense. Before EMC bought them, Kashya was the favorite replication technology of several vendors that just didn’t have anything decent in-place for replication (like Pillar). Obviously, with EMC now owning Kashya, it would look very, very bad if someone tried to sell you a Pillar array and their replication system came from EMC (it comes from FalconStor now). But I digress.

RecoverPoint lets you roll your disks back and forth in time, very much like a super-fine-grained TiVo for storage. It does this by creating a space equal to the space consumed by the original data that acts as a mirror, plus the use of what is essentially a redo log (so to use it locally you need 2x the storage + redo log space). The bigger the redo log, the more you can go back in time (you could literally go back several days). Oh, and they like to call the redo log The Journal.

It works by effectively mirroring the writes so they go to their target and to RecoverPoint. You can implement the “splitter” at the host level, the array (as long as it’s a Clariion from EMC) or with certain intelligent fiber switches using SSM modules (the last option being by far the most difficult and expensive to implement).

In essence, if you want to see a different version of your data, you ask RecoverPoint to present an “image” of what the disks would look like at a specified point-in-time (which can be entirely arbitrary or you can use an application-aware “bookmark”). You can then mount the set of disks the image represents (called a consistency group) to the same server or another server and do whatever you need to do. Obviously there are numerous uses for something like that. Recovering from data corruption while losing the least amount of data is the most obvious use case but you can use it to run what-if scenarios, migrations, test patches, do backups, etc.

You can also use RecoverPoint to replicate data to a remote site (where you need just 1x the storage + redo log). It does its own deduplication and TCP optimizations during replication, and is amazingly efficient (far more so than any other replication scheme in my opinion). They call it CRR (Continuous Remote Replication). Obviously, you get the TiVo-like functionality at the remote side as well.

What’s the kicker is the granularity of CRR/CDP. Obviously, as with anything, there can be no magic, but, given the optimizations it does, if the pipe is large enough you can do near-synchronous replication over distances previously unheard of, and get per-write granularity both locally and remotely. All without needing a WAN accelerator to help out, expensive FC-IP bridges and whatnot.

There’s one pretender that likes to take fairly frequent snapshots but even those are several minutes apart at best, can hurt performance and are limited in their ultimate number. Moreover, their recovery is nowhere near as slick, reliable and foolproof.

To wit: We did demos going back and forth a single transaction in SQL Server 2005. Trading firms love that one. The granularity was a couple of microseconds at the IOPS we were running. We recovered the DB back to entirely arbitrary points in time, always 100% successfully. Forget tapes or just having the current mirrored data!

We also showed Exchange being recovered at a remote Windows cluster. Due to Windows cluster being what it is, it had some issues with the initial version of disks it was presented. The customer exclaimed “this happened to me before during a DR exercise, it took me 18 hours to fix!!” We then simply used a different version of the data, going back a few writes. Windows was happy and Exchange started OK on the remote cluster. Total effort: the time spent clicking around the GUI asking for a different time + the time to present the data, less than a minute total. The guy was amazed at how streamlined and straightforward it all was.

It’s important to note that Exchange suffers more from those issues than other DBs since it’s not a “proper” relational DB like SQL is, the back-end DB is Jet and don’t let me get started… the gist is that replicating Exchange is not always straightforward. RecoverPoint gave us the chance to easily try different versions of the Exchange data, “just in case”.

How would you do that with traditional replication technologies?

How would you do that with other so-called CDP that is nowhere near as granular? How much data would you lose? Is that competing solution even functional? Anyone remember Mendocino? They kinda tried to do something similar, the stuff wouldn’t work right in a pristine lab environment, I gave up on it. RecoverPoint actually works.

Needless to say, the customer loved the demo (they always do, never seen anyone not like RecoverPoint, it’s like crack for IT guys). It solves all their DR issues, works with their stuff, and is almost magical. Problem is, it’s also pretty expensive – to protect the amount of data that customer has they’d almost need to spend as much on RecoverPoint as on the actual storage itself.

Which brings us to the source of the problem. Of course they like the product. But for someone that is considering low-end boxes from Dell, IBM etc. this will be a huge price shock. They keep asking me to see the price, then I hear they’re looking at stuff from HDS and IBM and (no disrespect) that doesn’t make me any more confident that they can afford RecoverPoint.

Our mistake is that we didn’t at first figure out their budget. And we didn’t figure out the value of their data – maybe they don’t need the absolute best DR technology extant since it won’t cost them that much if their data isn’t there for a few hours.

The best way to justify any DR solution is to figure out how much it costs the business if you can achieve, say, 1 day of RTO and 5 hours of RPO vs 5 minutes of RTO and near-zero RPO. Meaning, what is the financial impact to the business for the longer RPO and RTO? And how does it compare to the cost of the lower RPO and RTO recovery solution?

The real issue with DR is that almost no company truly goes through that exercise. Almost everyone says “my data is critical and I can afford zero data loss” but nobody seems to be in touch with reality, until presented with how much it will cost to give them the zero RPO capability.

The stages one goes through in order to reach DR maturity are like the stages of grief – Denial, Anger, Bargaining, Depression, and Acceptance.

Once people see the cost, they hit the Denial stage and do a 180: “You know what, I really don’t need this data back that quickly and can afford a week of data loss!!! I’ll mail punch cards to the DR site!” – typically, this is removed from reality and is a complete knee-jerk reaction to the price.

Then comes Anger – “I can’t believe you charge this much for something essential like this! It should be free! You suck! It’s like charging a man dying of thirst for water! I’ll sue! I’ll go to the competition!”

Then they realize there’s no competition to speak of so we reach the Bargaining stage: “Guys, I’ll give you my decrepit HDS box as a trade-in. I also have a cool camera collection you can have, baseball cards, and I’ll let you have fun with my sister for a week!”

After figuring out how much money we can shave off by selling his HDS box, cameras and baseball cards on ebay and his sister to some sinister-looking guys with portable freezers (whoopsie, he did say only a week), it’s still not cheap enough. This is where Depression sets in. “I’m screwed, I’ll never get the money to do this, I’ll be out of a job and homeless! Our DR is an absolute joke! I’ll be forced to use simple asynchronous mirroring! What if I can’t bring up Exchange again? It didn’t work last time!”

The final stage is Acceptance – either you come to terms with the fact you can’t afford the gear and truly try to build the best possible alternative, or you scrounge up the money somehow by becoming realistic: “well, I’m only gonna use RecoverPoint for my Exchange and SQL box and maybe the most critical VMs, everything else will be replicated using archaic methods but at least my important apps are protected using the best there is”.

It would save everyone a lot of heartache and time if we just jump straight to the Acceptance phase where RecoverPoint is concerned:

  • Yes, it really works that well.
  • Yes, it’s that easy.
  • Yes, it’s expensive because it’s the best.
  • Yes, you might be able to afford it if you become realistic about what you need to protect.
  • Yes, you’ll have to do your homework to justify the cost. If nothing else, you’ll know how much an outage truly costs your business! Maybe your data is more important than your bosses realize. Or maybe it’s a lot LESS important than what everyone would like to think. Either way you’re ahead!
  • Yes, leasing can help make the price more palatable. Leasing is not always evil.
  • No, it won’t be free.
  • If you have no money at all why are you even bothering the vendors? Read the brochures instead.
  • If you have some money please be upfront with exactly how much you can spend, contrary to popular belief not everyone is out to screw you out of all your IT budget. After all we know you can compare our pricing to others’ so there’s no point in trying to screw anyone. Moreover, the best customers are repeat customers, and we want the best customers! Just like with cars, there’s some wiggle room but at some point if you’re trying to get the expensive BMW you do need to have the dough.

     

Anyway, I rambled enough…

 

D

    

Mon
16
Jun '08

Massive benchmark comparison between Windows XP, Vista and 2008 Server, 32- and 64-bit

Found this while surfing and couldn’t resist posting the link. This guy did a massive array of tests on pretty much all versions of Windows that matter at the moment. The short version? If it’s performance you’re after, there’s no clear winner, since they all have their strengths. Overall, of the currently-supported OSes 2008 server seems to have the edge, as indicated by my own experiences. Attaching the results below, but here’s a link, too.

microsoft OS benchmarks

Tue
10
Jun '08

Virtualized Windows I/O performance on Xen with and without the optimized PV drivers, and versus the Linux host

One of my readers, Randall Ehren, was kind enough to provide benchmarks for Xen-virtualized Windows 2003 and XP with and without the optimized PV driver, and also compare to the underlying host. Most of the text below is copied verbatim from his correspondence with me, I just added some clarification in places.

physical machine description:
dell poweredge r200 server, 8GB ram, 2×250GB SATA 7200rpm in a RAID1

Xen host: ubuntu 8.0.4 LTS Server edition running xen 3.2 hypervisor (this is referred to as the dom0 machine)

This server is hosting two virtual servers (1 - windows 2003 server (1GB RAM), 2 - windows xp (1GB RAM)) and I performed two postmark benchmarks - one with an out of the box windows installation (indicated by “no PV drivers”), the other with a paravirtualized disk driver (indicated by “with Xen PV 0.9.6 drivers”) whose purpose is to greatly increase disk & network performance for windows-based virtual machines running under Xen. the drivers can be found here:

 http://wiki.xensource.com/xenwiki/XenWindowsGplPv

Postmark settings:

set number 10000
set transactions 20000
set subdirectories 5
set size 500 100000
set read 4096
set write 4096

Underlying host

##
## server: ubu 8 amd64 / iron / ext3fs on LVM2


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
##

Linux vm5 2.6.24-17-xen #1 SMP Thu May 1 15:55:31 UTC 2008 x86_64 GNU/Linux

Time:

        73 seconds total
        59 seconds of transactions (338 per second)

Files:
        20092 created (275 per second)
                Creation alone: 10000 files (10000 per second)
                Mixed with transactions: 10092 files (171 per second)
        9935 read (168 per second)
        10064 appended (170 per second)
        20092 deleted (275 per second)
                Deletion alone: 10184 files (783 per second)
                Mixed with transactions: 9908 files (167 per second)

Data:
        548.25 megabytes read (7.51 megabytes per second)
        1158.00 megabytes written (15.86 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        193 seconds total
        123 seconds of transactions (162 per second)

Files:
        20092 created (104 per second)
                Creation alone: 10000 files (166 per second)
                Mixed with transactions: 10092 files (82 per second)
        9935 read (80 per second)
        10064 appended (81 per second)
        20092 deleted (104 per second)
                Deletion alone: 10184 files (1018 per second)
                Mixed with transactions: 9908 files (80 per second)

Data:
        548.25 megabytes read (2.84 megabytes per second)
        1158.00 megabytes written (6.00 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        129 seconds total
        68 seconds of transactions (294 per second)

Files:
        20092 created (155 per second)
                Creation alone: 10000 files (204 per second)
                Mixed with transactions: 10092 files (148 per second)
        9935 read (146 per second)
        10064 appended (148 per second)
        20092 deleted (155 per second)
                Deletion alone: 10184 files (848 per second)
                Mixed with transactions: 9908 files (145 per second)

Data:
        548.25 megabytes read (4.25 megabytes per second)
        1158.00 megabytes written (8.98 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        336 seconds total
        274 seconds of transactions (72 per second)

Files:
        20092 created (59 per second)
                Creation alone: 10000 files (178 per second)
                Mixed with transactions: 10092 files (36 per second)
        9935 read (36 per second)
        10064 appended (36 per second)
        20092 deleted (59 per second)
                Deletion alone: 10184 files (1697 per second)
                Mixed with transactions: 9908 files (36 per second)

Data:
        548.25 megabytes read (1.63 megabytes per second)
        1158.00 megabytes written (3.45 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        233 seconds total
        181 seconds of transactions (110 per second)

Files:
        20092 created (86 per second)
                Creation alone: 10000 files (222 per second)
                Mixed with transactions: 10092 files (55 per second)
        9935 read (54 per second)
        10064 appended (55 per second)
        20092 deleted (86 per second)
                Deletion alone: 10184 files (1454 per second)
                Mixed with transactions: 9908 files (54 per second)

Data:
        548.25 megabytes read (2.35 megabytes per second)
        1158.00 megabytes written (4.97 megabytes per second)

 

Conclusion: seems that the PV driver does help greatly with I/O performance. Of course, comparing to the performance of the underlying host the VMs suck. I’d like to see Randall run the test and use the same box to run at least 2003 in native mode and then post, that should give a great comparison between NTFS and ext3.

Randall/D

Wed
28
May '08

Finally, some postmark results for OSX! And how does it do versus Windows?

My colleague Ian (last name withheld to save him from the Mac zealots) compiled the postmark code on his beloved Mac and ran it with the same settings I use in general (see older blog posts, just search for postmark).

I’ve been curious for the longest time to see how OSX performs in this test, since most UNIX and -alike systems work great with it. I wanted to see if OSX would be appropriate for a high IOPS-type environment (my belief being that due to the choice of kernel and filesystem it would suck - Mach and HFS+ not being exactly ideally suited to such tasks).

This is obviously not the most scientific test but I think it is good enough to get a rough gauge.

I’m still waiting for the specifics on the Mac but it’s an older Intel-based 17″ Macbok Pro with a 2.16GHz CPU, 5400 RPM HD and 2GB RAM.

The horrendous result (I think my rusty abacus did better once):

Time:
1259 seconds total
1186 seconds of transactions (16 per second)

Files:
20092 created (15 per second)
Creation alone: 10000 files (163 per second)
Mixed with transactions: 10092 files (8 per second)
9935 read (8 per second)
10064 appended (8 per second)
20092 deleted (15 per second)
Deletion alone: 10184 files (848 per second)
Mixed with transactions: 9908 files (8 per second)

Data:
548.25 megabytes read (445.92 kilobytes per second)
1158.00 megabytes written (941.85 kilobytes per second)

To compare and contrast (and save you from searching the older posts):

On a similar-spec Thinkpad T60 running XP (1.8GHz Core Duo, 2GB RAM, 60GB 5400 RPM HD):

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

 

And on the spankin’ T61p running 2008 Server (2.6GHz Core 2 Duo, 4GB RAM, 200GB 7200 RPM HD):

Time:
110 seconds total
39 seconds of transactions (512 per second)

Files:
20092 created (182 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (258 per second)
9935 read (254 per second)
10064 appended (258 per second)
20092 deleted (182 per second)
Deletion alone: 10184 files (207 per second)
Mixed with transactions: 9908 files (254 per second)

Data:
548.25 megabytes read (4.98 megabytes per second)
1158.00 megabytes written (10.53 megabytes per second)

 

The drive and CPU speed is important but postmark results are largely a function of filesystem and cache efficiency. It’s also worth noting that postmark is in no way optimized for windows since it is just standard C code and indeed was meant to be run on Unix boxes. Typically, good Unix filesystems beat Windows in postmark (my record run time was like under 10s on a Solaris box and DMX).

Unless something is wrong, HFS+ and/or the OSX cache are execrable for this kind of workload, which is a pity. Maybe there are better mount options? Some tuning options?

This is huge! Even if there’s some issue (disk fragmentation, for instance) the difference in sheer IOPS performance between OSX and pretty much anything else is staggering.

Any Mac users out there that want to chime in and save the day please let me know, I’ll send you the source and you can compile it whichever which way. I truly hope there is some serious error here.

D

Tue
13
May '08

Lowest-impact antivirus tool I’ve ever tried

I’ve been trying out ESET’s NOD32 on my 64-bit 2008 Server box. Before that I’d tried Avast! – which has great detection but noticeably slows down my computer, even when it’s loading pre-cached and pre-checked content (easy test: load Firefox with and without Avast! several times. It’s ALWAYS much slower to load with antivirus on than without. Without Avast! it loads instantly).

So I put in NOD32 Business Edition and the performance difference is staggering. Indeed, I can’t tell the difference between having it on or off. Unless you ask for a scan of the entire box the antivirus process never even goes to 1% of CPU consumption. If you check various online tests of the different antivirus programs they do show NOD32 having some of the best performance overall (including possibly the best heuristics engine with practically zero false positives), plus it works with 2008.

Other progs (Like Kaspersky) also work well but they’re much slower. I think I’ve found my holy grail when it comes to virus protection.

The one massive drawback is that Business Edition (which is the only one that supports 2008) is ONLY sold in 5-computer packs. It’s not expensive (boils down to like $40 per box, same as Home Edition) but I don’t HAVE 5 servers, I just have my 1 laptop that runs 2008.

I asked ESET and they wouldn’t sell me a single Business license. That’s just silly. The product is priced right, is totally solid but they won’t sell you a less than 5 licenses. I won’t spend 200-odd bucks for one machine.

Their response was that most businesses have more than 5 computers in general, so even if they have only 1-2 servers the rest of the licenses can be used on desktops/laptops. Which makes sense but it doesn’t help me :)

The only other product I’d consider now is Avira’s Antivir (same great speed and detection rate, however it provides many more false positives) but I hear it doesn’t work on 2008.

Damn the box is fast now, I forgot how blazing 2008 feels unencumbered by other fluff :)

D

Thu
8
May '08

Retarded storage and thin-skinned people

So this is kind of a long but funny story and a rant against oversensitive people at the same time.

About a year ago, this sales guy and I go to this architecture firm since they told us they are in dire need of a better storage solution.

We meet with their admin, real nice young guy, let’s call him Mike. He explains to me how they have this old <insert few-letter-company name> clustered NAS with some JBOD behind it. They’re having performance issues, it’s not scalable, they don’t replicate it or do snaps, the list goes on about how much he hates that box. It’s just not working out.

He then mentions he wasn’t part of the decision to buy the box and he just wants to get rid of it and get something much better.

So I start explaining to him the higher-end NAS solutions, I talk about the EMC Celerra, all the things it does etc. The whole explanation takes like 2 hours since he really was unfamiliar with a lot of the basics so I started from the ground up, explained the entire concept and architecture etc.

By the end of this we’re bonding with the guy, he’s throwing some F bombs in casual conversation, all in all we’re comfortable. He tells me he finally gets it, he realizes it took him a while to see the big picture but now he totally understands the value prop. He’s excited.

I feel stoked since I like the guy and it’s not often that you get to educate someone and make them that happy. Very rewarding. So we’re joking some more and I mention how the old box is pretty much retarded when compared to the EMC box, since the EMC box does so much more it’s ridiculous.

He laughs about that and agrees, we joke some more, I promise him I’ll send him a config to look over and we leave.

On the way out he tells me how great it all was, and cautions me jokingly that it’s probably not a good idea to mention to more conservative customers that their existing storage is retarded. We laugh and part ways in a very friendly fashion. Of course I don’t normally say something like that, I only did because we were joking around and bonding and, most importantly, he told me it wasn’t his baby and that he hates it. Usually the coast is clear after something like that :)

So I send him his config, he’s getting a great deal, all very well architected. No response. I call him, no response. Eventually the rep calls him, and Mike tells the rep how he was offended that I called his storage retarded and he doesn’t want to do business with us. I thought this was the weirdest thing ever. My initial reaction is that maybe someone close to him is mentally retarded – but if that were the case, he should have shown some kind of reaction when I first mentioned the dim-wittedness of his existing storage.

But wait, there’s more.

About a year later… different gig, different rep. I get the invite to go to this place and talk about storage. They’ve had problems for years and have a really old and bad system in place and really need a replacement. I walk in, and of course it’s the same exact architecture firm! I tell the rep that this is probably a bad idea and that I should leave. I don’t have time though because Mike comes to greet us.

The moment he sees me, he’s like “sorry guys, this is not gonna happen, you just leave now so we don’t waste each other’s time”. He says that he really respects my expertise but he won’t do business with a company I work at. He doesn’t want to speak to another engineer and pretty much kicks us out. I can’t shut up any more and I tell Mike that he has really, really thin skin.

Needless to say the new sales guy is dumbfounded.

The sales guy calls Mike a day or so later and gets an explanation out of him. Mike claims he doesn’t want to deal with engineers that belittle his equipment since how do I know in what financial dire straits they were? Maybe they were forced to buy the retarded storage.

Which is fine but shows that either Mike lied throughout our entire first meeting or has an amazingly bad short-term memory.

I wish Mike all the best in his future endeavors and still stand by my original assertion: get off your retarded storage if it’s causing you problems. Even if you don’t have money there are other appliance-type solutions to be had on the cheap (or free)!

Here are some easy-to-use appliances that are quite good:

You could try all of them as virtual machines if you don’t want to dedicate hardware to them to begin with. That way you can test all of them easily. You can also roll your own with Solaris 10 or Linux, of course it requires one to know what they’re doing but it’s amazing what can be accomplished for next to zero dollars nowadays.

And Mike, if you ever read this:

Get some thicker skin. And maybe some Gingko Biloba. Moreover, if the real reason I offended you was that someone close to you is retarded – get over it, it’s just an expression!

People are just too damn sensitive these days. Just get the job done.

D

Tue
25
Mar '08

Windows Server 2008 RTM 64-bit performance versus Vista SP1 64-bit, and using 2008 as a workstation

I’ve been using Vista x64 for a while now, just so I can make use of all the memory on my machine (an über-thinkpad), and because I like shiny new things and 64-bitness and don’t want to be one-upped by smug Mac users with their feline-named OSes, mock turtlenecks and their newfound 64-bit capabilities. Of course, with the good comes some bad – Vista, while in my opinion a step forward in many ways, does take a step backward when it comes to some areas of performance and sheer resource requirements. A lot of it can be attributed to poorly-written drivers, especially any Aero GUI slowdowns with nVidia cards.

Since space was running out I bought a new hard drive (200GB Seagate 7200 RPM) and decided to install the RTM 2008 bits. If something went wrong I figured I could always either go back to my old drive or just move Vista to the new drive with some imaging utility or other, no biggie. If 2008 worked out, I’d keep it.

The reason this comparison is worthwhile is that 2008 and Vista SP1 have the same exact kernel – I checked, NTOSKRNL.EXE is the same in both OSes. One would think that the differences wouldn’t be huge and that therefore there’s no point going to 2008. Of course, there are a lot of other pieces aside from the kernel, and I think that Microsoft checks to see what OS you’re running and maybe disables certain features in the kernel accordingly – I couldn’t get the LargeSystemCache registry parameter to have any effect on Vista, for example.

Let’s compare CPU- and Graphics-benchmarks first, since those shouldn’t really be different. I used Cinebench 64-bit.

 

Vista:

Rendering (Single   CPU): 3040 CB-CPU
Rendering (Multiple CPU): 5367 CB-CPU
Multiprocessor Speedup: 1.77
Shading (OpenGL Standard)          : 4256 CB-GFX

 

2008:

Rendering (Single   CPU): 3053 CB-CPU
Rendering (Multiple CPU): 5379 CB-CPU
Multiprocessor Speedup: 1.86
Shading (OpenGL Standard)          : 4478 CB-GFX

 

Slightly better scores for 2008 it seems, but not dramatically better. Next, postmark, since I/O should be where it shines, it being a server and all:

 

Vista:

Time:

        170 seconds total

        98 seconds of transactions (204 per second)

 

Files:

        20092 created (118 per second)

                Creation alone: 10000 files (200 per second)

                Mixed with transactions: 10092 files (102 per second)

        9935 read (101 per second)

        10064 appended (102 per second)

        20092 deleted (118 per second)

                Deletion alone: 10184 files (462 per second)

                Mixed with transactions: 9908 files (101 per second)

 

Data:

        548.25 megabytes read (3.23 megabytes per second)

        1158.00 megabytes written (6.81 megabytes per second)

 

2008:

Initially I had enabled the “advanced performance” in the device manager for disk, since everyone tells you to do so in all tuning guides…

 

Time:

136 seconds total

45 seconds of transactions (444 per second)

 

Files:

20092 created (147 per second)

Creation alone: 10000 files (263 per second)

Mixed with transactions: 10092 files (224 per second)

9935 read (220 per second)

10064 appended (223 per second)

20092 deleted (147 per second)

Deletion alone: 10184 files (192 per second)

Mixed with transactions: 9908 files (220 per second)

 

Data:

548.25 megabytes read (4.03 megabytes per second)

1158.00 megabytes written (8.51 megabytes per second)

 

Much faster than Vista. I then disabled the “enable advanced performance” to see how much slower it would become:

 

Time:

110 seconds total

39 seconds of transactions (512 per second)

 

Files:

20092 created (182 per second)

Creation alone: 10000 files (454 per second)

Mixed with transactions: 10092 files (258 per second)

9935 read (254 per second)

10064 appended (258 per second)

20092 deleted (182 per second)

Deletion alone: 10184 files (207 per second)

Mixed with transactions: 9908 files (254 per second)

 

Data:

548.25 megabytes read (4.98 megabytes per second)

1158.00 megabytes written (10.53 megabytes per second)

 

Amazingly, much faster, not slower! I did some checking and this is what the setting actually does… it re-introduces an older, somewhat undesirable behavior. A bit hard to find the proper explanation, and I hope Microsoft makes what happens behind the scenes a bit more obvious. At the moment it’s quite obscure, and every guide tells you to enable it for performance. Just leave it alone. BTW the Vista score is with the setting disabled.

 

Could I have run other benchmarks like Sandra etc? Sure, but I just wanted to keep it simple and there just wasn’t enough time.

 

The next step is to run the tests on the same hardware with XP. That’s forthcoming.

 

Conclusion:

 

Seems like Microsoft did something right. Even with the 64-bit version (that takes naturally more RAM than the 32-bit one), 2008 Server takes less memory than Vista (2-300MB less at any given time in my case), runs quicker and just feels better, kinda like an unencumbered Vista. Simple things like searching a huge index in Outlook happen much faster than before. The Server Manager app is awesome, and one can try out the Hyper-V Hypervisor (BTW that, predictably, clashes with VMware and disables your power management, so beware). A server OS is in general also more secure and, over time, probably more reliable, given the workloads it’s supposed to run.

 

Can everyone run it? Should they? No, not unless you have a license for 2008 through MSDN or somesuch, otherwise it’s expensive. Some assembly is also required, and you do need to know what you’re doing. However, if you’re so inclined, you can easily get the demo version of 2008. Apparently there are clean, documented ways to increase the evaluation period (no cracks or BIOS spoofers) that I think come from Microsoft but I’m not going to list them here just in case…

 

In addition, while almost all my apps installed fine (including games and hairy driver stuff like Daemon Tools), 2 things didn’t: Bluetooth and my Logitech mouse drivers. I don’t quite use Bluetooth but I liked some of the features of my mouse (the utterly kickass Logitech VX Revolution), now it’s just like a normal mouse. I’m still keeping 2008. I’m sure other stuff will have issues, like DRM/BluRay. For people that like the Windows Sidebar: there are hacks to get it working that involve copying stuff from Vista. I think the sidebar is largely useless.

 

FYI, there are 2 notable omissions in 2008: Readyboost and Superfetch. Superfetch exists as a service but to even get it to start you have to edit the registry. I didn’t think it helped much so I disabled it again. Readyboost isn’t even an option. And the old-style boot prefetch that worked in 2003 Server doesn’t seem to be there. So it does boot a bit slower than Vista, but not much. Once you get the box up and running it’s fast though.

 

In the end, I’m leaving 2008 on my box, and that’s all that matters.

 

D

Mon
4
Feb '08

NetApp posts SPC-1 results

NetApp posted some SPC results showing their 3040 box performing pretty well in SPC-1 relative to an EMC box.

There have been rumors that when running multiple features in a NetApp box then performance suffers. Which kinda negates the whole value prop of NetApp (since that’s when people typically choose NetApp - they want one box to do everything).

A realistic test would be to have OTHER apps sharing the array (on other spindles), as is usually the case. Almost nobody dedicates an entire array of that size to a single app.

Have the box do CIFS, NFS, iSCSI AND FC.

Show performance over a significant period of time (another point NetApp detractors use – performance declines over time due to WAFL fragmentation).

THEN show the performance delta as each feature is enabled.

Obviously hard to do and maintain kosher SPC results but it would be a worthwhile addendum and, if successful, would shut up the NetApp detractors (since that’s a usual technique for selling against NetApp). I’d also show performance in degraded mode.

Anyone have any data on NetApp performing either way when used as a multi-role box?

A note on the EMC config and interpreting those benchmarks in general, be they SPC or SPEC or whatever: ALWAYS READ THE FULL DISCLOSURE regarding the test, don’t just look at the graph. If you’re not technical, get a techie to explain it to you.

For instance, looking at the way the EMC box was set up, I highly doubt it was done using EMC’s best practices. To wit:

  1. They didn’t maximize the write cache
  2. They seem to not have used separate spindles for the snapshot area (a differentiator since, unlike NetApp, EMC not only allows such a thing to happen but actually encourages it)
  3. They could have used MetaLUNs more instead of striping using Windows.

I’d be willing to bet dollars to nuts that the NetApp box was set up properly :)

Another thing: look at the response times in the graphs.

Like they say, “only believe 50% of the statistics you read”.

D

Thu
20
Dec '07

Ate at Delmonico’s in NYC

I was helping out a customer with some backup issues in the Wall street area and they happened to be literally across the street from Delmonico’s.

At the end of a particularly long day I thought I’d reward myself with a nice steak, and the proximity to the steakhouse made it hard to resist.

Delmonico’s is one of those places that have been around forever. Bit stuffy inside, I didn’t opt for the wet-aged Delmonico cut but instead went for the T-Bone (dry-aged on-premises). I also had a rather excellent salad with roasted tomatoes, herbs and mozzarella.

This is not going to be one of those inspired entries – the steak just wasn’t that good. It was undercooked, underseasoned and just lacked flavor. I probably should have gone for the house’s signature cut (the famous Delmonico cut) but any decent steakhouse should have no problems making a proper T-Bone…

Maybe I’ll give it another chance. Prolly not.

D

Sun
9
Dec '07

We need more wizards!

No, I don’t mean Gandalf, I mean the software kind. And before I’m accused of being Gates’ live-in cabana boy (it’s all baseless rumors), let me clarify.

It’s a known fact that most OSes need tuning (sometimes significant) to perform well with heavy-duty applications (I’m not talking about your home web server, I’m talking about Exchange, SAP, Oracle, IIS, Apache etc. in large deployments. I acknowledge the fact that most OSes, out of the box, will work OK for anything small).

Most frequently the application documentation will have some kind of tuning guidelines telling you approximately what to do in each OS. The installer sometimes will apply some tunings for you after asking for your permission. Often, the suggested settings are woefully inadequate for truly large implementations, as with NetBackup (the Veritas-suggested tunings work for smaller environments but I have some magical kernel tunings as posted before that make it truly fly when the ridiculous is asked of it – and the difference in the parameters between my config and what Veritas suggests is huge. Oh, and some of my parameters are way smaller than what Veritas recommends. And I won’t call them Symantec, Veritas is a way cooler name anyway, look it up in a Latin-English dictionary).

Frequently, some tunings are so common that I don’t even know why they’re not in the default configuration in certain OSes. Different conversation.

The problem is, there are experts that DO know how to set up and tune the systems properly, but said experts are rarely the admins that install and administer the thing. Usually, a fair portion of those experts do work at the companies that make the OSes and apps.

The elitist among us might say, “tough, the lowly admins need to learn all this stuff, otherwise they’re not worth what they’re paid”. To which I respond with the following points:

  • Not everyone has the time to learn the arcana of several OSes and applications, learning most of the important features is complicated enough and some shops are truly short-staffed
  • The über-experts themselves don’t know it all: They may know how to perfectly set up Exchange but wouldn’t know how to do the same thing with Oracle, how can the basic admins be expected to have such multi-discipline expertise?
  • I firmly believe in the simplicity of the appliance computing model
  • We all have more important things to do (like taking care of the big picture) than constantly worrying about minutiae
  • The people that complain that the admins should be more intelligent are typically the people that actually enjoy dealing with the apocryphal, their jobs are secure anyway
  • There’s money to be made in the simplification of IT – look at Microsoft, EMC/VMware and NetApp. People like simplicity and are willing to pay for it.

Of course, many larger companies will opt for professional services to do the job, but the quality of people just varies dramatically. Just because you’re getting an expensive Veritas PS guy doesn’t mean that

  1. He knows what the hell he’s doing beyond what’s in the installation manual (you know who you are!) and (less significantly)
  2. Is even a Veritas employee, despite his badge (most vendors subcontract smaller companies).

At the moment, most OSes just apply generic formulas based on memory and/or number of CPUs, though somehow do not take into account CPU speed and load, and, indeed, the ancient formulas are a pain with today’s very large memory systems (usually you have to limit some tunables in large-memory HP-UX and Solaris boxes, otherwise some parameters get out of control).

I understand that making OSes truly self-tuning is not here yet, nor will it be for a while (64-bitness has taken away some of the pain though, at least in Windows). In the interim, there are better ways to approach the problem. My suggestion: Modernize the formulas that build the tunables and use simple AI techniques like Expert Systems. At installation time, benchmark the hardware and ask the user what will the server be running? OK, so if the answer is a web server, under what conditions? How many users? And so on. Admins are far more likely to know the answers to those questions than “how many open file handles do you think you’ll need?”

Based on the answers and the benchmark results, the system should either tell you what you want is possible, or bitch.

If the box is to be serving double-duty (or quintuple, in some cases), the wizard should check and see if the tunings will conflict and, if not, tune the whole box so that it can accommodate all the applications.

If you’re creating a filesystem, what will the intended use be? The defaults for almost all filesystems are wrong! One size fits only the people that have that size. The problem is that, once you’ve put in several TB on filesystems someone built with the default parameters, changing them is almost impossible: you have to take a backup, destroy the filesystems, rebuild them then restore the data. Which could have been avoided if, say, maybe not the OS but at least Oracle had the smarts to query the FS and figure out it’s using insufficient log and block sizes and that performance will suck. At which point it should puke and tell you “sorry, this is sub-optimal, either do such-and-such to fix it or continue anyway at your peril”. But of course you’re using raw disks for Oracle, right? Right?

Or take the example of Logical Volume Managers. They are cool, yes. They can work great. They will also let you do insane things such as create multiple LVs and stripe them, even if they’re on the same physical disk! The checks that should have been performed are so ridiculously simple it boggles the mind.

HP kinda started doing something like this a while ago – look at the templates in SAM, you can apply 2-3 different (useless) templates based on what the box will be doing that will affect a few tunables. HP-UX is guilty of needing the most tuning of any current OS I can think of, BTW (It also pays great dividends if you know what you’re doing, I took a Superdome to 2x the I/O performance once, felt proud but it took a lot of effort and research that could have been avoided).

Seems like the intelligence that would make our lives easier is like the proverbial hot potato: always someone else’s problem.

I know it’s a tall order: the whole solution would rely on much deeper interoperability between the various components than we’re used to. But I think the end result would be worth it.

In the meantime, if you have to do it all yourself, at least use common sense and have some golden OS builds that are each good for a different use, then just replicate them as needed.

Anyway, all this is aggravating my hemorrhoids (I call them The Grapes of Wrath), better stop now.

D

 

Fri
7
Dec '07

(Very) Preliminary Windows Server 2008 impressions and Vista Multimedia Performance under battery power

Out of curiosity, I very briefly tried the new Server 2008 Release Candidate (freely available from Microsoft). I’ve been using Vista 64-bit since I need to see all the memory in my machine and, while it works mostly OK, there are some low-level scheduling issues with it – for instance, sound is really choppy on battery power, no matter what I do with the power settings, so I can’t use the thing to watch a DVD or listen to music on the plane. Many others seem to be having the same issues, despite the funky Multimedia Class Scheduler nonsense that Microsoft put in the OS that makes networking slower (great info here), even though older incarnations were not suffering from media playback issues under load. And no, if I disable the Multimedia Scheduler it does NOT work better, it actually gets worse, which means that the service is there to fix some other kludge-y issue Microsoft introduced with the scheduler or something like excessive power throttling of certain devices.

But, as usual, I digress. This is about Server 2008. What’s noteworthy is that Vista SP1 inherits the exact same kernel as Server 2008.

This will be a short entry, there are others online talking more about 2008. What I noticed:

  1. It’s light for a Windows OS. There’s no excessive bloat guys, the thing takes about 300MB of RAM with the default install, and more can be saved by trimming unnecessary services (of which there are very few).
  2. It’s fast. Under preliminary benchmarking, even the RC code (that probably has some features missing and extra debugging code) seems about as fast as 2003 after SP2 (unlike others that have been releasing benchmarks of, say, Vista SP1 in it’s pre-release form, I’d rather wait until the final code is out).
  3. Seems to work with most Vista drivers so, if you want to turn it into a workstation, you can. You can also install the Vista GUI if you’re so inclined with no adverse effects (aside from the ones that come with the Vista UI that is). Runs very smooth.
  4. Application compatibility is similar to that of Server 2003.
  5. The OS does NOT suffer from the same issues as Vista regarding media playback (I made sure I installed the Power Management driver and selected the same kind of PM scheme as Vista). Maybe a good omen come Vista SP1? We shall see.

The new management interfaces are nicely laid out, and selecting Roles for the server and adding or removing features as needed is very simple. It feels more like a well-integrated 2003 R3 rather than Vista.

I didn’t get to play with the new virtualization, it doesn’t seem to be in the RC code (though, reading some documentation, it seems as if it will have VMotion-like capabilities, which I will believe when I see).

UPDATE: 12/17/07

There is no more Vista multimedia performance issue on 2 separate computers. Some patches just released by Microsoft removed the issue (plus the issue of the mouse cursor stuttering). Interestingly, the patches had no mention of fixing said issues. I thought it was a fluke but having seen this fixed on 2 different boxes (one 32-bit, one 64) I don’t think it is.

For the Vista detractors: I’d advise everyone to wait until SP1 – as with most Microsoft releases. It’s no different. They’re actually getting better, NT4 was unusable until SP3 at least… given the unreal amount of code in the system, I’m surprised it runs this well. They really need to slim it down. Supposedly, Windows 7 will be slimmer (http://apcmag.com/7668/beyond_vista_windows_7_what_we_know_so_far). However, it mostly targets the kernel and it was never the Windows kernel that was the issue (it’s actually surprisingly decent), it’s all the crud around it.

D

Thu
15
Nov '07

My opinion on the Sun/NetApp altercation: Both companies should be grateful instead of resorting to lawsuits

Since opinions are like you-know-what, and since I’m decidedly anatomically complete in that respect (some, indeed, claim all of me is composed of implied anatomical part, so maybe that’s why I’m so opinionated), I thought I’d throw my $0.2 in the pot and not stay silent. The whole issue irks me quite a bit, actually.

Like my colleague, Rich, and I think most digerati (there’s a nice word whose time came and went, it seems), I have been following the machismo display between Sun and NetApp (see some representative comments from both sides here and here). BTW, I doubt anything will really happen with the lawsuits, and highly doubt even that money will change hands out-of-court to settle this. This is more about chest-thumping than anything else. But, in a nutshell, it seems it all started due to NetApp wanting to buy some STK patents (from before the STK acquisition), Sun not wanting to sell but instead asking for $36m to license the patents, NetApp being upset and telling Sun they infringe their WAFL patent with ZFS, then Sun telling NetApp to stop selling filers. Those guys are all nuts. I may be missing some facts (NetApp is super-cagey about what STK stuff they wanted) but they are all still nuts.

It seems people will try to patent anything these days. But going after people that you think infringed your patents can be pathetic if your story is not airtight and your goals noble – remember SCO?

I do believe in protecting one’s IP in some way – whether the best way is a patent I’m not so sure, there’s always copyright. I’m not as naïve as some open source zealots that think all patents are evil and that all software should be free. I wonder where they work and how they all make their living? Do those guys all work in places that only do open source and just give away stuff? If I develop a piece of truly cool IP that can result in me making money, rest assured I’ll try to capitalize on it.

However, I do believe that the current patent system is flawed. It’s also difficult (I think impossible) to find people technically competent enough to oversee the process. For instance (and, to cut to the chase), I would have denied NetApp the WAFL patent, since

  1. It’s a simple evolution and/or modification of existing block allocation schemes to facilitate writes (more technical info later on)
  2. There were other COW (Copy On Write) filesystems prior to NetApp, such as LFS and numerous research projects. Specifically,
  3. Daniel Phillips had done most of the COW work prior to NetApp’s patent, but had to abandon work on the tux2 filesystem due to fear of patent laws (see here). He didn’t file a patent first, since nobody that does open source development is thus inclined.

     

But where do you draw the line on what’s truly new and patentable? And what if enforcing a patent is detrimental to the common good? Should Xerox have patented the mouse? It was totally new back then. What if they’d enforced the patent and told Apple and later Microsoft that they are not allowed, no matter what, to use a mouse? Or if HG Wells patented the science fiction novel? If Hoover patented the vacuum cleaner? If RCA patented the television? You get my drift. There would be zero innovation.

I think patenting obvious stuff should just not be allowed. And, if your patent is based on prior art (regardless of whether it’s been patented), it should be summarily denied. If the patent is granted but is then proven after the fact that someone else had figured out the idea first (as in the case of Mr. Phillips), the patent should automatically be invalidated. Complex, no?

Which is why many think that patenting software should not be allowed.

At the end, with some problems, there is only a finite number of solutions (often only one). Researchers may be working simultaneously on the problem. Eventually, only one will be first with a solution. I am opposed to penalizing the other guy simply because he used a similar algorithm to mine (especially when, mathematically, there may be zero other solutions, making every approach to solve the problem produce the same result).

Back to Sun and NetApp. The truth is, I think, pretty simple. While I have enormous respect for both companies (a bit more for Sun, due to their history and my extensive personal experiences), both companies’ major products are based on a tremendous amount of prior art (patented or not, nobody seems to have complained to either company). Truly, they stand on the shoulders of proverbial IT giants. Sun has the PR benefit of having contributed vast amounts of IP to the world, compared to NetApp (though some technologies like NFS and Java have been pretty painful, so it’s a mixed blessing).

NetApp code heavily borrows from Unix, Sun, IBM, Cisco, EMC and many others. For instance, since Data ONTAP (NetApp’s OS) can’t scale beyond 2 boxes, NetApp purchased Spinnaker – SpinOS creates a single namespace that can transcend many nodes (BTW other products such as IBRIX, Exanet and others can do the same thing really well). The current GX OS is bits from the older ONTAP on top of FreeBSD with some SpinOS bits. However, both the older 7G and the newer GX OSes are offered, since 7G does a lot more (SpinOS can be just large-scale NAS – no iSCSI or FC block device targets, even if those targets on a 7G box are just files, but I digress). Of course NetApp wants to move everyone to SpinOS, which explains NetApp’s current craze with NFS everywhere. It’s infectious, now all of a sudden once again everyone wants to use NFS – VMWare, Oracle, senile grannies running compute clusters all over the world. We get it, it’s a shared-namespace, network-based FS, and sure, you can run pretty much anything on it. People have been for decades. How quickly we forget that it really isn’t the best network-based filesystem, and that there was a reason people developed cool alternative technologies such as AFS, Coda, PVFS, the native IBRIX mode, and many others. The new CIFS that’s part of Windows Server 2008 is actually a really decent implementation, but I’ll probably get flamed by the NFS fanbois for saying so.

And how quickly people forget that it was Sun that gave us NFS, warts and all (well, v4.1 ain’t too bad but that’s a collective effort – the wonders of open source). The rather execrable CIFS, BTW, (the other main NetApp “technology”) was not invented by Microsoft but rather by IBM in 1983. IBM and Cisco invented iSCSI. Legato (now owned by EMC) played a fundamental role in developing NDMP. And I can’t even remember who first created versioning filesystems but I fondly remember my VAXes and they used to do that stuff ages before NetApp even existed (not to mention proper manly-man single-system-image clustering, but that’s a story for another day). I’m pretty sure NetApp didn’t develop Fibre Channel, either.

Cue to today: Now everyone can do snapshots, it’s almost de rigeur, and the truly cool do application-aware snaps.

Volume management is standard, too.

Filesystem expansion is everywhere.

Thin provisioning (not a fan but anyway) is becoming more and more prevalent.

iSCSI is everywhere.

So, the real ZFS issues NetApp is complaining about seem to be the “Write Anywhere” and COW parts, since those are really the only true similarities with WAFL. Seriously, like that’s what’s the most important aspect of Sun’s ZFS. Indeed, while very quick for initial writes, a write-anywhere algorithm can lead to horrific fragmentation and continuously-declining performance over time (which is why you have to defrag NetApp filers). It’s just a safe, easy and computationally cheap method for allocating blocks to minimize write time for write-heavy applications such as NFS. Possibly one of the reasons NetApp did it was because in their boxes there are no RAID controllers, there’s just a CPU or two (486’s I believe in the original boxes) that has to do EVERYTHING – RAID calcs, rebuilds, snaps, caching, etc (the back end of all NetApp gear is JBOD). Using WAFL a lot of the inefficiencies in RAID are bypassed, since it will schedule multiple writes in order to fill a RAID stripe. A more elegant approach such as extent-based allocation (like VxFS) would have been too computationally-intensive, especially for writes. Dave and his pals have a good paper on WAFL here, BTW.

Here’s what ZFS is: It was not meant to be a NetApp killer, it’s just a truly modern FS, with few limits, and an amalgam of all the current “cool” technologies and ideas. Snaps, thin provisioning, expansion, volume management, pools, quotas, self-healing, all in a single technology, that’s surprisingly well thought out, and easy to use even from the command line. ZFS is not the raison d’être of the Solaris OS, but merely a feature of it. Plus it does data checksumming with every write, which other filesystems don’t. Your data is exceptionally safe in ZFS. Some test results here. More features here, and it’s easy to see NetApp getting annoyed after reading that page (though they just think COW is a good idea, the other tremendous features are not in NetApp’s WAFL). Not sure if they fixed the read performance issues NetApp has with their implementation, I need to do some testing of my own.

In my opinion, the only reason NetApp became popular is because it trivialized the whole NAS aspect. Made it easy to build decent, clustered NFS/CIFS boxes without the need to know UNIX. If Sun had put a wizard-driven GUI to perform such actions in their boxes 10 years ago, NetApp might not exist today. To date, I think Sun’s management tools are pathetic, no matter how amazingly solid the underlying tech might be. There’s a GUI for ZFS but, again, that’s besides the point. Aside from initial write performance, a NetApp filer is not about WAFL, extending disk pools and whatnot, it’s about all-around ease-of-use and the sheer amount of cool features.

If NetApp wants to sue someone so badly, maybe they need to sue the Openfiler or FreeNAS developers? Or, if they want to go after someone that’s not open source, how about Open-E? That stuff sure looks much more similar to NetApp than anything made by Sun. Really cool, too. Or maybe they need to sue EMC. Those guys sure make some nice, full-featured NAS gear. Among a myriad other solutions…

Suing someone over a filesystem that’s newer and better in almost every single way than yours but uses one common (and unavoidable in the case of COW) design methodology is just plain silly… and, BTW, how did this escape the patent trolls? Another COW implementation?

And if more developers like Daniel Phillips get scared because of patent laws, then innovation will truly be stifled. The whole point of research is that you can reference other people’s ideas so you don’t always have to re-invent the wheel.

NetApp needs to innovate a bit more themselves. They developed a cool technology and have milked it to death, and even made it do things it shouldn’t (like iSCSI and FC targets, the NetApp approach is really unclean but they are trying to force their OS to do everything, whereas companies like EMC go for the more modular approach and are criticized for being “complex”).

I think I’ll stop writing now since it’s getting late. Never was one to save posts for editing later.

D

Fri
26
Oct '07

Ate at the Staghorn steakhouse in NYC

At the insistence of my colleagues (that seem to enjoy the steak posts more than the high-falutin’ technology ones) I decided to visit another NYC steakhouse.

It was raining, I didn’t feel like going further so I went to a place near the office at 2 Penn Plaza (Madison Sq. Garden).

It’s a newer place called the Staghorn on 36th, just west of 8th Ave. Really nice and modern inside, unlike most other NYC steakhouses. Almost totally empty.

The prices are a bit below other joints, probably because the cuts are not quite as colossal.

I opted for a T-bone this time and a house salad. All the cuts had the same price, BTW.

The salad had an excellent vinaigrette with a touch of oregano. I fortified it with a tiny bit of blue cheese.

The steak was truly excellent, dry-aged, with a wonderful nuttiness and caramelization, exhibiting slight undertones of hazelnut.

Not perfect though - had the cut been a bit thicker it would have been juicier, another 4-5 oz wouldn’t be too much to add. Nonetheless, a wonderful piece of beef. In the thicker parts it was amazing in tenderness, texture and flavor.

I finished with a rather good tiramisu that was a touch on the oversoaked side but very tasty.

Recommended. This place shouldn’t be as obscure.

D

Mon
15
Oct '07

Uptempo cache can get paged out! (EDIT: After all, it does NOT).

I normally don’t do retractions unless proven wrong. So, ignore the text below and read Nick’s comment.

—————————-

A warning to those who use Datacore’s Uptempo:

While it works wonderfully as long as the server doesn’t suffer a low memory condition, the memory it reserves for cache will get paged out in low-memory situations.

I found out the hard way (as usual), while running some very demanding VMs (I only have 2GB and not the best laptop, a new machine is forthcoming). The way Uptempo reserves memory is by using a specific process, Dscaddmemory or something like that (I’ve now removed it from my system so I can’t remember the exact name). If you look at Task Manager, that process has as much memory allocated to it as you’ve allocated Uptempo.

When I was running out of RAM, I noticed that the process started shrinking in size, until it was 16MB (out of 280MB). Windows, since it looks like a normal process, decided to page it out in order to reclaim RAM.

Of course, this kinda defeats the purpose. I’d rather page out everything BUT my fancy dedicated cache, the way HP-UX does it if you tell it to (story for another day but HP-UX cache tends to work better if you specify the min and max sizes as the same and not let it auto-allocate).

My real beef with Uptempo is that it didn’t try to reclaim the memory when there most obviously was enough memory for it (after it paged itself out needlessly, I had over 350MB free and plenty in the Windows cache).

It didn’t even try to reclaim the RAM after I quit VMWare and had 1.5GB free.

Obviously, either I’m missing something fundamental or some work needs to be done. Granted, any time you are forced to swap heavily cache won’t help much but they should be at least giving the memory back to the process afterwards.

Supercache never shows up as a process, it grabs the memory when the system boots (it’s one of the first things that happen) and nothing can swap it out. It’s also configurable on-the-fly, Uptempo needs a reboot for any size changes.

With 64-bit all these helper caching programs will probably become obsolete since cache is not limited to 1GB any longer. Though I’m not sure I subscribe to Vista’s Superfetch, since it does make the HD work like crazy when you first start the box and is more suited for boxes that are not shut down it seems. Once it settles down it works OK.

D

Wed
26
Sep '07

Ate at The Old Homestead in NYC

I’ve been hopped up on uppers all day (relax, just a huge amount of chocolate-covered high-test espresso beans, though the amount of caffeine was surely enough to get me disqualified from competing in any sport - every time I pee it smells like freshly-brewed coffee). Needing something to relax me, and since my bowel movements have been altogether too easy lately, I thought I’d go for steak. Two birds with one stone.

It’s been a while since my last red meat extravaganza, and, at the behest of my buddies, I tried The Old Homestead, on 14th and 9th.

The place is a bit old-fashioned, as befits most NYC steakhouses. There’s this weird old sign, stating this place is “the king of beef”.

I bumped into Odin on the way in, he was ordering takeout for the lads. We exchanged knowing nods, told him to say hi.

I was served by a decrepit waiter with a handlebar moustache, he probably was almost too old to fight when he was drafted in WWI. He had an accent so I asked him where his pith helmet was. He, in turn, recommended the 36oz ribeye, priced no more than lighter fare on the menu. Once again, I asked for an internal temperature between 145F and 150F, once again I got a blank stare. So far, only the people at Emeril’s Delmonico in Vegas have been able to respond to this request without batting an eyelid. But that is a story for another day.

I also ordered a chopped salad since I’ve been told I need some roughage. The salad was amazing, and enough for two. I ate the whole thing, not one to ignore roughage consumption guidelines.

Then the steak came.

Oh dear.

The bone wasn’t even that big. The rest was all meat and a bit of fat. This is, to date, the largest single steak I’ve had (though not, alarmingly, the largest amount of meat I’ve consumed in one sitting). And was it good! It was served with a roasted head of garlic, French style. Not quite the consistency of the steak in Flames (that was almost like good Ahi) but still awesome.

I almost couldn’t eat the whole thing. But I did, it was that good. By the end I felt like Mr. Creosote in Monty Python’s The Meaning of Life. And I did not have the “waffer thin mint“.

On the way back to the train, it was hot and, after all this food, I started sweating profusely. I passed by a funeral parlor on 14th and the proprietor eyed me appreciatively. This is not hot-weather food!

Highly recommended.

D

Thu
20
Sep '07

WAN acceleration for remote workers

The deluge of WAN accelerators from Cisco, Riverbed, Juniper, Expand, Packeteer,Bluecoat, Silverpeak etc. etc. is proving good for datacenters. Not sure how many vendors will remain viable in a year or two, but the selection at the moment is decent.

However, most of the vendors don’t address remote desktop acceleration, say for people using 3G cards on their laptops or even cable modems - sometimes the routing to corporate networks can be arcane enough that the ms of latency add up, plus most home connections are asymmetrical anyway.

So, it would be pretty cool to have a WAN accelerator in your laptop, right? Well, so far only two companies have stepped forward:

The far more established product, even if you’ve never heard of it, is AcceleNet Enterprise from ICT (Intelligent Compression Technologies, www.ictcompress.com - they were recently bought by ViaSat). ICT has been doing just this for years, with a veritable who is who of clients (no they haven’t paid me to say this, I just think the stuff is cool). Lots of service providers use it.

ICT deploys a server that acts as a proxy, then you install an agent on your laptop. Transfers are compressed both ways.

The other vendor is known to us all - it’s Riverbed. They have now what’s called Steelhead Mobile. Effectively, it puts a Riverbed box inside your laptop. A normal Steelhead is needed to communicate with, as well as a Steelhead Mobile Controller for management. I saw pricing for the controller and it was a bit dear…

You can even adjust how much cache to give your mini-Riverbed, so if you have the space, go nuts.

Of course, you can also use this technology for servers and save money on appliance costs - I wonder if they have something that checks if you’ve installed it on a server OS, and how much CPU does it take to do it’s thing.

I heard somewhere Cisco is also working on something similar, unsurprisingly.

D

Fri
17
Aug '07

Processor scheduling and quanta in Windows (and a bit about Unix/Linux)

One of the more exotic and exciting IT subjects is the one of processor scheduling (if you’re not excited, read on, practical stuff to be seen later in the text). Multi-tasking OSes just give the illusion that they’re doing things in parallel - in reality, the CPUs rapidly skip from task to task using various algorithms and heuristics, making one think the processes truly are running simultaneously. The choice of scheduling algorithm can be immensely important.

Wikipedia has a nice article on schedulers in general: en.wikipedia.org/wiki/Scheduling_%28computing%29, good primer.

To cut a long story short: the processors are allowed to spend finite chunks of time (quanta) per process. Note that the quantum has nothing to do with task priority, it’s simply the amount of time the CPU will spend on the task. Every time the CPU switches to a new process, there’s what’s called a context switch (en.wikipedia.org/wiki/Context_switch), which is computationally expensive. Obviously, we need to avoid excessive context switching but still maintain the illusion of concurrency.

In Windows Server (that uses a multi-level feedback queue algorithm, FYI), the default quantum is a fixed 120ms, close to many UNIX variants (100ms) and generally accepted as a reasonably short length of time that can fool humans into believing concurrency. Compare this to the workstation-level products (Windows Vista/XP/2000 Pro) that have a variable quantum that’s much shorter and also provide a quantum (not priority) boost to the foreground process (the process in the currently active window). In the workstation products, the quantum ranges from 20-60ms typically, with the background processes always relegated to the smallest possible quantum, ensuring that the application one is currently using “feels” responsive and that no background task hampers perceived performance too much. Typically, in a box that’s used as a busy terminal server this will be the better setting to use since it will ensure that the numerous “in-focus” user processes will all get a quantum sooner rather than later.

The longer, fixed quantum of Windows Server means that fewer system resources are wasted on context switching, and that all processes have the same quantum. More total system throughput can be realized with such a scheme, and it’s a more of a fair scheduler. It also explains the higher benchmark numbers when running the scheduler in “background services” mode. It’s obviously best for systems that are running a few intensive processes that can benefit from the longer quantum (and, believe it or not, games and pro audio apps run better like this).

Note that I/O-bound threads (processes waiting on disk, mouse, screen and keyboard I/O) are given priority over CPU-bound threads anyway, which explains why the longer quantum doesn’t harm interactivity much. Try it - have 4 winzip/winrar/7zip sessions running concurrently. You CAN still move your mouse :) Here’s a great primer on internal windows architecture: elqui.dcsc.utfsm.cl/apuntes/guias-free/Windows.pdf. Another, deeper dive: download.microsoft.com/download/5/b/3/5b38800c-ba6e-4023-9078-6e9ce2383e65/C06X1116607.pdf.

Of course, there are ways to tune the timeslice in a more fine-grained fashion. In the registry, check out HKLM\SYSTEM\CurrentControlSet\Control\PriorityControl\Win32PrioritySeparation . Here are some explanations about how it works: www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/29623.mspx?mfr=true and www.microsoft.com/mspress/books/sampchap/4354c.aspx are great.

For instance - what if you don’t care to increase the quantum on the foreground window but, instead, just want short, fixed quanta (effectively around 60ms) for all processes to improve response time on a system with a lot of processes? Setting Win32PrioritySeparation to 0×28 will take care of that.

Here’s a useful Win32PrioritySeparation chart from forums.guru3d.com/showthread.php?p=1451631#post1451631:

2A Hex = Short, Fixed , High foreground boost.
29 Hex = Short, Fixed , Medium foreground boost.
28 Hex = Short, Fixed , No foreground boost.

26 Hex = Short, Variable , High foreground boost.
25 Hex = Short, Variable , Medium foreground boost.
24 Hex = Short, Variable , No foreground boost.

1A Hex = Long, Fixed, High foreground boost.
19 Hex = Long, Fixed, Medium foreground boost.
18 Hex = Long, Fixed, No foreground boost.

16 Hex = Long, Variable, High foreground boost.
15 Hex = Long, Variable, Medium foreground boost.
14 Hex = Long, Variable, No foreground boost.

Here are some other pages where others have figured out the effective quanta (and remember the numbers are not in ms): blogs.msdn.com/embedded/archive/2006/03/04/543141.aspx (for embedded Windows, I have doubts about the accuracy of his calculations regarding the effective quantum but still interesting), www.microsoft.com/technet/sysinternals/information/windows2000quantums.mspx (for Windows 2000, probably still valid).

Here’s a really nice article on the effects of schedulers and I/O-bound processes on virtualization: regions.cmg.org/regions/mcmg/m102006_files/6187_Mark_Friedman_Virtualization.doc

Linux, on the other hand, has not one but several totally different CPU schedulers and I/O elevators available. Just see this page, comparing 2.6.22 with Vista’s kernel, and note how many non-standard features are available as patches: widefox.pbwiki.com/Scheduler . You can get schedulers with cool names such as genetic, anticipatory, etc. Linux used to suffer on the desktop, but with recent patches interactivity has improved tremendously, and is now far more viable as a desktop OS. Here’s some cool info on anticipatory schedulers: www.cs.rice.edu/~ssiyer/r/antsched/. Anticipatory schedulers can help systems with slower I/O (laptops and desktops, especially) feel more interactive, and was the default I/O elevator for a while (CFQ is the current default for I/O, though can have issues with desktop users, see ubuntuforums.org/showthread.php?t=456692). A list of all the I/O elevators in the kernel: ebergen.net/wordpress/2006/01/26/io-scheduling/. Whitepapers: www.cs.ccu.edu.tw/%7Elhr89/linux-kernel/Linux%20IO%20Schedulers.pdf, www.linuxinsight.com/files/ols2004/pratt-reprint.pdf, www.linuxinsight.com/files/ols2005/seelam-reprint.pdf .

Recently, Linux moved to the Completely Fair Scheduler model (www.osnews.com/story.php/18240/Linux-Switches-to-CFS-Scheduler-in-2.6.23), sparking a lot of controversy (www.osnews.com/story.php/18350/Linus-On-CFS-vs.-SD) since it’s not quite done yet (kerneltrap.org/node/14055). More info on CFS: immike.net/blog/2007/08/01/what-is-the-completely-fair-scheduler/.

Interesting benchmarks showing the effects of scheduling on Linux performance: developer.osdl.org/craiger/hackbench/, math.nmu.edu/~randy/Research/Speaches/Disk%20Scheduling%20In%20Linux.ppt.

For anyone wishing to test the various Linux schedulers’ impact on interactivity, Con Kolivas has something: members.optusnet.com.au/ckolivas/interbench/. Con’s Staircase/Deadline (SD) scheduler (lwn.net/Articles/224865/) didn’t make it to the mainline kernel, unfortunately, and a miffed Con announced he’s dropping out of kernel development. Pity, since I think he single-handedly contributed more to the advancement of Linux interactivity on the desktop than anyone else. It’s great to have the choice of schedulers depending on how you’re planning to use your system - it’s already done with the I/O elevator, let it be done with the CPU scheduler. Instead, Linus invoked his Papal-like powers and made what I consider to be an unsound decision.

The real issue with Linux though is the userland. Here’s a great paper showing issues with the userland and how it robs us of speed: ols2006.108.redhat.com/reprints/jones-reprint.pdf . A lot of the CPU and I/O scheduler design is workarounds for those issues. Unless one deliberately chooses a stripped-down Linux distribution, the amount of bloat in the current code is incredible.

Finally, Solaris 10 also comes with a bunch of different schedulers, which you can assign globally or on a per-process/project basis. Tons more info: www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html, blogs.sun.com/andrei/date/20050131, wiki.its.queensu.ca/display/JES/Solaris+10+Containers+and+Fair+Share+Scheduling, docs.sun.com/app/docs/doc/816-0222/6m6nmlsug?l=en&a=view.

Heady reading, no?

D

Thu
9
Aug '07

Ate at AJ Maxwell’s in Manhattan

Once more, dear reader, I place my colon’s health at peril for your reading pleasure and culinary edification.

I could have gone to Via Brazil for a proper feijoada by walking a few yards from my hotel but, instead, I sacrificed variety on the altar of dedication and had another bone-in ribeye. It is my mission to eat at all the decent NYC steakhouses.

For those who don’t know me (and many who do): I don’t eat steak all the time… indeed, I consider myself a veritable gourmand (and I do know the difference between gourmand and gourmet, as do my belts).

Anyway: ordered a medium-rare ribeye. They chargrill their steaks at AJ Maxwell’s so if you don’t like them that way don’t go. If you do, the steaks are good. The meat was tender and flavorful. It looks colossal but it is (they say) just 22oz. It looked huge and was over 2in thick. Probably 22oz after cooking.

I read some reviews and typically the people that complain asked for medium or medium well. If the piece is that thick and they chargrill it, rest assured the exterior will be pretty crispy if you want medium. By the same token, getting medium rare could mean some parts are pretty rare indeed. Not the place to be if you like medium and above.

I actually thought it was better than Bobby Van’s though still not as good as Flames. However, eating once someplace is not enough of a statistical sample. It’s beef after all, not purified water. Not the easiest thing in the world to be consistent with. Hence the incredulity of most people when I tell them that I had the best steak of my life at Wollensky’s. Maybe I got lucky. Hey, at least I said Wollensky’s, not Appleby’s… it’s a legitimate steakhouse.

After a few months I’ll definitely need colonics to get rid of the barnacles.

BTW, if you just want to read about technology you can select the topics at the top of the screen so you don’t have to read about my steak-eating adventures. Or vice versa.

D

Wed
8
Aug '07

Ate at Bobby Van’s in Manhattan

After the glowing reviews of a colleague I ate at Bobby Van’s on 230 Park. It’s considered to be one of the better NYC steakhouses (there are 4 in the chain, most in NYC).

I got a bone-in ribeye and some mushrooms.

I asked for a 145°F internal temperature and the decrepit waiter looked at me like I had three heads. “What does that mean?” I said medium rare…

The steak was pretty good, slightly overcooked but not as flavorful as what I had at Flames. It was also a bit dry for a ribeye and totally unseasoned. Still, not a bad cut.

The mushrooms provided some lubrication.

Not a religious experience, I’ll try the Old Homestead tomorrow hopefully.

D

Mon
30
Jul '07

Just how much is your antivirus harming your I/O?

I just got a new corporate laptop, a nice, shiny T60 (OK, it’s IBM black and therefore thoroughly incapable of reflecting on any part of the spectrum).

I noticed that doing disk-intensive work was much slower than I’ve been used to. I configured it as a server (see previous posts) and that helped a bit but not as much as I’d like to.

It seems the antivirus software is checking each and every file, and takes 100% of a CPU to do so. Were this not a dual-core box it would be begging for mercy.

Taking an entire CPU is unacceptable IMO. So I ran some benchmarks - the trusty postmark once more to the rescue:

 

After tweaking as a server, antivirus running, 100% CPU utilization while bench running:

Time:
344 seconds total
230 seconds of transactions (86 per second)

Files:
20092 created (58 per second)
Creation alone: 10000 files (95 per second)
Mixed with transactions: 10092 files (43 per second)
9935 read (43 per second)
10064 appended (43 per second)
20092 deleted (58 per second)
Deletion alone: 10184 files (1131 per second)
Mixed with transactions: 9908 files (43 per second)

Data:
548.25 megabytes read (1.59 megabytes per second)
1158.00 megabytes written (3.37 megabytes per second)

 

With a more efficient antivirus program instead, variable CPU utilization (from 10%-100%):

Time:
276 seconds total
174 seconds of transactions (114 per second)

Files:
20092 created (72 per second)
Creation alone: 10000 files (123 per second)
Mixed with transactions: 10092 files (58 per second)
9935 read (57 per second)
10064 appended (57 per second)
20092 deleted (72 per second)
Deletion alone: 10184 files (484 per second)
Mixed with transactions: 9908 files (56 per second)

Data:
548.25 megabytes read (1.99 megabytes per second)
1158.00 megabytes written (4.20 megabytes per second)

 

Disabling the antivirus makes it way faster for transactions:

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

Caching with UpTempo for a nice 50% boost in performance:

Time:
121 seconds total
65 seconds of transactions (307 per second)

Files:
20092 created (166 per second)
Creation alone: 10000 files (277 per second)
Mixed with transactions: 10092 files (155 per second)
9935 read (152 per second)
10064 appended (154 per second)
20092 deleted (166 per second)
Deletion alone: 10184 files (509 per second)
Mixed with transactions: 9908 files (152 per second)

Data:
548.25 megabytes read (4.53 megabytes per second)
1158.00 megabytes written (9.57 megabytes per second)

Not tweaking the laptop as a server resulted in > 400s runtimes in the default config (sometimes 500s). FYI, the drive is a smaller, 5400 RPM jobbie, not the 200GB 7200 RPM SATA I have my eye on.

One could extrapolate these results. On a bigger box the end results will differ but everything will remain relatively similar.

Obviously, antivirus is sorely needed in this day and age, but if you’re planning on doing heavy I/O be careful what antivirus program you pick and how it’s configured. Depending on the server, I’d gladly trade some protection in exchange for a bunch more performance. Or you can go Unix/Linux and not really have to bother.

I’d say setting up an antivirus program to only scan extensions that can be infected and only scan on creates/modifies and not reads, can boost performance significantly.

Interestingly, caching didn’t help much with antivirus enabled - most of the bottleneck was the antivirus since everything had to go through it first. What if this was a database/email/fileserver with heavy activity?

D

Wed
13
Jun '07

Ate at Murphy’s Style Grill, in Red Bank, NJ

Will be demonstrating Cisco’s WAAS tomorrow in NYC, so today we spent some time going through a testing protocol so we can show people different things.

After we finished we had dinner at Murphy’s in NJ. Strange place. It’s not a classy steakhouse or anything - nor does it have aspirations to be one.

The menu is, to quote Kipling, as immutable as the hills. Apparently any substitutions or deviations are swiftly and sternly stamped out, as though they signify an impending revolution that threatens all that we hold holy. Dressing on the side? Heresy! Burn!

I got the 24oz Delmonico. I was urged not to ask anything about it, lest they bring out someone to take me to the back. He also suggested generous amounts of A1.

At least it was inexpensive (about $17) and properly cooked. If you’re looking for flavor and marbling, look elsewhere. Much of it looked like solid marble, though. Had to surgically remove a good amount of gristle.

Better than the steak at Bowling Green, I have to admit.

D

Fri
8
Jun '07

This has been one of the worst trips ever - because of one of the silliest DR exercises ever

Well, aside from visiting Flames and helping fix a severe customer problem. Those were rewarding. I still haven’t pooped that steak, BTW.

I was supposed to only stay for 1 day in Manhattan, fix the issue, ba da bing. I ended up staying an extra day - had no extra clothes and no time to get anything. Washed my undies on my own and used the hair dryer over a period of hours to dry them. I learned my lesson now and will always have extra stuff with me.

So I try to go back home today and guess what - Air Traffic Control computers had a major glitch (abcnews.go.com/Business/wireStory?id=3259992) that messed up the whole country’s air travel. Thousands of flights delayed and canceled. Mine was canceled, after I spent about 10 hours in the airport. Another 2 hours in the line to simply rebook the flight since they had 3 people trying to serve hordes. And all because, at least according to the report, a system failed and the failover system didn’t have the capacity to sustain the whole load.

So, while I wait in the airport to catch a stand-by flight tomorrow morning, unbathed and frankly looking a bit menacing, I decided to vent a bit. No hotels, no cars.

Maybe this is too much conjecture and if I’m wrong please enlighten me, but let’s enumerate some of the things wrong with this picture:

  1. First things first: While it’s cool to fail over to a completely separate location, typically you want a robust local cluster first so you can fail over to another system in the original location.
  2. If the original location is SO screwed up (meaning that a local cluster has failed, which typically means something really ominous for most places) ONLY THEN do you fail over to another facility altogether.
  3. Last but not least: Whatever facility you fail over to has to have enough capacity (demostrated during tests) to sustain enough load to let operations proceed. Ideally, for critical systems, the loss of any one site should hardly be noticeable.

According to the report none of the aforementioned simple rules were followed. Someone made the decision to fail over to another facility, which promptly caved under the load. A cascade effect ensued.

I mean, seriously: One of the most important computer systems in the country does not have a well-thought-out and -tested DR implementation. Guys, those are rookie mistakes. Like some airports having 1 link to the outside world, or 2 links but with the same provider. Use some common sense!

So, I guess I’ll put that in the list together with using what’s tantamount to unskilled labor securing our airports instead of highly trained and well-paid personnel that’s been screened extremely intensely and actually takes pride in the job. Maybe some of those unskilled people are running the computers, it might be like the Clone Army in Star Wars. A mass of cheap, expendable labor that collectively has the IQ of my left nut (I’m not being overly harsh - my left nut is quite formidable). The armed forces heading the same way isn’t the most reassuring thought, either.

Yes, I’m upset!!!

wallpapers images animal gorilla

D

Thu
7
Jun '07

ZFS in OSX

Not amazing news but an official announcement nonetheless: Saw this (www.macnn.com/articles/07/06/06/zfs.in.leopard/) and I couldn’t resist posting. This means a few things:

  1. Sun figured out how to make ZFS bootable (at least on OSX)
  2. Someone figured out how to deal with ZFS and resource forks (I can’t believe they are willing to break compatibility with so much software otherwise).

Now I just need a Mac so I can run some benchmarks before and after. I have some buddies that might oblige… finally the Macs get a decent FS.

Now if only Apple could lose the silly Mach legacy, it’s a common misconception that the kernel in OSX is FreeBSD - it ain’t. Run lmbench (www.bitmover.com/lmbench/) on different platforms and compare results such as context switching, thread creation and whatnot. Then you’ll see why OSX can’t always make a decent server OS.

D

'

Ate at Flames in Manhattan

I was helping a client in the Wall Street District today with some rather obscure CIFS performance issues (Opportunistic Locks anyone? Berzerk BDCs causing issues? Multi-user Access DBs over WAN?)

Had to stay overnight (unplanned) so after putting in some solid hours I decided to get some steak, and NYC is the place to get decent steak.

Did some research and found out that Flames was walking distance from my hotel, so I went.

Got a T-Bone this time (usually go for strip or ribeye but the waiter insisted, even though they had far more expensive cuts on offer). Some creamed spinach and a small salad and I was set.

Flames is one of those fancy places where they cut your steak for you. At least they don’t feed you or, indeed, help you masticate.

Not that they would need to - the dry-aged steak had fantastic flavor and was reasonably tender (not the most tender but good). I wish it had been a tad less cooked but it was still great, and I devoured it in atavistic glory, almost beating the man-pelt on my chest in ecstasy. It’s been a while since I’ve had proper dry-aged beef.

The creamed spinach wasn’t too creamy or salty. The salad was just OK, I typically use salads for intestinal lubrication anyway and it served the purpose.

I did overhear some patrons asking for well done steaks, this is one of those places where they won’t try to talk you out of it, sadly. I think steakhouses should make you actually sign a waiver if you want to commit such culinary atrocity.

I also overheard a waiter trying to sell some $100 “Kobe” steak to some ladies, telling them how they massage the cows 4 times a day. I discreetly shook my head at them and they got the message.

Anyway - long story short, strongly recommended, and don’t dare order anything beyond medium-rare.

Now back to washing and drying my Superman underoos - I had no change of clothes and I’m writing this naked. It kinda is an appropriate image for this review though…

D

Sat
2
Jun '07

IBRIX at EMC World

I’ve known about IBRIX for a while, but it was refreshing to talk to a decent techie that knew the product. They have improved it a lot over the past year.

For the uninitiated, IBRIX can be either

  1. A network-based filesystem using the IBRIX client and protocol
  2. Also accessible using NFS or CIFS
  3. SAN-based parallel filesystem

The product’s claim to fame is it’s scalability and performance (realized by adding extra nodes “hot”). Their most famous client is probably Pixar, they replaced a ton of NetApp boxes with an IBRIX cluster and realized huge performance benefits and vastly reduced costs. I always liked cool filesystem technologies and this definitely falls under the realm of “cool”. Some highlights based on notes I took on my Blackberry during the session and questions I asked:

  • No limits on filesystem size (they have deployed single namespace filesystems several PB in size).
  • 300mb/s read, 200mb/s write on small box per node. Bigger boxes can do 1.2GB/s per node, of course your storage needs to be able to keep up.
  • No limit on the number of nodes.
  • Automatic rebalancing of data over time. When you add new disk you rebalance to keep things humming.
  • Dedicated ibrix backup node, works with 3rd party backup SW, can have many backup servers for backup speed.
  • Has snaps now (global), this was a failing of the product before since it was lacking snapshots.
  • No real limit on the number of files per FS.
  • Biggest file size they have tested on production is an 8TB file, no software limit.
  • Nodes use FC to access storage, clients use Ethernet.
  • Client on Windows or Linux, otherwise general NFS and CIFS. Client is fastest.
  • Your prod servers can be the ibrix nodes but very compute-intensive. They recommend the client (IP-based, bonded). or get an 8-core box.
  • There is no single lock manager - this is the coolest thing. There is global metadata and global locking, all nodes participate equally.
  • How are node failures handled? All nodes interchangeable. All see same storage. Storage allocated to remaining servers if you lose a node.
    Can lose all but 1 server.
  • Back-end storage size per node? Unlimited.
  • Multipathing per node? Powerpath works. Can do bonded GigE up to 8 ports per.
  • How are files allocated? The file inode contains the info concerning which node it needs to go to. Round-robin allocation or preferred servers per file type. Also if server over 50% full then it’s skipped.
  • All volumes accessible by all nodes.
  • Can stripe huge files across many nodes.

I’m stoked! I can think of so many uses for this product:

  1. Data mining
  2. Digital media
  3. Oil and gas
  4. Backups

D

'

Ate at Trotter’s Tavern in Bowling Green, OH

I had some great customer meetings in OH this week. One meeting took me to Bowling Green, cute town.

The locals like to eat steak at Trotter’s Tavern. They only serve fist-sized and -shaped chunks of sirloin in some weird sauce that has at least some Worcestershire in it but is more tangy. No other cut choices, you get either 10 or 16 ounces and that’s it.

I asked the waitress how it was aged and got a blank stare back. I could almost read her mind: “we just defrost it in the microwave”.

Well, had it been cooked properly it might have been OK, but mine was well-done (which I hadn’t asked for). Ate it anyway, as is my idiom, but I can’t say I recommend the place. Maybe if you get the 10-ouncer and ask for medium rare it might be medium by the time you get it. It’s tough to cook a thick piece of meat properly.

At least the place is relatively inexpensive, their most expensive piece is $25 and comes with all the trimmings.

There was one weird thing though: The restroom was festooned with carvings (yes, carvings) asserting the gayness of various people.

D

Wed
23
May '07

Data Domain Update

I’m not known for retractions and I’m not posting one. I did however check out the new DD boxes and the really big ones are far more capable than the old ones.

So, the techies (hats off for enduring a half hour with me) explained to me a few things:

  1. The smallest block is 4K
  2. The highest possible performance for the biggest box is 200MB/s
  3. The biggest box can do a bit over 30TB raw
  4. They scrub the disk continuously so it’s effectively defragged (see below for caveat) - they did admit performance totally sucks over time if you don’t do it (finally vindicated!)

This is good news, since it’s obviously far bigger than the old ones.

Some issues though (based on what the techies told me):

  1. It scrubs the disk by virtue of NBU deleting the old images, then it knows what to get rid of. If your retentions are long then you will have performance problems. They suggested just dumping it all to tape and starting afresh once in a while. Which just confirms my suspicions on how the stuff truly works.
  2. Each “controller” is really a separate box. The 16 controller limit does not mean it’s a larger appliance, it’s the limit of the management software.
  3. Ergo, each controller can be a separate VTL or separate NFS mount. You cannot aggregate all your controllers in one large VTL. This sucks since if you need to do backups at 1GB/s or so, you’ll need at least 5-6 boxes, and you will have to define a separate library and drives per box. If you do NFS, you need to define 1-2 shares per box. This is a management nightmare. Make it all a single library! Copan has the same issue. I don’t know how they can do it though based on their architecture.

So, it looks to me like it may be a fit for some people, though I have no idea about the price points. If you want performance then you’ll need a ton of the boxes, and you’ll need to spend time configuring them. If 10 maxed-out boxes cost the same (or, worse, more) than a big EMC DL4400 (that can do 2.2GB/s) then it’s not an easy sell. Especially since EMC will be adding dedupe to their VTL - plus, you won’t have to define a bunch of separate libraries. Will EMC’s dedupe be similar? No idea, but if it doesn’t impact performance then it’s pretty compelling.

Thoughts? You know the drill.

D

'

Storage Virtualization - is there a point?

This has been bothering me for a while, and I think I’m not alone.

Hitachi has been making great progress with their virtualization gear, as has IBM, Falconstor before them, etc.

They claim you’ll be freed from the vendors’ shackles, achieve greater utilization of your arrays, simplify administration, cure cancer etc.

Well, here’s what I think:

  1. You will instead be shackled to the virtualization provider
  2. You won’t have a clue where your stuff is
  3. If you want to retire an array you could have problems (imagine creating a LUN composed of LUNs from 3 different arrays)
  4. You STILL have to use the management interfaces of the back-end arrays, since you still have to provision the storage. Instead of provisioning to hosts you provision to the virtualizer.

 

So, what have you gained, exactly?

D

'

Ate at Del Frisco’s steakhouse in Orlando

Superb.

Not much fanfare, steaks wet-aged 21 days.

Got the strip. So much better than Charley’s it wasn’t even funny. Great flavor, tender, perfectly cooked. 8/10. (Charley’s claim double the aging time but their stuff just wasn’t that good).

Sides were maybe too rich (the spinach could clog a Yak’s arteries). Bisque too thick and nowhere near my sublime experience in Savannah, GA. At least they gave me sherry to put in it, I can’t believe it’s not SOP in any place serving bisque. Heathens!

Dessert was just OK.

Something tells me (maybe it’s my impacted colon) that I should not eat steak again tonight.

D

'

Should EMC move to more multi-functional devices?

Here’s the deal: EMC has a lot of cool stuff. Lots of it came through acquisitions. Lots of it runs on the x86 platform, believe it or not.

At the moment one needs to buy multiple boxes from EMC to do NAS, SAN, archiving, etc.

Imagine if you got instead generic boxes (with their power relative to their cost, there could be a few models).

In each box you could run a Clariion, Centera, Celerra, a print server, WAAS (even though it’s Cisco it’s really a Linux box), something like Recoverpoint, and so on.

All the products could be custom Virtual Machine Appliances, possibly running on a modified ESX platform (so you can’t just run them anywhere). You’d get all the benefits of cool technologies such as VMotion and HA. You could easily add to it.

This doesn’t preclude the use of specialized hardware to accelerate certain functions, though in this age of quad-core CPUs even that may be unnecessary.

Think about it. EMC owns the IP for all that technology.

They don’t need to make less money - if anything, since all the platforms would be virtual, production would be greatly streamlined. They could even have a single type of box (say a quad quad with tons of RAM and expansion capability) as the hardware. You need more speed for NAS? Add an extra box, an extra license and load-balance a new virtual data mover.

This of course is unattainable at the moment - I don’t think VMware can provide such low latency and high throughput but maybe I’m wrong.

Such a move won’t fix the proliferation of management interfaces, but EMC could build a common interface.

Thoughts?

D

Tue
22
May '07

Netbackup best practices for ridiculously busy environments (but not exclusively).

While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might share some of my experience regarding running Netbackup on very large setups - nothing like learning through pain.

Don’t get me wrong - NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.

I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much - we fixed that pronto…

I won’t go too deep into each point. If anyone cares then post a comment and I will expand on it.

If you have a small shop running NBU on a single server, much of this is not for you - but there may still be a nugget or two in there… However, if you don’t at least use barcodes, I will go after you. Use tar or Windows backup, or even a rusty abacus, go to your corner and be quiet.

 

  1. Have a dedicated master server - if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
  2. Go way beyond the tuning recommendations in the manual - if you know what you’re doing. For instance, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, whatever) support… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” whatever, I’ve gotten such ridiculously bad advice from their support I still cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
  3. Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
  4. Separate TAN (Tape Area Network), if you can swing it.
  5. Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (whatever nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
  6. Experiment with TOE cards, such as the Alacritech ones. You will get closer to full gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
  7. Try to use port bonding that works at the switch level, too - 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
  8. Don’t use weak switches at the core. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS!!!! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed - i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly blocked 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? Even after the initial transfer that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
  9. Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
  10. Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
  11. Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
  12. Use snapshots as much as possible.
  13. If you have more than a couple of media servers, consider a VTL.
  14. If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a power supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
  15. If the DBAs can’t be persuaded even after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
  16. As a corollary to #15, only use tape for large jobs that will actually stream your tape drives.
  17. Know what your boxes can push. Most servers, even very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, sustained. Do the math. Beat the score! I cheated, BTW.
  18. Know what expansion slots to use - not all are equal, even if they look the same.
  19. Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
  20. Be super-careful with command-line manipulation of the NBU DB. Perfectly legitimate commands will not function as you might think due to silly heuristics (or lack thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a pronounced limp.
  21. Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
  22. Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
  23. If you use tapes, have 2 copies of everything.
  24. Replicate to remote sites if at all possible. Tape should be a last resort.
  25. Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
  26. Do at least 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session - there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an extra email to an externally available address. You’re not paranoid if they’re really out to get you!
  27. Can you even read from disk as fast as you can write to your backup medium? Benchmark.
  28. What’s your current network throughput if you max out all the media servers? Benchmark.
  29. Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
  30. Use storage unit groups. Why on earth would you not?
  31. Cluster the master.
  32. Do NOT put media traffic through firewalls, it’s too much. ACLs on switches can work just fine.
  33. Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might as well let media server traffic through. If it seems that #32 and #33 are somewhat self-contradictory, give yourself a cigar.
  34. Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.

 

That’s all I have for now. Is there more? Tons, but I need to pee.

D

'

EMC World: Replication Manager and Exchange 2007

Just attended a session. Seems like the new rev of RM supports 2007 fully. They also support Recoverpoint clones (or will, later this week).

For whoever is not aware of it, EMC Replication Manager is like a front-end that manages local replicas of your salient Exchange data for the purposes of backup and restore.

Can be fiddly to set up but if you have EMC gear and Exchange, you really should look at it.

D

Mon
21
May '07

Just ate at Charley’s steakhouse in Orlando

As has been my idiom lately, I will comment on food.

Went to Charley’s steakhouse while attending EMC World.

They made a huge deal of showing off their steaks - which looked good. Wet-aged, 6 weeks for the bone-in ones, 4 weeks for the rest. Aged in-house. I prefer dry-aged but it’s hard to find outside NYC.

So I had a chunky strip, medium-rare.

Observations:

  1. Too seared on the outside, too rare on the inside (would be classified as rare in other places)
  2. Really not that tender
  3. Way too stringy
  4. Others complained theirs was too salty, mine was OK.
  5. Shoulda gone for the ribeye or porterhouse.

Escargot were OK but needed more salt and garlic.

Next time I’m getting fish, or maybe a fillet (which is too boring a cut but at least it’s hard to screw up).

D

'

At EMC World

Currently attending EMC World. The first day bored me to tears, I hope the rest will be more exciting (though it utterly depends on the presenters). Some of the material is too introductory, even if one attends the advanced sessions they’re not that advanced.

More to follow.

D

Tue
8
May '07

I wonder when dedup will make it to the arrays

Anyone feel that deduplication is not finding its final resting place in backups and WAN accelerators?

It’s only a matter of time before the algorithms are run as a matter of choice on the array processors.

Of course, that means fewer disk sales, but also bigger/faster/more expensive processors.

Replication will also become more efficient - see EMC’s recent acquisition of Kashya (now RecoverPoint - one of its functions is dedup during replication from array to array, how long do you think it will take them to move this functionality to the array processors?)

Just some random thoughts…

D

Fri
4
May '07

Another windows tuning I forgot to mention

I use my laptop so much that I sometimes forget about some server-type tunings.

I resuscitated my hot-rod AMD box - it’s a grossly overclocked monster but only has 1GB RAM (since it’s hard to find that kind of fast RAM in bigger sizes, and using 4 sticks prohibits me from overclocking it so much). Let’s just say the CPU is running a full GHz faster than stock, and with air, not water or peltier coolers.

Anyway, since it only has 1GB RAM and I use it for Photoshop and games, I can’t really use something like Supercache or Uptempo on it.

So I tried O&O Software’s Clevercache. By far not as good as the other 2 products - however, it does a decent job of automatically managing cache so you always have enough free RAM.

Then I tried the DisablePagingExecutive registry tweak - not that obscure, tons of references around.

BTW, there is a way to stop postmark from using caching - set buffering false is the command. However, I want to see the benchmark run on a system that would run normally, not measure the raw speed of my disks. Nobody cares about that anyway, especially in the big leagues (unless the config is truly moronic, of course). Cache is everything. But I digress.

So - postmark once more.

Stock:

Time:
177 seconds total
144 seconds of transactions (138 per second)

Files:
20092 created (113 per second)
Creation alone: 10000 files (333 per second)
Mixed with transactions: 10092 files (70 per second)
9935 read (68 per second)
10064 appended (69 per second)
20092 deleted (113 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (68 per second)

Data:
548.25 megabytes read (3.10 megabytes per second)
1158.00 megabytes written (6.54 megabytes per second)

after tuning as server with the background process, large cache and fsutil as described previously:

Time:
107 seconds total
85 seconds of transactions (235 per second)

Files:
20092 created (187 per second)
Creation alone: 10000 files (526 per second)
Mixed with transactions: 10092 files (118 per second)
9935 read (116 per second)
10064 appended (118 per second)
20092 deleted (187 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (116 per second)

Data:
548.25 megabytes read (5.12 megabytes per second)
1158.00 megabytes written (10.82 megabytes per second)

with clevercache:

Time:
97 seconds total
71 seconds of transactions (281 per second)

Files:
20092 created (207 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (142 per second)
9935 read (139 per second)
10064 appended (141 per second)
20092 deleted (207 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (139 per second)

Data:
548.25 megabytes read (5.65 megabytes per second)
1158.00 megabytes written (11.94 megabytes per second)

Hell, I guess I might get Clevercache for this system - sped it up a bit and manages memory consumption.

But look at this:

All the above plus using the DisablePagingExecutive registry tweak: BOOYA!

Time:
45 seconds total
28 seconds of transactions (714 per second)

Files:
20092 created (446 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10092 files (360 per second)
9935 read (354 per second)
10064 appended (359 per second)
20092 deleted (446 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (353 per second)

Data:
548.25 megabytes read (12.18 megabytes per second)
1158.00 megabytes written (25.73 megabytes per second)

I guess the box is staying this way.

More info on the registry tweak:

http://technet2.microsoft.com/windowsserver/en/library/3d3b3c16-c901-46de-8485-166a819af3ad1033.mspx?mfr=true

In a nutshell, it disables the paging of kernel and driver code, so it’s always memory-resident. Makes sense in some cases, as you can see above :)

It’s so unusual that it gave me that much of a boost, though. I’d tried it a long time ago and it wasn’t quite as dramatic, but that was on a much older system.

One would argue that postmark lied but using a stopwatch and just eyeballing the sucker it was way quicker doing the transactions.

On servers I just didn’t normally set it because I figured they had enough RAM. Maybe I should start doing it on boxes that do a lot of transactional I/O. Damn, I need to try this with Supercache.

Obviously, your mileage may vary.

WARNING: DO NOT DO THIS ON ANY MACHINE THAT NEEDS TO SUSPEND!!!

Which is why I just didn’t do it on the laptop.

D

Wed
2
May '07

Cisco WAAS benchmarks, and WAN optimizers in general

Lately I’ve been dealing with WAN accelerators a lot, with the emphasis on Cisco’s WAAS (some other, smaller players are Riverbed, Juniper, Bluecoat, Tacit/Packeteer and Silverpeak). The premise is simple and compelling: Instead of having all those servers at your edge locations, move your users’ data to the core and make accessing the data feel almost as fast as having it locally, by deploying appliances that act as proxies. At the same time, you will actually decrease the WAN utilization, enabling you to use cheaper pipes, or at least not have to upgrade, where in the past you were planning to anyway.

There are significant other benefits (massive MAPI acceleration, HTTP, ftp, and indeed any TCP-based application will be optimized). Many Microsoft protocols are especially chatty, and the WAN accelerators pretty much remove the chattiness, optimize the TCP connection (automatically resizing Send/Receive windows based on latency, for instance), LZ-compress the data, and to top it all will not transfer data blocks that have already been transferred.

At this point I need to point out that there is a lot of similarity with deduplication technologies - for example, Cisco’s DRE (Data Redundancy Elimination) is, at heart, a dedup algorithm not unlike Avamar’s or Data Domain’s. So, if a Powerpoint file has gone through the DRE cache already, and someone modifies the file and sends it over the WAN again, only the modified parts will really go through. It really works and it’s really fast (and I’m about the most jaded technophile you’re likely to meet).

The reason I’m not opposed to this use of dedup (see previous posts) is that the datasets are kept at a reasonable size. For instance, at the edge you’re typically talking about under 200GB of cache, not several TB. Doing the hash calculations is not as time-consuming with a smaller dataset and, indeed, it’s set up so that the hashes are kept in-memory. You see, the whole point of this appliance is to reduce latency, not increase it with unnecessary calculations. Compare this to the multi-TB deals of the “proper” dedup solutions used for backups…

Indeed, why the hell would you need dedup-based backup solutions if you deploy a WAN accelerator? Chances are there won’t be anything at the edge sites to back up, so the whole argument behind dedup-based backups for remote sites sort of evaporates. Dedup now only makes sense in VTLs, just so you can store a bit more.

On Dedup VTLs: Refreshingly, Quantum doesn’t quote crazy compression ratios - I’ve seen figures of about 9:1 as an average, which is still pretty good (and totally dependent on what kind of data you have). I just cringe when I see the 100:1, 1000:1 or whatever insanity Data Domain typically states. I’m still worried about the effect on restore times, but I digress. See previous posts.

Anyway, back to WAN accelerators. So how do these boxes work? All fairly similarly. Cisco’s, for instance, does 3 main kinds of optimizations: TFO, DRE and LZ. TFO means TCP Flow Optimizations, and takes care of snd/rcv window scaling, enables large initial windows, enables SACK and BIC TCP (the latter 2 help with packet loss).

DRE is the dedup part of the equation, as mentioned before.

LZ is simply LZ compression of data, in addition to everything else mentioned above.

Other vendors may call their features something else, but at the end there aren’t too many ways to do this. It all boils down to:

  1. Who has the best implementation speed-wise

  2. Who is the best administration-wise

  3. Who is the most stable in an enterprise setting

  4. What company has the highest chance of staying alive (like it or not, Cisco destroys the other players here)

  5. What company is committed to the product the most

  6. As a corollary to #5, what company does the most R&D for the product

Since Cisco is, by far, the largest company of any that provide WAN accelerators (indeed, they probably spend more on light bulbs per year than the net worth of the other companies provided), in my opinion they’re the obvious force to be reckoned with, not someone like Riverbed (as cool as Riverbed is, they’re too small, and will either fizzle out or get bought - though Cisco didn’t buy them, which is food for thought. If Riverbed is so great, why would Cisco simply not acquire them?)

Case in point: When Cisco bought Actona (which is the progenitor of the current WAAS product) they only really had the Windows file-caching part shipping (WAFS). It was great for CIFS but not much else. Back then, they were actually lagging compared to the other players when it came to complete application acceleration. Fast forward a mere few months: They now accelerate anything going over TCP, their WAFS portion is still there but it’s even better and more transparent, the product works with WCCP and inline cards (making deployment at the low-end easy) and is now significantly faster than the competitors. Helps to have deep pockets.

For an enterprise, here are the main benefits of going with Cisco the way I see them:

  1. Your switches and routers are probably already Cisco so you have a relationship.

  2. WAAS interfaces seamlessly with the other Cisco gear.

  3. The best way to interface a WAN accelerator is WCCP. And it was actually developed by Cisco.

  4. The Cisco appliances are tunnel-less and totally transparent (I met someone that had Riverbed everywhere - a software glitch rendered ALL WAN traffic inoperable, instead of having it go through unaccelerated which is the way it is supposed to work. He’s now looking at Cisco).

  5. WAAS appliances don’t mess with QoS you may have already set.

  6. The WAAS boxes are actually faster in almost anything compared to the competition.

And now for the inevitable benchmarks:

Depending on the latency, you can get more or less of a speed-up. For a comprehensive test see this: http://www.cisco.com/application/pdf/en/us/guest/products/ps6870/c1031/cdccont_0900aecd8054f827.pdf

Another, longer rev: http://www.cisco.com/web/CA/channels/pdf/Miercom-on-Cisco-WAAS-Riverbed-Juniper-competitive.pdf

Yes, this is on Cisco’s website but it’s kinda hard to find any performance statistics on the other players’ sites showing Cisco’s WAAS (any references to WAFS are for an obsolete product). At least this one compares truly recent codebases of Cisco, Riverbed and Juniper. For me, the most telling numbers were the ones showing how much traffic the server at the datacenter actually sees. Cisco was almost 100x better than the competition - where the other products passed several Mbits through to the server, Cisco only needed to pass 50Kbits or so.

It is kinda weird that the other vendors don’t have any public-facing benchmarks like this, don’t you think?

However, since I tend to not completely believe vendor-sponsored benchmark numbers as much as I may like the vendor in question, I ran my own.

I used NISTnet (a free WAN simulator, http://www-x.antd.nist.gov/nistnet/) to emulate latency and throughput indicative of standard telco links (i.e. a T1). The fact that the simulator is freely available and can be used by anyone is compelling since it allows testing without disrupting production networks (for the record, I also tested on a few production networks with similar results, though the latency was lower than with the simulator).

The first test scenario is that of the typical T1 connection (approx. 1.5Mbits/s or 170KB/s at best) and 40ms of round-trip delay. I tested with zero packet loss, which is not totally realistic but it makes the benchmarks even more compelling. Usually there is a little packet loss, which makes transfer speeds even worse. This is one of the most common connections to remote sites one will encounter in production environments.

The second scenario is that of a bigger pipe (3Mbit) but much higher latency (300ms), emulating a long-distance link such as a remote site in Asia over which developers do their work. I injected a 0.2% packet loss (a small number, given the distance).

It is important to note that, in the interests of simplicity and expediency, these tests are not comprehensive. A comprehensive WAAS test consists of:

  • Performance without WAAS but with latency

  • Performance with WAAS but data not already in cache (cold cache hits). Such a test shows the real-time efficiency of the TFO, DRE and LZ algorithms.

  • Performance with the data already in the cache (hot cache hits).

  • Performance with pre-positioning of fileserver data. This would be the fastest a WAAS solution would perform, almost like a local fileserver.

  • Performance without WAAS and without latency (local server). This would be the absolute fastest performance in general.

The one cold cache test I performed involved downloading a large ISO file (400MB) using HTTP over the simulated T1 link. The performance ranged from 1.5-1.8MB/s (a full 10 times faster than without WAAS) for a cold cache hit. After the file was transferred (and was therefore in cache) the performance went to 2.5MB/s. The amazing performance might have been due to a highly compressible ISO image but, nevertheless, is quite impressive. The ISO was a full Windows 2000 install CD with SP4 slipstreamed - a realistic test with realistic data, since one might conceivably want to distribute such CD images over a WAN. Frankly this went through so quickly that I keep thinking I did something wrong.

T1 results
ftp without WAAS:
ftp: 3367936 bytes received in 19.53Seconds 168.40Kbytes/sec

Very normal T1 behavior with the simulator (for a good-quality T1).

ftp with WAAS:
ftp: 3367936 bytes received in 1.34Seconds 2505.90Kbytes/sec (15x improvement ).

Sending data was even faster:
ftp: 3367936 bytes sent in 0.36Seconds 9381.44Kbytes/sec.

waasT1

 

High Latency/High Bandwidth results

The high latency (300ms) link, even though it had double the theoretical throughput of the T1 link, suffers significantly:

ftp without WAAS
ftp: 3367936 bytes received in 125.73Seconds 26.79Kbytes/sec.

I was surprised at how much the high latency hurt the ftp transfers. I ran the test several times with similar results.

ftp with WAAS
ftp: 3367936 bytes received in 2.16Seconds 1562.12Kbytes/sec. (58x improvement ).

waaslat

 

I have more results with office-type apps but they will make for too big of a blog entry, not that this isn’t big. In any case, the thing works as advertised. I need to build a test Exchange server so I can see how much stuff like attachments are accelerated. Watch this space. Oh, and there’s another set of results at http://www.gotitsolutions.org/2007/05/18/cisco-waas-performance-benchmarks.html

Comments? Complaints? You know what to do.

D

Mon
30
Apr '07

On traveling lately

Been a while since I updated this blog. Too busy running around, evangelizing cool technologies, eating rich food, not exercising and spending WAY too much time in airports delayed due to bad weather. Someone needs to either:

  1. Change the rules so that planes fly even under more adverse conditions (which is, technically, possible)

  2. Improve the planes (long shot)

  3. Give testosterone shots to all involved since, on occasion, I think they’re being way too conservative. I used to work for an airline, and though many rules are sound, others just piss me off.

The other thing that neess to happen is someone needs to figure out how to deal with the middle seats in planes. At the moment they’re only comfortable for seriously emaciated people, let alone anyone normal. I look like I could wrestle a gorilla so I’m decidedly uncomfortable in middle seats, but that’s another subject. Suffice it to say that dieting would not help in my case - in the shoulder area, I’d need a meat cleaver and/or bone saw to see any lateral reduction. I know I’m not the first to complain but come on! Once I had the middle seat and to either side of me were people of similar (if not larger) stature. At the end of the flight I felt like we’d been married. Here are some suggestions that are not politically correct but I’m not known for that…

  1. Collect biometric info on all passengers (namely, weight and body dimensions, not necessarily security-related biometrics but that would be an easier way to get the info if it became a government mandate)

  2. Using the biometric info figure out where people should sit so that:

    1. Weight is balanced

    2. People of similar sizes are not sitting together

    3. Middle seats are assigned to slim people and/or

    4. Only 1 large person per 3 seats

    5. Still try to sit families together

    6. While checking in, offer the option of more comfort (not just legroom). Initially, maybe charge for it!

I think this makes sense. Really, algorithmically it’s not too bad.

D

Wed
28
Feb '07

Just ate at Keens Steakhouse in NYC

Well, just finished the meal. Steak was ordered medium-rare, arrived medium, a bit chewy (but still tasty) and not hot. I was too tired to complain and ate military style (i.e. it was gone in a minute).

The 26oz ribeye I had at Wollensky’s a couple years ago was a religious experience, comparatively. That thing needed a butterknife, at most. Sometimes staring at it hard enough was sufficient to lop off a piece.

I admit I don’t have enough of a statistical sample for either joint.

Just thought I’d share this.

D

Mon
19
Feb '07

It’s all about data classification and searching

I don’t know if this has been discussed elsewhere but I felt like I had an epiphany so there…

They way I see it, in a decade or two the most important technology regarding data will be data classification and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they’re smaller than the slower-spinning drives.

Imagine if speed and size were not issues. I know that’s a big assumption but let’s play along for a second… (let’s just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it’s just expensive, so it’s not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn’t get it, but they’re saying it should be out in a few years (there were issues with scalability and speed).

Let’s forget about the Microsoft-specific implementation and just think about the concept instead (I’d use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it - it’s just a huge database occupying the entire drive.

Think of the advantages:

  1. Far more resilient to failures
  2. Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
  3. Replication via log shipping
  4. Amazing indexing
  5. Easy expandability
  6. The potential for great performance, if done right
  7. Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let’s consider a simple text document - you’re trying to sell some storage, so you write a proposal for a new client. You could have metadata on:

  • Author
  • Filename
  • Client name
  • Type of document - proposal
  • Project name
  • Excerpt
  • Salesperson’s name
  • Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
  • Document revision (possible automatically generated)

… and so on. A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created:

  • Create a virtual directory with all files pertaining to that specific client (most common way people would organize it)
  • Show all the material for this specific project
  • Show all proposals that have to do with this salesperson

… and so on.

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course.

Existing software that does this classification is fairly poor, in my opinion. Please correct me if I’m wrong.

The other piece that needs to be there is extremely robust search and indexing capabilities. Some of that technology is there (google desktop and its ilk) but natural language search has to be - well, natural, but unambiguous at the same time.

I hope you can now see why I believe these technologies are important. If Google continues the way it’s going, it may well become the most important company in the next decade (some might argue it’s the most important one already).

For any sci-fi fans out there, this is a good novel that’s a bit related to the chaotic storage systems of the future: http://www.scifi.com/sfw/books/sfw7677.html

D

'

Some clarification on the caching

Re the previous post:

If you want to use supercache or uptempo the idea is that you take AWAY from windows/SQL/exchange cache and add to the fancy cache.

So, even on windows server, in “file and print sharing for microsoft windows” (in the properties for your network card, under file and printer sharing for Microsoft networks, bizarrely enough), you could say “maximize throughput for network applications”. In the various apps you’d just minimize the cache (i.e. only 10MB for Exchange) and just give the rest to supercache/uptempo.

Be aware that supercache is on a PER VOLUME basis, not global (its blessing and its curse at the same time). If you have a lot of volumes maybe just cache a few key data volumes, tempdb and the pagefile partition, or use uptempo, which allows you to allocate a single global cache pool that is then shared among the volumes you choose.

For SQL, using a RAM disk for tempdb seems to work even better.

Having seen the products work wonders only with 128MB dedicated to them, and bearing in mind that most servers have 4GB RAM or more, I’d say go nuts. I’d buy 4GB of RAM and make it cache in a heartbeat.

D


'

On deduplication and Data Domain appliances

One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.

The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.

The ID is then associated with the block and both are stored.

If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).

This is what allows dedup technologies to store a lot of data.

Now, why it depends how much you can store:

If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.

Now the drawbacks (and why I never bought it):

The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).

Imagine you have 2 billion entries.

Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.

Each block has it’s ID calculated (CPU-intensive).

Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.

This can take a while, depending on what search/sort/store algorithms they use.

I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.

I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.

Now, imagine you’re trying to RECOVER your 1GB PST.

Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.

When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.

This is CPU- and disk-intensive. Takes a while.

The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.

Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.

There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.

So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.

Another tack would be to use virtual tape - doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.

Plus, even the cheapest EMC virtual tape box works at over 300MB/s.

I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.

Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).

Let me know what you think…

D

'

Do you need a VTL or not?

I first posted this as a comment on http://www.gotitsolutions.org but this is its rightful place.

Having deployed what was, at the time, the largest VTL in the world, and subsequently numerous other VTL and ATA Solutions, I think I can offer a somewhat different perspective:

It depends on the number of data movers you have and how much manual work you’re prepared to do. Oh, and speed.

Licensing for VTL is now capacity-based for most packages (at least the famous/infamous/important ones like CommVault, Networker and NetBackup, not respectively).

Also, I’d forget about using VTL features such as replication and using the VTL to write directly to tape (unless you’re retarded, insane or the backup software is running ON the VTL, as is the case now with EMC’s CDL). Just use the VTL like tape. I’ve been so vehement about this that even the very stubborn and opinionated Curtis Preston is now afraid to say otherwise with me in the room… (I shut him up REALLY effectively during one Veritas Vision session we were co-presenting a couple years ago. I like Curtis but he’s too far removed from the real world. Great presenter, though, and funny).

Even dedup features are suspect in my opinion, since they rely on hashes and searches of databases of hashes, which progressively get slower the more you store in them. Most companies selling dedup (data domain, avamar, to name a couple major names) are sorta cagey when you confront them with questions such as “I have 5 servers with 50 million files each, how well will this thing work?”

Answer is, it won’t, even for far fewer files. Just get some raw-based backup method that also indexes, such as Networker’s snapimage or NBU’s flashbackup.

Dedup also fails with very large files such as database files.
I can expand on any of the above comments if anyone cares.

But back on the data movers (Media Agents, Storage Nodes, Media Servers):

Whether you use VTL or ATA, you effectively need to divvy up the available space.

With ATA, you either allocate a fixed amount of space to each data mover, or use a cluster filesystem (such as Adic’s Stornext) to allow all data movers to see the same disk.

With VTL, the smallest quantum of space you can allocate to a data mover is, simply, a virtual tape. A virtual tape, just like a real tape, gets automatically alocated, as needed.

So, imagine you have a large datacenter, with maybe 40 data movers and multiple backup masters.

Imagine you have a 64TB ATA array.

You can either:

1. Split the array into 40 chunks, and have a management nightmare
2. Deploy stornext so all servers see a SINGLE 64TB filesystem (at an extra 3-4K per server, plus probably 50K more for maintenance, central servers and failover) - easy to deal with but complex to deploy and more software on your boxes)
3. Deploy VTL and be done with it.

For such a large environment, option #3 is the best choice, hands down.

With filesystems, you have to worry about space, fragmentation, mount options, filesystem creation-time tunables, runtime tunables, esoteric kernel tunings, fancy disk layouts, and so on. If you’re weird like me and thoroughly enjoy such things, then go for it. As time goes by though, the novelty factor diminishes greatly. Been there, done that, smashed some speed records on the way.

What’s needed in the larger shops, aside from performance, is scalability, ease of use and deployment, and simplicity.

With VTL, you get all of that.

The other issue with disk is that backup vendors, while they’re getting better, impose restrictions on the # streams in/out, copy to tape and so on. No such restrictions on tape.

One issue with VTL: depending on your backup software, setting up all those new virtual drives etc. can be a pain (esp. on NBU).
for a small shop (less than 2 data movers), a VTL is probably overkill.

D

'

So who am I?

Hello everyone,

My name is Dimitris Krekoukias.

This blog used to be on another server, I moved it here - hopefully this hosting facility will be more stable.

I resemble a silverback gorila more than a monkey (man-pelt and all), and could probably wrestle one (and have a fair chance of winning).
I have extensive experience in the backup and recovery arena, and indeed know far more about certain products than I (or the vendors) would like to.
This blog will not be just about recovery - I have other interests, such as storage, OS design, tuning, filesystems, HPC, and other exotica. Plus a ton of non-IT-related hobbies - but that’s a story for another day.
Hopefully everyone will find this blog stimulating, controversial and, at times, annoying - in which case, tough.

D

'

On windows filesystem tuning and funky cache mechanisms

Edited: I just realized I must have used different postmark settings for vista and XP. Do NOT use the following numbers to compare Vista to XP performance.

I won’t go into a diatribe on how to tune Windows - there are excellent guides on Microsoft’s and IBM’s sites, among others.

But I wanted to share some goodness based on some recent findings of mine.

First, the part that most probably know (works on XP and 2003):

From a command window do

fsutil behavior set disablelastaccess 1

This will disable access time recording, which IMO is useless unless you really do care when a file was accessed and/or there isn’t much going on with your disk (or are on some fancy EMC box with tons of cache). If you have busy disks, this typically helps a bit.

On 2003, you can also increase the size of the lookaside buffer if you have many concurrent file operations:

fsutil behavior set memoryusage 2

This also works on Vista but not XP, sadly. See more here: http://technet2.microsoft.com/WindowsServer/en/library/9fcf44c8-68f4-4204-b403-0282273bc7b31033.mspx?mfr=true

Now, for the interesting part. I use a laptop that’s pretty decent (100GB 7200RPM drive, 2GB RAM). I hammer my disk since I use the laptop for vmware and other duties (music software with thousands of files, for instance).

I like postmark and iozone for measuring performance. Here’s how I configure postmark:

set number 10000

set transactions 20000

set subdirectories 5

set size 500 100000

set read 4096

set write 4096

run

This will create 10,000 files, then perform 20,000 transactions on them. The files will range from 500 bytes to 100KB in size. This is brutal on CPU, cache and disk. If you want different-sized files you just specify the min and max sizes, just be careful with the number (if you leave it at 10,000 and tell it to make 100GB files, better make sure you have the space).

Anyway, here are some results:

Vista untweaked (10000 files and transactions, 512 byte I/O):

Time:
181 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (83 per second)
Creation alone: 10000 files (121 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (83 per second)
Deletion alone: 10094 files (210 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.43 megabytes per second)
826.79 megabytes written (4.57 megabytes per second)

Vista tweaked with fsutil as described above:

Time:
159 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (94 per second)
Creation alone: 10000 files (158 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (94 per second)
Deletion alone: 10094 files (224 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.62 megabytes per second)
826.79 megabytes written (5.20 megabytes per second)

So it’s a bit better.

Another thing you can do is set the processor quanta to be fixed 120ms chunks (simply done by right clicking on “My Computer”, properties, advanced, performance, settings, advanced, processor scheduling for background services. Yes, I’ve had by far the best luck with XP by tuning it like a server. Your mileage may vary but this also increases postmark results a bit.

You can also play with increasing the cache (in that advanced pane again select “system cache” and, with regedit, go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\size and make it a 3. This is all if you have XP. In 2003 it comes just like that. Unless you want to run SQL, IIS or Exchange, in which case there’s a setting, “maximize throughput for network applications”. This limits cache to 512MB, and lets the apps cache on their own.
OR, you can actually spend some money and ridiculously increase performance by getting a caching product like Superspeed’s Supercache or Datacore’s Uptempo (I tried O&O Clevercache as well and was thoroughly underwhelmed).
Here are results with 20,000 transactions and 4K I/O, XP tuned just like a server:

Time:
386 seconds total
308 seconds of transactions (64 per second)

Files:
20092 created (52 per second)
Creation alone: 10000 files (142 per second)
Mixed with transactions: 10092 files (32 per second)
9935 read (32 per second)
10064 appended (32 per second)
20092 deleted (52 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (32 per second)

Data:
548.25 megabytes read (1.42 megabytes per second)
1158.00 megabytes written (3.00 megabytes per second)

And here are results with the exact same settings but with 256MB of Supercache on that volume, lazy writes on:

Time:
196 seconds total
163 seconds of transactions (122 per second)

Files:
20092 created (102 per second)
Creation alone: 10000 files (344 per second)
Mixed with transactions: 10092 files (61 per second)
9935 read (60 per second)
10064 appended (61 per second)
20092 deleted (102 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (60 per second)

Data:
548.25 megabytes read (2.80 megabytes per second)
1158.00 megabytes written (5.91 megabytes per second)
I am a believer. The size of the dataset far exceeded the capacity of supercache, but it helped tremendously regardless.
Since I don’t believe all benchmarks, I also ran iozone.

4096 8192 16384
64
128
256
512
1024
2048
4096 70011
8192 29264 50257
16384 26229 33289 37198
32768 27578 28827 34778
65536 26982 27890 28997
131072 20901 21680 22223
262144 21769 20789 22249
524288 23076 25270 26258

The top row shows record size, the left column file size. The above is without the cache. Now with cache:

4096 8192 16384
64
128
256
512
1024
2048
4096 279746
8192 264110 262117
16384 250322 249355 238230
32768 233373 238932 233980
65536 204786 232418 234544
131072 234552 230336 225731
262144 164434 227792 222540
524288 35515 31533 41262

These results are for writes, in both cases. Iozone’s output is too large to include here but I’ll gladly send the entire file to anyone that wants it. I would ignore record sizes under 4K since windows will coalesce writes to 4K and up anyway (up to 64K).
It seems that these products are worth a serious look. In most cases, significant benefits will be realized by caching the volume that holds the swapfile, even if only using 128MB. In one case I went from 124 seconds for a postmark run to 70s by caching the swap volume. Even though I had ample memory and windows shouldn’t be using swap.

Unix is generally a bit more robust for caching and virtual memory, so you don’t need extra products. Looks like Windows needs a bit of help. Indeed, Microsoft uses Supercache on the servers that host MSN, I found out…
Anyway, you can see that up to 256MB supercache kicks windows’ cache ass. Now remember, this is a box tuned just like a server, it was using like 1GB of cache even without supercache. After you exceed the size of the cache by using the large 512MB test file, you still realize some benefits, as you can see.

Datacore’s uptempo produced similar results, is far less tunable, uses a unified cache (instead of a chunk per partition), is easier to configure and can be more or less expensive - Supercache for 4 CPUs is like $1K, but half that for 2 CPUs. UpTempo is about $700 regardless. Another difference is that UpTempo is 32-bit only at the moment.

D