Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Mon
29
Mar '10

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

Thu
14
Jan '10

Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…

D

Sat
9
Jan '10

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

Sun
14
Jun '09

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

What a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 “Leonidas”, a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  - this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • “XFS tuned” means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • “ext3 tuned” means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…

Postmark numbers:

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s  
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D

Thu
5
Feb '09

The true XIV fail condition finally revealed (?)

I just got this information:

For XIV to be in jeopardy you need to lose 1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within a few minutes (so there’s no time to rebuild) while writing to the thing.

Have no way of confirming this but it did come from a reliable source.

A customer recently tried pulling random drives and XIV didn’t shut down and was working fine, but they were from the data nodes.

Why can’t anyone post something concrete here? I’m sure IBM won’t post since the confusion serves them well.

For what it’s worth, the customer is really happy with the simplicity of the XIV GUI.

D

Mon
5
Jan '09

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken - I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens - which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance - put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

Wed
20
Aug '08

What is the value of your data? Do you have the money to implement proper DR that works? How are you deciding what kind of storage and DR strategy you’ll follow? And how does Continuous Data Protection like EMC’s RecoverPoint help?

Maybe the longest title for a post ever. And one of my longest, most rambling posts ever, it seems.

Recently we did a demo for a customer that I thought opened an interesting can of worms. Let’s set the stage – and, BTW, let it be known that I lost my train of thought multiple times writing this over multiple days so it may seem a bit incoherent (unusually, it wasn’t written in one shot).

The customer at the moment uses DASD and is looking to go to some kind of SAN for all the usual reasons. They were looking at EMC initially, then Dell told them they should look at Equallogic (imagine that). Not that there’s anything wrong with Dell or Equallogic… everything has its place.

So they get the obligatory “throw some sh1t on the wall and see what sticks” quote from Dell – literally Dell just sent them pricing on a few different models with wildly varying performance and storage capacities, apparently without rhyme or reason. I guess the rep figured they could afford at least one of the boxes.

So we start the meeting with yours truly asking the pointed questions, as is my idiom. It transpires that:

  1. Nobody looked at what their business actually does
  2. Nobody checked current and expected performance
  3. Nobody checked current and expected DR SLAs
  4. Nobody checked growth potential and patterns
  5. Nobody asked them what functionality they would like to have
  6. Nobody asked them what functionality they need to have
  7. Nobody asked how much storage they truly need
  8. Nobody asked them just how valuable their data is
  9. Nobody asked them how much money they can really spend, regardless of how valuable their data is and what they need.

So we do the dog-and-pony – and unfortunately, without really asking them anything about money, show them RecoverPoint first, which even worse than showing a Lamborghini (or insert your favorite grail car) to someone that’s only ever used and seen badly-maintained rickshaws, to use a car analogy.

To the uninitiated, EMC’s RecoverPoint is the be-all, end-all CDP (Continuous Data Protection) product, all nicely packaged in appliance format. It used to be Kashya (which seems to mean either “hard question” or “hard problem” in Hebrew), then EMC wisely bought Kashya, and changed the name to something that makes more marketing sense. Before EMC bought them, Kashya was the favorite replication technology of several vendors that just didn’t have anything decent in-place for replication (like Pillar). Obviously, with EMC now owning Kashya, it would look very, very bad if someone tried to sell you a Pillar array and their replication system came from EMC (it comes from FalconStor now). But I digress.

RecoverPoint lets you roll your disks back and forth in time, very much like a super-fine-grained TiVo for storage. It does this by creating a space equal to the space consumed by the original data that acts as a mirror, plus the use of what is essentially a redo log (so to use it locally you need 2x the storage + redo log space). The bigger the redo log, the more you can go back in time (you could literally go back several days). Oh, and they like to call the redo log The Journal.

It works by effectively mirroring the writes so they go to their target and to RecoverPoint. You can implement the “splitter” at the host level, the array (as long as it’s a Clariion from EMC) or with certain intelligent fiber switches using SSM modules (the last option being by far the most difficult and expensive to implement).

In essence, if you want to see a different version of your data, you ask RecoverPoint to present an “image” of what the disks would look like at a specified point-in-time (which can be entirely arbitrary or you can use an application-aware “bookmark”). You can then mount the set of disks the image represents (called a consistency group) to the same server or another server and do whatever you need to do. Obviously there are numerous uses for something like that. Recovering from data corruption while losing the least amount of data is the most obvious use case but you can use it to run what-if scenarios, migrations, test patches, do backups, etc.

You can also use RecoverPoint to replicate data to a remote site (where you need just 1x the storage + redo log). It does its own deduplication and TCP optimizations during replication, and is amazingly efficient (far more so than any other replication scheme in my opinion). They call it CRR (Continuous Remote Replication). Obviously, you get the TiVo-like functionality at the remote side as well.

What’s the kicker is the granularity of CRR/CDP. Obviously, as with anything, there can be no magic, but, given the optimizations it does, if the pipe is large enough you can do near-synchronous replication over distances previously unheard of, and get per-write granularity both locally and remotely. All without needing a WAN accelerator to help out, expensive FC-IP bridges and whatnot.

There’s one pretender that likes to take fairly frequent snapshots but even those are several minutes apart at best, can hurt performance and are limited in their ultimate number. Moreover, their recovery is nowhere near as slick, reliable and foolproof.

To wit: We did demos going back and forth a single transaction in SQL Server 2005. Trading firms love that one. The granularity was a couple of microseconds at the IOPS we were running. We recovered the DB back to entirely arbitrary points in time, always 100% successfully. Forget tapes or just having the current mirrored data!

We also showed Exchange being recovered at a remote Windows cluster. Due to Windows cluster being what it is, it had some issues with the initial version of disks it was presented. The customer exclaimed “this happened to me before during a DR exercise, it took me 18 hours to fix!!” We then simply used a different version of the data, going back a few writes. Windows was happy and Exchange started OK on the remote cluster. Total effort: the time spent clicking around the GUI asking for a different time + the time to present the data, less than a minute total. The guy was amazed at how streamlined and straightforward it all was.

It’s important to note that Exchange suffers more from those issues than other DBs since it’s not a “proper” relational DB like SQL is, the back-end DB is Jet and don’t let me get started… the gist is that replicating Exchange is not always straightforward. RecoverPoint gave us the chance to easily try different versions of the Exchange data, “just in case”.

How would you do that with traditional replication technologies?

How would you do that with other so-called CDP that is nowhere near as granular? How much data would you lose? Is that competing solution even functional? Anyone remember Mendocino? They kinda tried to do something similar, the stuff wouldn’t work right in a pristine lab environment, I gave up on it. RecoverPoint actually works.

Needless to say, the customer loved the demo (they always do, never seen anyone not like RecoverPoint, it’s like crack for IT guys). It solves all their DR issues, works with their stuff, and is almost magical. Problem is, it’s also pretty expensive – to protect the amount of data that customer has they’d almost need to spend as much on RecoverPoint as on the actual storage itself.

Which brings us to the source of the problem. Of course they like the product. But for someone that is considering low-end boxes from Dell, IBM etc. this will be a huge price shock. They keep asking me to see the price, then I hear they’re looking at stuff from HDS and IBM and (no disrespect) that doesn’t make me any more confident that they can afford RecoverPoint.

Our mistake is that we didn’t at first figure out their budget. And we didn’t figure out the value of their data – maybe they don’t need the absolute best DR technology extant since it won’t cost them that much if their data isn’t there for a few hours.

The best way to justify any DR solution is to figure out how much it costs the business if you can achieve, say, 1 day of RTO and 5 hours of RPO vs 5 minutes of RTO and near-zero RPO. Meaning, what is the financial impact to the business for the longer RPO and RTO? And how does it compare to the cost of the lower RPO and RTO recovery solution?

The real issue with DR is that almost no company truly goes through that exercise. Almost everyone says “my data is critical and I can afford zero data loss” but nobody seems to be in touch with reality, until presented with how much it will cost to give them the zero RPO capability.

The stages one goes through in order to reach DR maturity are like the stages of grief – Denial, Anger, Bargaining, Depression, and Acceptance.

Once people see the cost, they hit the Denial stage and do a 180: “You know what, I really don’t need this data back that quickly and can afford a week of data loss!!! I’ll mail punch cards to the DR site!” – typically, this is removed from reality and is a complete knee-jerk reaction to the price.

Then comes Anger – “I can’t believe you charge this much for something essential like this! It should be free! You suck! It’s like charging a man dying of thirst for water! I’ll sue! I’ll go to the competition!”

Then they realize there’s no competition to speak of so we reach the Bargaining stage: “Guys, I’ll give you my decrepit HDS box as a trade-in. I also have a cool camera collection you can have, baseball cards, and I’ll let you have fun with my sister for a week!”

After figuring out how much money we can shave off by selling his HDS box, cameras and baseball cards on ebay and his sister to some sinister-looking guys with portable freezers (whoopsie, he did say only a week), it’s still not cheap enough. This is where Depression sets in. “I’m screwed, I’ll never get the money to do this, I’ll be out of a job and homeless! Our DR is an absolute joke! I’ll be forced to use simple asynchronous mirroring! What if I can’t bring up Exchange again? It didn’t work last time!”

The final stage is Acceptance – either you come to terms with the fact you can’t afford the gear and truly try to build the best possible alternative, or you scrounge up the money somehow by becoming realistic: “well, I’m only gonna use RecoverPoint for my Exchange and SQL box and maybe the most critical VMs, everything else will be replicated using archaic methods but at least my important apps are protected using the best there is”.

It would save everyone a lot of heartache and time if we just jump straight to the Acceptance phase where RecoverPoint is concerned:

  • Yes, it really works that well.
  • Yes, it’s that easy.
  • Yes, it’s expensive because it’s the best.
  • Yes, you might be able to afford it if you become realistic about what you need to protect.
  • Yes, you’ll have to do your homework to justify the cost. If nothing else, you’ll know how much an outage truly costs your business! Maybe your data is more important than your bosses realize. Or maybe it’s a lot LESS important than what everyone would like to think. Either way you’re ahead!
  • Yes, leasing can help make the price more palatable. Leasing is not always evil.
  • No, it won’t be free.
  • If you have no money at all why are you even bothering the vendors? Read the brochures instead.
  • If you have some money please be upfront with exactly how much you can spend, contrary to popular belief not everyone is out to screw you out of all your IT budget. After all we know you can compare our pricing to others’ so there’s no point in trying to screw anyone. Moreover, the best customers are repeat customers, and we want the best customers! Just like with cars, there’s some wiggle room but at some point if you’re trying to get the expensive BMW you do need to have the dough.

     

Anyway, I rambled enough…

 

D

    

Tue
25
Mar '08

Windows Server 2008 RTM 64-bit performance versus Vista SP1 64-bit, and using 2008 as a workstation

I’ve been using Vista x64 for a while now, just so I can make use of all the memory on my machine (an über-thinkpad), and because I like shiny new things and 64-bitness and don’t want to be one-upped by smug Mac users with their feline-named OSes, mock turtlenecks and their newfound 64-bit capabilities. Of course, with the good comes some bad – Vista, while in my opinion a step forward in many ways, does take a step backward when it comes to some areas of performance and sheer resource requirements. A lot of it can be attributed to poorly-written drivers, especially any Aero GUI slowdowns with nVidia cards.

Since space was running out I bought a new hard drive (200GB Seagate 7200 RPM) and decided to install the RTM 2008 bits. If something went wrong I figured I could always either go back to my old drive or just move Vista to the new drive with some imaging utility or other, no biggie. If 2008 worked out, I’d keep it.

The reason this comparison is worthwhile is that 2008 and Vista SP1 have the same exact kernel – I checked, NTOSKRNL.EXE is the same in both OSes. One would think that the differences wouldn’t be huge and that therefore there’s no point going to 2008. Of course, there are a lot of other pieces aside from the kernel, and I think that Microsoft checks to see what OS you’re running and maybe disables certain features in the kernel accordingly – I couldn’t get the LargeSystemCache registry parameter to have any effect on Vista, for example.

Let’s compare CPU- and Graphics-benchmarks first, since those shouldn’t really be different. I used Cinebench 64-bit.

 

Vista:

Rendering (Single   CPU): 3040 CB-CPU
Rendering (Multiple CPU): 5367 CB-CPU
Multiprocessor Speedup: 1.77
Shading (OpenGL Standard)          : 4256 CB-GFX

 

2008:

Rendering (Single   CPU): 3053 CB-CPU
Rendering (Multiple CPU): 5379 CB-CPU
Multiprocessor Speedup: 1.86
Shading (OpenGL Standard)          : 4478 CB-GFX

 

Slightly better scores for 2008 it seems, but not dramatically better. Next, postmark, since I/O should be where it shines, it being a server and all:

 

Vista:

Time:

        170 seconds total

        98 seconds of transactions (204 per second)

 

Files:

        20092 created (118 per second)

                Creation alone: 10000 files (200 per second)

                Mixed with transactions: 10092 files (102 per second)

        9935 read (101 per second)

        10064 appended (102 per second)

        20092 deleted (118 per second)

                Deletion alone: 10184 files (462 per second)

                Mixed with transactions: 9908 files (101 per second)

 

Data:

        548.25 megabytes read (3.23 megabytes per second)

        1158.00 megabytes written (6.81 megabytes per second)

 

2008:

Initially I had enabled the “advanced performance” in the device manager for disk, since everyone tells you to do so in all tuning guides…

 

Time:

136 seconds total

45 seconds of transactions (444 per second)

 

Files:

20092 created (147 per second)

Creation alone: 10000 files (263 per second)

Mixed with transactions: 10092 files (224 per second)

9935 read (220 per second)

10064 appended (223 per second)

20092 deleted (147 per second)

Deletion alone: 10184 files (192 per second)

Mixed with transactions: 9908 files (220 per second)

 

Data:

548.25 megabytes read (4.03 megabytes per second)

1158.00 megabytes written (8.51 megabytes per second)

 

Much faster than Vista. I then disabled the “enable advanced performance” to see how much slower it would become:

 

Time:

110 seconds total

39 seconds of transactions (512 per second)

 

Files:

20092 created (182 per second)

Creation alone: 10000 files (454 per second)

Mixed with transactions: 10092 files (258 per second)

9935 read (254 per second)

10064 appended (258 per second)

20092 deleted (182 per second)

Deletion alone: 10184 files (207 per second)

Mixed with transactions: 9908 files (254 per second)

 

Data:

548.25 megabytes read (4.98 megabytes per second)

1158.00 megabytes written (10.53 megabytes per second)

 

Amazingly, much faster, not slower! I did some checking and this is what the setting actually does… it re-introduces an older, somewhat undesirable behavior. A bit hard to find the proper explanation, and I hope Microsoft makes what happens behind the scenes a bit more obvious. At the moment it’s quite obscure, and every guide tells you to enable it for performance. Just leave it alone. BTW the Vista score is with the setting disabled.

 

Could I have run other benchmarks like Sandra etc? Sure, but I just wanted to keep it simple and there just wasn’t enough time.

 

The next step is to run the tests on the same hardware with XP. That’s forthcoming.

 

Conclusion:

 

Seems like Microsoft did something right. Even with the 64-bit version (that takes naturally more RAM than the 32-bit one), 2008 Server takes less memory than Vista (2-300MB less at any given time in my case), runs quicker and just feels better, kinda like an unencumbered Vista. Simple things like searching a huge index in Outlook happen much faster than before. The Server Manager app is awesome, and one can try out the Hyper-V Hypervisor (BTW that, predictably, clashes with VMware and disables your power management, so beware). A server OS is in general also more secure and, over time, probably more reliable, given the workloads it’s supposed to run.

 

Can everyone run it? Should they? No, not unless you have a license for 2008 through MSDN or somesuch, otherwise it’s expensive. Some assembly is also required, and you do need to know what you’re doing. However, if you’re so inclined, you can easily get the demo version of 2008. Apparently there are clean, documented ways to increase the evaluation period (no cracks or BIOS spoofers) that I think come from Microsoft but I’m not going to list them here just in case…

 

In addition, while almost all my apps installed fine (including games and hairy driver stuff like Daemon Tools), 2 things didn’t: Bluetooth and my Logitech mouse drivers. I don’t quite use Bluetooth but I liked some of the features of my mouse (the utterly kickass Logitech VX Revolution), now it’s just like a normal mouse. I’m still keeping 2008. I’m sure other stuff will have issues, like DRM/BluRay. For people that like the Windows Sidebar: there are hacks to get it working that involve copying stuff from Vista. I think the sidebar is largely useless.

 

FYI, there are 2 notable omissions in 2008: Readyboost and Superfetch. Superfetch exists as a service but to even get it to start you have to edit the registry. I didn’t think it helped much so I disabled it again. Readyboost isn’t even an option. And the old-style boot prefetch that worked in 2003 Server doesn’t seem to be there. So it does boot a bit slower than Vista, but not much. Once you get the box up and running it’s fast though.

 

In the end, I’m leaving 2008 on my box, and that’s all that matters.

 

D

Fri
7
Dec '07

(Very) Preliminary Windows Server 2008 impressions and Vista Multimedia Performance under battery power

Out of curiosity, I very briefly tried the new Server 2008 Release Candidate (freely available from Microsoft). I’ve been using Vista 64-bit since I need to see all the memory in my machine and, while it works mostly OK, there are some low-level scheduling issues with it – for instance, sound is really choppy on battery power, no matter what I do with the power settings, so I can’t use the thing to watch a DVD or listen to music on the plane. Many others seem to be having the same issues, despite the funky Multimedia Class Scheduler nonsense that Microsoft put in the OS that makes networking slower (great info here), even though older incarnations were not suffering from media playback issues under load. And no, if I disable the Multimedia Scheduler it does NOT work better, it actually gets worse, which means that the service is there to fix some other kludge-y issue Microsoft introduced with the scheduler or something like excessive power throttling of certain devices.

But, as usual, I digress. This is about Server 2008. What’s noteworthy is that Vista SP1 inherits the exact same kernel as Server 2008.

This will be a short entry, there are others online talking more about 2008. What I noticed:

  1. It’s light for a Windows OS. There’s no excessive bloat guys, the thing takes about 300MB of RAM with the default install, and more can be saved by trimming unnecessary services (of which there are very few).
  2. It’s fast. Under preliminary benchmarking, even the RC code (that probably has some features missing and extra debugging code) seems about as fast as 2003 after SP2 (unlike others that have been releasing benchmarks of, say, Vista SP1 in it’s pre-release form, I’d rather wait until the final code is out).
  3. Seems to work with most Vista drivers so, if you want to turn it into a workstation, you can. You can also install the Vista GUI if you’re so inclined with no adverse effects (aside from the ones that come with the Vista UI that is). Runs very smooth.
  4. Application compatibility is similar to that of Server 2003.
  5. The OS does NOT suffer from the same issues as Vista regarding media playback (I made sure I installed the Power Management driver and selected the same kind of PM scheme as Vista). Maybe a good omen come Vista SP1? We shall see.

The new management interfaces are nicely laid out, and selecting Roles for the server and adding or removing features as needed is very simple. It feels more like a well-integrated 2003 R3 rather than Vista.

I didn’t get to play with the new virtualization, it doesn’t seem to be in the RC code (though, reading some documentation, it seems as if it will have VMotion-like capabilities, which I will believe when I see).

UPDATE: 12/17/07

There is no more Vista multimedia performance issue on 2 separate computers. Some patches just released by Microsoft removed the issue (plus the issue of the mouse cursor stuttering). Interestingly, the patches had no mention of fixing said issues. I thought it was a fluke but having seen this fixed on 2 different boxes (one 32-bit, one 64) I don’t think it is.

For the Vista detractors: I’d advise everyone to wait until SP1 – as with most Microsoft releases. It’s no different. They’re actually getting better, NT4 was unusable until SP3 at least… given the unreal amount of code in the system, I’m surprised it runs this well. They really need to slim it down. Supposedly, Windows 7 will be slimmer (http://apcmag.com/7668/beyond_vista_windows_7_what_we_know_so_far). However, it mostly targets the kernel and it was never the Windows kernel that was the issue (it’s actually surprisingly decent), it’s all the crud around it.

D

Thu
20
Sep '07

WAN acceleration for remote workers

The deluge of WAN accelerators from Cisco, Riverbed, Juniper, Expand, Packeteer,Bluecoat, Silverpeak etc. etc. is proving good for datacenters. Not sure how many vendors will remain viable in a year or two, but the selection at the moment is decent.

However, most of the vendors don’t address remote desktop acceleration, say for people using 3G cards on their laptops or even cable modems - sometimes the routing to corporate networks can be arcane enough that the ms of latency add up, plus most home connections are asymmetrical anyway.

So, it would be pretty cool to have a WAN accelerator in your laptop, right? Well, so far only two companies have stepped forward:

The far more established product, even if you’ve never heard of it, is AcceleNet Enterprise from ICT (Intelligent Compression Technologies, www.ictcompress.com - they were recently bought by ViaSat). ICT has been doing just this for years, with a veritable who is who of clients (no they haven’t paid me to say this, I just think the stuff is cool). Lots of service providers use it.

ICT deploys a server that acts as a proxy, then you install an agent on your laptop. Transfers are compressed both ways.

The other vendor is known to us all - it’s Riverbed. They have now what’s called Steelhead Mobile. Effectively, it puts a Riverbed box inside your laptop. A normal Steelhead is needed to communicate with, as well as a Steelhead Mobile Controller for management. I saw pricing for the controller and it was a bit dear…

You can even adjust how much cache to give your mini-Riverbed, so if you have the space, go nuts.

Of course, you can also use this technology for servers and save money on appliance costs - I wonder if they have something that checks if you’ve installed it on a server OS, and how much CPU does it take to do it’s thing.

I heard somewhere Cisco is also working on something similar, unsurprisingly.

D

Thu
7
Jun '07

ZFS in OSX

Not amazing news but an official announcement nonetheless: Saw this (www.macnn.com/articles/07/06/06/zfs.in.leopard/) and I couldn’t resist posting. This means a few things:

  1. Sun figured out how to make ZFS bootable (at least on OSX)
  2. Someone figured out how to deal with ZFS and resource forks (I can’t believe they are willing to break compatibility with so much software otherwise).

Now I just need a Mac so I can run some benchmarks before and after. I have some buddies that might oblige… finally the Macs get a decent FS.

Now if only Apple could lose the silly Mach legacy, it’s a common misconception that the kernel in OSX is FreeBSD - it ain’t. Run lmbench (www.bitmover.com/lmbench/) on different platforms and compare results such as context switching, thread creation and whatnot. Then you’ll see why OSX can’t always make a decent server OS.

D

Sat
2
Jun '07

IBRIX at EMC World

I’ve known about IBRIX for a while, but it was refreshing to talk to a decent techie that knew the product. They have improved it a lot over the past year.

For the uninitiated, IBRIX can be either

  1. A network-based filesystem using the IBRIX client and protocol
  2. Also accessible using NFS or CIFS
  3. SAN-based parallel filesystem

The product’s claim to fame is it’s scalability and performance (realized by adding extra nodes “hot”). Their most famous client is probably Pixar, they replaced a ton of NetApp boxes with an IBRIX cluster and realized huge performance benefits and vastly reduced costs. I always liked cool filesystem technologies and this definitely falls under the realm of “cool”. Some highlights based on notes I took on my Blackberry during the session and questions I asked:

  • No limits on filesystem size (they have deployed single namespace filesystems several PB in size).
  • 300mb/s read, 200mb/s write on small box per node. Bigger boxes can do 1.2GB/s per node, of course your storage needs to be able to keep up.
  • No limit on the number of nodes.
  • Automatic rebalancing of data over time. When you add new disk you rebalance to keep things humming.
  • Dedicated ibrix backup node, works with 3rd party backup SW, can have many backup servers for backup speed.
  • Has snaps now (global), this was a failing of the product before since it was lacking snapshots.
  • No real limit on the number of files per FS.
  • Biggest file size they have tested on production is an 8TB file, no software limit.
  • Nodes use FC to access storage, clients use Ethernet.
  • Client on Windows or Linux, otherwise general NFS and CIFS. Client is fastest.
  • Your prod servers can be the ibrix nodes but very compute-intensive. They recommend the client (IP-based, bonded). or get an 8-core box.
  • There is no single lock manager - this is the coolest thing. There is global metadata and global locking, all nodes participate equally.
  • How are node failures handled? All nodes interchangeable. All see same storage. Storage allocated to remaining servers if you lose a node.
    Can lose all but 1 server.
  • Back-end storage size per node? Unlimited.
  • Multipathing per node? Powerpath works. Can do bonded GigE up to 8 ports per.
  • How are files allocated? The file inode contains the info concerning which node it needs to go to. Round-robin allocation or preferred servers per file type. Also if server over 50% full then it’s skipped.
  • All volumes accessible by all nodes.
  • Can stripe huge files across many nodes.

I’m stoked! I can think of so many uses for this product:

  1. Data mining
  2. Digital media
  3. Oil and gas
  4. Backups

D

Wed
23
May '07

Storage Virtualization - is there a point?

This has been bothering me for a while, and I think I’m not alone.

Hitachi has been making great progress with their virtualization gear, as has IBM, Falconstor before them, etc.

They claim you’ll be freed from the vendors’ shackles, achieve greater utilization of your arrays, simplify administration, cure cancer etc.

Well, here’s what I think:

  1. You will instead be shackled to the virtualization provider
  2. You won’t have a clue where your stuff is
  3. If you want to retire an array you could have problems (imagine creating a LUN composed of LUNs from 3 different arrays)
  4. You STILL have to use the management interfaces of the back-end arrays, since you still have to provision the storage. Instead of provisioning to hosts you provision to the virtualizer.

 

So, what have you gained, exactly?

D

Tue
8
May '07

I wonder when dedup will make it to the arrays

Anyone feel that deduplication is not finding its final resting place in backups and WAN accelerators?

It’s only a matter of time before the algorithms are run as a matter of choice on the array processors.

Of course, that means fewer disk sales, but also bigger/faster/more expensive processors.

Replication will also become more efficient - see EMC’s recent acquisition of Kashya (now RecoverPoint - one of its functions is dedup during replication from array to array, how long do you think it will take them to move this functionality to the array processors?)

Just some random thoughts…

D

Wed
2
May '07

Cisco WAAS benchmarks, and WAN optimizers in general

Lately I’ve been dealing with WAN accelerators a lot, with the emphasis on Cisco’s WAAS (some other, smaller players are Riverbed, Juniper, Bluecoat, Tacit/Packeteer and Silverpeak). The premise is simple and compelling: Instead of having all those servers at your edge locations, move your users’ data to the core and make accessing the data feel almost as fast as having it locally, by deploying appliances that act as proxies. At the same time, you will actually decrease the WAN utilization, enabling you to use cheaper pipes, or at least not have to upgrade, where in the past you were planning to anyway.

There are significant other benefits (massive MAPI acceleration, HTTP, ftp, and indeed any TCP-based application will be optimized). Many Microsoft protocols are especially chatty, and the WAN accelerators pretty much remove the chattiness, optimize the TCP connection (automatically resizing Send/Receive windows based on latency, for instance), LZ-compress the data, and to top it all will not transfer data blocks that have already been transferred.

At this point I need to point out that there is a lot of similarity with deduplication technologies - for example, Cisco’s DRE (Data Redundancy Elimination) is, at heart, a dedup algorithm not unlike Avamar’s or Data Domain’s. So, if a Powerpoint file has gone through the DRE cache already, and someone modifies the file and sends it over the WAN again, only the modified parts will really go through. It really works and it’s really fast (and I’m about the most jaded technophile you’re likely to meet).

The reason I’m not opposed to this use of dedup (see previous posts) is that the datasets are kept at a reasonable size. For instance, at the edge you’re typically talking about under 200GB of cache, not several TB. Doing the hash calculations is not as time-consuming with a smaller dataset and, indeed, it’s set up so that the hashes are kept in-memory. You see, the whole point of this appliance is to reduce latency, not increase it with unnecessary calculations. Compare this to the multi-TB deals of the “proper” dedup solutions used for backups…

Indeed, why the hell would you need dedup-based backup solutions if you deploy a WAN accelerator? Chances are there won’t be anything at the edge sites to back up, so the whole argument behind dedup-based backups for remote sites sort of evaporates. Dedup now only makes sense in VTLs, just so you can store a bit more.

On Dedup VTLs: Refreshingly, Quantum doesn’t quote crazy compression ratios - I’ve seen figures of about 9:1 as an average, which is still pretty good (and totally dependent on what kind of data you have). I just cringe when I see the 100:1, 1000:1 or whatever insanity Data Domain typically states. I’m still worried about the effect on restore times, but I digress. See previous posts.

Anyway, back to WAN accelerators. So how do these boxes work? All fairly similarly. Cisco’s, for instance, does 3 main kinds of optimizations: TFO, DRE and LZ. TFO means TCP Flow Optimizations, and takes care of snd/rcv window scaling, enables large initial windows, enables SACK and BIC TCP (the latter 2 help with packet loss).

DRE is the dedup part of the equation, as mentioned before.

LZ is simply LZ compression of data, in addition to everything else mentioned above.

Other vendors may call their features something else, but at the end there aren’t too many ways to do this. It all boils down to:

  1. Who has the best implementation speed-wise

  2. Who is the best administration-wise

  3. Who is the most stable in an enterprise setting

  4. What company has the highest chance of staying alive (like it or not, Cisco destroys the other players here)

  5. What company is committed to the product the most

  6. As a corollary to #5, what company does the most R&D for the product

Since Cisco is, by far, the largest company of any that provide WAN accelerators (indeed, they probably spend more on light bulbs per year than the net worth of the other companies provided), in my opinion they’re the obvious force to be reckoned with, not someone like Riverbed (as cool as Riverbed is, they’re too small, and will either fizzle out or get bought - though Cisco didn’t buy them, which is food for thought. If Riverbed is so great, why would Cisco simply not acquire them?)

Case in point: When Cisco bought Actona (which is the progenitor of the current WAAS product) they only really had the Windows file-caching part shipping (WAFS). It was great for CIFS but not much else. Back then, they were actually lagging compared to the other players when it came to complete application acceleration. Fast forward a mere few months: They now accelerate anything going over TCP, their WAFS portion is still there but it’s even better and more transparent, the product works with WCCP and inline cards (making deployment at the low-end easy) and is now significantly faster than the competitors. Helps to have deep pockets.

For an enterprise, here are the main benefits of going with Cisco the way I see them:

  1. Your switches and routers are probably already Cisco so you have a relationship.

  2. WAAS interfaces seamlessly with the other Cisco gear.

  3. The best way to interface a WAN accelerator is WCCP. And it was actually developed by Cisco.

  4. The Cisco appliances are tunnel-less and totally transparent (I met someone that had Riverbed everywhere - a software glitch rendered ALL WAN traffic inoperable, instead of having it go through unaccelerated which is the way it is supposed to work. He’s now looking at Cisco).

  5. WAAS appliances don’t mess with QoS you may have already set.

  6. The WAAS boxes are actually faster in almost anything compared to the competition.

And now for the inevitable benchmarks:

Depending on the latency, you can get more or less of a speed-up. For a comprehensive test see this: http://www.cisco.com/application/pdf/en/us/guest/products/ps6870/c1031/cdccont_0900aecd8054f827.pdf

Another, longer rev: http://www.cisco.com/web/CA/channels/pdf/Miercom-on-Cisco-WAAS-Riverbed-Juniper-competitive.pdf

Yes, this is on Cisco’s website but it’s kinda hard to find any performance statistics on the other players’ sites showing Cisco’s WAAS (any references to WAFS are for an obsolete product). At least this one compares truly recent codebases of Cisco, Riverbed and Juniper. For me, the most telling numbers were the ones showing how much traffic the server at the datacenter actually sees. Cisco was almost 100x better than the competition - where the other products passed several Mbits through to the server, Cisco only needed to pass 50Kbits or so.

It is kinda weird that the other vendors don’t have any public-facing benchmarks like this, don’t you think?

However, since I tend to not completely believe vendor-sponsored benchmark numbers as much as I may like the vendor in question, I ran my own.

I used NISTnet (a free WAN simulator, http://www-x.antd.nist.gov/nistnet/) to emulate latency and throughput indicative of standard telco links (i.e. a T1). The fact that the simulator is freely available and can be used by anyone is compelling since it allows testing without disrupting production networks (for the record, I also tested on a few production networks with similar results, though the latency was lower than with the simulator).

The first test scenario is that of the typical T1 connection (approx. 1.5Mbits/s or 170KB/s at best) and 40ms of round-trip delay. I tested with zero packet loss, which is not totally realistic but it makes the benchmarks even more compelling. Usually there is a little packet loss, which makes transfer speeds even worse. This is one of the most common connections to remote sites one will encounter in production environments.

The second scenario is that of a bigger pipe (3Mbit) but much higher latency (300ms), emulating a long-distance link such as a remote site in Asia over which developers do their work. I injected a 0.2% packet loss (a small number, given the distance).

It is important to note that, in the interests of simplicity and expediency, these tests are not comprehensive. A comprehensive WAAS test consists of:

  • Performance without WAAS but with latency

  • Performance with WAAS but data not already in cache (cold cache hits). Such a test shows the real-time efficiency of the TFO, DRE and LZ algorithms.

  • Performance with the data already in the cache (hot cache hits).

  • Performance with pre-positioning of fileserver data. This would be the fastest a WAAS solution would perform, almost like a local fileserver.

  • Performance without WAAS and without latency (local server). This would be the absolute fastest performance in general.

The one cold cache test I performed involved downloading a large ISO file (400MB) using HTTP over the simulated T1 link. The performance ranged from 1.5-1.8MB/s (a full 10 times faster than without WAAS) for a cold cache hit. After the file was transferred (and was therefore in cache) the performance went to 2.5MB/s. The amazing performance might have been due to a highly compressible ISO image but, nevertheless, is quite impressive. The ISO was a full Windows 2000 install CD with SP4 slipstreamed - a realistic test with realistic data, since one might conceivably want to distribute such CD images over a WAN. Frankly this went through so quickly that I keep thinking I did something wrong.

T1 results
ftp without WAAS:
ftp: 3367936 bytes received in 19.53Seconds 168.40Kbytes/sec

Very normal T1 behavior with the simulator (for a good-quality T1).

ftp with WAAS:
ftp: 3367936 bytes received in 1.34Seconds 2505.90Kbytes/sec (15x improvement ).

Sending data was even faster:
ftp: 3367936 bytes sent in 0.36Seconds 9381.44Kbytes/sec.

waasT1

 

High Latency/High Bandwidth results

The high latency (300ms) link, even though it had double the theoretical throughput of the T1 link, suffers significantly:

ftp without WAAS
ftp: 3367936 bytes received in 125.73Seconds 26.79Kbytes/sec.

I was surprised at how much the high latency hurt the ftp transfers. I ran the test several times with similar results.

ftp with WAAS
ftp: 3367936 bytes received in 2.16Seconds 1562.12Kbytes/sec. (58x improvement ).

waaslat

 

I have more results with office-type apps but they will make for too big of a blog entry, not that this isn’t big. In any case, the thing works as advertised. I need to build a test Exchange server so I can see how much stuff like attachments are accelerated. Watch this space. Oh, and there’s another set of results at http://www.gotitsolutions.org/2007/05/18/cisco-waas-performance-benchmarks.html

Comments? Complaints? You know what to do.

D

Mon
19
Feb '07

It’s all about data classification and searching

I don’t know if this has been discussed elsewhere but I felt like I had an epiphany so there…

They way I see it, in a decade or two the most important technology regarding data will be data classification and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they’re smaller than the slower-spinning drives.

Imagine if speed and size were not issues. I know that’s a big assumption but let’s play along for a second… (let’s just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it’s just expensive, so it’s not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn’t get it, but they’re saying it should be out in a few years (there were issues with scalability and speed).

Let’s forget about the Microsoft-specific implementation and just think about the concept instead (I’d use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it - it’s just a huge database occupying the entire drive.

Think of the advantages:

  1. Far more resilient to failures
  2. Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
  3. Replication via log shipping
  4. Amazing indexing
  5. Easy expandability
  6. The potential for great performance, if done right
  7. Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let’s consider a simple text document - you’re trying to sell some storage, so you write a proposal for a new client. You could have metadata on:

  • Author
  • Filename
  • Client name
  • Type of document - proposal
  • Project name
  • Excerpt
  • Salesperson’s name
  • Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
  • Document revision (possible automatically generated)

… and so on. A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created:

  • Create a virtual directory with all files pertaining to that specific client (most common way people would organize it)
  • Show all the material for this specific project
  • Show all proposals that have to do with this salesperson

… and so on.

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course.

Existing software that does this classification is fairly poor, in my opinion. Please correct me if I’m wrong.

The other piece that needs to be there is extremely robust search and indexing capabilities. Some of that technology is there (google desktop and its ilk) but natural language search has to be - well, natural, but unambiguous at the same time.

I hope you can now see why I believe these technologies are important. If Google continues the way it’s going, it may well become the most important company in the next decade (some might argue it’s the most important one already).

For any sci-fi fans out there, this is a good novel that’s a bit related to the chaotic storage systems of the future: http://www.scifi.com/sfw/books/sfw7677.html

D

'

On deduplication and Data Domain appliances

One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.

The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.

The ID is then associated with the block and both are stored.

If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).

This is what allows dedup technologies to store a lot of data.

Now, why it depends how much you can store:

If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.

Now the drawbacks (and why I never bought it):

The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).

Imagine you have 2 billion entries.

Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.

Each block has it’s ID calculated (CPU-intensive).

Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.

This can take a while, depending on what search/sort/store algorithms they use.

I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.

I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.

Now, imagine you’re trying to RECOVER your 1GB PST.

Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.

When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.

This is CPU- and disk-intensive. Takes a while.

The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.

Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.

There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.

So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.

Another tack would be to use virtual tape - doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.

Plus, even the cheapest EMC virtual tape box works at over 300MB/s.

I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.

Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).

Let me know what you think…

D