Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Mon
22
Feb '10

Protecting your existing legacy storage investment with virtualization – do’s and don’ts

It’s an undeniable fact that many customers, while they would love to use the highly advanced features of modern disk arrays, have already made a big investment in legacy storage. Sure, it doesn’t have all the great features, but it’s already there, frequently there’s a lot of it, and the maintenance isn’t expiring for another year or two so it’s not economically feasible to get rid of it.

Another issue most enterprises face is data migration – whether that’s to move from old to new on the same vendor, or from vendor to vendor. No matter how you cut it, you’ll have to do it someday.

A third issue is performance on the existing gear – maybe you have a ton of legacy storage but it’s just not performing the way you’d expect.

The final issue is managing disparate arrays. Nobody really wants to do that.

There are storage virtualization products that, conceptually, try to solve some of those issues in a similar way to how VMware, Hyper-V and Xen address similar issues with servers.

The idea is that you virtualize your existing storage behind gear that will give it some extra capabilities, centralized management and thereby extend its service life and maybe even eke out some more performance out of it. Your existing hosts will typically address the storage via the virtualizing device, so obviously some assembly is required (rezoning etc).

The devices I’m aware of fall into 3 basic categories:

  1. Devices that encapsulate existing LUNs and don’t need other equipment or much reconfiguration besides dropping them in, zoning and presenting the LUNs to the hosts through them. Examples are: FalconStor NSS, IBM SVC, HDS USP-V, HP SVSP.
  2. Devices that don’t need other equipment, offer some compelling extra features but cannot encapsulate LUNs and therefore need an initial migration besides the zoning. Example: NetApp V-Series.
  3. Devices that need extensive fabric upgrades besides reconfiguration. Example: EMC Invista (I’m not sure if it needs LUN migrations, I don’t think so but I’m sure someone from EMC will chime in).

There are other differences in the devices listed above, so I created a table and highlighted the areas where there’s either the odd man out or there’s some feature not available with the others. I’m aware that the table is nowhere near complete, but as it is I doubt it will fit onto a web page nicely. If there are inaccuracies, let me know and I’ll fix it. I admit I know little about HP’s SVSP. (re-posted with some SVC edits).

 

Thin Provisioning

Thin Clones

Snapshots

Also an Array

In-Band

Deduplication

Replication

Needs Migration

NAS

Needs fabric Upgrade

FCoE

Perf Acceleration

Can do live FC migrations

Needs some space on array

EMC

N

N

N

N

N

N

N (needs RecoverPoint)

? (prob N)

N

Y

N

N

Y

N

HP

Y (? perf impact)

?

Y (? perf impact)

N

split-path

N

Y

?

N

N

N

N

Y

N

FalconStor

Y (? perf impact)

Y (perf impact)

Y (perf impact)

N

Y

N

Y

Y

N

N

N

Y (SSD cache)

Y

N

HDS

Y (perf impact)

?

Y (perf impact)

Y

Y

N

Y

N

N

N

N

Y (huge cache, RAM)

Y

N

IBM

Y (no perf impact)

Y (perf impact)

Y (perf impact)

Y (limited 4x SSD per node)

Y

N

Y

N

N

N

N

Y (192GB large cache with 8 nodes)

Y

N

NetApp

Y (no perf impact)

Y (no perf impact)

Y (no perf impact)

Y

Y

Y

Y

Y

Y

N

Y (also 10GbE)

Y (gigantic cache, multi-TB)

N (iSCSI, NFS, CIFS only at present)

Y

 

The design decisions are interesting.

Of the above, IBM and FalconStor take the “pure appliance” approach, using Linux servers with custom code – that’s what those boxes were designed to do from the get-go. The idea is that you either have a bunch of old arrays or you buy a bunch of new, cheap and not very capable arrays, then front them with SVC or NSS, thereby making them decent.

Since IBM and FalconStor were always designed to perform this function, they are also, in my opinion, the best-suited for tasks like migrations. Indeed, I believe one can do a “hit and run” with said boxes, i.e. do the migration then remove the boxes from the fabric, making them popular with certain PS organizations.

On the other hand, HDS and NetApp instead offer the virtualization functionality as an additional feature to their arrays – as in, “you’ll probably buy our disk but we can enhance your legacy box, too”.

EMC took a completely different approach and uses out-of-band control servers and intelligent fabric switches to perform the virtualization trickery.

It’s important to note that NetApp lacks the live migration feature, but instead offers deduplication, application-aware snaps, great replication and NAS, and is arguably the most feature-rich platform (I’m trying to not be biased as I’m writing this). The biggest caveat (a deal-breaker for some) is that it can’t encapsulate your existing LUNs – instead, you need to chop up your RAID groups into LUNs, then present them to the NetApp system, which will then need to reformat said LUNs. This process also takes away some space for extra checksum calculations and other overheads. Arguably, you can make this up (and then some) in the end after using the features on tap (sorry). But you still need to figure time to migrate your stuff over gradually.

I believe EMC offers the least features and the most complex implementation – you can do stuff like mirror your LUNs from box to box and do migrations, but your arrays don’t really gain any new features. I have yet to meet a customer that owns this solution. I know there are a few big ones that went that way; it’s just not very common.

Of the devices mentioned above, the SVC is probably the most commonly used, then the USP-V (IBM and HDS always argue on that point since the capability to virtualize comes with HDS boxes whereas virtualization is the only thing the SVC does), then come FalconStor and NetApp, then HP with the relative newcomer SVSP, and last EMC (Invista hasn’t been a particularly successful product for EMC).

Storage Virtualization do’s and don’ts

I’d say that you should only really consider buying a virtualization product if you have well over 10TB of older gear (I’d say over 50TB IMHO) that is not TOO old (i.e. not older than 3-4 years). Quite frequently, if your gear is really old, refreshing it with new just ends up being cheaper. Of course, there’s always eBay.

I’d also recommend not buying new low-end arrays and using virtualization to make them “better”. You are introducing more complexity into the environment, and it won’t necessarily be cheaper, either (something like the SVC has licenses that cost by the TB). Just buy a decent modern array that has all the features you need and be done with it.

Furthermore – don’t get into virtualization just to migrate from your older to your newer arrays. There are other ways.

You should use common sense (imagine that). As you’re not supposed to mix drive types within RAID groups even if you can, you typically don’t want to have an application straddling 5 different arrays, all vastly different in capability, just because you can.

It’s tempting to say “I’ll create a LUN that’s striped among every single disk on 5 different arrays”. Not to say that this should never be done (I’ve RAID-0′d across Symmetrix to get enough performance, long story), but only do it if you know what you’re doing and the exact layout that you’ll end up with. Nothing spells misery like RAID0 across many LUNs in an existing RAID group… :)

Finally – figure out what features are the most important to you. If you want dedupe, NAS and tight app integration, NetApp is the ticket. If you prefer ease of migration, you may want to look at the other solutions.

The guarantees

In order to entice customers to try their stuff, HDS and NetApp have some space savings guarantees in place regarding virtualization. HDS has a flat 50% guarantee (predicated upon converting from RAID1 to RAID5 + thin provisioning) or 20% guarantee (just thin provisioning).

NetApp has the ZIP program. It’s a bit different – there’s no hard number in the savings. Rather, the customer’s data is analyzed and the customer presented with the savings % NetApp guarantees to achieve in their case. If the customer agrees and NetApp achieves the guaranteed savings, then the gear gets purchased. If the savings are not reached, then the customer gets to keep the gear free of charge (that’s right).

Such guarantee programs have been much ridiculed by the vendors that don’t offer them, but I think they show the respective companies believe in their products enough to wrap some kind of guarantee around them.

In conclusion…

Properly deployed, storage virtualization can be effective in increasing the efficiencies of legacy storage footprints lacking in functionality. Just be careful and examine your motives for virtualization before making the move. Sometimes it’s a decidedly false economy.

D

Thu
18
Feb '10

So, are there any independent bloggers? Really?

There was some weird backlash against my site and my person recently – see here and here and in the comments here. Chuck Hollis got all uppity about whether I work at NetApp (with, for) or not.

I find it interesting that this only came up when I wrote something pro-NetApp. Wasn’t even anti-EMC.

It never came up when I was extolling the virtues of RecoverPoint (which I still think is awesome). I didn’t see anyone from NetApp or any EMC competitor start questioning where I worked, where the full disclosure was etc etc. Maybe they all just assumed I worked for EMC. Well – not directly, I was selling a ton of EMC gear, which was in turn paying my mortgage, which is as good as. But, ultimately, I just like the product since, properly deployed, it can solve some real problems.

So why is NetApp the company everyone loves to hate? Is it fear? Disrespect? Lack of understanding? All the above? But, I digress. NetApp customers love the product, and the company’s recent earnings announcement, as well as the fact we sold 1 Exabyte of enterprise storage last year, tells the real story. The People want their highly-functional, space-efficient, simple-to-use, application-aware storage, not 50 different products that are loosely integrated. Volkslagerung! Is that right, German-speaking readers?(edit: Volksdatenspeicher seems better as “storage for the people”).

So, I clarified things in the About page (upper left), I thought it was already clear but apparently not. Chuck is still not satisfied, so I think I’ll have to figure out a way to show some fancy animation of me in some NetApp uniform, hugging Hitz, Lau, Georgens and Mendoza and receiving my MVP award. Plus another animation showing the super-secret initiation ceremony and the extensive branding on my left buttock. Right.What was most interesting in this ad hominem attack was that the important discussion topics were largely ignored, a very efficient tactic to lure the unsuspecting reader’s mind away from the real issues.Which brings us to the subject of this post.

There seems to be this cute, romantic notion that there is such a thing as a truly independent blogger, and if I’m not independent, then what I say is tainted.

Well – let me break it to you and disabuse you of this notion: There ain’t no such thing as an independent blogger.

We are all biased, one way or another, about everything. Our past experiences shape our biases and the automatic stories our brains will create to explain any information we are presented with.

It doesn’t matter whether we work for a storage vendor or are customers – indeed, customers are typically among the most biased IT folks around! (storage vendor employees are usually crusty, jaded, cynical, have been around the block and typically have the dirt on multiple technologies).

I’ve been in customer meetings where I was told the customer doesn’t ever want to talk to EMC again because they treated him badly 10 years ago, or that he doesn’t want to talk to NetApp because he read in Barry’s blog that it only has 30% usable space, another that has FC queuing issues with HDS gear and wants to get rid of it at all costs, yet another that has had some controller panics with IBM gear and wants to get off of that and never touch IBM ever again, the list goes on. Those guys become zealots.

Then you have the other customer type, the one that receives Rolexes and other cool gifts in order to say whatever he’s told to say. Some actually will demand it (I’ve been in one of those meetings, too – “if you give me your watch we may have a deal”. I chose to assume he was kidding, lest I completely lose my faith in mankind).

You then have your “analyst” type that’s an independent industry “expert” – most of those guys haven’t touched the products they’re writing about, ever, and are just rehashing whatever they read in other publications or are told by their vendor drinking buddy. Yet they’re among the most trusted and read. They, too have their personal favorite horses they’re backing…

Finally you have your VAR bloggers. People – those guys make money selling the stuff. Yes, they know the tech, but don’t exactly expect an impartial discussion… plus, they get all kinds of incentives from vendors.

So, who do you trust, when you can’t even trust yourself? Since, by definition, you are also biased, gentle reader…

I wish I could tell you. Ultimately, everyone has an agenda, whether conscious or subconscious. You just need to become shrewd enough to see through the agenda.

Maybe a good starting point is a truly intelligent, fact-based discussion bereft of ad hominem attacks?

D

Wed
10
Feb '10

More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?

Before all the variable-block aficionados go up in arms, I freely admit variable-block deduplication may overall squeeze more dedupe out of your data.

I won’t go into a laborious explanation of variable vs fixed, but, in a nutshell, fixed-block deduplication means that data is split into equal chunks, each chunk given a signature, compared to a DB and the common chunks are not stored.

Variable-block basically means the chunk size is variable, with more intelligent algorithms also having a sliding window, so that even if the content in a file is shifted, the commonality will still be discovered.

With that out of the way, let’s get to the FUD part of the post.

I recently had a TLA vendor tell my customer: “NetApp deduplication is fixed-block vs our variable-block, therefore far less efficient, therefore you must be sick in the head to consider buying that stuff for primary storage!”

This is a very good example of FUD that is based on accurate facts which, in addition, focuses the customer’s mind on the tech nitty-gritty and away from the big picture (that being “primary storage” in this case).

Using the argument for a pure backup solution is actually valid. But what if the customer is not just shopping for a backup solution? Or, what if, for the same money, they could have it all?

My question is: Why do we use deduplication?

At the most basic level, deduplication will reduce the amount of data stored on a medium, enabling you to buy less of said medium yet still store quite a bit of data.

So, backups were the most obvious place to deploy deduplication. Backup-to-Disk is all the rage, what if you can store more backups on target disk with less gear? That’s pretty compelling. In that space you have of course Data Domain and the Quantum DXi as the two of the more usual backup target suspects.

Another reason to deduplicate is to not only achieve more storage efficiency but also improve backup times by not even transferring over the network data that’s already been transferred. In that space there’s Avamar, PureDisk, Asigra, Evault and others.

NetApp simply came up with a few more reasons to deduplicate, not mutually exclusive with the other 2 use cases above:

  1. What if you could deduplicate your primary storage – typically the most expensive part of any storage investment – and as a result buy less?
  2. What if deduplication could actually dramatically improve your performance in some cases, while not hindering it in most cases? (the cache is deduplicated as well, more info later).
  3. What if deduplication was not limited to infrequently-accessed data but, instead, could be used for high-performance access?

For the uninitiated, NetApp is the only vendor, to date, that can offer block-level deduplication for all primary storage protocols for production data - block and file, FC, iSCSI, CIFS, NFS.

Which is a pretty big deal, as is anything useful AND exclusive.

What the FUD carefully fails to mention is that:

  1. Deduplication is free to all NetApp customers (whoever didn’t have it before can get it via a firmware upgrade for free)
  2. NetApp customers that use this free technology see primary storage savings that I’ve seen range anywhere from 10% to 95%, despite all the limitations the FUD-slingers keep mentioning
  3. It works amazingly well with virtualization and actually greatly speeds things up especially for VDI
  4. Things that would defeat NetApp dedupe will also defeat the other vendors’ dedupe (movies, compressed images, large DBs with a lot of block shuffling). There is no magic.

So, if a customer is considering a new primary storage system, like it or not, NetApp is the only game in town with deduplication across all storage protocols.

Which brings us back to whether fixed-block is less efficient than variable-block:

WHO CARES? If, even with whatever limitations it may have, NetApp dedupe can reduce your primary storage footprint by any decent percentage, you’re already ahead! Heck, even 20% savings can mean a lot of money in a large primary storage system!

Not bad for a technology given away with every NetApp system

D

Mon
8
Feb '10

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

clip_image001

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok.  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So - there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

Tue
2
Feb '10

Vendor FUD-slinging – at what point should legal action be taken? And who do you believe as a customer?

I’m all for a good fight, but in the storage industry it seems that all too many creative liberties are taken when competing.

Let’s assume, for a moment, that we’re talking about the car industry instead. I like cars, and I love car analogies. So we’ll use that, and it illustrates the absurdity really well.

The competitors in this example will be BMW and Mercedes. Nobody would argue that they are two of the most prominent names in luxury cars today.

BMW has the high-performance M-series. Let’s take as an example the M6 – a 500HP performance coupe. Looks nice on paper, right?

Let’s say that Mercedes has this hypothetical new marketing campaign to discredit BMW, with the following claims (I need to, again, clarify that this campaign is entirely fictitious, and used only to illustrate my point, lest I get attacked by their lawyers):

  1. Claim the M6 doesn’t really have 500HP, but more like 200HP.
  2. Claim the M6 only does 0-60 in under 5 seconds with only 5% of the gas tank filled, a 50lb driver, downhill, with a tail wind and help from nitrous.
  3. Claim that if you fill the gas tank past 50%, performance will drop so the M6 does 0-60 in more like 30 seconds. Downhill.
  4. Claim that it breaks like clockwork past 5K miles.
  5. Claim that they have one, they tested it, and performs as they say.
  6. Claim that, since they are Mercedes, the top name in the automotive industry, you should trust them implicitly.

Imagine Mercedes, at all levels, going to market with this kind of information – official company announcements, messages from the CEO, company blogs, engineers, sales reps, dealer reps and mechanics…

Now, imagine BMW’s reaction.

How quickly do you think they’d start suing Mercedes?

How quickly would they have 10 independent authorities testing 10 different M6 cars, full of gas, in uphill courses, with overweight drivers, just to illustrate how absurd Mercedes’ claims are?

How quickly would Mercedes issue a retraction?

And, to the petrolheads among us – wouldn’t such a stunt look like Mercedes is really, really afraid of the M6? And don’t we all know better?

More to the point – do you ever see Mercedes pulling such a stunt?

Ah, but you can get away with stuff like that in the storage industry!

Unfortunately, the storage industry is rife with vendors claiming all kinds of stuff about each other. Some of it is or was true, much of it is blown all out of proportion, and some is blatant fabrication.

For instance, XIV breaking if you pull 2 disks out – as I state in a previous post, it’s possible if the right 2 drives fail within a few minutes of each other. I think it’s unacceptable, even though it’s highly unlikely to happen in real life. But I’ve seen sales campaigns against the XIV use this as the mantra, to the point that the fallacy is finally stated: “ANY 2 drive failure will bring down the system”.

Obviously this is not true and IBM can demonstrate how untrue that is. Still, it may slow down the IBM campaign.

Other fallacies are far more complicated to prove wrong, unfortunately.

An example: Pillar Data has an asinine yet highly detailed report by Demartek showing NetApp and EMC arrays having significantly lower rebuild speeds than Pillar (as if that’s the most important piece of data management, but anyway – rebuild speed hasn’t helped Pillar sales much, even if it’s true).

To anyone that knows how to configure NetApp and EMC, they’d see that the Pillar box was correctly configured, whereas the others intentionally made to look 4x worse (in the case of NetApp, they literally went against not just best practices but blatantly against system defaults in order to make it slower). However, some CIOs might read this and give credence to it, since they don’t know the details and don’t read past the first graph.

For EMC and NetApp to dispute this, they have to go to the trouble of configuring, properly, a similar system, and running similar tests, then writing a detailed and coherent response. It’s like wounding the enemy soldier instead of killing them, their squadmates have to help them out, wasting manpower. I get it – it’s effective in war. But is it legal in the business world?

Last but not least: EMC and HP, at the very least, have anti-NetApp reports, blogs, PPTs etc. that literally look just like the absurd Mercedes/BMW example above, sometimes worse. Some of it was true a long time ago (the famous FUD “2x + snap delta” space requirement for LUNs is really “1x + snap delta” and has been for years), some of it is pure fabrication (”it slows down to 5% of its original speed if you fill it up!”). See here for a good explanation.

Of course, again that’s like wounding the enemy soldiers: NetApp engineers have to go and defend their honor, show all kinds of reports, customer examples, etc etc. Even so, at some point many CIOs will just say “I trust EMC/HP, I’ve been buying their stuff forever, I’ll just keep buying it, it works”. The FUD is enough to make many people that were just about to consider something else, go running back to mama HP.

Should NetApp sue? I’ve seen some of the FUD examples and literally they are not just a bit wrong but magnificently, spectacularly, outrageously wrong. Is that slander? Tortuous interference? Simply a mistake? I’m sure some lawyer, somewhere, knows the answer. Maybe that lawyer needs to talk to some engineers and marketing people.

Let’s flip the tables:

If NetApp went ahead and simply claimed an EMC CX4-960 can only really hold 450TB, what would EMC do?

I can only imagine the insanity that would ensue.

I’ll finish with something simple from the customer standpoint:

NetApp sold 1 Exabyte of enterprise storage last year, if it was as bad as the other (obviously worried) vendors are saying, does that mean all those customers buying it by the truckload and getting all those efficiencies and performance are stupid and wasted their money?

D