Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Mon
30
Aug '10

NetApp benefits for virtualization - benchmarked and proven

My colleague Vaughn Stewart explains it in detail here. I didn’t feel we gave this the publicity it deserves…

In a nutshell: We have numbers (published only after VMware engineering themselves approved the paper as accurate and gave their permission) proving that, compared to “traditional” arrays, running virtualized workloads on NetApp gear needs less resources while providing excellent performance.

If you don’t want to spend time reading Vaughn’s article, this link has the goods in impressive detail…

It’s worth noting the “traditional” array had a lot more disks and RAM, but the NetApp array had a Flash Cache module. We are not allowed to publish the vendor of the “traditional” array due to licensing restrictions, but, as mentioned, VMware engineering verified the results – the test was legit.

Some pictures for the impatient:

 

image

 

image

 

image

image

 

Key take-aways:

  1. A lot less disk space needed with NetApp
  2. A lot quicker to provision the VMs
  3. Faster performance than RAID10 even without the Flash Cache (and dramatically higher with)
  4. No-compromise RAID-DP offers same protection as RAID6 without the penalty
  5. NFS for VMware can be pretty fast inded given the appropriate storage behind!

 

D

Sat
7
Aug '10

FUD tales from the blogosphere: when vendors attack (and a wee bit on expanding and balancing RAID groups)

Haven’t blogged in a while, way too busy. Against my better judgment, I thought I’d respond to some comments I’ve seen on the blogosphere, adding one of my trademark extremely long titles. Part response, part tutorial. People with no time to read it all: Skip to the end and see if you know the answer to the question or if you have ideas on how to do such a thing.

It’s funny how some vendors won’t hesitate to wholeheartedly agree when some “independent” blogger criticizes their competition (before I get flamed, independent in quotes since, as I discussed before, there ain’t no such thing whether said blogger realizes it or not – being biased is a basic human condition).

The equivalent of someone posting in an Audi forum about excessive brake dust, and having guys from Mercedes and BMW chime in and claim how they “tested” Audis and indeed they had issues (but of course!) and how their cars are better now and indeed maybe Audi doesn’t have as much of a lead any more (if, indeed, they ever did). I think the term for that is “shill” but I can understand taking every opportunity to harm an opponent.

So the “Storage Architect” posted entries asking about certain features to be implemented on NetApp storage, one of them being able to reduce the size of an aggregate. Then everyone and their mum jumped on and complained how on earth such an important feature isn’t there… :) BTW – I’m not saying such a thing wouldn’t be useful to have from time to time. I’ll just try to explain why it’s tricky to implement and maybe ways to avoid problems.

For the uninitiated, an “aggregate’ is a collection of RAID-DP RAID groups, that are pooled, striped and I/O then hits all the drives from all RAID groups equally for performance. You then carve out volumes out of that aggregate (containers for NFS, CIFS, iSCSI, FC).

A pretty simple structure, really, but effective. Similar constructs are used by many other storage vendors that allow pooling.

So, the question was, why not be able to make an aggregate smaller? (you can already make it bigger on-the-fly, as well as grow or shrink the existing volumes within).

An HP guy them proceeded to complain about how he put too few drives in an aggregate and ended up with an imbalanced configuration while trying to test a NetApp box.

So, some basics… the following picture shows a well-balanced pool – notice the equal number of drives per RAID group:

image

The idea being that everything is load-balanced:

image

Makes sense, right?

You then end up with pieces of data across all disks, which is the intent. Growing it is easy – which is, after all, what 99.99% of customers ever want to do.

However, the HP dude didn’t have enough disks to create a balanced config with the default-sized RAID group (16). So he ended up with something like this, not performance-optimal:

image

So what the HP dude wanted to do, was to reduce the size of the RAID group and remove drives, even though he expanded the aggregate (and by extension the RAID group) originally.

Normally, before one starts creating pools of storage (with any storage system), one also knows (or should) what  one has to play with in order to get the best overall config. It’s like – “I want to build a 12-cylinder car engine, but I only have 9 cylinders”. Well – either buy more cylinders, or build an 8-cylinder engine… don’t start building the 12-cylinder engine and go “oops” :) This is just Storage 101. Mistakes can and do happen, of course.

So, with the current state of tech, if I only had 20 drives to play with (and no option to get more), assuming no spares, I’d rather do one of the following:

  1. Aggregate with 10 + 10 RAID groups inside or…
  2. Use all 20 drives in a single RAID group for max space
  3. Ask someone that knows the system better than I do for some advice

This is common sense and both doable and trivial with a NetApp system. The idea is you set the desired RAID group size for that aggregate BEFORE you put in disks. Not really difficult and pretty logical.

For instance, aggr options HPdudeAggr raidsize 10 before adding the drives would have achieved #1 above. Graphically, the Web GUI has that option in there as well, when you modify an aggregate. The option exists and it’s well-known and documented. Not knowing about it is a basic education issue. Arguing that no education should be needed to use a storage device (with an extreme number of features) properly even for deeply involved, low-level operations, is a romantic notion at best. Maybe some day. We are all working hard to make it a reality. Indeed, a lot of things that would take a really long time in the past (or still, with other boxes) have become trivialized – look at SnapDrive and the SnapManager products, for instance.

Back to our example: if, in the future, 10 more disks were purchased, and approach #1 above was taken, one would simply add the ten disks to the aggregate with aggr add HPdudeAggr 10. Resulting in a 10+10+10 config.

But what if I had done #2 above (make a 20-drive RAID group the default for that aggregate)?

Then, simply, you’d end up imbalanced again, with a 20+10. Some thought is needed before embarking on such journeys.

Maybe a better approach would be to add, say, a more reasonable number of drives to achieve good balance? Adding 12 more drives, for example, would allow for an aggregate with 16+16 drives. So, one could simply change the raidsize using aggr options HPdudeAggr raidsize 16, then, add the 12 disks to the aggregate with aggr add HPdudeAggr –g all 12. 

This would expand both RAID groups contained within the aggregate dynamically to 16 drives per, resulting in a 16+16 configuration. Which, BTW, is not something you can easily do with most other storage systems…

Having said all that, I think that for people that are not storage savvy (or for the storage savvy that are suffering from temporary brain fog), a good enhancement would be for the interfaces to warn you about imbalanced final configs and show you what will be created in a nice graphical fashion, asking you if you agree (and possibly providing hints on how it could be done better).

I’m not aware of any other storage system that does that degree of handholding but hey, I don’t know everything.

Indeed, maybe the nature of the other posts was being bait so I’ll obligingly take the bait and ask the question so you can advertise your wares here: :)

Is anyone aware of a well-featured storage system from an established, viable vendor that currently (Aug 7, 2010, not roadmap or “Real Soon Now”) allows the creation of a wide-striped pool of drives with some RAID structures underneath; then allows one to evacuate and then destroy some of those underlying RAID groups selectively, non-disruptively, without losing data, even though they already contain parts of the stripes; then change the RAID layout to something else using those same existing drives and restripe without requiring some sort of data migration to another pool and without needing to buy more drives? Again, NOT for expansion, but for the shrinking of the pool?

To clarify even further: What the HP guy did was exactly this: He had 20 drives to play with, he created by mistake a pool with 2 RAID groups, 14+2 and a 2+2, how would your solution take those 2 RAID groups, with data, and change the config to something like 10 + 10 without needing more drives or the destruction of anything?

Can you dynamically reduce a RAID group? (NetApp can dynamically expand, but not reduce a RAID group).

I’m not implying such a thing doesn’t exist, I’m merely curious. I could see ways to make this work by virtualizing RAID further. Still, it’s just one (small) part of the storage puzzle.

The one without sin may cast the first stone! :)

D

Technorati Tags: ,,

Mon
24
May '10

Et tu, Brute? EMC offering capacity guarantees? The sky is falling! Will Chuck resign?

It came to my attention that EMC is offering a 20% efficiency guarantee vs the competition (they seem to be focusing on NetApp as usual but that’s besides the point in this post). See here.

Now, I won’t go ahead and attack their guarantee. Good luck with that, more power to you etc etc. They need all the competitive edge they can get.

No, what I’ll do is expose yet more EMC messaging inconsistency. If you’ve been following the posts in my site you’ll notice that I have absolutely nothing against EMC products – but I do have issues with how they’re sold and marketed and what they’ll say about the competition.

First and foremost – most major storage players, with the notable exception of EMC, have been offering some kind of efficiency guarantee. Sure, you needed to read the fine print to see if your specific use case would be covered (like with every binding document), but at least the guarantees were there. NetApp was first with our 50% efficiency guarantee, then came others (HDS and 3Par are just some that come to mind). We even offer a 35% guarantee if we virtualize EMC arrays :)

We all have different ways of getting the efficiency – NetApp has a combo of deduplication, thin provisioning, snapshots, highly efficient RAID and thin cloning, for instance. Others have a subset (3Par has their really good thin provisioning, for instance). Regardless, we all tried to offer some measure of extra efficiency in these hard economic times.

And it’s not just marketing – I have multiple customers that, especially on virtualized environments, save at least 70% (that’s a real 70%, not 70% because we switched them from RAID10 to RAID-DP – literally, a 10TB data set is occupying 3TB). And for deployments like VDI, the savings are in the extreme range.

EMC’s stance was to, at a minimum, ridicule said guarantees. The inimitable Barry Burke (the storage anarchist) had this pretty funny post.

Chuck Hollis has been far more polemic about this – the worst was when he said he’d quit if EMC tried to do something similar (see here in the comments). BTW – we are all waiting for that resignation :) (on a more serious note, Chuck, if you don’t resign because of this, at least refrain from promising next time).

He also called other guarantees “shenanigans” here. I guess he’s really against the idea of guarantees.

But now it’s all good – you see, EMC is offering a blanket 20% efficiency guarantee versus the competition! I.e. – they will be able to provide 20% more actual usable storage or else they’ll give you free drives to cover the difference. You see, this guarantee is real, not like all the other companies offer :)

Kidding aside, methinks they’re missing the point – this (to go back to my favorite car analogies) is like saying: “Both our car and your car have a 3-liter engine, but yours has twin turbos and a racing intercooler and 3 times the horsepower – but we won’t take any of that into account, we will strictly examine whether you indeed have a 3-liter engine, and we’ll bore ours out to make it 3.6 liters for free”. Alrighty then. I’ll keep my turbos. But how will they deal with an existing NetApp customer that’s getting something like 3x efficiency already?

If a NetApp customer is getting 3x the usable storage due to deduplication and other means, will EMC come up with the difference or will they just make sure they offer 20% more raw storage? 

To the customer, all that matters is how much effective storage they’re able to use, not how much raw storage is in the box.

But, still, this is not what this post is about.

Throughout the years, NetApp and other vendors have offered true innovation on different fronts. Each time that happens, EMC (that also innovates - through acquisition mostly - but likes to act as if nobody else does) employs their usual “minimize and divert” technique. Either they will trivialize the innovation (“who’d want to do that?”) or they will proclaim it false, then divert attention to something they already do (or will do in a few years).

This is even the case for technologies EMC eventually acquired, like Data Domain. Before EMC acquired Data Domain, they disparaged the product, claimed it was the worst kind of device you’d ever want in your datacenter, then tried to sell you the execrable DL3D (AKA Quantum DXi – don’t get me started, the first release was an utter mess).

We all know what happened to that story eventually… at the moment, EMC is offering to swap out existing DL3Ds for free in many cases, and put Data Domain in their place since it’s infinitely better. But wait, weren’t they saying how terrible Data Domain was compared to DL3D?

Some will say this is fine since they’re just trying to compete, and “all is fair”. Personally, if I were approached by sales teams with those about-face tactics, I’d be annoyed.

So, without further ado, I present you with a slide a colleague created. Some of the timing may be a bit off, but the gist should be fairly clear… :)

image

I could have added a few more lines (Flash Cache, for instance) but it would have made for too busy a slide.

EDIT: I’ll add something I posted as a comment on someone else’s blog that I think is germane.

Since, to provide apples-to-apples protection, EMC HAS to be configured with RAID6, where are the public benchmarks showing EMC RAID6? As you well know, ALL NetApp benchmarks (SPEC, SPC) are with RAID-DP. Any EMC benchmarks around are with RAID10.

Maybe another guarantee is needed:

Provide no worse protection, functionality, space and performance than X competitor.

Otherwise, you’re only tackling a relatively unimportant part of the big picture.

D

Technorati Tags: ,,,,,,,,,,

Fri
7
May '10

NetApp usable space – beyond the FUD

I come across all kinds of FUD, and some of the most ridiculous claims against NetApp regard usable space. I won’t post screenshots from competitive docs since who knows who’ll complain, but suffice it to say that one of the usual strategies against NetApp is to claim the system has something like well under 50% space efficiency using a variety of calculations, anecdotes and obsolete information. In one case, 34% usable space :) Right…

The purpose of this post is to outline the state of the art regarding NetApp usable space as of Spring of 2010. 

Since NetApp systems can use free space in various ways instead of just for LUNs, there is frequent confusion regarding what each space-related parameter means, and what the best practices are. NetApp’s recommendations have changed over the years as the technology matured – my goal is to bring everybody up to speed.

Executive summary

Depending on the number and type of drives and the design, aside from edge cases dealing with small systems with a very low number of disks, the real usable space in NetApp systems can easily exceed 75% of the real usable space in the drives. I’ve seen it as high as about 78%. That’s amazingly efficient for something with double-parity protection as default and includes spares. This number is the same whether it represents NAS or SAN data and doesn’t include deduplication, compression or space-efficient clones, which could inflate it to over 1000%. Indeed, NetApp systems are used in the biggest storage installations on the planet partly because they’re so space-efficient. Now, on to the details…

What’s space good for anyway?

Legacy arrays use space in very simple terms – you create RAID groups, then you create LUNs on them and those LUNs pretend they’re normal disks, and that’s that. Figuring out where your space goes is easy – there’s a 1:1 relationship between LUN size and space used on the array. You buy an array that can provide 10TB after RAID and spares, and that’s all you ever get – nothing more, nothing less.

Legacy arrays can sometimes use features such as snapshots, but frequently there are so many caveats around their use (performance being a big one) that either they’re never implemented, or their number is very small indeed to make them really useful.

Since NetApp gear doesn’t suffer from those limitations, customers invariably end up using snapshots a lot, and for various reasons, not just backup. I have customers with over 10,000 snapshots in their arrays… they replicate all those snapshots to another array, can retrieve data that’s several months old, and have stopped relying on legacy backup software, saving money and achieving far faster and easier DR in the process, since with snapshots there’s no restore needed.

What’s your effective space with NetApp gear?

If you consider that each snapshot looks like a complete copy of your data, without factoring in any deduplication at all, the effective logical space could be many, many times more than the physical space. A large law firm I deal with manages to fit about 2.5PB of data into 8TB of snapshot delta space – which is pretty efficient by anyone’s standards. We’re not talking about backups done on deduplicated disk here that need to be restored to become useful – we’re talking about many thousands of straight-up, application-consistent, full copies of LUNs, CIFS and NFS shares that you can mount at full speed instantly, without needing to restore from another medium or backup application.

Once you add deduplication and thin cloning, the storage efficiency goes even higher.

It’s not the size of your disk that matters, it’s how you use it

If you use a NetApp system like a legacy disk array, without taking advantage of any of the advanced features (maybe you just care for the multi-protocol functionality, with great performance and reliability) then your usable space falls right within norms. Once you start using the advanced snapshot features, they start eating space of course – but giving you something in return. What you need to figure out is if the tradeoffs are worth it: for instance, if I can keep a month’s worth of Exchange backups with a nominal capacity increase, what is that worth for me? Maybe:

  • I can eliminate backup software licenses
  • I can shrink my storage footprint
  • Avoid purchasing external disk for backups
  • I don’t need to buy external CDP hardware/software and a bunch of extra disk
  • My restores take seconds
  • DR becomes trivial

Or, if I can create 150 clones of my SQL database that my developers can simultaneously use and only chew up a small fraction of the space I’d otherwise need, what is that worth? With other systems, I’d need 150x the space…

Or, create thousands of VM clones for VDI…

How much money are you saving?

What do simplicity and speed mean to your business from an OpEx savings standpoint?

Another way to look at it:

How much more efficient would your business be if you weren’t hampered by the limitations of legacy technology? It’s all about becoming aware of the expanded possibilities.

What you buy

FYI, and to clear any misconceptions in case you can’t be bothered to read the rest: if you ask me for a 10TB usable system, you’ll get a system that will truly provide 10TB usable, honest-to-goodness Base2 space protected against dual-drive failure (no RAID5), and after all overheads, spares etc. have been taken out. If you want snapshot space we’ll have to add some (like you’d need to with any other vendor). It’s as simple as that.

Right-sized, real space vs raw capacity

Others have explained some of this before but, for completion, I’ll take a stab:

  • The real usable size of, say, a 450GB drive is not really 450GB regardless of the manufacturer.
  • The real usable capacity quoted depends on whether it’s Base2 or Base10 math and a bunch of other factors…
  • All vendors that source drives from multiple manufacturers that use RAID groups need to right-size their drives – meaning that, if manufacturer A offers a tad more space in the drive than manufacturer B, in order to use both kinds of drives in the same RAID group, you kinda need to make them seem like the exact same size, meaning you go for the lowest common denominator.
  • Using our 450GB example above, the real addressable right-sized Base10 space in that drive is 438.3GB, and even “less” in Base2 (402.2). Base2 math simply means 1024 bytes in 1K, not 1000, and the rest follows.
  • Beware of analysis, comparisons or quotes showing Base10 from one vendor and Base2 from another, or raw disk space from one vendor vs right-sized from another! Always ask what base is what you’re seeing and whether the numbers reflect right-sized drives! If you look at the right-sized drive Base2 space from various vendors, it’s usually pretty close. Base your % usable calculations on that number and not the marketing 450GB number that’s not real for any vendor anyway.
  • Everyone pretty much buys the same drives from the same drive manufacturers…

Some space reservation axioms

Any system that allows snapshots, clones etc. typically needs some space for those advanced operations. For instance, if you completely fill up a system and then want to take a snapshot, it may let you but if you modify any data then it won’t have space to store the writes and the snapshot will be invalidated and deleted – kinda pointless.

As usual, there is no magic. If you expect to be able to store multiple snapshots, the system needs space to store the data changed between snapshots, regardless of array vendor!

And, out of curiosity – how many man-made devices do you own that you max out all the time? Not leaving breathing room is a recipe for trouble for any piece of equipment.

Explanation of the NetApp data organization

For the uninitiated, here’s a hierarchical list of NetApp structures:

  1. Disks
  2. RAID groups – made of multiple disks. Default RAID is RAID-DP. The system automatically makes them, you don’t need to define them or worry about back-end balancing etc. NetApp RAID groups are typically large, 16 disks or so. RAID-DP ensures better protection than RAID10 (the math shows 163x better than RAID10 and 4,000 better than RAID5).
  3. Parity drives – drives containing extra information that can be used to rebuild data. RAID-DP uses 2 parity drives per RAID group.
  4. Spares – drives that can replace failed or failing drives (no need to wait until the drive is truly dead)
  5. Aggregates – a collection of RAID groups and the basic unit from which space is allocated. That’s really what you define, then the system figures out automatically how to allocate disks and create RAID groups for you (can even expand RAID groups on the fly as you add more disks to the aggregate, even 1 disk at a time).
  6. Volumes – a container that takes space from an Aggregate. A volume can be NAS or SAN. A volume can only belong to one Aggregate, and there will typically be many volumes within an Aggregate. Most people will enable the automatic growing of Volumes.
  7. LUNs – they are placed inside the Volumes. One or more per volume, depending on what you’re trying to do. Usually one.
  8. Snapshots – logical, space-efficient copies of either entire Volumes or structures within volumes. There are 3 kinds depending on what you’re trying to do (Snapshot, Snapvault and Flexclone) but they all use similar underlying technology. I might get into the differences in a future post. Briefly: Snapshot – shorter term, Snapvault – longer term, Flexclone – writeable Snapshot.

Explanation of the NetApp space allocations

  1. Snapshot Reserve – an accounting feature that sets aside a logical percentage of space on a Volume. For instance, if you create a 10TB volume and set a 10% Snap Reserve, the client system will see 9TB usable. Most people will enable automatic deletion of Snapshots. The percentage to set aside is at your discretion and is variable on the fly. The actual amount of space consumed is related to your rate of change between snapshots. See here for some real averages across thousands of systems.
  2. Aggregate Snap Reserve – this is pretty unique. One can actually roll back an entire Aggregate on a NetApp system – can come in handy if you accidentally deleted whole Volumes or in general did some gigantic boo-boo. Rolling back the entire Aggregate will undo whatever was done to that aggregate to break it! This feature is enabled by default and has a 5% reservation. It it not mandatory unless you are running Syncmirror (mostly in Metrocluster setups). Depending on what you want to do, you could disable this altogether or set it to a small number like 1% (my recommendation).
  3. Fractional Reserve – The one that confuses everyone. In a nutshell: it’s a legacy safety net in case you want to modify all the data within a LUN yet still keep the snapshots. Think about it: Let’s say you took a snapshot and you then went ahead and modified every single block of your data. Your snap delta would balloon to the total size of the LUN – regardless of whether you use NetApp, EMC, XIV, Compellent, 3Par, HDS, HP etc etc. The data has to go someplace! There’s a great explanation in this document and I suggest you read it since it covers quite a bit more, too. This one is great, too. Long story short: With snapshot autodelete, and/or volume autogrow, you can set it to zero. If you use the SnapManager products, they take care of snapshot deletion themselves.
  4. System reserve – this is the only one that’s not optional. It’s set to 10% by default. You can actually change it but I’m not telling you how. That space is there for a reason, and changing it will potentially cause problems with high write rate environments. That 10% is used for various operations and has been found to be a good percentage to maintain good performance. All NetApp sizing takes this into account. BTW – ask other vendors if it’s perfectly safe to fill their systems at 100% all the time and whether that impacts performance or prevents them from being able to do certain things… And finally, that 10% lost is gained back in spades with the other NetApp efficiency methodologies (starting at the low level with RAID-DP – please do some simple math based on our 16+ drive RAID group vs typical RAID group sizes) so it doesn’t even matter.

Bottom line: Aside from the 10% system reserve, the rest is all usable space.

The NetApp defaults and some advice

So, here’s where it can get interesting (and confusing) and where the competition gets all their ammunition. Depending on the age of the documentation and firmware, different best practices and defaults apply.

So, if you look at competitive docs from other vendors, they claim that if you use NetApp for LUNs you waste double the space for fractional reserve. That recommendation was true many years ago and it was a safety precaution regarding fractional reserve. The documentation has been updated years ago with zero fractional reserve as the recommendation, but of course that doesn’t help competitors so they left the old messaging. So here’s a basic list of quick recommendations for LUNs:

  1. Snap reserve – 0
  2. Fractional reserve – 0
  3. Snap autodelete on (unless you have SnapManager products managing the snap deletion)
  4. Volume autogrow on
  5. Leave at least a little space available in your volumes, don’t let a LUN 100% fill a volume (the LUN space can be thick but the volume space can be thin-provisioned). This space is needed for deduplication and other processes temporarily
  6. Do consider embracing thin provisioning, even if you don’t want to oversubscribe your disk. It’s much more flexible long-term, and allows for storage elasticity.

So, look at the defaults and ask your engineer if it’s OK to change them if they don’t agree with the settings above. Especially on older systems, I notice that the fractional reserve is still 100%, even after getting updated with the latest software (the update doesn’t change your config). Nothing like giving someone a bunch of disk space back with a few clicks…

If you want to do thin provisioning, depending on the firmware, you may see that using thin provisioning on a volume forces the fractional reserve to 100% – but, ultimately, no real space is being consumed. Was OK in 7.2x, changed to the 100% behavior in 7.3.1, fixed in 7.3.3 since it was confusing everyone.

The bottom line

Ultimately, I want you thinking of how you can use your storage as a resource that enables you to do more than just storing your LUNs. And, finally, I wanted to dispel notions that NetApp storage has less storage efficiency than legacy systems. Comments are always appreciated!

D

Tue
27
Apr '10

What exactly is Unified Storage and who can sell it to you?

It’s come to my attention that pretty much every storage manufacturer is trying to imitate NetApp’s thought leadership and keeps announcing “Unified Storage” products. Everyone can do it now, it seems :)

Now, this post is not going to be bashing them or claiming they don’t work.

This post is about arguing what “Unified Storage” really means. And, more importantly, whether you should care about the differences.

Now, NetApp has been shipping Unified Storage for 8+ years now, and has shipped 150,000 Unified Storage systems to date. See here and here. So, I’d think nobody can argue that NetApp has quite a bit of experience in the technology and, indeed, were the very first to do it. Depending on your definition of “Unified”, NetApp may still be the only one doing it, but read on.

The crazy success of NetApp’s Unified Storage (just look at the company’s growth) has forced the other vendors, who initially dismissed the concept, to take a harder look – imagine that, customers actually like the idea of a Unified Storage System!

Here’s how most (if not all) other vendors approach ”Unified Storage”:

  • Start with your legacy Fiber Channel Array, use that to serve FC and maybe iSCSI – it’s probably a decent box, no reason to re-invent the wheel.
  • Connect some kind of Windows, Linux or UNIX server(s) to it that will then serve CIFS and NFS and maybe iSCSI (this is the NAS part)
  • Replicate them using different mechanisms for the FC and NAS parts

Pretty simple, really. You end up with the base legacy array, plus more boxes on top (ideally 2+ to ensure redundancy, plus some of them need an extra box or two called a “Control Station”).

It all works – after all, it’s just like putting servers in front of your storage, you’re doing that anyway. You are able to serve FC, iSCSI, NFS and CIFS out of the same rack. If we assume that the rack is the termination point for the cables and that you don’t care much about exactly what happens within. So, most C-level execs are OK with it – the rack can serve out all those protocols, ergo the “Unified Storage” claim seems justified.

Here are some potentially business-impacting issues with this approach:

  1. Aside from a couple of exceptions, the add-on boxes used by the storage vendors to add the NAS protocols aren’t made by that vendor (neither the OS nor the hardware). Obviously that raises some concerns with interoperability, manageability and the longevity of whatever NAS vendor was chosen. Support is now maybe not as robust since you are relying on using tech someone licensed from someone else.
  2. Replication gets complicated since you need to do it a few different ways depending on what protocol you’re replicating.
  3. Patching is more time-consuming since, apart from the legacy array, you need to also patch all the NAS paraphernalia.
  4. Management is frequently totally separate and laborious – you take care of the legacy array separately from the NAS part
  5. Certain important features are only available to one part of the solution (file-level single-instancing/”dedupe”, for example, only available for CIFS and NFS and not for iSCSI or FC).
  6. And, finally, what I think is the biggest problem: Space allocation is split between the FC and NAS parts and you can’t reduce one to increase the other. For instance, if you started with a 50/50 split, once you’ve allocated the space to the NAS (that always has its own Volume Manager and now owns that 50% chunk of array space), and you realize you’re only using 10% of that space after all, you can’t go ahead and return the remainder of the space to the FC part. This can cause serious inefficiency, inflexibility, cost and manageability issues.

The NetApp approach

NetApp decided to do things a bit differently. Maybe by virtue of how the original systems started out, it turned out it was easier for NetApp to effectively create what is effectively a protocol engine. Maybe “Protocol Engine with Integrated Disk Control and Protection” is more appropriate than “Unified Storage” but it’s a bit wordy…

Effectively, a single NetApp box, without external hangers-on, allows you to:

  • Connect using a variety of methods – FC, 1GbE, 10GbE, FCoE
  • Use the proprietary NetApp RAID-DP protection for great performance and better protection than RAID10
  • Provision FC, iSCSI, CIFS and NFS out of the same pool of physical disk space
  • Deduplicate FC, iSCSI, CIFS and NFS workloads
  • Perform application-aware replication regardless of protocol
  • Take application-aware snapshots regardless of protocol
  • Clone VMs, DBs and indeed, anything you like, without chewing up space and without impacting performance
  • Virtualize legacy arrays and impart on them the NetApp features
  • Perform workload and cache prioritization
  • Auto-migrate hot blocks to large flash cache to increase speeds (at a super-efficient 4K granularity)

As you can see, everything happens within one system, there’s no separate RAID controller or NAS box or replication box. And, like it or not, that’s a pretty impressive list of capabilities.

The potential business benefits with a true Unified Storage system

  1. Single product – you’re not relying on the marriage of completely different boxes.
  2. Better reliability, less things to break.
  3. Better support – no finger-pointing, it’s a single system from a single company.
  4. Consistent replication – one way to replicate things, yet still application-aware for 100% recoverability, improved CapEx and OpEx.
  5. Management simplicity – lower OpEx.
  6. All performance-enhancing and efficiency features are available to all protocols – Improved CapEx.
  7. There’s no dichotomy between FC, iSCSI and NAS space – allocations are fluid – Improved CapEx and OpEx.
  8. Protect your existing investment by virtualizing existing legacy disk arrays – improved CapEx and OpEx.
  9. Overall lower OpEx and CapEx – in addition to the significant space-saving features (avoid purchasing as much storage long-term), there’s significant cost avoidance since you potentially don’t need to purchase: Backup software, deduplication appliances, replication appliances, fileservers, OS licenses…

So, should you care how “Unified Storage” is architected?

Beyond the philosophical debate (one box vs multiple), given what you read, what do you think? I believe that the multi-box approach has some inherent drawbacks that are difficult to overcome. Comments welcome as always.

D

Mon
29
Mar '10

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

Tue
9
Mar '10

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers - remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance - there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points - so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

Wed
3
Mar '10

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

Mon
22
Feb '10

Protecting your existing legacy storage investment with virtualization – do’s and don’ts

It’s an undeniable fact that many customers, while they would love to use the highly advanced features of modern disk arrays, have already made a big investment in legacy storage. Sure, it doesn’t have all the great features, but it’s already there, frequently there’s a lot of it, and the maintenance isn’t expiring for another year or two so it’s not economically feasible to get rid of it.

Another issue most enterprises face is data migration – whether that’s to move from old to new on the same vendor, or from vendor to vendor. No matter how you cut it, you’ll have to do it someday.

A third issue is performance on the existing gear – maybe you have a ton of legacy storage but it’s just not performing the way you’d expect.

The final issue is managing disparate arrays. Nobody really wants to do that.

There are storage virtualization products that, conceptually, try to solve some of those issues in a similar way to how VMware, Hyper-V and Xen address similar issues with servers.

The idea is that you virtualize your existing storage behind gear that will give it some extra capabilities, centralized management and thereby extend its service life and maybe even eke out some more performance out of it. Your existing hosts will typically address the storage via the virtualizing device, so obviously some assembly is required (rezoning etc).

The devices I’m aware of fall into 3 basic categories:

  1. Devices that encapsulate existing LUNs and don’t need other equipment or much reconfiguration besides dropping them in, zoning and presenting the LUNs to the hosts through them. Examples are: FalconStor NSS, IBM SVC, HDS USP-V, HP SVSP.
  2. Devices that don’t need other equipment, offer some compelling extra features but cannot encapsulate LUNs and therefore need an initial migration besides the zoning. Example: NetApp V-Series.
  3. Devices that need extensive fabric upgrades besides reconfiguration. Example: EMC Invista (I’m not sure if it needs LUN migrations, I don’t think so but I’m sure someone from EMC will chime in).

There are other differences in the devices listed above, so I created a table and highlighted the areas where there’s either the odd man out or there’s some feature not available with the others. I’m aware that the table is nowhere near complete, but as it is I doubt it will fit onto a web page nicely. If there are inaccuracies, let me know and I’ll fix it. I admit I know little about HP’s SVSP. (re-posted with some SVC edits).

 

Thin Provisioning

Thin Clones

Snapshots

Also an Array

In-Band

Deduplication

Replication

Needs Migration

NAS

Needs fabric Upgrade

FCoE

Perf Acceleration

Can do live FC migrations

Needs some space on array

EMC

N

N

N

N

N

N

N (needs RecoverPoint)

? (prob N)

N

Y

N

N

Y

N

HP

Y (? perf impact)

?

Y (? perf impact)

N

split-path

N

Y

?

N

N

N

N

Y

N

FalconStor

Y (? perf impact)

Y (perf impact)

Y (perf impact)

N

Y

N

Y

Y

N

N

N

Y (SSD cache)

Y

N

HDS

Y (perf impact)

?

Y (perf impact)

Y

Y

N

Y

N

N

N

N

Y (huge cache, RAM)

Y

N

IBM

Y (no perf impact)

Y (perf impact)

Y (perf impact)

Y (limited 4x SSD per node)

Y

N

Y

N

N

N

N

Y (192GB large cache with 8 nodes)

Y

N

NetApp

Y (no perf impact)

Y (no perf impact)

Y (no perf impact)

Y

Y

Y

Y

Y

Y

N

Y (also 10GbE)

Y (gigantic cache, multi-TB)

N (iSCSI, NFS, CIFS only at present)

Y

 

The design decisions are interesting.

Of the above, IBM and FalconStor take the “pure appliance” approach, using Linux servers with custom code – that’s what those boxes were designed to do from the get-go. The idea is that you either have a bunch of old arrays or you buy a bunch of new, cheap and not very capable arrays, then front them with SVC or NSS, thereby making them decent.

Since IBM and FalconStor were always designed to perform this function, they are also, in my opinion, the best-suited for tasks like migrations. Indeed, I believe one can do a “hit and run” with said boxes, i.e. do the migration then remove the boxes from the fabric, making them popular with certain PS organizations.

On the other hand, HDS and NetApp instead offer the virtualization functionality as an additional feature to their arrays – as in, “you’ll probably buy our disk but we can enhance your legacy box, too”.

EMC took a completely different approach and uses out-of-band control servers and intelligent fabric switches to perform the virtualization trickery.

It’s important to note that NetApp lacks the live migration feature, but instead offers deduplication, application-aware snaps, great replication and NAS, and is arguably the most feature-rich platform (I’m trying to not be biased as I’m writing this). The biggest caveat (a deal-breaker for some) is that it can’t encapsulate your existing LUNs – instead, you need to chop up your RAID groups into LUNs, then present them to the NetApp system, which will then need to reformat said LUNs. This process also takes away some space for extra checksum calculations and other overheads. Arguably, you can make this up (and then some) in the end after using the features on tap (sorry). But you still need to figure time to migrate your stuff over gradually.

I believe EMC offers the least features and the most complex implementation – you can do stuff like mirror your LUNs from box to box and do migrations, but your arrays don’t really gain any new features. I have yet to meet a customer that owns this solution. I know there are a few big ones that went that way; it’s just not very common.

Of the devices mentioned above, the SVC is probably the most commonly used, then the USP-V (IBM and HDS always argue on that point since the capability to virtualize comes with HDS boxes whereas virtualization is the only thing the SVC does), then come FalconStor and NetApp, then HP with the relative newcomer SVSP, and last EMC (Invista hasn’t been a particularly successful product for EMC).

Storage Virtualization do’s and don’ts

I’d say that you should only really consider buying a virtualization product if you have well over 10TB of older gear (I’d say over 50TB IMHO) that is not TOO old (i.e. not older than 3-4 years). Quite frequently, if your gear is really old, refreshing it with new just ends up being cheaper. Of course, there’s always eBay.

I’d also recommend not buying new low-end arrays and using virtualization to make them “better”. You are introducing more complexity into the environment, and it won’t necessarily be cheaper, either (something like the SVC has licenses that cost by the TB). Just buy a decent modern array that has all the features you need and be done with it.

Furthermore – don’t get into virtualization just to migrate from your older to your newer arrays. There are other ways.

You should use common sense (imagine that). As you’re not supposed to mix drive types within RAID groups even if you can, you typically don’t want to have an application straddling 5 different arrays, all vastly different in capability, just because you can.

It’s tempting to say “I’ll create a LUN that’s striped among every single disk on 5 different arrays”. Not to say that this should never be done (I’ve RAID-0′d across Symmetrix to get enough performance, long story), but only do it if you know what you’re doing and the exact layout that you’ll end up with. Nothing spells misery like RAID0 across many LUNs in an existing RAID group… :)

Finally – figure out what features are the most important to you. If you want dedupe, NAS and tight app integration, NetApp is the ticket. If you prefer ease of migration, you may want to look at the other solutions.

The guarantees

In order to entice customers to try their stuff, HDS and NetApp have some space savings guarantees in place regarding virtualization. HDS has a flat 50% guarantee (predicated upon converting from RAID1 to RAID5 + thin provisioning) or 20% guarantee (just thin provisioning).

NetApp has the ZIP program. It’s a bit different – there’s no hard number in the savings. Rather, the customer’s data is analyzed and the customer presented with the savings % NetApp guarantees to achieve in their case. If the customer agrees and NetApp achieves the guaranteed savings, then the gear gets purchased. If the savings are not reached, then the customer gets to keep the gear free of charge (that’s right).

Such guarantee programs have been much ridiculed by the vendors that don’t offer them, but I think they show the respective companies believe in their products enough to wrap some kind of guarantee around them.

In conclusion…

Properly deployed, storage virtualization can be effective in increasing the efficiencies of legacy storage footprints lacking in functionality. Just be careful and examine your motives for virtualization before making the move. Sometimes it’s a decidedly false economy.

D

Wed
10
Feb '10

More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?

Before all the variable-block aficionados go up in arms, I freely admit variable-block deduplication may overall squeeze more dedupe out of your data.

I won’t go into a laborious explanation of variable vs fixed, but, in a nutshell, fixed-block deduplication means that data is split into equal chunks, each chunk given a signature, compared to a DB and the common chunks are not stored.

Variable-block basically means the chunk size is variable, with more intelligent algorithms also having a sliding window, so that even if the content in a file is shifted, the commonality will still be discovered.

With that out of the way, let’s get to the FUD part of the post.

I recently had a TLA vendor tell my customer: “NetApp deduplication is fixed-block vs our variable-block, therefore far less efficient, therefore you must be sick in the head to consider buying that stuff for primary storage!”

This is a very good example of FUD that is based on accurate facts which, in addition, focuses the customer’s mind on the tech nitty-gritty and away from the big picture (that being “primary storage” in this case).

Using the argument for a pure backup solution is actually valid. But what if the customer is not just shopping for a backup solution? Or, what if, for the same money, they could have it all?

My question is: Why do we use deduplication?

At the most basic level, deduplication will reduce the amount of data stored on a medium, enabling you to buy less of said medium yet still store quite a bit of data.

So, backups were the most obvious place to deploy deduplication. Backup-to-Disk is all the rage, what if you can store more backups on target disk with less gear? That’s pretty compelling. In that space you have of course Data Domain and the Quantum DXi as the two of the more usual backup target suspects.

Another reason to deduplicate is to not only achieve more storage efficiency but also improve backup times by not even transferring over the network data that’s already been transferred. In that space there’s Avamar, PureDisk, Asigra, Evault and others.

NetApp simply came up with a few more reasons to deduplicate, not mutually exclusive with the other 2 use cases above:

  1. What if you could deduplicate your primary storage – typically the most expensive part of any storage investment – and as a result buy less?
  2. What if deduplication could actually dramatically improve your performance in some cases, while not hindering it in most cases? (the cache is deduplicated as well, more info later).
  3. What if deduplication was not limited to infrequently-accessed data but, instead, could be used for high-performance access?

For the uninitiated, NetApp is the only vendor, to date, that can offer block-level deduplication for all primary storage protocols for production data - block and file, FC, iSCSI, CIFS, NFS.

Which is a pretty big deal, as is anything useful AND exclusive.

What the FUD carefully fails to mention is that:

  1. Deduplication is free to all NetApp customers (whoever didn’t have it before can get it via a firmware upgrade for free)
  2. NetApp customers that use this free technology see primary storage savings that I’ve seen range anywhere from 10% to 95%, despite all the limitations the FUD-slingers keep mentioning
  3. It works amazingly well with virtualization and actually greatly speeds things up especially for VDI
  4. Things that would defeat NetApp dedupe will also defeat the other vendors’ dedupe (movies, compressed images, large DBs with a lot of block shuffling). There is no magic.

So, if a customer is considering a new primary storage system, like it or not, NetApp is the only game in town with deduplication across all storage protocols.

Which brings us back to whether fixed-block is less efficient than variable-block:

WHO CARES? If, even with whatever limitations it may have, NetApp dedupe can reduce your primary storage footprint by any decent percentage, you’re already ahead! Heck, even 20% savings can mean a lot of money in a large primary storage system!

Not bad for a technology given away with every NetApp system

D

Mon
8
Feb '10

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

clip_image001

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok.  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So - there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D