Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Mon
29
Mar '10

Filesystem benchmark extravaganza – Win, Linux, NTFS, EXT4, XFS, BFS scheduler impact and more…

Technorati Tags: ,,,,,,,,,

It’s been a while since I checked the status of Linux-land regarding filesystems and CPU and I/O schedulers, so I thought I’d post some results.

A bunch of new distributions are coming out with Linux kernel 2.6.32 as standard (Ubuntu 10.04LTS one of them), and one distribution (PC Linux OS) will have Con Kolivas’ BFS as default process scheduler.

Since I’m a scheduler and filesystem aficionado, I was intrigued by the simplicity of the BFS and wanted to see how it might affect I/O performance. I don’t believe that there’s ever a single scheduling algorithm that fits all possible use cases, and, furthermore, I think the default Linux CFS scheduler is getting a bit unwieldy in its complexity (though the goal is to make it scale to very large numbers of cores, not all machines have that).

So I grabbed a spare machine that doesn’t have a ton of horsepower on purpose, and tested using the same benchmark (NetApp’s venerable postmark) on the same part of the disk, with Windows, Ubuntu 10.04LTS beta, and PCLOS 2010 beta2.

Here’s some of the averaged data (did multiple runs):

 

  time IOPS Creation Reads Appends Deletes Read MB/s Write MB/s
win tuned 208 244 156 121 123 183 2.68 5.658
win supercache 197 224 200 111 113 175 2.78 5.88
win uptempo 172 281 277 139 141 156 3.19 6.73
Ubuntu 10.04b 154 169 454 84 85 727 3.56 7.52
Ubuntu 10.04b deadline 136 173 500 86 87 10184 4.03 8.51
PCLOS ext4 CFQ 105 239 1011 118 120 4478 5.25 11.09
PCLOS ext4 deadline 97 240 1158 119 121 7828 5.73 12.21
PCLOS XFS tuned deadline 187 136 476 68 68 509 2.93 6.19

Some explanation on the fields:

  • Time: Total time to run
  • IOPS: mixed transactions per second (maybe the most useful number)
  • Creation: files created per second (not mixed with transactions)
  • Reads: reads per second
  • Appends: appends per second
  • Deletes: pure file deletions per second (not mixed with transactions)
  • Read and write MB/s: the effective MB/s

Some notes:

  • Windows was just vanilla XP with all service packs, tuned as a server (box was too weak to take Win7)
  • Supercache was using a portion of memory for the Supercache filter driver
  • Uptempo is a similar filter driver that provides caching (see previous posts here and here)
  • Ubuntu was the latest available beta of 10.04
  • Whenever you see “deadline” the deadline scheduler was used instead of CFQ
  • PCLOS is the latest available beta of PCLOS 2010.
  • I mounted ext4 with barrier=0,noatime
  • I mounted XFS with nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k and made it with mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=128m

Some pretty pictures for the ADD among us:

image

image

image

image

Some observations:

  • Some operations that are metadata-heavy like deletion, can get heavily cached in Linux, so that skews the MB/s and total time results when 10,000 files can be deleted in an instant. I wanted to show the numbers because, depending on what you do, that may or may not be important.
  • Intelligent caching still helps tremendously, note the results for Uptempo on Windows and the crazy file creation times on Linux (also skews the MB/s but is a useful number to know if you’re creating a lot of files constantly)
  • Depending on what you’re doing, Windows can be Just Fine…
  • The BFS seems to have been an excellent scheduler choice for PCLOS and something other Linux vendors should start looking at seriously – unless there are serious other I/O tweaks in the kernel that I don’t know about, the difference in performance is staggering between Ubuntu and PCLOS (Phoronix figured that out as well here – they have tons of additional benchmarks showing things other than I/O).
  • And, last but by no means least – the deadline scheduler is ahead of CFQ again, both for Ubuntu and PCLOS. Not by a huge margin but it’s a safe choice especially if you’re running Linux on enterprise storage that has its own decent I/O scheduler built-in. There have been cases of CFQ dramatically lowering performance with external arrays in some cases. Remember – most of the guys writing code for Linux don’t have access to enterprise storage… using the defaults could be harming your Linux performance!

D

6DTVJ7EB98KH

2UE67CVY82RZ 

Thu
18
Mar '10

FUD and The Invention of Lying

I watched “The Invention of Lying” movie the other day. Fairly entertaining, and it had an interesting concept:

Imagine a society where nobody can lie – the very concept of lying is alien and never even enters anyone’s mind. Obviously, tons of jokes can be made using that premise, and the movie is riddled with them – such as their fictional Pepsi ad: “Pepsi: when they’re out of Coke!”

In the movie, a single man stumbles upon the concept of lying, and realizes he can do whatever he wishes since nobody else can tell he’s lying.

Obviously, in our society lying is quite prevalent – a large percentage of the population wouldn’t have jobs or offspring without lying.

I thought – what if, just for fun, we applied “The Invention of Lying” movie concept to IT sales? (I guess this is another take on comparing vendors to cars or wines and whatnot). I’m going for an alphabetical, non-comprehensive list (and added a few non-storage entries). I’ll leave it to the reader to figure out if this is more accurate from the standpoint of a rep that cannot lie, or vice versa… :)

  • 3Par: Our best asset is Marc Farley, his highly entertaining blog is what sells our gear. Our gear is pretty fast, though the software not as good as others’. Unsure how we are still in business. Also unsure why nobody has bought us yet. We do have a handful of very large, loyal customers.
  • Apple: Our stuff is prettier but inside it’s all the same, actually often slower than others. Oh, and it’s a lot more expensive. But the software is cool (when you can find it). You’ll probably need to run Windows in a VM anyway to get the full functionality. Did we mention our stuff is prettier?
  • Bluearc: We have limited-functionality NAS with good sequential and random read speeds but not so much for random writes. Oh, and no application integration. But it’s good for certain workloads. Why is nobody acquiring us?
  • Compellent: Data Progression is the coolest thing we do, and we’ll probably go under now that the big vendors can do it. Oh, and it never did much in the real world, especially for performance. Hopefully we’ll get acquired, but if our technology is that good, why did nobody acquire us yet? We’re extremely affordable!
  • Equallogic: We’ll give you free storage (the first hit is free) if/since you also buy Dell servers. We might even throw in a free laptop and a projector. And a mouse pad. Make sure you convert everything to iSCSI since that’s all we do. Oh, you wanted to know specifics about the storage? Well – it’s free! If you buy some servers. You really want to know about the storage? Well, it’s free if… What? You want to understand the failure math of RAID 50? It’s atrocious, but the box is free if…
  • EMC: We buy companies since innovating is kinda hard and time-consuming, so our solutions end up being a mish-mash of technologies. It all mostly works, though interoperability between platforms sucks. Regarding storage, you should really only buy Symmetrix since all our other stuff doesn’t even come close to that quality, we have the other boxes just to meet price points and plug portfolio holes. We trash competitors until we acquire them or until we build something good enough that’s similar. We also sell futures. Hard. We focus too much on NetApp.
  • HDS: We don’t know how to write software but our high-end gear hardware is pretty solid. The cheaper stuff is OK, severely lacks in functionality but we’ll just drop the price enough that you’ll buy it anyway. Capisce?
  • HP: Seems that buying companies works for EMC, we’ll do the same, let’s see what happens. We used to make the best calculators in the world. Oh, and our best array is actually made by HDS. Our servers are great! Please, also buy some printers, they’re pretty good.
  • IBM: We used to be some of the best in storage, now our only 2 products are SVC and DS8K (oops, and now XIV), everything else we resell after we put our faceplates on it. Our biggest sellers are products made by LSI and NetApp. Oh, and we internally compete with the XIV team we acquired. Our storage solutions don’t talk to one another since they’re all made by different people. But SVC can tie it all together! Well, some of it, anyway.
  • Intel: We are so big that even if AMD has better stuff, eventually we catch up. Just you wait. In the meantime, buy more Intel to keep us going. Resistance is futile.
  • Isilon: We are decent for bulk sequential-access NAS, just don’t do any kind of random workload on our gear.
  • LeftHand: If you want any reasonable storage efficiency plus resiliency you need to buy a bunch of boxes (5 or so), since each box is essentially an HP server with internal disks, and the whole server can die. Oh, and we only do iSCSI. So you better make sure you only do iSCSI.
  • NetApp: We probably have some of the worst marketing of all vendors, and often can’t clearly articulate what makes our systems better to C-level execs, focusing almost entirely on techies. We also have issues with making some acquisitions pan out. ONTAP 8 is taking us forever to release, and until then you won’t have very wide striping (update: GA’d 3/19/10). We complicate sales because our engineers are too technical and insist on explaining how the boxes work at a low level, frequently confusing customers, that seldom care about understanding Row-Diagonal Parity equations. Too much good information is tribal knowledge, including performance tuning and the gigantic customers we have. We focus too much on EMC.
  • Pillar: We cry ourselves to sleep because all we have is Larry Ellison and QoS. Maybe Larry will finally force Oracle to finally buy some of his^H^H^H our gear? I wonder how that will go down since Oracle is already using a superior technology and achieving great savings… but we do make a fairly fast box if you’re OK with limited functionality and RAID50.
  • Sun: We can sell you some LSI storage, but even that may be going away. You can also get the exact same storage from IBM that also resells LSI. How about a Thumper? We may also have some leftover HDS gear that we can give you real cheap.
  • Xiotech: Our value prop is extremely obscure and only understood well by about 5 engineers. Out of those 5 engineers, 2 understand the exact failure scenarios of our ISE architecture, and they can’t explain it to anyone else. We are pretty cheap though.
  • XIV: We believe in success through obfuscation. Our box can only do about 17K IOPS if the workload isn’t cache-friendly but we know how to cheat in benchmarks and make it seem faster (make sure your benchmark writes all zeros and/or fits in cache). The box also consumes more power and space than any other storage system. Our reps compete with IBM reps even though we are owned by IBM, since we only get paid on XIV sales, regardless of what the customer’s needs are. Oh, and under certain conditions, a 2-disk failure will bring down the entire system. But don’t you worry about that. BTW, the GUI is amazingly pretty.

Hope you had a chuckle reading some of this!

(minor edits – typo plus some on Twitter complained I was too gentle in the NetApp section :) )

D

Wed
17
Mar '10

Are you using the features of your existing platforms? And, if not, why not?

This is going to be another post that was inspired by sheer frustration…

It’s one thing talking to someone about adopting a totally new platform and meeting with resistance – I get it, it’s not what they’re used to, it’s new stuff, they don’t know if it will work etc. etc.

However, recently I’m encountering an alarming percentage of existing users of technology that are not using a lot of the features available to them – and I don’t mean small things, I’m talking about the features that someone literally buys the equipment for…

I understand if we’re talking about a feature you actually have to pay extra for, there may not be money in the budget for it. But this is not what this post is about…

Do you use the freely available or already paid for features? How do you know?

Consider this (I have more examples but we’ll keep it simple): I have a handful of customers that use our equipment (NetApp) with VMware that steadfastly refuse to even consider:

  • Deduplication
  • Thin Provisioning
  • Snapshots
  • Rapid, thin VM cloning

Those 4 technologies are frequently the reasons someone buys NetApp in the first place for virtualized environments, since they can lead to:

  • Vastly reduced storage footprint
  • Faster performance
  • Easier management
  • Easier and faster backup and recovery
  • Tremendous money savings

In my sample base, those customers absolutely would benefit from those technologies – it’s not a “maybe” or “your mileage may vary”. I know how their data is laid out and what kind of data it is, and the difference will be staggering.

Unjustified anger

I’ve also had customers tell me “where are my promised efficiencies?” They get really irate, and when I tell them exactly what to do in order to get said efficiencies, they start backpedalling and telling me how they can’t turn the features on during production hours. They then promise to turn some on during a maintenance window, then time goes by, they seem to forget about it and call me again, irate, complaining about the lack of features and efficiencies. And the cycle continues.

Is it an education problem? Lack of time?

Maybe it’s just a matter of education, but when someone is presented with the facts, several use cases from other local and global customers (including huge household names everyone recognizes), customers with hundreds of PB of data, all of them using the technology and achieving in many cases more than a 3:1 reduction in storage footprint, and still ignores the advice, there’s something wrong…

The other excuse for “shelfware” (software you never use but you just leave on the shelf) is lack of time to implement the features. For complex software I can see time being an issue, but my example is about things that can be done with a few mouse clicks.

The not invented here syndrome

There’s a term called “the not invented here syndrome”. This is an affliction suffered by professionals in all kinds of fields, not just IT. Some symptomps include:

  • Extreme resistance to any new ideas that were not developed within the company (frequently, by that person)
  • Extreme resistance to any kind of change, no matter how benign, low-risk, low-cost and beneficial it might be
  • Dismissing irrefutable proof
  • Thinking that your problems are more challenging than everyone else’s
  • The inability to recognize the real challenges facing their organization (“can’t see the forest for the trees”)

This is a perfectly normal human condition. We each have our world view, and some of us really don’t like having that view challenged. The human mind will actually go to amazing lengths to ensure that the existing worldview stays unmodified. The examples are all around us – people ignore what seems to be common sense all the time. History is full of horrific examples. I don’t want to depress anyone, so here are some humorous examples:

"I don’t trust fire, it can burn you!"

"That wheel thing seems like the devil’s own work!"

"Nobody needs more than 640K RAM in their PC".

Some friendly advice…

Back to the IT world. There are a few simple things you can do in order to make life a bit easier for all.

  1. Please read the documentation suggested by your engineer
  2. Then read it again and take notes and prepare questions
  3. Be open to new ideas – “luddite technologist” is a contradiction in terms
  4. Be flexible – try new things on copies of data or less important data, there’s always a way
  5. Reach out to your engineer, don’t always wait for them to reach out (our schedules are usually crazy)
  6. Think in terms of the business problems you’re trying to solve, not in terms of technology (you may not know that what you have can already solve your problems)
  7. If your vendor reaches out to you, maybe it’s not just to sell you more stuff… maybe we’re even trying to help out. Imagine that!
  8. Never assume anything (including that you always know better than the vendor, or that everyone’s lying to you, especially if you already own their gear!)
  9. If presented with irrefutable proof of something, consider graciously conceding
  10. Be aware of your shortcomings and prejudices (we all have them)
  11. Accept you don’t know it all (guess what – the customer is not always right!)
  12. And, last but not least: put the business first, and your ego a distant last.

I’ll get off my soapbox now.

D

Tue
9
Mar '10

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers - remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance - there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points - so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

Wed
3
Mar '10

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D