Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Wed
17
Mar '10

Are you using the features of your existing platforms? And, if not, why not?

This is going to be another post that was inspired by sheer frustration…

It’s one thing talking to someone about adopting a totally new platform and meeting with resistance – I get it, it’s not what they’re used to, it’s new stuff, they don’t know if it will work etc. etc.

However, recently I’m encountering an alarming percentage of existing users of technology that are not using a lot of the features available to them – and I don’t mean small things, I’m talking about the features that someone literally buys the equipment for…

I understand if we’re talking about a feature you actually have to pay extra for, there may not be money in the budget for it. But this is not what this post is about…

Do you use the freely available or already paid for features? How do you know?

Consider this (I have more examples but we’ll keep it simple): I have a handful of customers that use our equipment (NetApp) with VMware that steadfastly refuse to even consider:

  • Deduplication
  • Thin Provisioning
  • Snapshots
  • Rapid, thin VM cloning

Those 4 technologies are frequently the reasons someone buys NetApp in the first place for virtualized environments, since they can lead to:

  • Vastly reduced storage footprint
  • Faster performance
  • Easier management
  • Easier and faster backup and recovery
  • Tremendous money savings

In my sample base, those customers absolutely would benefit from those technologies – it’s not a “maybe” or “your mileage may vary”. I know how their data is laid out and what kind of data it is, and the difference will be staggering.

Unjustified anger

I’ve also had customers tell me “where are my promised efficiencies?” They get really irate, and when I tell them exactly what to do in order to get said efficiencies, they start backpedalling and telling me how they can’t turn the features on during production hours. They then promise to turn some on during a maintenance window, then time goes by, they seem to forget about it and call me again, irate, complaining about the lack of features and efficiencies. And the cycle continues.

Is it an education problem? Lack of time?

Maybe it’s just a matter of education, but when someone is presented with the facts, several use cases from other local and global customers (including huge household names everyone recognizes), customers with hundreds of PB of data, all of them using the technology and achieving in many cases more than a 3:1 reduction in storage footprint, and still ignores the advice, there’s something wrong…

The other excuse for “shelfware” (software you never use but you just leave on the shelf) is lack of time to implement the features. For complex software I can see time being an issue, but my example is about things that can be done with a few mouse clicks.

The not invented here syndrome

There’s a term called “the not invented here syndrome”. This is an affliction suffered by professionals in all kinds of fields, not just IT. Some symptomps include:

  • Extreme resistance to any new ideas that were not developed within the company (frequently, by that person)
  • Extreme resistance to any kind of change, no matter how benign, low-risk, low-cost and beneficial it might be
  • Dismissing irrefutable proof
  • Thinking that your problems are more challenging than everyone else’s
  • The inability to recognize the real challenges facing their organization (“can’t see the forest for the trees”)

This is a perfectly normal human condition. We each have our world view, and some of us really don’t like having that view challenged. The human mind will actually go to amazing lengths to ensure that the existing worldview stays unmodified. The examples are all around us – people ignore what seems to be common sense all the time. History is full of horrific examples. I don’t want to depress anyone, so here are some humorous examples:

"I don’t trust fire, it can burn you!"

"That wheel thing seems like the devil’s own work!"

"Nobody needs more than 640K RAM in their PC".

Some friendly advice…

Back to the IT world. There are a few simple things you can do in order to make life a bit easier for all.

  1. Please read the documentation suggested by your engineer
  2. Then read it again and take notes and prepare questions
  3. Be open to new ideas – “luddite technologist” is a contradiction in terms
  4. Be flexible – try new things on copies of data or less important data, there’s always a way
  5. Reach out to your engineer, don’t always wait for them to reach out (our schedules are usually crazy)
  6. Think in terms of the business problems you’re trying to solve, not in terms of technology (you may not know that what you have can already solve your problems)
  7. If your vendor reaches out to you, maybe it’s not just to sell you more stuff… maybe we’re even trying to help out. Imagine that!
  8. Never assume anything (including that you always know better than the vendor, or that everyone’s lying to you, especially if you already own their gear!)
  9. If presented with irrefutable proof of something, consider graciously conceding
  10. Be aware of your shortcomings and prejudices (we all have them)
  11. Accept you don’t know it all (guess what – the customer is not always right!)
  12. And, last but not least: put the business first, and your ego a distant last.

I’ll get off my soapbox now.

D

Mon
8
Feb '10

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

clip_image001

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok.  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So - there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

Tue
2
Feb '10

Vendor FUD-slinging – at what point should legal action be taken? And who do you believe as a customer?

I’m all for a good fight, but in the storage industry it seems that all too many creative liberties are taken when competing.

Let’s assume, for a moment, that we’re talking about the car industry instead. I like cars, and I love car analogies. So we’ll use that, and it illustrates the absurdity really well.

The competitors in this example will be BMW and Mercedes. Nobody would argue that they are two of the most prominent names in luxury cars today.

BMW has the high-performance M-series. Let’s take as an example the M6 – a 500HP performance coupe. Looks nice on paper, right?

Let’s say that Mercedes has this hypothetical new marketing campaign to discredit BMW, with the following claims (I need to, again, clarify that this campaign is entirely fictitious, and used only to illustrate my point, lest I get attacked by their lawyers):

  1. Claim the M6 doesn’t really have 500HP, but more like 200HP.
  2. Claim the M6 only does 0-60 in under 5 seconds with only 5% of the gas tank filled, a 50lb driver, downhill, with a tail wind and help from nitrous.
  3. Claim that if you fill the gas tank past 50%, performance will drop so the M6 does 0-60 in more like 30 seconds. Downhill.
  4. Claim that it breaks like clockwork past 5K miles.
  5. Claim that they have one, they tested it, and performs as they say.
  6. Claim that, since they are Mercedes, the top name in the automotive industry, you should trust them implicitly.

Imagine Mercedes, at all levels, going to market with this kind of information – official company announcements, messages from the CEO, company blogs, engineers, sales reps, dealer reps and mechanics…

Now, imagine BMW’s reaction.

How quickly do you think they’d start suing Mercedes?

How quickly would they have 10 independent authorities testing 10 different M6 cars, full of gas, in uphill courses, with overweight drivers, just to illustrate how absurd Mercedes’ claims are?

How quickly would Mercedes issue a retraction?

And, to the petrolheads among us – wouldn’t such a stunt look like Mercedes is really, really afraid of the M6? And don’t we all know better?

More to the point – do you ever see Mercedes pulling such a stunt?

Ah, but you can get away with stuff like that in the storage industry!

Unfortunately, the storage industry is rife with vendors claiming all kinds of stuff about each other. Some of it is or was true, much of it is blown all out of proportion, and some is blatant fabrication.

For instance, XIV breaking if you pull 2 disks out – as I state in a previous post, it’s possible if the right 2 drives fail within a few minutes of each other. I think it’s unacceptable, even though it’s highly unlikely to happen in real life. But I’ve seen sales campaigns against the XIV use this as the mantra, to the point that the fallacy is finally stated: “ANY 2 drive failure will bring down the system”.

Obviously this is not true and IBM can demonstrate how untrue that is. Still, it may slow down the IBM campaign.

Other fallacies are far more complicated to prove wrong, unfortunately.

An example: Pillar Data has an asinine yet highly detailed report by Demartek showing NetApp and EMC arrays having significantly lower rebuild speeds than Pillar (as if that’s the most important piece of data management, but anyway – rebuild speed hasn’t helped Pillar sales much, even if it’s true).

To anyone that knows how to configure NetApp and EMC, they’d see that the Pillar box was correctly configured, whereas the others intentionally made to look 4x worse (in the case of NetApp, they literally went against not just best practices but blatantly against system defaults in order to make it slower). However, some CIOs might read this and give credence to it, since they don’t know the details and don’t read past the first graph.

For EMC and NetApp to dispute this, they have to go to the trouble of configuring, properly, a similar system, and running similar tests, then writing a detailed and coherent response. It’s like wounding the enemy soldier instead of killing them, their squadmates have to help them out, wasting manpower. I get it – it’s effective in war. But is it legal in the business world?

Last but not least: EMC and HP, at the very least, have anti-NetApp reports, blogs, PPTs etc. that literally look just like the absurd Mercedes/BMW example above, sometimes worse. Some of it was true a long time ago (the famous FUD “2x + snap delta” space requirement for LUNs is really “1x + snap delta” and has been for years), some of it is pure fabrication (”it slows down to 5% of its original speed if you fill it up!”). See here for a good explanation.

Of course, again that’s like wounding the enemy soldiers: NetApp engineers have to go and defend their honor, show all kinds of reports, customer examples, etc etc. Even so, at some point many CIOs will just say “I trust EMC/HP, I’ve been buying their stuff forever, I’ll just keep buying it, it works”. The FUD is enough to make many people that were just about to consider something else, go running back to mama HP.

Should NetApp sue? I’ve seen some of the FUD examples and literally they are not just a bit wrong but magnificently, spectacularly, outrageously wrong. Is that slander? Tortuous interference? Simply a mistake? I’m sure some lawyer, somewhere, knows the answer. Maybe that lawyer needs to talk to some engineers and marketing people.

Let’s flip the tables:

If NetApp went ahead and simply claimed an EMC CX4-960 can only really hold 450TB, what would EMC do?

I can only imagine the insanity that would ensue.

I’ll finish with something simple from the customer standpoint:

NetApp sold 1 Exabyte of enterprise storage last year, if it was as bad as the other (obviously worried) vendors are saying, does that mean all those customers buying it by the truckload and getting all those efficiencies and performance are stupid and wasted their money?

D

Thu
14
Jan '10

Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…

D

Sat
9
Jan '10

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

'

Should techies or business owners decide on technology (or both)?

It’s no secret that, in most companies, the technology folks are primarily the ones deciding on which new technologies to adopt – after all, they are the ones that understand the technology, right? Business owners explain the business problem to the technologists, and the techies take it from there – and ultimately present 2-3 different solutions that will work and the business picks the cheapest.

This could be great – if it weren’t for the fact that, like everyone, techies have their own agenda, which ends up tainting the decision process. Consider some of the following:

  • Comfort level with existing vendor (if it ain’t broke why fix it? This assumes all of the vendor’s products work equally well)
  • Job security (”why learn something new? Maybe they’ll hire someone that already knows this!”)
  • Delusions of grandeur (”I have the power!”)
  • Fear (”it sounds amazing, but what if the stuff doesn’t work?)
  • Disbelief (”my current gear can’t do this, there’s no way this new stuff is that good!”)
  • Laziness (”you mean I have to test this new stuff? It cuts into my online gaming time!”)
  • Envy (”my buddy at this other company has this stuff, I must have something cooler/bigger!”)
  • Lack of time (”I really don’t have the time to test this new stuff!”)
  • Vendor kickbacks (we all know it happens in one form or another, and to the perennially under-paid techies, an expensive gift may be something they will never otherwise be able to afford, so it gains huge importance in their eyes)
  • The inability to grasp the real business drivers
  • The inability to think strategically
  • Being wowed by “cool” features that are of dubious business importance (see other post here)
  • Conversely, not understanding features that could be of immense business importance, that could save the company millions and increase productivity tenfold.

Of course, someone like a CIO or CTO normally acts as the bridge that spans the techie and business worlds, but of course that doesn’t always work (see here).

The only way around the issue is to create a new decision process for the company, one that involves all the interested parties from all departments. As complex as it may sound, this does work, and most of the time new ideas/issues get unearthed (”what do you mean my database is not backed up now?” or “what do you mean it would take 2 weeks to recover my lab environment?”)

Try it, you may be surprised at what happens!

D

Fri
4
Dec '09

Is EMC under-sizing RecoverPoint and Avamar deals to win business?

It’s been a while since I wrote anything – unlike some, I actually have a day job! Well, at least that’s my excuse.

My admiration for RecoverPoint is well known (see older post, which is referenced internally within EMC as a great pro-RecoverPoint article). It really is a good product and, next to VMware, my favorite EMC acquisition.

So it incenses me when I see a good product being misconfigured, and reminds me of Hanlon’s Razor: “Never attribute to malice that which can be adequately explained by stupidity“. You see, I’d rather chalk this up as sales not knowing what they’re doing rather than assume that EMC knows full well the ramifications of their decision and goes ahead and does the dirty deed anyway.

However, I’ve seen multiple cases recently where RecoverPoint and/or Avamar were most decidedly incorrectly sized to support the customer’s workload. The customer likes the price and goes for the solution, only to be in for a nasty surprise later on. Not to worry, everything can be fixed with some more boxes, licenses and hard disks! After all, it’s tough and expensive to rip the stuff out!

To start with RecoverPoint: it can be a wonderful DR tool but, like any tool, needs to be used correctly in order to be most effective. For instance, there are several aspects when designing a RecoverPoint solution:

  • One needs to take into account the sustained throughput each device can handle (minuscule when compared to the total bandwidth of a CX4 or V-Max), and add extra devices in order to comfortably sustain the throughput the customer needs – even if that means you go beyond the 2-device-per-site RecoverPoint SE maximum and into the realm of “full” RecoverPoint (which can do more than 2 appliances per site, for added performance).
  • To expand on the previous point, assume that one of the RecoverPoint devices is “gravy” and is there to fail over if another box breaks. So, you effectively don’t want to be relying on having the full complement of RecoverPoint boxes working. This is especially important in 2-box RecoverPoint SE configs. If one box breaks (and they’re plain Dell 1950 servers) then that should not be debilitating to your performance while you’re waiting for a new box.
  • Licensing is capacity-based, which also needs to be explained to the customer (including what it means price-wise if you go beyond what RecoverPoint SE will support).
  • There is an absolute ceiling for TB replicated
  • There’s a different price depending on whether you want to do local only, remote only or both kinds of replication (CDP, CRR and CLR licenses)
  • Beware of the increased I/O on the array! When doing any kind of traffic through RecoverPoint, at the very least you get quite a bit more I/O on the “journal” (the redo log part of RecoverPoint) in addition to your main disk. If you want to also do local recovery, you could be doing as much as 3x the I/O! You see, you have to send the normal copy of the data through first, then Clariion splits off the I/Os to RecoverPoint, which then writes data to a full local mirror, then also to the journal. Obviously, the array needs enough fast disks to cope with this.
  • As a corollary to the previous point, to do CDP you need at least 2x the space plus a percentage for the journal (depends on the change rate and how far back in time you want to be able to go to)
  • Additionally, you can’t present multiple clones of the data simultaneously, from different points in time – you have to do them one at a time. Could be important in some use cases.
  • Creating a full-speed-access snapshot of your data can take quite a while, again could be important in some cases.
  • Last but not least – RecoverPoint, while efficient, is still subject to the laws of physics, so if you are told you’ll get zero RPO/RTO over a multi-thousand-mile link, stop what you’re doing, email me and I’ll overnight you an industrial-strength cattleprod, gratis… which you can then use on the rep in question.

So – all I’m saying is, ask all the right questions before sending that PO over…

Avamar is a different case altogether. It’s a dedup backup appliance that dedupes the data before it’s sent over the network. It’s very efficient at doing rapid backups over poor WAN connections. You don’t have to pay per-client fees, it supports most major OSes and applications, and is fairly easy to use. However – the original use case for the product was doing centralized backups of multiple small remote sites that are connected via poor links, and it still excels in that. Doing backups of large datasets at the datacenter, on the other hand, is not really what it was designed to do, yet I see it positioned in such a way.

I also see EMC selling really, really small Avamar configs (1-2 boxes), the hope being that dedup will be so effective that it’ll all be a wash in the end. Well – deduplication, in general, is the ultimate “it depends” solution!

Here are some considerations:

  • Not all data deduplicates equally! Make sure you run the EMC dedup estimator not just on fileserver data but also on your DBs! (DBs don’t really dedupe well, and media files and in general anything compressed dedupes even worse). Make sure you really get a good sample of your data analyzed, ideally all of it if possible.
  • If the sizer and dedup tool have only been run for plain fileserver data and that’s not what you have, don’t believe anything you see…
  • Explain your desired retentions and insist you see the Avamar sizer results. A good rule of thumb is that if your data is 5TB, then even with dedupe and compression, you’ll still need about 5TB once you factor in retention, unless you’re one of those rare cases that had tremendous duplication to begin with.
  • Make sure you understand the ramifications of not going to the RAIN grid in the first place – if you get a couple of Avamar boxes they can’t be part of the RAIN architecture, and if you lose one then the entire system is down hard. If you have RAIN, you could lose an entire node and it will be OK (kinda like RAID5 for servers) but migrating from non-RAIN to RAIN is non-trivial. Ask for the details. Ideally, even if you don’t need enough capacity to go RAIN, just buy the appliances to go RAIN but don’t buy the capacity licenses (i.e. you could buy 1TB of capacity yet have 5 nodes that theoretically can have a bunch more capacity).
  • Figure out if you want fast backups or fast recovery or both, and choose product accordingly (the fastest recovery is always replication/snapshots of primary data). Remember – usually, the desired end result is to recover, not to back up!
  • Understand exactly how Avamar can go to tape – the solution is not clean and it’s excessively slow. The product is really meant for those that want to go tapeless.

That’s all I have for now.

D

 

Sat
15
Aug '09

Should your backups to disk consume more disk than you use for production? Seriously?

So, let’s talk about this not-so-hypothetical customer… They have:

  • A few sites
  • A lot of data per site
  • Much of the data is DBs and Multimedia
  • No replication currently
  • Can’t back up everything currently
  • No proper DR
  • Fairly significant rate of change
  • Not the fastest pipes between sites

They asked me to propose a solution that will back everything up and cross-replicate the backups between the sites. They want to move as far away from tape as possible.

After much deliberation and examination of the data and requirements, we concluded that, in order to back everything up (and to stick to their requirements), even with various kinds of dedupe (I sized the solution with best practices for the usual suspects), due to the rate of change and the large amount of data with poor undedupability (that can’t possibly be a word), they will need about 3x the total amount of production space in order to achieve backups to disk (including dedupe!)

So, we declined to propose a solution. I want to sell something as much as the next guy but primarily I want repeat customers and the only way to get a happy repeat customer is to not screw him the first time… And selling them 3x the space only for backups doesn’t make too much sense to me when they could be spending their money much more wisely.

I explained how it doesn’t make sense to spend that kind of money on disk that’s just for backups! After all, backups are a last resort. My list of preferred methods for recovery (from best to worst):

  1. Local and remote replication + application-aware snapshots
  2. Backups to disk
  3. Backups to tape
  4. Snot, a claw hammer, duct tape and bailing wire (sometimes actually works better than tape but anyway…)

Wouldn’t it be a slightly better idea to use maybe 2x the disk, possibly even spend less money compared to the backup-only solution, and instead:

  • Cross-replicate the production data for rapid recovery
  • Achieve full local and remote DR
  • Be able to go back in time with snapshots both locally and remotely
  • Replicate the snapshots themselves automatically
  • Still get dedupe but this time on primary storage (make the current storage last longer)
  • Not need a forklift upgrade (investment protection)
  • Reduce or eliminate tape and reliance on the backup software
  • Get even longer retention than with backups to disk
  • No pipe upgrades
  • Drastically simplify administration
  • Potentially save millions over the next few years!

We’ll see what they decide to do. There was tremendous resistance to what I and a horde of seasoned engineers believe is the proper solution, with all kinds of very reasonable excuses being voiced (”we have no time, no resources, the stakeholders don’t care” etc). However, my position on this is clear. Yes, there’s more short-term pain in order to transform the infrastructure to the utopic vision of the bullets above, but the long-term gains are staggering!

I’ll let everyone know what happened the moment I hear. This one is really interesting…

D

, , , , , , , ,

Wed
29
Jul '09

Ease Of Use, Backup and Recovery And Efficiency in Modern Disk Arrays – What Questions Should You Really Be Asking the Vendors?

It’s interesting how many storage vendors claim their products are easy to use and, indeed, show nice canned demos full of wizards and elves and whatnot that seem to impress most. There are also grandiose claims of magically reliable hardware and other pixie dust… Ultimately, the reality is that:

  1. Most modern arrays, as long as you’re comparing like to like (i.e. from the same class, same kind of RAID), properly configured, will be reliable enough for most uses
  2. Similarly to #1 above: aside from insane marketing cache IOPS (a certain prominent vendor quotes IOPS numbers not even from cache but from the buffers of the FC ports, how realistic is that?) performance is not crazily different between similar-class boxes with similar numbers of disks. Ultimately, cache runs out and you need to hit the spindles… (so, boxes that can contain gigantic amounts of cache such as NetApp with PAM cache boards or EMC’s V-Max with multiple engines have a leg up there)
  3. There Is No Magic
  4. Almost everyone is using the same bits internally (CPUs, disks, RAM…) – with some key enhancements here and there. Don’t let the exact hardware details cloud your judgment. A good example: Let’s say Array X has 2 CPUs at 2GHz and Array Y has 2 CPUs at 3GHz. Unless the arrays come from the same manufacturer and run the same code, it’s VERY difficult to compare. Even if the CPUs are exactly the same, it’s tough to compare. The reason? Running anything (let’s pick Oracle) on the exact SAME hardware may produce wildly different benchmarks depending on whether the OS is Solaris, Linux or Windows, the tunings employed, and whether it’s 64- or 32-bit - the variable here being the OS.
  5. It all comes down to the intelligence, efficiency and reliability of the array software

Some business-related questions to ask the vendor:

  1. How is the support? Is it outsourced or not?
  2. Is the company viable? Is it profitable? Growing? Or is it tiny, struggling and depends on a single cool feature to woo prospects?
  3. References? Are the customers loving it or is it just OK? A cool one I heard today: “since I stopped using <TLA> my blood pressure dropped”…
  4. What large companies are using the technology? It’s one thing to have a reference from a mom-and-pop shop, and another to have one from Oracle, Microsoft etc. (and have multiple PB deployed inside those large companies)
  5. How many PB is the vendor deploying daily? How many total installations CURRENTLY UNDER SUPPORT, I don’t care how many since the company’s inception since the “since inception” means you’ll get numbers including people that got RID of the solution.
  6. Is the vendor OK with giving a performance guarantee (i.e. based on your workload that you will get 100,000 IOPS) and giving you a 100% refund if they fail to meet the metrics?
  7. To expand on the previous item: Is the vendor OK with doing a “Right of Return” - let you return the box if it doesn’t meet some agreed-upon criteria?
  8. Is the vendor OK with doing a Proof-of-Concept?

The prospective customers should probably ask for a bit more detail – and focus on things that will be statistically more important day-to-day than cool features of debatable real-world use. Some technical questions I’d ask:

  • Can I add drives on my own? Easily? Or do I need PS?
  • What requires downtime? Why?
  • What protocols does the array support? Can I use whatever or am I locked in?
  • Do I need extra appliances to support more protocols or are they all truly built-in?
  • Can I expand the ports?
  • Can I switch a LUN so it’s presented via iSCSI instead of FC (or vice versa)?
  • How do I do stuff like add drives to a RAID group? Is it on-the-fly? Do I need to destroy the RAID group?
  • Do I need to add disks in groups or can I add 1-2 if I want?
  • How much realistic protection do the available RAID schemes afford me? And what do I give up?
  • Can I lose any 2 drives in rapid succession without losing data? (dual-drive-loss has happened to various people I know and to me twice, it’s not as rare as you’d think. I lost data…)
  • Does RAID6 result in a performance decrease?
  • What is the real usable capacity, after RAID, based on real disk capacities (base-2 not base-10) and not marketing? You see, a 1TB drive doesn’t really offer 1TB…
  • Explain all the overheads in the system if best practices are followed - in some systems, even after RAID, 10TB usable is more like 5TB usable…
  • How easy is it to have a LUN span multiple disks in multiple RAID groups for performance? Meaning, in practical terms – do I need to worry about the back-end or will the system just take care of it for me?
  • Do I need to worry about adding disks in certain multiples, especially when dealing with such spanned LUNs?
  • Can I move stuff around the array?
  • How quick is the rebuild of drives?
  • Does the array detect impending failures and fail drives before they actually fail, in order to avoid a parity rebuild?
  • Do I need to care and know a lot about the back-end in order to optimize performance?
  • Is it easy to set up replication? Do I need extra appliances? Can I use FC and/or IP?
  • What’s the replication delta? (some arrays have a pretty huge minimum chunk they need to send over, can affect RPO)
  • Is compression supported for replication?
  • Regarding replication (both local and remote): Can I set up logical LUN groups that get treated as one in order to maintain consistency?
  • Can I grow a LUN?
  • Can I shrink a LUN?
  • Can I do it all from 1 place (and when I say all I mean all the way to having the LUN visible in the OS as a Filesystem, complete with proper partition offsets) OR do I need to visit like 3 different interfaces? Most vendors focus on the creation of a LUN. Easily creating a LUN is only a small piece of the puzzle!
  • Can I multi-purpose my disks or do I need to dedicate some to NAS, some to FC?
  • Can I prioritize my I/O?
  • Can I prioritize and tune my cache?
  • Do thin provisioning and snapshots adversely impact performance?
  • How many snapshots can I keep?
  • Can I keep a snapshot for, say, a year without messing up my performance and without needing a ton of space?
  • Can I use snapshots to clone LUNs so that they can be used to rapidly provision, say, servers or VMs without occupying too much space?
  • How easily and quickly can I backup and restore?
  • What kind of application integration is available? Some vendors offer basic VSS integration for Windows, but can I, say, recover individual emails and clone DBs without needing to use my backup application? How easily and quickly?
  • What about integration with applications that aren’t on Windows and may even be custom? Is it easy to properly integrate them?
  • Can I increase the cache size if needed? By how much?
  • Can I tier my data?
  • Does it work with the primary backup apps and VMware SRM?
  • Can I get data encryption?
  • Can I get data compression for all kinds of data?
  • Can I get deduplication? And does it work for backup data only or also for my production data so I can save space?
  • What is the deduplication impact?
  • Can I script operations if I want to?
  • What kind of reporting and data gathering is available?

This is not even a comprehensive list and I’m sure everyone has their own (if you haven’t written your own list down I suggest you do!) but represents what I feel are features that are realistically valuable.

What do you think? Comments always welcome…

D


Sat
13
Jun '09

About the Data Domain acquisition – and is EMC really the best place for Data Domain?

Much has already been written about this imminent acquisition of Data Domain by either NetApp or EMC and, since opinions are like you-know-what, and I have one, here it is… if I ramble, forgive me. I have too much to say and I’m trying to be PC… I wrote and subsequently erased all kinds of stuff that could probably get me in trouble (the more you work with a company the more dirt you uncover, and I have several earth movers’ worth).

I do think that both companies waited too long to try and acquire Data Domain – frankly, it’s staggering to me that other companies that make decent products like CommVault haven’t been acquired yet (I mean, seriously, if EMC want to compete in the backup software space they should just drop Networker and buy CommVault). Consolidation is the trend…

Maybe both NetApp and EMC thought their in-house deduplication would work out for everything, maybe they thought Data Domain wouldn’t become a contender. Maybe they thought it was just a phase. Either way, the backup market is still strong, most people don’t want to move en masse to something like Avamar, not everyone needs VTL, and Data Domain does provide a very convenient way to keep using your existing backup product, make next to no changes, and get better efficiencies.

The simple truth is that EMC needed SOMETHING to combat Data Domain so they signed the agreement with Quantum and rushed the product to market. And then tried to strong-arm the resellers into forgetting about Data Domain and instead selling the new and amazing DL3D (that backfired BTW).

As far as EMC is concerned, the attempt to acquire Data Domain is a slap in the face for Quantum and all the customers that have been pitched/sold DL3D (the OEM’ed Quantum DXi product). EMC has spent quite a bit of time belittling Data Domain and instead pushing a product that has seen very limited testing (I know, I’ve been burned personally by it several times). A good example: EMC recently released a patch to allow backups done with EMC’s Networker to actually be deduplicated (talk about a reason to return a product if there ever was one – like a car that can’t go faster than 10 mph or that gets 2 mpg instead of 20 mpg). You see, there was an issue with the filter that figures out what backup app you’re using, and Networker backups were getting only plain old compression, NO deduplication. This is no secret, if anyone bothers to read the release notes of the recent patches they’ll see this info. Maybe if you’re a DL3D customer you should insist on reading the release notes if they’re not easily available? After all, you have a right to know what’s changing!

Think about this: EMC’s own backup product was not tested with DL3D. Yet EMC happily sold DL3D to customers with Networker. To me, this is a sales-driven company, not a customer-driven company.

Not to mention other crippling bugs, slow startup times (especially in the case of unclean shutdowns) and the abysmal performance which simply stems from how the product is designed – it’s spindle-happy and needs about 2 trays of drives to work well. Oh, and don’t EVER fill it beyond 80% capacity. You’re also not supposed to use it as a normal CIFS/NFS share for archiving anything like email or normal files (arguably a great place for dedup).

So, EMC knew about the DL3D issues (well, some of them, it’s not their product after all, indeed I helped them identify some of the bugs) and played coy with customers. Then, they saw NetApp making a move for Data Domain and realized that by buying Data Domain EMC could accomplish several things:

  • Minimize NetApp’s cash reserves if NetApp does in the end succeed in acquiring Data Domain (but is that necessarily a bad thing for NetApp?)
  • Remove the flailing DL3D and replace it with a product that actually works and is selling very well
  • Get a bunch of solid deduplication and consistency checking algorithms
  • Assimilate a competitor that’s been a huge thorn on EMC’s side in that space
  • Reduce the efficiency of NetApp as a competitor

But think from the customer standpoint for a minute (most of the analysts so far seem to miss the most important player here – and that’s certainly not EMC, NetApp or Data Domain, but the customer). You’ve been pitched DL3D, and now you must forget about that and all the bad things you were told about Data Domain – it’s all good now that it belongs to EMC, you’ll be taken care of. Or you can buy the DL3D if you still want it (and I don’t see EMC derailing ANY existing DL3D campaign, no matter what).

I were a DL3D prospect/customer, I’d be worried no matter what.

Let’s talk about the best place for Data Domain to end up. As far as investors go of course, if they want to make a quick buck and run, the EMC cash offer is tantalizing. But for Data Domain employees, EMC can be a black hole and the added complexity and bureaucracy anything but fun. EMC has become almost too diversified – let’s look at just some of EMC’s storage solutions (I won’t mention the software since then it’d be a REALLY long and weird post):

  • Symmetrix
  • Clariion
  • Celerra
  • Centera
  • Atmos
  • EDL
  • DL3D
  • RecoverPoint
  • Avamar (that’s both a software solution and an appliance)

What’s interesting is that, by and large, the teams in charge of the above products don’t talk much, if at all, with each other. Talk about islands! And, when it comes to sales, EMC has internally competing groups of people that sell the above products – for instance, “NAS overlay” guys only get paid on Celerra sales, and I’ve seen them screw up campaigns that were clearly a pure Clariion play just so they could somehow get some Celerra in so they get paid. The basic EMC sales guy you meet can sell them all and indeed doesn’t care, but the people he relies on for support cannot sell them all and do care about what gets sold. It’s all very fragmented and, again, not a model that operates with the customer’s best interests always in mind. It always baffled me why EMC would allow so much fluff in their sales organization.

So, if Data Domain got absorbed, they’d probably not be enjoying all the “melting pot” advantages the EMC corporate bloggers seem so keen on advertising, and the “large startup” feel (maybe it’s like that in MA for a few chosen people – in most other locations it’s decidedly not like that). They’d just be another acquired unit, internally competing with other units, dealing with large-company politics and other inefficiencies. The EMC stock wouldn’t really become much higher than it is now, if at all. It’s been about the same for quite some time now.

Let’s examine the scenario of NetApp buying Data Domain:

  • NetApp is much more focused than EMC – indeed they have literally less than a handful of major offerings that don’t really compete with each other
  • The NetApp sales force is unified and doesn’t internally compete about what to sell
  • NetApp culture is much closer to Data Domain culture
  • It’s not good for innovation to have one company hoarding 3 dedup technologies, NetApp + Data Domain will actually push EMC more and be better for the customers
  • Data Domain could make NetApp much stronger against EMC, in turn driving NetApp’s stock price up significantly. Which, in turn, would give investors back much more than $2bn, thereby making this the better deal.

The only drawback I see (as do most writing about this) is NetApp’s relatively poor history in managing the few acquisitions they’ve made. But I believe that as long as they leave Data Domain alone and slowly try to integrate the technology in the other products it will all work out.

Hopefully all this made some sense…

D

Mon
5
Jan '09

On current smartphones

The time has come for me to get a new phone (my current one can’t keep up with the demands and the speed or lack thereof ends up frustrating me).

So I’ve been looking at the plethora of devices out there - Berries, Windows Mobile, iPhone, etc (disclaimer: I’ve been a Blackberry user for many years now).

For me, the ideal mobile device needs to:

  • Synchronize seamlessly all my Exchange stuff
  • Be able to display PDF and office docs (not necessarily edit them)
  • Be a great phone (reception, clarity)
  • Have tethering ability
  • Be fast when I multitask on it
  • Offer GPS (almost all current ones do)
  • Have a decent supply of third-party apps
  • Be able to last me for a whole day (NOT just a business day) of pretty heavy usage
  • Have not so much an intuitive OS but an OS that doesn’t get in the way
  • Let me input text very, very fast (I’m writing this on my phone now)
  • Be tough (Mil-Spec would be great)
  • The ability to play music/videos is not essential but is nice to have (all do it now)
  • Camera nice-to-have but it doesn’t need to be amazing
  • Should be able to have a decent web browser
  • Shouldn’t be ridiculously huge…

So here’s the Executive Summary if your needs coincide with mine:

- Get a Blackberry Bold

For the nitty-gritty:

We HAVE to mention the iPhone, it’s a marvel of social engineering, industrial design and amazing marketing/branding. Of course the battery life utterly sucks if you try to use it the way I’d need to but, most importantly: I cannot type on the sucker! I don’t have abnormally large sausage fingers, indeed I believe my digits are downright elegant, yep I just cannot type fast or accurately on the iPhone (this paragraph might have taken me 10 mins to write on it). So we stike that one out.

Then you have the new Blackberry Storm, also touch-screen. On this one, the entire screen is a gigantic button that you need to press in order for it to register. I found that this approach seems to make it way more accurate for me than the iPhone. The battery life and build are also great. Too bad that the hardware can’t keep up, it feels decidedly slow, more so than the iPhone. Scratch that one too.

Then we have the narrow-form-factor Berries. Can’t type on them quickly. Out they all go.

Next are the various and sundry Windows Mobile devices. Almost too much choice here, huge third-party support, some great hardware from a few vendors. But I find that the OS really gets in my way and all of them also feel amazingly slow. Battery life is no great shakes, either.

Nokia has some good ones, the E71 is my favorite, but they don’t sync that elegantly with Exchange plus the keyboard is weird. Great build, though. If you like its keyboard go for it. OS can take some getting used to…

What remains is the Blackberry Bold. Sure, size-wise it feels like holding a slipper against your head (fortunately size was never a very important criterion) but it passes almost all the other tests! It also lets you send/receive emails while on the phone, feels fast, and has an amazing keyboard. Probably because it’s slipper-sized…

Well-made but it’s so nice that it needs a decent case to get ruggedized so you keep it looking nice, in which case it’ll look more like a size 13 boot against my face and I won’t be able to see just how nice it is anyway.

Am I alone in believing that many people would gladly pay a premium for a sleek, ruggedized device that doesn’t look like a Casio G-Shock? I’d be totally OK with the silly and easily disfigured plastic chromed bits being replaced by Kevlar or rubber, a scratch-proof screen, the ability to withstand immersion for 30′, successful drop-testing from 1 story to concrete, flexible circuit boards (not the ultra-thin ribbon type, you can get boards that are almost rubbery), Mil-Spec connectors, port covers…

It’s all possible, it just adds to the cost. But I guarantee most professionals will pay $100-200 more for the ruggedized model that doesn’t need clunky cases. Ericsson, Siemens and Nokia all had standard phones (never made it to the US) that would fit the above description with the exception of the scratch-proof screen (the Ericsson one was pretty amazing - they suggested you wash it to get rid of the dirt - albeit pretty large), but they slowly stopped making them. They weren’t even much more expensive than the plain models!

The old, thick Blackberries used to be pretty tough, I dropped mine onto concrete many, many times (drop-kicked it once) and the only damage was that the vibrating thingy inside stopped working 100% of the time, a no-no among Berry addicts. It did look scratched but it wasn’t painted on so the scratches weren’t that visible.

I hear the iPhone can be tough, at least the original one. The 3G - not so sure. A colleague had his stop working after he dropped it 3 feet. It landed on its back (should be an easy knock to absorb), you can’t even see a scratch on it. Unacceptable, IMO.

D

Tue
22
May '07

EMC World: Replication Manager and Exchange 2007

Just attended a session. Seems like the new rev of RM supports 2007 fully. They also support Recoverpoint clones (or will, later this week).

For whoever is not aware of it, EMC Replication Manager is like a front-end that manages local replicas of your salient Exchange data for the purposes of backup and restore.

Can be fiddly to set up but if you have EMC gear and Exchange, you really should look at it.

D

Mon
21
May '07

At EMC World

Currently attending EMC World. The first day bored me to tears, I hope the rest will be more exciting (though it utterly depends on the presenters). Some of the material is too introductory, even if one attends the advanced sessions they’re not that advanced.

More to follow.

D

Mon
19
Feb '07

So who am I?

Hello everyone,

My name is Dimitris Krekoukias.

This blog used to be on another server, I moved it here - hopefully this hosting facility will be more stable.

I resemble a silverback gorila more than a monkey (man-pelt and all), and could probably wrestle one (and have a fair chance of winning).
I have extensive experience in the backup and recovery arena, and indeed know far more about certain products than I (or the vendors) would like to.
This blog will not be just about recovery - I have other interests, such as storage, OS design, tuning, filesystems, HPC, and other exotica. Plus a ton of non-IT-related hobbies - but that’s a story for another day.
Hopefully everyone will find this blog stimulating, controversial and, at times, annoying - in which case, tough.

D