Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Wed
28
Feb '07

Just ate at Keens Steakhouse in NYC

Well, just finished the meal. Steak was ordered medium-rare, arrived medium, a bit chewy (but still tasty) and not hot. I was too tired to complain and ate military style (i.e. it was gone in a minute).

The 26oz ribeye I had at Wollensky’s a couple years ago was a religious experience, comparatively. That thing needed a butterknife, at most. Sometimes staring at it hard enough was sufficient to lop off a piece.

I admit I don’t have enough of a statistical sample for either joint.

Just thought I’d share this.

D

Mon
19
Feb '07

It’s all about data classification and searching

I don’t know if this has been discussed elsewhere but I felt like I had an epiphany so there…

They way I see it, in a decade or two the most important technology regarding data will be data classification and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they’re smaller than the slower-spinning drives.

Imagine if speed and size were not issues. I know that’s a big assumption but let’s play along for a second… (let’s just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it’s just expensive, so it’s not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn’t get it, but they’re saying it should be out in a few years (there were issues with scalability and speed).

Let’s forget about the Microsoft-specific implementation and just think about the concept instead (I’d use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it - it’s just a huge database occupying the entire drive.

Think of the advantages:

  1. Far more resilient to failures
  2. Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
  3. Replication via log shipping
  4. Amazing indexing
  5. Easy expandability
  6. The potential for great performance, if done right
  7. Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let’s consider a simple text document - you’re trying to sell some storage, so you write a proposal for a new client. You could have metadata on:

  • Author
  • Filename
  • Client name
  • Type of document - proposal
  • Project name
  • Excerpt
  • Salesperson’s name
  • Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
  • Document revision (possible automatically generated)

… and so on. A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created:

  • Create a virtual directory with all files pertaining to that specific client (most common way people would organize it)
  • Show all the material for this specific project
  • Show all proposals that have to do with this salesperson

… and so on.

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course.

Existing software that does this classification is fairly poor, in my opinion. Please correct me if I’m wrong.

The other piece that needs to be there is extremely robust search and indexing capabilities. Some of that technology is there (google desktop and its ilk) but natural language search has to be - well, natural, but unambiguous at the same time.

I hope you can now see why I believe these technologies are important. If Google continues the way it’s going, it may well become the most important company in the next decade (some might argue it’s the most important one already).

For any sci-fi fans out there, this is a good novel that’s a bit related to the chaotic storage systems of the future: http://www.scifi.com/sfw/books/sfw7677.html

D

'

Some clarification on the caching

Re the previous post:

If you want to use supercache or uptempo the idea is that you take AWAY from windows/SQL/exchange cache and add to the fancy cache.

So, even on windows server, in “file and print sharing for microsoft windows” (in the properties for your network card, under file and printer sharing for Microsoft networks, bizarrely enough), you could say “maximize throughput for network applications”. In the various apps you’d just minimize the cache (i.e. only 10MB for Exchange) and just give the rest to supercache/uptempo.

Be aware that supercache is on a PER VOLUME basis, not global (its blessing and its curse at the same time). If you have a lot of volumes maybe just cache a few key data volumes, tempdb and the pagefile partition, or use uptempo, which allows you to allocate a single global cache pool that is then shared among the volumes you choose.

For SQL, using a RAM disk for tempdb seems to work even better.

Having seen the products work wonders only with 128MB dedicated to them, and bearing in mind that most servers have 4GB RAM or more, I’d say go nuts. I’d buy 4GB of RAM and make it cache in a heartbeat.

D


'

On deduplication and Data Domain appliances

One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.

The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.

The ID is then associated with the block and both are stored.

If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).

This is what allows dedup technologies to store a lot of data.

Now, why it depends how much you can store:

If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.

Now the drawbacks (and why I never bought it):

The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).

Imagine you have 2 billion entries.

Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.

Each block has it’s ID calculated (CPU-intensive).

Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.

This can take a while, depending on what search/sort/store algorithms they use.

I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.

I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.

Now, imagine you’re trying to RECOVER your 1GB PST.

Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.

When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.

This is CPU- and disk-intensive. Takes a while.

The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.

Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.

There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.

So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.

Another tack would be to use virtual tape - doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.

Plus, even the cheapest EMC virtual tape box works at over 300MB/s.

I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.

Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).

Let me know what you think…

D

'

Do you need a VTL or not?

I first posted this as a comment on http://www.gotitsolutions.org but this is its rightful place.

Having deployed what was, at the time, the largest VTL in the world, and subsequently numerous other VTL and ATA Solutions, I think I can offer a somewhat different perspective:

It depends on the number of data movers you have and how much manual work you’re prepared to do. Oh, and speed.

Licensing for VTL is now capacity-based for most packages (at least the famous/infamous/important ones like CommVault, Networker and NetBackup, not respectively).

Also, I’d forget about using VTL features such as replication and using the VTL to write directly to tape (unless you’re retarded, insane or the backup software is running ON the VTL, as is the case now with EMC’s CDL). Just use the VTL like tape. I’ve been so vehement about this that even the very stubborn and opinionated Curtis Preston is now afraid to say otherwise with me in the room… (I shut him up REALLY effectively during one Veritas Vision session we were co-presenting a couple years ago. I like Curtis but he’s too far removed from the real world. Great presenter, though, and funny).

Even dedup features are suspect in my opinion, since they rely on hashes and searches of databases of hashes, which progressively get slower the more you store in them. Most companies selling dedup (data domain, avamar, to name a couple major names) are sorta cagey when you confront them with questions such as “I have 5 servers with 50 million files each, how well will this thing work?”

Answer is, it won’t, even for far fewer files. Just get some raw-based backup method that also indexes, such as Networker’s snapimage or NBU’s flashbackup.

Dedup also fails with very large files such as database files.
I can expand on any of the above comments if anyone cares.

But back on the data movers (Media Agents, Storage Nodes, Media Servers):

Whether you use VTL or ATA, you effectively need to divvy up the available space.

With ATA, you either allocate a fixed amount of space to each data mover, or use a cluster filesystem (such as Adic’s Stornext) to allow all data movers to see the same disk.

With VTL, the smallest quantum of space you can allocate to a data mover is, simply, a virtual tape. A virtual tape, just like a real tape, gets automatically alocated, as needed.

So, imagine you have a large datacenter, with maybe 40 data movers and multiple backup masters.

Imagine you have a 64TB ATA array.

You can either:

1. Split the array into 40 chunks, and have a management nightmare
2. Deploy stornext so all servers see a SINGLE 64TB filesystem (at an extra 3-4K per server, plus probably 50K more for maintenance, central servers and failover) - easy to deal with but complex to deploy and more software on your boxes)
3. Deploy VTL and be done with it.

For such a large environment, option #3 is the best choice, hands down.

With filesystems, you have to worry about space, fragmentation, mount options, filesystem creation-time tunables, runtime tunables, esoteric kernel tunings, fancy disk layouts, and so on. If you’re weird like me and thoroughly enjoy such things, then go for it. As time goes by though, the novelty factor diminishes greatly. Been there, done that, smashed some speed records on the way.

What’s needed in the larger shops, aside from performance, is scalability, ease of use and deployment, and simplicity.

With VTL, you get all of that.

The other issue with disk is that backup vendors, while they’re getting better, impose restrictions on the # streams in/out, copy to tape and so on. No such restrictions on tape.

One issue with VTL: depending on your backup software, setting up all those new virtual drives etc. can be a pain (esp. on NBU).
for a small shop (less than 2 data movers), a VTL is probably overkill.

D

'

So who am I?

Hello everyone,

My name is Dimitris Krekoukias.

This blog used to be on another server, I moved it here - hopefully this hosting facility will be more stable.

I resemble a silverback gorila more than a monkey (man-pelt and all), and could probably wrestle one (and have a fair chance of winning).
I have extensive experience in the backup and recovery arena, and indeed know far more about certain products than I (or the vendors) would like to.
This blog will not be just about recovery - I have other interests, such as storage, OS design, tuning, filesystems, HPC, and other exotica. Plus a ton of non-IT-related hobbies - but that’s a story for another day.
Hopefully everyone will find this blog stimulating, controversial and, at times, annoying - in which case, tough.

D

'

On windows filesystem tuning and funky cache mechanisms

Edited: I just realized I must have used different postmark settings for vista and XP. Do NOT use the following numbers to compare Vista to XP performance.

I won’t go into a diatribe on how to tune Windows - there are excellent guides on Microsoft’s and IBM’s sites, among others.

But I wanted to share some goodness based on some recent findings of mine.

First, the part that most probably know (works on XP and 2003):

From a command window do

fsutil behavior set disablelastaccess 1

This will disable access time recording, which IMO is useless unless you really do care when a file was accessed and/or there isn’t much going on with your disk (or are on some fancy EMC box with tons of cache). If you have busy disks, this typically helps a bit.

On 2003, you can also increase the size of the lookaside buffer if you have many concurrent file operations:

fsutil behavior set memoryusage 2

This also works on Vista but not XP, sadly. See more here: http://technet2.microsoft.com/WindowsServer/en/library/9fcf44c8-68f4-4204-b403-0282273bc7b31033.mspx?mfr=true

Now, for the interesting part. I use a laptop that’s pretty decent (100GB 7200RPM drive, 2GB RAM). I hammer my disk since I use the laptop for vmware and other duties (music software with thousands of files, for instance).

I like postmark and iozone for measuring performance. Here’s how I configure postmark:

set number 10000

set transactions 20000

set subdirectories 5

set size 500 100000

set read 4096

set write 4096

run

This will create 10,000 files, then perform 20,000 transactions on them. The files will range from 500 bytes to 100KB in size. This is brutal on CPU, cache and disk. If you want different-sized files you just specify the min and max sizes, just be careful with the number (if you leave it at 10,000 and tell it to make 100GB files, better make sure you have the space).

Anyway, here are some results:

Vista untweaked (10000 files and transactions, 512 byte I/O):

Time:
181 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (83 per second)
Creation alone: 10000 files (121 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (83 per second)
Deletion alone: 10094 files (210 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.43 megabytes per second)
826.79 megabytes written (4.57 megabytes per second)

Vista tweaked with fsutil as described above:

Time:
159 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (94 per second)
Creation alone: 10000 files (158 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (94 per second)
Deletion alone: 10094 files (224 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.62 megabytes per second)
826.79 megabytes written (5.20 megabytes per second)

So it’s a bit better.

Another thing you can do is set the processor quanta to be fixed 120ms chunks (simply done by right clicking on “My Computer”, properties, advanced, performance, settings, advanced, processor scheduling for background services. Yes, I’ve had by far the best luck with XP by tuning it like a server. Your mileage may vary but this also increases postmark results a bit.

You can also play with increasing the cache (in that advanced pane again select “system cache” and, with regedit, go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\size and make it a 3. This is all if you have XP. In 2003 it comes just like that. Unless you want to run SQL, IIS or Exchange, in which case there’s a setting, “maximize throughput for network applications”. This limits cache to 512MB, and lets the apps cache on their own.
OR, you can actually spend some money and ridiculously increase performance by getting a caching product like Superspeed’s Supercache or Datacore’s Uptempo (I tried O&O Clevercache as well and was thoroughly underwhelmed).
Here are results with 20,000 transactions and 4K I/O, XP tuned just like a server:

Time:
386 seconds total
308 seconds of transactions (64 per second)

Files:
20092 created (52 per second)
Creation alone: 10000 files (142 per second)
Mixed with transactions: 10092 files (32 per second)
9935 read (32 per second)
10064 appended (32 per second)
20092 deleted (52 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (32 per second)

Data:
548.25 megabytes read (1.42 megabytes per second)
1158.00 megabytes written (3.00 megabytes per second)

And here are results with the exact same settings but with 256MB of Supercache on that volume, lazy writes on:

Time:
196 seconds total
163 seconds of transactions (122 per second)

Files:
20092 created (102 per second)
Creation alone: 10000 files (344 per second)
Mixed with transactions: 10092 files (61 per second)
9935 read (60 per second)
10064 appended (61 per second)
20092 deleted (102 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (60 per second)

Data:
548.25 megabytes read (2.80 megabytes per second)
1158.00 megabytes written (5.91 megabytes per second)
I am a believer. The size of the dataset far exceeded the capacity of supercache, but it helped tremendously regardless.
Since I don’t believe all benchmarks, I also ran iozone.

4096 8192 16384
64
128
256
512
1024
2048
4096 70011
8192 29264 50257
16384 26229 33289 37198
32768 27578 28827 34778
65536 26982 27890 28997
131072 20901 21680 22223
262144 21769 20789 22249
524288 23076 25270 26258

The top row shows record size, the left column file size. The above is without the cache. Now with cache:

4096 8192 16384
64
128
256
512
1024
2048
4096 279746
8192 264110 262117
16384 250322 249355 238230
32768 233373 238932 233980
65536 204786 232418 234544
131072 234552 230336 225731
262144 164434 227792 222540
524288 35515 31533 41262

These results are for writes, in both cases. Iozone’s output is too large to include here but I’ll gladly send the entire file to anyone that wants it. I would ignore record sizes under 4K since windows will coalesce writes to 4K and up anyway (up to 64K).
It seems that these products are worth a serious look. In most cases, significant benefits will be realized by caching the volume that holds the swapfile, even if only using 128MB. In one case I went from 124 seconds for a postmark run to 70s by caching the swap volume. Even though I had ample memory and windows shouldn’t be using swap.

Unix is generally a bit more robust for caching and virtual memory, so you don’t need extra products. Looks like Windows needs a bit of help. Indeed, Microsoft uses Supercache on the servers that host MSN, I found out…
Anyway, you can see that up to 256MB supercache kicks windows’ cache ass. Now remember, this is a box tuned just like a server, it was using like 1GB of cache even without supercache. After you exceed the size of the cache by using the large 512MB test file, you still realize some benefits, as you can see.

Datacore’s uptempo produced similar results, is far less tunable, uses a unified cache (instead of a chunk per partition), is easier to configure and can be more or less expensive - Supercache for 4 CPUs is like $1K, but half that for 2 CPUs. UpTempo is about $700 regardless. Another difference is that UpTempo is 32-bit only at the moment.

D