Thursday, March 01, 2007

Search and Index Sizing and Planning - Real world data from MSW

Source: Blog Joel Oleson

"I've heard such a huge variety of guidance around Index sizing. This is a topic that will have a huge swath, so its very important to understand your data or to be conservative. If you have the ability to resize later or add larger disks you may find this data compelling.

In the capacity planning document you'll see we recommend 30% of disk for Index. What does that mean? Well, since now we have a search db, the index edb file, and SSP database it can be confusing. Reading the response from Sam from Microsoft's internal/Intranet deployment you can see how mileage really does vary. If the 12TB deployment which is currently being indexed were to have 30% of disk just for the size of the index on disk, the file would be 3TB! Currently the index file on disk is 83GB, but the search database is 243GB. This is with 19.4 million documents indexed. Since the recommendation is to have 2X the size for the index file for swapping it out on the query server, a planner would say that they should plan for 60% on disk or over 6TB. This would be quite a loss given the current size or "real world" size is 83GB. They currently have 300GB allocated for that drive and since it's on a SAN they can grow the disk if needed. Obviously with a heavy records management repository or a page heavy site will have different results, so be sure to understand your content.

My tip is don't over plan or let this kill your design on this one. The recommendation has gone from 50% to 30% over time, maybe you've even seen 10%. My recommendation is to understand your data. Remember that the content of audio, video, archives, ZIPs, PDFs, MDB, MPP, MSG, VSD, GIF, JPG, PSD, CAD, WAV, MSI, EXE, and hundreds of file types are not indexed by default. You have to add Ifilters for the files that are not indexed by default (which is a pretty decent list (most Office file types and text based formats) and you should be selective about what you add since many of them are not multi threaded. The other consideration is file size. If files larger than 16MB are not indexed by default, then the biggest files taking up the largest size on disk won't be indexed. When you have a 15MB PPT, how much of it is even indexed? Maybe 100K worth (if you have verbose notes)?

Our indexes are larger than they were in SPS 2003, but my recommendation is to be conservative at 10% with the expectation that you'll really see something around 1-5%. Don't underestimate the Search Database though. In the MSW farm, that database sees the most action in terms of writes. It is the most actively written to database in the farm from what I hear. Makes sense understanding that it is the property store. Although I've almost completely gone to RAID 5 in all my planning for minimizing cost, I do recommend a RAID 0+1 drive for the search and config database, and RAID 0+1 for the transaction logs.

Message from Sam...

The real-world data today is that we are indexing 12TB of Sharepoint content worldwide + an unknown amount of non-Sharepoint content from our Redmond SSP. The numbers look like this:

Number of documents indexed: 19.4 million Size of search database: 243GB Size of index on disk: 83GB

Thus you could consider the amount of disk used to be about 326GB. Assuming a 14TB total corpus say (just a guess, really) then the real-world data would indicate 2.33%. Of course this is very much a 'mileage may vary' exercise as everyone's document mix is different."

No comments: