Defrag the data identifiers


If you randomly pick a new GUID each time you add a new item to the FileStorage, the FileStorage's index file will grow rapidly. Its not unlikely if you file storage holds around 140.000 items, the index file will have grown up to around 4 GB (yes, that's right, four giga bytes; about 1 DVD filled with data just for indexing). Its even likely this index file is even larger than the data contents you store in the .data file.

Rather than using unstructured and truly random GUIDs, in some cases its OK to use incremental GUIDs. If this is acceptable, you might want to defrag your data identifiers, using the defrag command in the CLI.

Defragging using the CLI

If you want to defrag a FileStorage yourself, you can use the CLI, by passing the old file storage name, the new file storage name, and by specifying a SQL table, and a SQL column:

H:\CLI>FileStorageCmd.exe defrag youroldfile DefraggedFileStorage mytable mycolumn
. 4 (4 files/sec, 00:00:00 mins) Reading indexes....................................................
/ 4 (4 files/sec, 00:00:00 mins) Writing............................................................
File storage optimization finished
This operation took 611 msecs

What happens when defragging

If we compare the verbose 'dir' between the original and the defragged filestorage, we can see what has happened;


H:\Proj\CodePlex\NFileStorage\FileStorageCmd\bin\Debug>FileStorageCmd.exe dir opt_v1.3 verbose
. 4 (4 files/sec, 00:00:00 mins) Dir................................................................
Data identifier                      | Text identifier  | Creation date     | Size
2f97b016-bcba-4d1e-a5e8-1371d1fc4a21 | **************** | 20090412 07:38:22 | 11.264
9f0e3f19-eb0f-4381-a530-c8c1e5eae4d8 | **************** | 20090412 07:38:23 | 11.264
9ff92cbf-1529-44cf-bdfa-5f5cc70bb6d8 | **************** | 20090412 07:38:23 | 11.264
a337a5fa-ca4d-4166-a10e-eaafc7acf679 | **************** | 20090412 07:38:23 | 11.264
4 files found (45.056 bytes)
This operation took 469 msecs


H:\Proj\CodePlex\NFileStorage\FileStorageCmd\bin\Debug>FileStorageCmd.exe dir DefraggedFileStorage verbose
. 4 (4 files/sec, 00:00:00 mins) Dir................................................................
Data identifier                      | Text identifier  | Creation date     | Size
00000000-0000-0000-0000-000000000000 | **************** | 20090412 07:43:52 | 11.264
00000000-0000-0000-0000-000000000001 | **************** | 20090412 07:43:52 | 11.264
00000000-0000-0000-0000-000000000002 | **************** | 20090412 07:43:52 | 11.264
00000000-0000-0000-0000-000000000003 | **************** | 20090412 07:43:52 | 11.264
4 files found (45.056 bytes)
This operation took 384 msecs

So each (random) GUID from the original is mapped to a incremental new GUID. Note that just altering the data identifiers would become a trouble for the program that uses the data identifiers. Likely you will have a database that contains a pointer to a specific item (like the identifier of a Person pointing to the Data identifier that contains information about that person). If we would alter the data identifier in the filestorage, also we would have to alter the GUID in the database ofcourse. This is the reason why the CLI command for defragging has two additional parameter that let you specify a table name, and a column name. Besides the defrag command producing an index and data file, a third file is produced; a SQL file. The SQL file will assist you in upgrading your DB contents.

Below you can see an example of the contents of this .SQL file;

H:\CLI>type DefraggedFileStorage.FileStorage.index.fc.sql
-- SQL Patch script to adjust the dataidentifiers
update mytable set mycolumn='2f97b016-bcba-4d1e-a5e8-1371d1fc4a21' where mycolumn='00000000-0000-0000-0000-000000000000'
update mytable set mycolumn='9f0e3f19-eb0f-4381-a530-c8c1e5eae4d8' where mycolumn='00000000-0000-0000-0000-000000000001'
update mytable set mycolumn='9ff92cbf-1529-44cf-bdfa-5f5cc70bb6d8' where mycolumn='00000000-0000-0000-0000-000000000002'
update mytable set mycolumn='a337a5fa-ca4d-4166-a10e-eaafc7acf679' where mycolumn='00000000-0000-0000-0000-000000000003'
-- EOF

Con's of defragging

  • When using random GUIDs you automatically 'protect' your resources. Let's imagine we have a website that holds very important information, let's say pictures, stored in a FileStorage. The pictures would be exposed to the outside world using a certain URL, like so: http://yoursite/picture.aspx?a={000-000-000-000-001}. Some users might want to try to see what happens if you pass the browser another URL; http://yoursite/picture.aspx?a={000-000-000-000-002}. If a GUID would be randomly picked, chances of a browser finding a valid second GUID is extremely small (since there are so many GUIDs to choose from). So be aware of defragging in these cases. You might want to consider making a solution that is less good then making the GUIDs incremental, but much better then defining random ones (a possible solution would be to use Comb Guids (aka Combined Guids), where certain bytes of the GUID are filled with some semi-unique key, like a combination of year, month, day, hour, minute, second, and milliseconds, or using the 'ticks' property of 'DateTime.now'.

Pro's of defragging

  • Way less storage space is required for the NFileStorage index file
    • before defragging, using random GUIDs the index is (in my case) 3.9 GB:
01-04-2009  11:36     3.992.971.364 example.FileStorage.index.fc
  • whilst after defragging, the GUIDs are incrementally ordered, consuming 'just' 1.2 MB:
11-04-2009  16:57         1.200.228 example_optimized.FileStorage.index.fc
  • big gain in performance while interacting with the index file
    • making a collection of all GUIDs in a filestorage based upon the index file (around 150.000 data identifiers in my case) take many minutes in the random GUID file, simply because a lot of read I/O happens on the index file.


Last edited May 30, 2009 at 10:26 AM by barkgj, version 2


No comments yet.