Tutorial :Indexing a 'non guessable' key for quick retrieval?



Question:

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.

I have a session key that I send to the client as a cookie. This is a GUID. I also have a surrogate IDENTITY int column.

I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.

I only send the GUID to the client browser:

a) i dont want my technical 'hacker' users being able to guage what 'user id' they are - i.e. know how many visitors we have had to the site in total

b) i want to make sure noone messes with data maliciously - nobody can guess a GUID

I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.

I don't know which of the following I should do :

  • Index the GUID column and always use that to find the row
  • Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
  • Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.

There may be several thousand sessions a day.

PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.


Solution:1

In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.

Some notes:

  • If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
  • A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
  • An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
  • The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.


Solution:2

Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.

Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId

"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."

Declare your table as

create table MyTable(      id uniqueidentifier default newsequentialid() not null primary key clustered    );   

However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.

Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.


Solution:3

If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.

More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »