Tutorial :Slow distinct query in SQL Server over large dataset



Question:

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.

The table has 3 columns:

  • id -- autonumber (clustered)
  • typeUUID -- GUID generated before the insert happens; used to group the types together
  • typeName -- The type name (duh...)

One of the queries we run is a distinct on the typeName field:

SELECT DISTINCT [typeName] FROM [types] WITH (nolock);  

The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.

To test this a little further, we ran the following query:

SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]  

This query returns in about 10 seconds, as I would expect, it's scanning the table.

Is there something we're missing here? Why does the first query take so long?

Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.

Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...

SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.

Once again, thank you all for your help!!


Solution:1

You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?


Solution:2

I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.

If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.


Solution:3

There is an isse with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.

So we too queries such as:

SELECT DISTINCT [typeName] FROM [types] WITH (nolock);  

and break it up into the following

SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)  SELECT DISTINCT typeName FROM #tempTable1  

Another way to get around it is to use a GROUP BY, which gets a different optimization plan.


Solution:4

As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.

So it's really a matter of limiting the number of rows that need to be scanned.

The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??

If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.

That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.

Marc


Solution:5

My first thought is statistics. To find last updated:

SELECT      name AS index_name,       STATS_DATE(object_id, index_id) AS statistics_update_date  FROM      sys.indexes   WHERE      object_id = OBJECT_ID('MyTable');  

Edit: Stats are updated when indexes are rebuilt, which I see are not maintained

My second thought is that is the index still there? The TOP query should still use an index. I've just tested on one of my tables with 57 million rows and both use the index.


Solution:6

A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).

Idea was from this question:

select typeName into #Result from Types where 1=0;    declare @t varchar(100) = (select min(typeName) from Types);  while @t is not null  begin      set @t = (select top 1 typeName from Types where typeName > @t order by typeName);          if (@t is not null)          insert into #Result values (@t);  end    select * from #Result;  

And looks like there are also some other methods (notably the recursive CTE @Paul White):

different-ways-to-find-distinct-values-faster-methods

sqlservercentral Topic873124-338-5


Solution:7

Second query works on 1000000 records but the first one 200M. I think this is a big difference :)


Solution:8

I should try something like this:

SELECT typeName FROM [types] WITH (nolock)  group by typeName;  

And like other i would say you need to normalize that column.


Solution:9

An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.

You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:

select type  from nightlyscan  union  select distinct type  from verybigtable  where rowid > lastscannedid  

Another option is to normalize the big table into two tables:

talbe1: id, guid, typeid  type table: typeid, typename  

This would be very beneficial if the number of types was relatively small.


Solution:10

I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?

This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.

It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.


Solution:11

An indexed view can make this faster.

create view alltypes  with schemabinding as  select typename, count_big(*) as kount  from dbo.types  group by typename    create unique clustered index idx  on alltypes (typename)  

The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)

Alternatively you could make a small table holding all values:

select distinct typename  into alltypes  from types    alter table alltypes  add primary key (typename)    alter table types add foreign key (typename) references alltypes  

The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »