Tutorial :SQL to determine minimum sequential days of access?



Question:

The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.

  Id      UserId   CreationDate  ------  ------   ------------  750997      12   2009-07-07 18:42:20.723  750998      15   2009-07-07 18:42:20.927  751000      19   2009-07-07 18:42:22.283  

What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.

In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.

Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)


Solution:1

The answer is obviously:

SELECT DISTINCT UserId  FROM UserHistory uh1  WHERE (         SELECT COUNT(*)          FROM UserHistory uh2          WHERE uh2.CreationDate          BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate)        ) = @days OR UserId = 52551  

EDIT:

Okay here's my serious answer:

DECLARE @days int  DECLARE @seconds bigint  SET @days = 30  SET @seconds = (@days * 24 * 60 * 60) - 1  SELECT DISTINCT UserId  FROM (      SELECT uh1.UserId, Count(uh1.Id) as Conseq      FROM UserHistory uh1      INNER JOIN UserHistory uh2 ON uh2.CreationDate           BETWEEN uh1.CreationDate AND               DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0))          AND uh1.UserId = uh2.UserId      GROUP BY uh1.Id, uh1.UserId      ) as Tbl  WHERE Conseq >= @days  

EDIT:

[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!


Solution:2

How about (and please make sure the previous statement ended with a semi-colon):

WITH numberedrows       AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID                                          ORDER BY CreationDate)                  - DATEDIFF(day,'19000101',CreationDate) AS TheOffset,                  CreationDate,                  UserID           FROM   tablename)  SELECT MIN(CreationDate),         MAX(CreationDate),         COUNT(*) AS NumConsecutiveDays,         UserID  FROM   numberedrows  GROUP  BY UserID,            TheOffset    

The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.

You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...

I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.

...and would be very much helped by an index on tablename(UserID, CreationDate)

Edited: Turns out Offset is a reserved word, so I used TheOffset instead.

Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.

Rob


Solution:3

If you can change the table schema, I'd suggest adding a column LongestStreak to the table which you'd set to the number of sequential days ending to the CreationDate. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment the LongestStreak in the new row, otherwise, you'll set it to 1.)

The query will be obvious after adding this column:

if exists(select * from table            where LongestStreak >= 30 and UserId = @UserId)     -- award the Woot badge.  


Solution:4

Some nicely expressive SQL along the lines of:

select          userId,      dbo.MaxConsecutiveDates(CreationDate) as blah  from      dbo.Logins  group by      userId  

Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):

using System;  using System.Data.SqlTypes;  using Microsoft.SqlServer.Server;  using System.Runtime.InteropServices;    namespace SqlServerProject1  {      [StructLayout(LayoutKind.Sequential)]      [Serializable]      internal struct MaxConsecutiveState      {          public int CurrentSequentialDays;          public int MaxSequentialDays;          public SqlDateTime LastDate;      }        [Serializable]      [SqlUserDefinedAggregate(          Format.Native,          IsInvariantToNulls = true, //optimizer property          IsInvariantToDuplicates = false, //optimizer property          IsInvariantToOrder = false) //optimizer property      ]      [StructLayout(LayoutKind.Sequential)]      public class MaxConsecutiveDates      {          /// <summary>          /// The variable that holds the intermediate result of the concatenation          /// </summary>          private MaxConsecutiveState _intermediateResult;            /// <summary>          /// Initialize the internal data structures          /// </summary>          public void Init()          {              _intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 };          }            /// <summary>          /// Accumulate the next value, not if the value is null          /// </summary>          /// <param name="value"></param>          public void Accumulate(SqlDateTime value)          {              if (value.IsNull)              {                  return;              }              int sequentialDays = _intermediateResult.CurrentSequentialDays;              int maxSequentialDays = _intermediateResult.MaxSequentialDays;              DateTime currentDate = value.Value.Date;              if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks)))                  sequentialDays++;              else              {                  maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays);                  sequentialDays = 1;              }              _intermediateResult = new MaxConsecutiveState                                        {                                            CurrentSequentialDays = sequentialDays,                                            LastDate = currentDate,                                            MaxSequentialDays = maxSequentialDays                                        };          }            /// <summary>          /// Merge the partially computed aggregate with this aggregate.          /// </summary>          /// <param name="other"></param>          public void Merge(MaxConsecutiveDates other)          {              // add stuff for two separate calculations          }            /// <summary>          /// Called at the end of aggregation, to return the results of the aggregation.          /// </summary>          /// <returns></returns>          public SqlInt32 Terminate()          {              int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays);              return new SqlInt32(max);          }      }  }  


Solution:5

Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.

So something like:

SELECT users.UserId, count(1) as cnt  FROM users  WHERE users.CreationDate > now() - INTERVAL 30 DAY  GROUP BY UserId  HAVING cnt = 30  


Solution:6

Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.

  1. What you should have done until now and should start doing now:
    Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
  2. What you should do now:
    - Export this table to a server that doesn't run your website and won't be needed for a while. ;)
    - Sort it by user, then date.
    - go through it sequentially, keep a counter...


Solution:7

If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.


Solution:8

You could use a recursive CTE (SQL Server 2005+):

WITH recur_date AS (          SELECT t.userid,                 t.creationDate,                 DATEADD(day, 1, t.created) 'nextDay',                 1 'level'             FROM TABLE t           UNION ALL          SELECT t.userid,                 t.creationDate,                 DATEADD(day, 1, t.created) 'nextDay',                 rd.level + 1 'level'            FROM TABLE t            JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid)     SELECT t.*      FROM recur_date t     WHERE t.level = @numDays  ORDER BY t.userid  


Solution:9

Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)

Another lead is from SQL Team's blog on runs

The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:

WITH Runs  AS    (SELECT UserID           , CreationDate           , ROW_NUMBER() OVER(PARTITION BY UserId                               ORDER BY CreationDate)             - ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak                                 ORDER BY CreationDate) AS RunNumber    FROM       (SELECT UH.UserID             , UH.CreationDate             , ISNULL((SELECT TOP 1 1                 FROM dbo.UserHistory AS Prior                 WHERE Prior.UserId = UH.UserId                 AND Prior.CreationDate                    BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1)                    AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak        FROM dbo.UserHistory AS UH) AS Consecutive  )  SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd  FROM Runs  GROUP BY UserID, RunNumber  HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days  

The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.


Solution:10

A couple of SQL Server 2012 options (assuming N=100 below).

;WITH T(UserID, NRowsPrevious)       AS (SELECT UserID,                  DATEDIFF(DAY,                           LAG(CreationDate, 100)                               OVER                                   (PARTITION BY UserID                                        ORDER BY CreationDate),                            CreationDate)           FROM   UserHistory)  SELECT DISTINCT UserID  FROM   T  WHERE  NRowsPrevious = 100   

Though with my sample data the following worked out more efficient

;WITH U           AS (SELECT DISTINCT UserId               FROM   UserHistory) /*Ideally replace with Users table*/      SELECT UserId      FROM   U             CROSS APPLY (SELECT TOP 1 *                          FROM   (SELECT                                          DATEDIFF(DAY,                                                   LAG(CreationDate, 100)                                                     OVER                                                      (ORDER BY CreationDate),                                                    CreationDate)                                  FROM   UserHistory UH                                  WHERE  U.UserId = UH.UserID) T(NRowsPrevious)                          WHERE  NRowsPrevious = 100) O  

Both rely on the constraint stated in the question that there is at most one record per day per user.


Solution:11

Something like this?

select distinct userid  from table t1, table t2  where t1.UserId = t2.UserId     AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n    AND (      select count(*)      from table t3      where t1.UserId  = t3.UserId        and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n     ) = n  


Solution:12

I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.

Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):

-- show basic understand of the math properties     select    ceil(max (creation_date) - min (creation_date))                max_min_days_diff,             count ( * ) real_day_count      from   user_access_log  group by   user_id;      -- select all users that have consecutively accessed the site     select   user_id      from   user_access_log  group by   user_id    having       ceil(max (creation_date) - min (creation_date))             / count ( * ) = 1;        -- get the count of all users that have consecutively accessed the site     select   count(user_id) user_count      from   user_access_log  group by   user_id    having   ceil(max (creation_date) - min (creation_date))             / count ( * ) = 1;  

Table prep script:

-- create table   create table user_access_log (id           number, user_id      number, creation_date date);      -- insert seed data   insert into user_access_log (id, user_id, creation_date)    values   (1, 12, sysdate);    insert into user_access_log (id, user_id, creation_date)    values   (2, 12, sysdate + 1);    insert into user_access_log (id, user_id, creation_date)    values   (3, 12, sysdate + 2);    insert into user_access_log (id, user_id, creation_date)    values   (4, 16, sysdate);    insert into user_access_log (id, user_id, creation_date)    values   (5, 16, sysdate + 1);    insert into user_access_log (id, user_id, creation_date)    values   (6, 16, sysdate + 5);  


Solution:13

declare @startdate as datetime, @days as int  set @startdate = cast('11 Jan 2009' as datetime) -- The startdate  set @days = 5 -- The number of consecutive days    SELECT userid        ,count(1) as [Number of Consecutive Days]  FROM UserHistory  WHERE creationdate >= @startdate  AND creationdate < dateadd(dd, @days, cast(convert(char(11), @startdate, 113)  as datetime))  GROUP BY userid  HAVING count(1) >= @days  

The statement cast(convert(char(11), @startdate, 113) as datetime) removes the time part of the date so we start at midnight.

I would assume also that the creationdate and userid columns are indexed.

I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.

Revised solution:

declare @days as int  set @days = 30  select t1.userid  from UserHistory t1  where (select count(1)          from UserHistory t3          where t3.userid = t1.userid         and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0)          and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + @days, 0)          group by t3.userid  ) >= @days  group by t1.userid  

I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.

Update: improved the date handling in the second solution.


Solution:14

This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).

  DECLARE @Range as INT  SET @Range = 10    SELECT DISTINCT UserId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))    FROM tblUserLogin a  WHERE EXISTS     (SELECT 1         FROM tblUserLogin b        WHERE a.userId = b.userId          AND (SELECT COUNT(DISTINCT(CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, CreationDate)))))                 FROM tblUserLogin c                WHERE c.userid = b.userid                  AND CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, c.CreationDate))) BETWEEN CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate))) and CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, a.CreationDate)))+@Range-1) = @Range)  

Creation script

  CREATE TABLE [dbo].[tblUserLogin](      [Id] [int] IDENTITY(1,1) NOT NULL,      [UserId] [int] NULL,      [CreationDate] [datetime] NULL  ) ON [PRIMARY]  


Solution:15

Spencer almost did it, but this should be the working code:

SELECT DISTINCT UserId  FROM History h1  WHERE (      SELECT COUNT(*)       FROM History      WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, @n-1, h1.CreationDate)  ) >= @n  


Solution:16

Off the top of my head, MySQLish:

SELECT start.UserId  FROM UserHistory AS start    LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId      AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY)    LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId      AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY)  WHERE pre_start.Id IS NULL  GROUP BY start.Id  HAVING COUNT(subsequent.Id)=30  

Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.


Solution:17

How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).

declare @ContinousDays int  set @ContinousDays = 30  -- select those that have 30 consecutive days    create table #tallyTable (Tally int)  insert into #tallyTable values (1)  ...  insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan    select [UserId],count(*),t.Tally from HistoryTable   join #tallyTable as t on t.Tally>0  where [CreationDate]> getdate()-@ContinousDays-t.Tally and         [CreationDate]<getdate()-t.Tally   group by [UserId],t.Tally   having count(*)>=@ContinousDays    delete #tallyTable  


Solution:18

Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...

SELECT UserId from History   WHERE CreationDate > ( now() - n )  GROUP BY UserId,   DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate    HAVING COUNT(TruncatedCreationDate) >= n  

EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).

@IDisposable I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.


Solution:19

assuming a schema that goes like:

create table dba.visits  (      id  integer not null,      user_id integer not null,      creation_date date not null  );  

this will extract contiguous ranges from a date sequence with gaps.

select l.creation_date  as start_d, -- Get first date in contiguous range      (          select min(a.creation_date ) as creation_date           from "DBA"."visits" a               left outer join "DBA"."visits" b on                      a.creation_date = dateadd(day, -1, b.creation_date ) and                      a.user_id  = b.user_id               where b.creation_date  is null and                    a.creation_date  >= l.creation_date  and                    a.user_id  = l.user_id       ) as end_d -- Get last date in contiguous range  from  "DBA"."visits" l      left outer join "DBA"."visits" r on           r.creation_date  = dateadd(day, -1, l.creation_date ) and           r.user_id  = l.user_id       where r.creation_date  is null  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »