Tutorial :Calculate TF-IDF using Sql



Question:

I have a table in my DB containning a free text field column.

I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.

Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct me to it?

Many Thanks,

Jon


Solution:1

In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.

Edit: Actually even without creating a persistent full text index you can still leverage the parser

WITH testTable AS  (  SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL  SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL  SELECT 3, N'red lorry yellow lorry' UNION ALL  SELECT 4, N'the quick brown fox jumped over the lazy dog'  )    SELECT display_term, COUNT(*) As Cnt  FROM testTable  CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)  WHERE TXT IS NOT NULL  GROUP BY display_term  HAVING COUNT(*) > 1  ORDER BY Cnt DESC  

Returns

display_term                   Cnt  ------------------------------ -----------  the                            3  brown                          2  lorry                          2  sea                            2  


Solution:2

Solution for SQL Server 2008:

here is the table:

CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));  

here is SQL query:

SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched       , count(*) as cnt_all  FROM MyTable MYT   INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id  

here is table valued function Fn_Split(@id int, @separator VARCHAR(32), @string VARCHAR(MAX)) (taken from here):

CREATE FUNCTION Fn_Split (@id int, @separator VARCHAR(32), @string VARCHAR(MAX))    RETURNS @t TABLE      (          ret_id INT         ,txt_word VARCHAR(MAX)      )     AS      BEGIN          DECLARE @xml XML          SET @XML = N'<root><r>' + REPLACE(@s, @separator, '</r><r>') + '</r></root>'            INSERT INTO @t(ret_id, val)          SELECT @id, r.value('.','VARCHAR(5)') as Item          FROM @xml.nodes('//root/r') AS RECORDS(r)            RETURN      END  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »