Tutorial :Remove duplicates by field from one table using another using LINQ


I have to leave in a DataTable only records with dates currently not present in the database.

So I read all existing dates using the stored procedure (is it correct?):

SELECT DISTINCT CAST(S.[date] AS DATE) -- original date is DATETIME2(0)  FROM ...  WHERE ...  

and load it to a DataTable:

var tableDate = new DataTable();  new SqlDataAdapter(command).Fill(tableDate);  

How to remove now from another table all unnecessary rows? I think LINQ could help but I'm not sure how..


I'm looking at your answer, which you say works, and you just want to know how to do it in a "single LINQ query." Keep in mind that these queries all have deferred execution, so the following two queries are functionally equivalent:

var q =      from d in dates      select d.Field<DateTime>("date");  return      (from r in records       where !q.Contains(r.Field<DateTime>("date"))       select r).CopyToDataTable();  


return      (from r in records       where !dates           .Select(d => d.Field<DateTime>("date"))           .Contains(r.Field<DateTime>("date"))       select r).CopyToDataTable();  

The second version is a lot harder to read, but nevertheless, it is "one query."

Having said this, none of these examples really seem to match your question title, which suggests that you are trying to remove duplicate rows. If that is indeed what you are trying to do, here is a method that will do that:

static DataTable RemoveDuplicates(DataTable dt)  {      return          (from row in dt.Rows.OfType<DataRow>()           group row by row.Field<string>("date") into g           select g              .OrderBy(r => r.Field<int>("ID"))              .First()).CopyToDataTable();  }  

If you don't care about which duplicates removed then you can just remove the OrderBy line. You can test this as follows:

static void Main(string[] args)  {      using (DataTable original = CreateSampleTable())      using (DataTable filtered = RemoveDuplicates(original))      {          DumpTable(filtered);      }      Console.ReadKey();  }    static DataTable CreateSampleTable()  {      DataTable dt = new DataTable();      dt.Columns.Add("ID", typeof(int));      dt.Columns.Add("Code", typeof(string));      dt.Columns.Add("Name", typeof(string));      dt.Rows.Add(1, "123", "Alice");      dt.Rows.Add(2, "456", "Bob");      dt.Rows.Add(3, "456", "Chris");      dt.Rows.Add(4, "789", "Dave");      dt.Rows.Add(5, "123", "Elen");      dt.Rows.Add(6, "123", "Frank");      return dt;  }    static void DumpTable(DataTable dt)  {      foreach (DataRow row in dt.Rows)      {          Console.WriteLine("{0},{1},{2}",              row.Field<int>("ID"),              row.Field<string>("Code"),              row.Field<string>("Name"));      }  }  

(just replace "date" with "Code" in the RemoveDuplicates method for this example)

Hopefully one of these answers your question. Otherwise I think you're going to have to be more clear with your requirements.


You could use Except()

return records.Except(dates);

UPDATED: If your DataTable has typed fields, then it should be like the following:

var excluded = arbDates.Rows.OfType<System.Data.DataRow>().Select(a => a[0]) .Except(excDates.Rows.OfType<System.Data.DataRow>().Select(e => e[0]));

otherwise you could cast it:

var excluded = arbDates.Rows.OfType<System.Data.DataRow>() .Select(a => Convert.ToDateTime(a[0].ToString())) .Except( excDates.Rows.OfType<System.Data.DataRow>() .Select(e => Convert.ToDateTime(e[0].ToString())));


Your SQL statement looks fine. As I understand it, you're casting to get the default time value starting at midnight. Therefore the dates in the other table being compared must also match that format in order to compare the dates with neutral times. If they aren't you can still use the code I have below but you must add the .Date property anywhere that the tableResult row's field is referenced. Also I have used Field<DateTime>(0) but depending on your query and based on your earlier example you may need to use Field<DateTime>("date").

There's no need for a custom comparer. To merge your LINQ queries into a single query you could simply use the let keyword and carry the intermediate result through the query and reference it.

Give this a try:

var tableDate = new DataTable();  new SqlDataAdapter(command).Fill(tableDate);    // this is the other table that has other dates, so populate as needed  var tableResult = new DataTable();    var newTable =      (from row in tableResult.AsEnumerable()      let uniqueRows = tableResult.AsEnumerable().Select(r => r.Field<DateTime>(0))                                  .Except(tableDate.AsEnumerable().Select(r => r.Field<DateTime>(0)))      where uniqueRows.Contains(row.Field<DateTime>(0))      select row).CopyToDataTable();  

In dot notation the query would be:

var newTable = tableResult.AsEnumerable()      .Select(row => new      {          Row = row,          UniqueRows =  tableResult.AsEnumerable()                                   .Select(r => r.Field<DateTime>(0))                                   .Except(tableDate.AsEnumerable().Select(r => r.Field<DateTime>(0)))      })      .Where(item => item.UniqueRows.Contains(item.Row.Field<DateTime>(0)))      .Select(item => item.Row)      .CopyToDataTable();  

Instead of tableResult.AsEnumerable() you could use tableResult.Rows.Cast<DataRow>() or tableResult.Rows.OfType<DataRow>(). The results are the same between all these approaches.

If you want to remove duplicates from the existing table (rather than copy it to a new table), you could remove the items returned by the Intersect method from the table:

var commonDates = tableDate.AsEnumerable().Select(row => row.Field<DateTime>(0))                             .Intersect(tableResult.AsEnumerable().Select(row => row.Field<DateTime>(0)));    for (int index = tableResult.Rows.Count - 1; index >= 0; index--)  {      if (commonDates.Contains(tableResult.Rows[index].Field<DateTime>(0)))      {          tableResult.Rows.RemoveAt(index);      }  }  


As I understand the problem, you are trying to de-dup data coming from some import. You may not need to do this using LINQ. Although the post title suggests LINQ, you later question whether LINQ might be the best solution and, given what we know, I think you could do this using a single Insert statement.

First, I'd suggest bulk copying the data into a temporary location in the db (if you are not already doing this) like so:

Create Table TempBulkCopyData  (      Id int not null identity(1,1)      , Date DateTime2 not null      , ...  )  

One of the advantages of bulk copying into a temporary location is that you can add indexes and such to speed up the cleaning process. To de-dup the data, you could then run a query like so:

Insert DestinationData(...)  Select ...  From BulkCopyData As BCD  Where Id = (              Select Min(BCD2.[Id])              From BulkCopyData As BCD2              Where Cast(BCD2.[Date] As Date) = Cast(BCD.[Date] As Date)              )  


Insert DestinationData(...)  Select ...  From BulkCopyData As BCD  Where Id = (              Select Min(BCD2.[Id])              From BulkCopyData As BCD2              Where DateDiff(d, BCD.[Date], BCD2.[Date]) = 0              )  

This will pull the first date it finds (the one with the lowest Id). This is obviously somewhat arbitrary but to get more refined we'd need to know more about the data structure and requirements.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »