Tutorial :Categorizing input data into sets based on attribute



I feel like this problem is related to the bin packing problem, as well as potentially to the set partitioning problem... I just want to bounce this off of someone before I head down the path too deeply.

I have input data (in a datafile) as follows:

entry_one 55  entry_two 56  entry_three 61  entry_four 62  entry_five 62  entry_six 68  entry_seven 72  entry_eight 73  entry_nine 78  entry_ten 79  entry_eleven 84  entry_twelve 85  entry_thirteen 91  entry_fourteen 92  entry_fifteen 99  entry_sixteen 100  entry_seventeen 121  entry_eighteen 125  entry_nineteen 127  entry_twenty 161  

With this data I want to have an algorithm that: groups the entries into groups such that the entries's associated numerical values within a group are within X (in my case, X is 16.) So for example, one arrangement could be:

group one:   entry_one   entry_two   entry_three   entry_four   entry_five   entry_six    group two:   entry_seven   entry_eight   entry_nine   entry_ten   entry_eleven   entry_twelve    group three:   entry_thirteen   entry_fourteen   entry_fifteen   entry_sixteen    group four:   entry_seventeen   entry_eighteen   entry_nineteen    group five:   entry_twenty  

This particular arrangement was achieved using a naive greedy algorithm in which I started with the lowest value (entry_one's 55), and allowed all values that were under 55+16 to be part of that group. I then started with the very next entry which was not in that group (entry_seven's 72) and allowed all values that were under 72+16 to be part of that group (group two), and so on in that order.

I believe that although a naive greedy algorithm works, it is unlikely to give me an optimal grouping/categorization, where I define "optimal grouping" such that the total number of groups is what is being minimized (in my case, this is for job scheduling, so I want to group like work as best as possible to minimize changeover.)

Any thoughts, modules, algorithms, sample code out there that people can suggest?


EDIT: I thought I should add how this is different from the bin packing problem. In the bin packing problem the optimization is: "given these bins of fixed size, with these objects of fixed size, how can I stuff the most of these fixed size objects into these bins without overflowing each bin." In my case, what I have is bins of infinite size but of filtered entry, so that if an object does "match" the signature for a bin, it can be inserted into said bin, but what we want is to minimize the total number of bins that we need to create.


  1. Start at the point nearest the midpoint
  2. select all values within half the range of it
  3. Use your existing solution going out in both directions.
  4. Repeat from step 2 using every other point in the original selection.
  5. select the best result.

Worst case this is an O(n2) problem as you can replace step 4 with "repeat with all points".


If I understand the problem right, then the greedy algorithm is in fact optimal. Let G be the greedy solution and let S be any solution. Consider the first group where G differs from S. Since G is a greedy solution, G's group is larger than S's. For example,

G : [1, 2, 3][4, 5, 6, 7][8, 9, 10][11, 12]...  S : [1, 2, 3][4, 5, 6][7, 8, 9][10, 11, 12]...  

We can derive a new solution S' from S that agrees with G on a longer prefix without introducing a new group. Just steal some items from the next group. The resulting groups are valid because one is already in G and the other is a subgroup of a valid group.

S': [1, 2, 3][4, 5, 6, 7][8, 9][10, 11, 12]...  

By induction, it follows that G is in fact optimal.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »