Tutorial :splitting lucene index into two halves



Question:

what is the best way to split an existing Lucene index into two halves i.e. each split should contain half of the total number of documents in the original index


Solution:1

The easiest way to split an existing index (without reindexing all the documents) is to:

  1. Make another copy of the existing index (i.e. cp -r myindex mycopy)
  2. Open the first index, and delete half the documents (range 0 to maxDoc / 2)
  3. Open the second index, and delete the other half (range maxDoc / 2 to maxDoc)
  4. Optimize both indices

This is probably not the most efficient way, but it requires very little coding to do.


Solution:2

A fairly robust mechanism is to use a checksum of the document, modulo the number of indexes, to decide which index it will go into.


Solution:3

Recent versions of Lucene have a dedicated tool to do this (IndexSplitter and MultiPassIndexSplitter under contrib/misc).


Solution:4

This question was one of the first I found when I was researching answers to this problem, so I'm leaving my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.

My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();  var sourceDir = GetOldIndexDir();  foreach (var plantID in distinctPlantIDs)  {      var query = new TermQuery(new Term("PlantID", plantID.ToString()));      var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go        //read each plant's documents and write them to the new index      using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))      using (var sourceSearcher = new IndexSearcher(sourceDir, true))      using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))      {          var numHits = sourceSearcher.DocFreq(query.Term);          if (numHits <= 0) continue;          var hits = sourceSearcher.Search(query, numHits).ScoreDocs;          foreach (var hit in hits)          {              var doc = sourceSearcher.Doc(hit.Doc);              destWriter.AddDocument(doc);          }          destWriter.Optimize();          destWriter.Commit();      }        //delete the documents out of the old index      using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))      using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))      {          sourceWriter.DeleteDocuments(query);          sourceWriter.Commit();      }  }  

That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.

Anyway, hope this helps someone out there.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »