Selvam S
July 11, 2016
This blog is about reading SOLR data by directly accessing the Lucene index folder, if you are looking for a normal query method, you should look at a different SOLR tutorial. This blog deals with little low-level details.
Coming to our scenario, we had to fetch & write all unique IDs from SOLR 5.5 running in cloud mode, for running a comparison to find missing IDs. We had a total of 350 million records, we needed to get those 350 million IDs. Normally, you can query SOLR, say from Java, PHP or Python and write it to a file. But since we had a huge set of documents, the query was leading to memory errors even when we had 80GB of RAM.
Thinking of a faster solution, we decided to go low level using Lucene. We used Lucene Lucene-core-5.5.1.jar with a simple Java program that will read the document ids and write them to a file. See the code below,
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.io.FileWriter;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
public class readIdSolr {
private static void createFile(String file, ArrayList < String > arrData)
throws IOException {
FileWriter writer = new FileWriter(file + ".txt");
int size = arrData.size();
for (int i = 0; i < size; i++) {
String str = arrData.get(i).toString();
writer.write(str);
}
writer.close();
}
public static void main(String[] args) throws IOException {
Path path = Paths.get(“index "); //replace with path to the index file
Directory dirIndex = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(dirIndex);
String id = "";
ArrayList < String > docIds = new ArrayList < String > ();
Document doc = null;
System.out.println("In--total--" + indexReader.numDocs());
int cnt = 0;
for (int i = 0; i < indexReader.numDocs(); i++) {
cnt += 1;
doc = indexReader.document(i);
id = doc.get("id");
docIds.add(id);
if (cnt % 10000 == 0) {
System.out.println("Current cnt " + cnt);
}
}
createFile("MyDataFile", docIds); indexReader.close(); dirIndex.close();
}
}
You can replace it with whatever fields you wish to fetch.
Just like how your fellow techies do.
We'd love to talk about how we can work together
Take control of your AWS cloud costs that enables you to grow!