At KnackForge we thrive hard to deliver best Drupal sites. As a part of this cause, for our potential Drupal sites we opt to use Apache solr as engine offering searching functionality. This makes our sites more scalable.
Recently team KnackForge was engaged in developing a massive Science News publishing site inspired by eScienceNews.com. More details about this site can be found in this case study page in drupal.org.
We had a requirement to show similar articles on the sidebar. I would like to share some snippet from the module that we used to achieve this feature. In our case, the similar articles are found using Apache mahout clustering algorithm. Hence they are more precise than Solr MoreLikeThis component. Of course that is the ultimate reason for using Mahout here.
We had architectured the mahout to use Solr core as persistent storage. The output from mahout cluster program is written to solr index in a separate core. Later from Drupal when an individual article node page is viewed, the related articles are queried and display in a sidebar as a block. This is the brief architecture of this site. Of course there are a lot of hiccups here and there.
The code below does the Drupal part of tasks explained above,
Solr has been a revolution in search world with its major implementations. Mahout is an exciting tool for machine learning work. In this article I am going to cover about the integration of Solr and Mahout for classification process.
Classification here is the process of categorizing a content into pre-defined set of categories. Classification process depends on model created from training sets. I will cover about mahout classification in my next blog.
I am going to hook into Solr update process, call the mahout classifier and add the category field based on the result from classifier. So every document indexing will have its category automatically assigned. Add the following configuration to solrconfig.xml.