blog-banner

Solr Streaming Expressions

  • Drupal Planet
  • SOLR
  • Streaming Expression

Solr 5.1 Streaming Expressions

 

Solr 5.1 introduced a revolutionary Streaming API. With Solr 5.2, you get Streaming Expressions on top of it. Ever wondered how to run nested queries in SOLR or run parallel computing capabilities, this could be the answer.

Streaming Expressions provide a simple query language for SolrCloud that merges search with parallel computing. Under the covers, Streaming Expressions are backed by a java Streaming API that provides a fast map/reduced implementation for SolrCloud. Streaming Expressions are composed of functions. All functions behave like Streams, which means that they don't hold all the data in memory at once. Read more about the basics here https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

Setup:

Assuming a Debian-based system, say Ubuntu 12.04 or 14.04. If you have not installed Solr 5.2, go grab the latest codebase (For eg https://apache.mirror1.spango.com/lucene/solr/5.2.1/), extract it.

Setup Solr in Cloud Mode:

Cloud mode lets you create collections and nodes. See https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud for more details.

bin/solr -e cloud

Enter the port and other details.

To start a single node, use,

bin/solr start -cloud -s example/cloud/node1/solr -p 8983

Streaming API:

Now comes the interesting part. We have the following streaming API functions,

  • Search
  • Merge
  • Unique
  • Group
  • Top
  • Parallel

I am going to write about Search, Merge, and Unique. Let us assume we have two fields, called id and city.

Search:

Search is the basic streaming method that processes a single expression and returns the data.

curl --data-urlencode 'stream=search(gettingstarted,q="*:*",fl="id, city", fq="city:San Pedro",sort="id asc")' https://localhost:8983/solr/gettingstarted/stream

'gettingstarted' is the collection name. We use fl and fq parameters here.

Merge:

Merges two Streaming Expressions and maintains the ordering of the underlying streams.

curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc", rows=5), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=10), on="id asc" )' https://www.knackforge.com:8983/solr/gettingstarted/stream

Here, we have two expressions that are merged into a single one. Also note, that we have a 'rows' attribute inside each expression that limits records for each individual expression separately. Merge by default supports only two expressions, if you want to extend it to support multiple expressions, you can nest the merge methods.

For eg,

curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), merge(search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),on="id asc"), on="id asc")' https://www.knackforge.com:8983/solr/gettingstarted/stream

Unique:

Wraps a Streaming Expression and emits a unique stream of Tuples based on the over parameter.

curl --data-urlencode 'stream=unique(merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc"), on="id asc"), over="id asc")' https://www.knackforge.com:8983/solr/gettingstarted/stream

See that, I have used the merge method inside unique, this way you can do a lot of things by combining the methods.

Get awesome tech content in your inbox