Skip to main content
Solr streaming

Solr Streaming Expressions

Solr 5.1 introduced a revolutionary Streaming API. With Solr 5.2, you get Streaming Expressions on top of it. Ever wondered on how to run nested queries in SOLR or running parallel computing capabilities, this could be the answer. 

Streaming Expressions provide a simple query language for SolrCloud that merges search with parallel computing. Under the covers Streaming Expressions are backed by a java Streaming API that provides a fast map/reduce implementation for SolrCloud. Streaming Expressions are composed of functions. All functions behave like Streams, which means that they don't hold all the data in memory at once. Read more about the basics here https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

Setup:

Assuming a debian based system, say Ubuntu 12.04 or 14.04. If you have not installed Solr 5.2, go grap latest codebase (For eg http://apache.mirror1.spango.com/lucene/solr/5.2.1/), extract it. 

Setup Solr in cloud mode.

Cloud mode lets you create collection and nodes. See https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud for more details.

bin/solr -e cloud

Enter the port and other details.

To start a single node, use,

bin/solr start -cloud -s example/cloud/node1/solr -p 8983

Streaming API:

Now comes the interesting part. We have the following streaming API functions, 

  • Search
  • Merge
  • Unique
  • Group
  • Top
  • Parallel

I am going to write about Search, Merge and Unique. Let us assume we have two fields, called id and city.

Search:

Search is the basic streaming method that process the single expression and returns the data.

curl --data-urlencode 'stream=search(gettingstarted,q="*:*",fl="id, city", fq="city:San Pedro",sort="id asc")' http://localhost:8983/solr/gettingstarted/stream

'gettingstarted' is the collection name. We use fl and fq parameters here.

Merge:

Merges two Streaming Expressions and maintains the ordering of the underlying streams.

curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc", rows=5), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=10), on="id asc" )' http://localhost:8983/solr/gettingstarted/stream

Here, we have two expressions that are merged into a single one. Also note, we have 'rows' attribute inside each expression that limites records for each individual expression separately. Merge by default supports only two expresssion, if you want to extend it to support mulitple expressions, you can nest the merge methods.

For eg,

curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), merge(search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),on="id asc"), on="id asc")' http://localhost:8983/solr/gettingstarted/stream

Unique:

Wraps a Streaming Expression and emits a unique stream of Tuples based on the over parameter.

curl --data-urlencode 'stream=unique(merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc"), on="id asc"), over="id asc")' http://localhost:8983/solr/gettingstarted/stream

See that, I have used merge method inside unique, this way you can do a lot of things by combining the methods.

Comments

Sabeer (not verified)

Thu, 03/02/2017 - 01:00

Is it possible to use more conditions in join's on clause? For example

leftOuterJoin(

  search(people, q=*:*, fl="personId,name", sort="personId asc"),

  search(pets, q=type:cat, fl="ownerId,petName", sort="ownerId asc"),

  on="personId=ownerId,name=petName"

)

Add new comment

The content of this field is kept private and will not be shown publicly.

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.