Selvam S
August 10, 2015
Solr 5.1 introduced a revolutionary Streaming API. With Solr 5.2, you get Streaming Expressions on top of it. Ever wondered how to run nested queries in SOLR or run parallel computing capabilities, this could be the answer.
Streaming Expressions provide a simple query language for SolrCloud that merges search with parallel computing. Under the covers, Streaming Expressions are backed by a java Streaming API that provides a fast map/reduced implementation for SolrCloud. Streaming Expressions are composed of functions. All functions behave like Streams, which means that they don't hold all the data in memory at once. Read more about the basics here https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
Assuming a Debian-based system, say Ubuntu 12.04 or 14.04. If you have not installed Solr 5.2, go grab the latest codebase (For eg https://apache.mirror1.spango.com/lucene/solr/5.2.1/), extract it.
Cloud mode lets you create collections and nodes. See https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud for more details.
bin/solr -e cloud
Enter the port and other details.
To start a single node, use,
bin/solr start -cloud -s example/cloud/node1/solr -p 8983
Now comes the interesting part. We have the following streaming API functions,
I am going to write about Search, Merge, and Unique. Let us assume we have two fields, called id and city.
Search is the basic streaming method that processes a single expression and returns the data.
curl --data-urlencode 'stream=search(gettingstarted,q="*:*",fl="id, city", fq="city:San Pedro",sort="id asc")' https://localhost:8983/solr/gettingstarted/stream
'gettingstarted' is the collection name. We use fl and fq parameters here.
Merges two Streaming Expressions and maintains the ordering of the underlying streams.
curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc", rows=5), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=10), on="id asc" )' https://www.knackforge.com:8983/solr/gettingstarted/stream
Here, we have two expressions that are merged into a single one. Also note, that we have a 'rows' attribute inside each expression that limits records for each individual expression separately. Merge by default supports only two expressions, if you want to extend it to support multiple expressions, you can nest the merge methods.
For eg,
curl --data-urlencode 'stream=merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), merge(search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),search(gettingstarted,q=":",fl="id, city", fq="city:Stockbridge",sort="id asc", rows=5),on="id asc"), on="id asc")' https://www.knackforge.com:8983/solr/gettingstarted/stream
Wraps a Streaming Expression and emits a unique stream of Tuples based on the over parameter.
curl --data-urlencode 'stream=unique(merge(search(gettingstarted,q=":",fl="id, city", fq="city:San Pedro",sort="id asc"), search(gettingstarted, q=":",fl="id, city", fq="city:Stockbridge",sort="id asc"), on="id asc"), over="id asc")' https://www.knackforge.com:8983/solr/gettingstarted/stream
See that, I have used the merge method inside unique, this way you can do a lot of things by combining the methods.
Just like how your fellow techies do.
We'd love to talk about how we can work together
Take control of your AWS cloud costs that enables you to grow!