Skip to content
Advertisement

Apache Solr extract, highlight HTML elements based on query, filter query terms

Update. (+18d) edited title and provided answer addressing original question.


tl/dr

I am indexing HTML pages and dumping the <p>...</p> content as a snippet for search query returns. However, I don’t want / need all that content (just the context around the query matched text).

Background

With these in my [classic] schema,

JavaScript

and these in my solrconfig.xml

JavaScript

I get this result [Solr Admin UI; facsimile shown here],

JavaScript

In the source HTML document those sentences occur singly in p-tags, e.g. <p>Sentence 1.</p>, <p>Sentence 1.</p>, …

Questions

  1. How can I index them, singly? My rationale is that I want to display a snippet of the context around the search result target (not the entire p-tagged content).

  2. Additionally, in the Linux grep command we can, e.g., return a line before and after the matched line (-C1, context, argument). Can we do something similar, here?

    i.e., if the Solr query match is in Sentence 2, the snippet would contain Sentences 1-3?

I tried assigning unique id’s to the p-elements (<p id="a">...</p> <p id="b">...</p> but I just got this in Solr,

JavaScript

Advertisement

Answer

Update [2020-12-31]

  • Please overlook the answering of my own question, as 18 days have passed with one comment and no answers.

I am building a search page with Solr as the backend, inspired by the following Ajax Solr tutorial. https://github.com/evolvingweb/ajax-solr

Ultimately, I decided to forgo Solr highlighting in favor of a more flexible, bespoke JavaScript (JS) solution.

Basically, I:

  • collect the Solr query (q) and filter query (fq) values (terms) in an array (simplified example shown below; more complete JS code appended)

    JavaScript
  • extract sentences matching those terms (words) via a JS regex expression

    JavaScript

    where doc.p is a Solr field (defined in schema.xml) corresponding to indexed HTML p-element (<p>…</p>) text.

  • highlight those query terms

    JavaScript
  • use those term-highlighted strings as snippets on the frontend

  • apply a similar approach to the highighting of query terms in the full documents, doc.p.toString()


Addendum

Here is the JS code I wrote to collect Solr “q” and “fq” terms in an array. Note that Solr returns single fq as a string, and multiple fq terms as an array.

JavaScript
Advertisement