Skip to content
Advertisement

Querying MarkLogic merged collection

I’m trying to write a query to get the unique values of an attribute from the final merged collection(sm-Survey-merged). Something like:

select distinct(participantID) from sm-Survey-merged;

I get a tree-cache error with the below equivalent JS query. Can someone help me with a better query?

[...new Set (fn.collection("sm-Survey-merged").toArray().map(doc => doc.root.participantID.valueOf()).sort(),  "unfiltered")]

Advertisement

Answer

If there are a lot of documents, and you attempt to read them all in a single query, then you run the risk of blowing out the Expanded Tree Cache. You can try bumping up that limit, but with a large database with a lot of documents you are still likely to hit that limit.

The fastest and most efficient way to produce a list of the unique values is to create a range index, and select the values from that lexicon with cts.values().

Without an index, you could attempt to perform iterative queries that search and retrieve a set of random values, and then perform additional searches excluding those previously seen values. This still runs the risk of either blowing out the Expanded Tree Cache, timeouts, etc. So, may not be ideal – but would allow you to get some info now without reindexing the data.

You could experiment with the number of iterations and search page size and see if that stays within limits, and provides consistent results. Maybe add some logging or flags to know if you have hit the iteration limit, but still have more values returned to know if it’s a complete list or not. You could also try running without an iteration limit, but run the risk of blowing OOM or ETC errors.

function distinctParticipantIDs(iterations, values) { 
  const participantIDs = new Set([]); 
  const docs = fn.subsequence(
    cts.search(
      cts.andNotQuery(
        cts.collectionQuery("sm-Survey-merged"), 
        cts.jsonPropertyValueQuery("participantID", Array.from(values))
      ), 
      ("unfiltered","score-random")), 
    1, 1000);

  for (const doc of docs) {
    const participantID = doc.root.participantID.valueOf();
    participantIDs.add(participantID);
  }
  
  const uniqueParticipantIDs = new Set([...values, ...participantIDs]);
  
  if (iterations > 0 && participantIDs.size > 0) {
    //there are still unique values, and we haven't it our iterations limit, so keep searching
    return distinctParticipantIDs(iterations - 1, uniqueParticipantIDs);
  } else {
    return uniqueParticipantIDs;
  }
}

[...distinctParticipantIDs(100, new Set()) ];

Another option would be to run a CoRB job against the database, and apply the EXPORT-FILE-SORT option with ascending|distinct or descending|distinct, to dedup the values produced in an output file.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement