Context
I’m building a set of ‘extractor’ functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these ‘component’ objects ordered by where they originally appeared in the page.
Problem
The last part of this process is a bit problematic. As far as I can see, there’s no easy way to tell where a given element is in a given dom document’s source code.
The numeric depth or css/xpath-like path also doesn’t feel helpful in this case.
Example
With the given extractors…
const extractors = [ // Extract buttons dom => Array.from(dom.window.document.querySelectorAll('button')) .map(elem => ({ type: 'button', name: elem.name, position: /* this part needs to be computed from elem */ })), // Extract links dom => Array.from(dom.window.document.querySelectorAll('a')) .map(elem => ({ type: 'link', name: elem.textContent, position: /* this part needs to be computed from elem */ link: elem.href, })), ];
…and the given document (I know, it’s an ugly and un-semantic example..):
<html> <body> <a href="/">Home</a> <button>Login</button> <a href="/about">About</a> ...
I need something like:
[ { type: 'button', name: 'Login', position: 45, ... }, { type: 'link', name: 'Home', position: 20, ... }, { type: 'link', name: 'About', position: 72, ... }, ]
(which can be later ordered by item.position
)
For example, 45 is the position/offset of the <button
with the example html string.
Advertisement
Answer
You could just iterate all the elements in the DOM and assign them an index, given your DOM doesn’t change:
const pos = new Symbol('document position'); for (const [index, element] of document.querySelectorAll('*').entries()( { element[pos] = index; }
Then your extractor can just use that:
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({ type: 'link', name: elem.textContent, position: elem[pos], link: elem.href, })),
Alternatively, JSDOM provides a feature where it attaches the source position in the parsed HTML text to every node, you can also use that – see includeNodeLocations
. The startOffset
will be in document order as well. So if you parse the input with that option enabled, you can use
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({ type: 'link', name: elem.textContent, position: dom.nodeLocation(elem).startOffset, link: elem.href, })),