Finding position of dom node in the document source

Context

I’m building a set of ‘extractor’ functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these ‘component’ objects ordered by where they originally appeared in the page.

Problem

The last part of this process is a bit problematic. As far as I can see, there’s no easy way to tell where a given element is in a given dom document’s source code.

The numeric depth or css/xpath-like path also doesn’t feel helpful in this case.

Example

With the given extractors…

const extractors = [

  // Extract buttons
  dom => 
    Array.from(dom.window.document.querySelectorAll('button'))
    .map(elem => ({
      type: 'button',
      name: elem.name,
      position:        /* this part needs to be computed from elem */
    })),

  // Extract links
  dom => 
    Array.from(dom.window.document.querySelectorAll('a'))
    .map(elem => ({
      type: 'link',
      name: elem.textContent,
      position:        /* this part needs to be computed from elem */
      link: elem.href,
    })),

];

JavaScript
​x
 
const extractors = [
​
  // Extract buttons
  dom => 
    Array.from(dom.window.document.querySelectorAll('button'))
    .map(elem => ({
      type: 'button',
      name: elem.name,
      position:        /* this part needs to be computed from elem */
    })),
​
  // Extract links
  dom => 
    Array.from(dom.window.document.querySelectorAll('a'))
    .map(elem => ({
      type: 'link',
      name: elem.textContent,
      position:        /* this part needs to be computed from elem */
      link: elem.href,
    })),
​
];
​

…and the given document (I know, it’s an ugly and un-semantic example..):

<html>
  <body>
    <a href="/">Home</a>
    <button>Login</button>
    <a href="/about">About</a>
...

JavaScript
 
<html>
  <body>
    <a href="/">Home</a>
    <button>Login</button>
    <a href="/about">About</a>
...
​

I need something like:

[
  { type: 'button', name: 'Login', position: 45, ... },
  { type: 'link', name: 'Home', position: 20, ... },
  { type: 'link', name: 'About', position: 72, ... },
]

JavaScript
 
[
  { type: 'button', name: 'Login', position: 45, ... },
  { type: 'link', name: 'Home', position: 20, ... },
  { type: 'link', name: 'About', position: 72, ... },
]
​

(which can be later ordered by item.position)

For example, 45 is the position/offset of the <button with the example html string.

Answer

You could just iterate all the elements in the DOM and assign them an index, given your DOM doesn’t change:

const pos = new Symbol('document position');
for (const [index, element] of document.querySelectorAll('*').entries()( {
    element[pos] = index;
}

JavaScript
 
const pos = new Symbol('document position');
for (const [index, element] of document.querySelectorAll('*').entries()( {
    element[pos] = index;
}
​

Then your extractor can just use that:

dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: elem[pos],
  link: elem.href,
})),

JavaScript
 
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: elem[pos],
  link: elem.href,
})),
​

Alternatively, JSDOM provides a feature where it attaches the source position in the parsed HTML text to every node, you can also use that – see includeNodeLocations. The startOffset will be in document order as well. So if you parse the input with that option enabled, you can use

dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: dom.nodeLocation(elem).startOffset,
  link: elem.href,
})),

JavaScript
 
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: dom.nodeLocation(elem).startOffset,
  link: elem.href,
})),
​

Advertisement

Answer