How can I convert HTML to Object structure with text and formatting?

Question

I need to convert a HTML String with nested Tags like this one: Into the following Array of objects with this Structure: I managed the conversion with the DOMParser() as long as there are no nested Tags. I am not able to get it running with nested Tags, like in the last paragraph, so my whole paragraph is bold, but

Accepted Answer

You can use recursion. And this seems a good case for a generator function. As it was not clear which tags should be retained in format (apparently, not p), I left this as a configuration to provide: const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);function* iterLeafNodes(nodes, format=[]) { for (let node of nodes) { if (node.nodeType == 3) { yield ({text: node.nodeValue, format: format.length ? [...format] : null}); } else { const tag = node.tagName.toLowerCase(); yield* iterLeafNodes(node.childNodes, formatTags.has(tag) ? format.concat(tag) : format); } }}// Example inputconst strHTML = "

Hello World

I am a text with bold word

I am bold text with nested italic Word.

"const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;let result = [...iterLeafNodes(nodes)];console.log(result); Note that this will still split the text when it is spread over multiple tags, which are considered non-formatting tags, like span.Secondly, I’m not convinced that having null as a possible value for format is more useful then just an empty array [], but anyway, the above produces null in that case.Special case – insertion of nIn comments you ask for the insertion of a line break after each p element.The code below will generate that extra element. Here I also used [] instead of null for format: const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);function* iterLeafNodes(nodes, format=[]) { for (let node of nodes) { if (node.nodeType == 3) { yield ({text: node.nodeValue, format: [...format]}); } else { const tag = node.tagName.toLowerCase(); yield* iterLeafNodes(node.childNodes, formatTags.has(tag) ? format.concat(tag) : format); if (tag === "p") yield ({text: "n", format: [...format]}); } }}// Example inputconst strHTML = "

Hello World

I am a text with bold word

I am bold text with nested italic Word.

"const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;let result = [...iterLeafNodes(nodes)];console.log(result);

How can I convert HTML to Object structure with text and formatting?

Advertisement

Answer

Special case – insertion of `n`