I need to convert a HTML String with nested Tags like this one:
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
Into the following Array of objects with this Structure:
const result = [{ text: "Hello World", format: null }, { text: "I am a text with", format: null }, { text: "bold", format: ["strong"] }, { text: " word", format: null }, { text: "I am a text with nested", format: ["strong"] }, { text: "italic", format: ["strong", "em"] }, { text: "Word.", format: ["strong"] }];
I managed the conversion with the DOMParser() as long as there are no nested Tags. I am not able to get it running with nested Tags, like in the last paragraph, so my whole paragraph is bold, but the word “italic” should be both bold and italic. I cannot get it running as a recursion.
Any help would be appreciated.
So the code I wrote so far is this one:
export interface Phrase { text: string; format: string | string[]; } export class HTMLParser { public parse(text: string): void { const parser = new DOMParser(); const sourceDocument = parser.parseFromString(text, "text/html"); this.parseChildren(sourceDocument.body.childNodes); // HERE SHOULD BE the result console.log("RESULT of CONVERSION", this.phrasesProcessed); } public phrasesProcessed: Phrase[] = []; private parseChildren(toParse: NodeListOf<ChildNode>) { this.phrasesProcessed = []; try { Array.from(toParse) .map(item => { if (item.nodeType === Node.ELEMENT_NODE && item instanceof HTMLElement) { return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: (child.nodeType === Node.ELEMENT_NODE && child instanceof HTMLElement) ? child.tagName : null })); } else { return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: null })); } }) .filter(line => line.length) // only non emtpy arrays .map(element => ([...element, { text: "n", format: null }])) // add linebreak after each P .reduce((acc: (Phrase)[], val) => acc.concat(val), []) // flatten .forEach( element => { // console.log("ELEMENT", element); this.phrasesProcessed.push(element); } ); } catch (e) { console.warn(e); } } }
Advertisement
Answer
You can use recursion. And this seems a good case for a generator function. As it was not clear which tags should be retained in format
(apparently, not p
), I left this as a configuration to provide:
const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]); function* iterLeafNodes(nodes, format=[]) { for (let node of nodes) { if (node.nodeType == 3) { yield ({text: node.nodeValue, format: format.length ? [...format] : null}); } else { const tag = node.tagName.toLowerCase(); yield* iterLeafNodes(node.childNodes, formatTags.has(tag) ? format.concat(tag) : format); } } } // Example input const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>" const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes; let result = [...iterLeafNodes(nodes)]; console.log(result);
Note that this will still split the text when it is spread over multiple tags, which are considered non-formatting tags, like span
.
Secondly, I’m not convinced that having null
as a possible value for format
is more useful then just an empty array []
, but anyway, the above produces null
in that case.
Special case – insertion of n
In comments you ask for the insertion of a line break after each p
element.
The code below will generate that extra element. Here I also used []
instead of null
for format
:
const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]); function* iterLeafNodes(nodes, format=[]) { for (let node of nodes) { if (node.nodeType == 3) { yield ({text: node.nodeValue, format: [...format]}); } else { const tag = node.tagName.toLowerCase(); yield* iterLeafNodes(node.childNodes, formatTags.has(tag) ? format.concat(tag) : format); if (tag === "p") yield ({text: "n", format: [...format]}); } } } // Example input const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>" const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes; let result = [...iterLeafNodes(nodes)]; console.log(result);