Text and emoji extraction from HTML

Question

I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img> tag. The result should be plain text + emoji characters. I don't care much about spaces, but the

Accepted Answer

If you would like to use dom parser instead of pure regexp and get more control over HTML, here’s an example how to achieve this one:const htmlString = "

your contet...

";const toRawString = (htmlString) => { if (!htmlString) { return null; } const parser = new DOMParser(); const parsedHTML = parser.parseFromString(htmlString, "text/html"); // Get all images and keep only alt attribute content // So if you need some data from other attributes you can reuse this one below const images = parsedHTML.querySelectorAll("img"); images.forEach((image) => { const altSpan = document.createElement('span'); altSpan.innerHTML = image.alt; image.parentElement.appendChild(altSpan); image.parentElement.removeChild(image); }); // Replace all additional spaces return parsedHTML.body.textContent.replace(/ss+/g, " ");};

Advertisement

Answer