Text and emoji extraction from HTML

Question

I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img> tag. The result should be plain text + emoji characters. I don&#8217;t care…

Accepted Answer

If you would like to use dom parser instead of pure regexp and get more control over HTML, here’s an example how to achieve this one:const htmlString = "

your contet...

";const toRawString = (htmlString) => { if (!htmlString) { return null; } const parser = new DOMParser(); const parsedHTML = parser.parseFromString(htmlString, "text/html"); // Get all images and keep only alt attribute content // So if you need some data from other attributes you can reuse this one below const images = parsedHTML.querySelectorAll("img"); images.forEach((image) => { const altSpan = document.createElement('span'); altSpan.innerHTML = image.alt; image.parentElement.appendChild(altSpan); image.parentElement.removeChild(image); }); // Replace all additional spaces return parsedHTML.body.textContent.replace(/ss+/g, " ");};

Advertisement

Answer