I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img>
tag. The result should be plain text + emoji characters.
I don’t care much about spaces, but the cleaner it is, the better.
// this cleans the HTML quite well, but I need to extend it to keep the emojis const stripTags = (html: string, ...args) => { return html.replace(/<(/?)(w+)[^>]*/?>/g, (_, endMark, tag) => { return args.includes(tag) ? "<" + endMark + tag + ">" : "" }).replace(/<!--.*?-->/g, "") }
<div> <div class="text-bold"> <span dir="auto"> <div> <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div> </div> <div class=""> <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div> </div> <div class=""> <div dir="auto" style="text-align: start;"> <span class=""><img height="16" width="16" alt="🐕" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… <div class="" role="button" tabindex="0">Something else</div> </div> </div> </span> </div> </div>
Expected output:
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix.🐕 : @House… Something else.
Advertisement
Answer
If you would like to use dom parser instead of pure regexp and get more control over HTML, here’s an example how to achieve this one:
const htmlString = "<div>your contet...</div>"; const toRawString = (htmlString) => { if (!htmlString) { return null; } const parser = new DOMParser(); const parsedHTML = parser.parseFromString(htmlString, "text/html"); // Get all images and keep only alt attribute content // So if you need some data from other attributes you can reuse this one below const images = parsedHTML.querySelectorAll("img"); images.forEach((image) => { const altSpan = document.createElement('span'); altSpan.innerHTML = image.alt; image.parentElement.appendChild(altSpan); image.parentElement.removeChild(image); }); // Replace all additional spaces return parsedHTML.body.textContent.replace(/ss+/g, " "); };