I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img> tag. The result should be plain text + emoji characters.
I don’t care much about spaces, but the cleaner it is, the better.
// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
return html.replace(/<(/?)(w+)[^>]*/?>/g, (_, endMark, tag) => {
return args.includes(tag) ? "<" + endMark + tag + ">" : ""
}).replace(/<!--.*?-->/g, "")
}
<div>
<div class="text-bold">
<span dir="auto">
<div>
<div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">
<span class=""><img height="16" width="16" alt="🐕" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House…
<div class="" role="button" tabindex="0">Something else</div>
</div>
</div>
</span>
</div>
</div>
Expected output:
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix.🐕 : @House… Something else.
Advertisement
Answer
If you would like to use dom parser instead of pure regexp and get more control over HTML, here’s an example how to achieve this one:
const htmlString = "<div>your contet...</div>";
const toRawString = (htmlString) => {
if (!htmlString) {
return null;
}
const parser = new DOMParser();
const parsedHTML = parser.parseFromString(htmlString, "text/html");
// Get all images and keep only alt attribute content
// So if you need some data from other attributes you can reuse this one below
const images = parsedHTML.querySelectorAll("img");
images.forEach((image) => {
const altSpan = document.createElement('span');
altSpan.innerHTML = image.alt;
image.parentElement.appendChild(altSpan);
image.parentElement.removeChild(image);
});
// Replace all additional spaces
return parsedHTML.body.textContent.replace(/ss+/g, " ");
};