Skip to content
Advertisement

Text and emoji extraction from HTML

I need to perform text and emoji extraction from HTML (I have no control over the HTML I get). I found it fairly simple to remove HTML tags using the following function; however, it strips out the emojis embedded within an <img> tag. The result should be plain text + emoji characters.

I don’t care much about spaces, but the cleaner it is, the better.

// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
    return html.replace(/<(/?)(w+)[^>]*/?>/g, (_, endMark, tag) => {
        return args.includes(tag) ? "<" + endMark + tag + ">" : ""
    }).replace(/<!--.*?-->/g, "")
}
<div>
   <div class="text-bold">
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">
               <span class=""><img height="16" width="16" alt="🐕" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… 
               <div class="" role="button" tabindex="0">Something else</div>
            </div>
         </div>
      </span>
   </div>
</div>

Expected output:

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.
he's (mostly) a Beagle and Jack Russell mix.🐕 : @House… Something else.

Advertisement

Answer

If you would like to use dom parser instead of pure regexp and get more control over HTML, here’s an example how to achieve this one:

const htmlString = "<div>your contet...</div>";

const toRawString = (htmlString) => {
  if (!htmlString) {
    return null;
  }

  const parser = new DOMParser();
  const parsedHTML = parser.parseFromString(htmlString, "text/html");

  // Get all images and keep only alt attribute content
  // So if you need some data from other attributes you can reuse this one below
  const images = parsedHTML.querySelectorAll("img");
  images.forEach((image) => {
    const altSpan = document.createElement('span');
    altSpan.innerHTML = image.alt;
    image.parentElement.appendChild(altSpan);
    image.parentElement.removeChild(image);
  });

  // Replace all additional spaces
  return parsedHTML.body.textContent.replace(/ss+/g, " ");
};
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement