I want to scrape a page with some news inside. Here it’s an HTML simplified version of what I have :
<info id="random_number" class="news"> <div class="author"> Name of author </div> <div class="news-body"> <blockquote>...<blockquote> Here it's the news text </div> </info> <info id="random_number" class="news"> <div class="author"> Name of author </div> <div class="news-body"> Here it's the news text </div> </info>
I want to get the author and text body of each news, without the blockquote part. So I wrote this code :
let newsPage = await newsPage.$$("info.news"); for (var news of newsPage){ // Loop through each element let author = await news.$eval('.author', s => s.textContent.trim()); let textBody = await news.$eval('.news-body', s => s.textContent.trim()); console.log('Author :'+ author); console.log('TextBody :'+ textBody); }
It works well, but I don’t know how to remove the blockquote part of the “news-body” part, before getting the text, how can I do this ?
EDIT : Sometimes there is blockquote exist, sometime not.
Advertisement
Answer
You can use optional chaining with ChildNode.remove()
. Also you may consider innerText
more readable.
let textMessage = await comment.$eval('.news-body', (element) => { element.querySelector('blockquote')?.remove(); return element.innerText.trim(); });