I want to scrape a page with some news inside. Here it’s an HTML simplified version of what I have :
JavaScript
x
18
18
1
<info id="random_number" class="news">
2
<div class="author">
3
Name of author
4
</div>
5
<div class="news-body">
6
<blockquote><blockquote>
7
Here it's the news text
8
</div>
9
</info>
10
<info id="random_number" class="news">
11
<div class="author">
12
Name of author
13
</div>
14
<div class="news-body">
15
Here it's the news text
16
</div>
17
</info>
18
I want to get the author and text body of each news, without the blockquote part. So I wrote this code :
JavaScript
1
8
1
let newsPage = await newsPage.$$("info.news");
2
for (var news of newsPage){ // Loop through each element
3
let author = await news.$eval('.author', s => s.textContent.trim());
4
let textBody = await news.$eval('.news-body', s => s.textContent.trim());
5
console.log('Author :'+ author);
6
console.log('TextBody :'+ textBody);
7
}
8
It works well, but I don’t know how to remove the blockquote part of the “news-body” part, before getting the text, how can I do this ?
EDIT : Sometimes there is blockquote exist, sometime not.
Advertisement
Answer
You can use optional chaining with ChildNode.remove()
. Also you may consider innerText
more readable.
JavaScript
1
5
1
let textMessage = await comment.$eval('.news-body', (element) => {
2
element.querySelector('blockquote')?.remove();
3
return element.innerText.trim();
4
});
5