I am using Node and Puppeteer to scrape some information from webpage I am having issues with selecting the right elements.
This describes the exact situation. I want to select only the ‘Hello’ text which is always in the first child. The only difference is that there are like 50 pieces of DOM exactly like this and I want to select the ‘Hello’ of every of them.
<span class='first'> <span class='second'> <span class='third'> <span> <a class='forth'>Hello</a> </span> </span> </span> <span class='second'> <span class='third'> <span> <a class='forth'>Some text</a> </span> </span> </span> <span class='second'> <span class='third'> <span> <a class='forth'>Different text</a> </span> </span> </span>
Advertisement
Answer
If the emphasis is on the “Hello” text, then you can use an XPath selector with contains()
method that looks for the element with the specific text using page.$x
:
await page.$x("//a[contains(text(), 'Hello')]")
You are also able to grab only the first child by:
await page.$('body > span > span:nth-child(1) > span > span > a')
And you can evaluate its content like this with page.evaluate
:
const text = await page.evaluate(el => el.innerText, await page.$(selector))
Did you know? If you right click on an element in Chrome DevTools “Elements” tab and you select “Copy”: there you are able to copy the exact selector or xpath of an element. After that you can switch to the “Console” tab and with the Chrome api you are able to test the selector’s content, so you can prepare it for your puppeteer script. E.g.: $x("//a[contains(text(), 'Hello')]").innerText
or $('body > span > span:nth-child(1) > span > span > a').innerText
should show the link what you expected to click on, otherwise you need to change on the access, or you need to check if there are more elments with the same selector etc. This may helps to find more appropriate selectors.