Skip to content
Advertisement

Puppeteer not retrieving JavaScript rendered page

I am trying to load the product page using puppeteer but its not working.

    const puppeteer = require('puppeteer')

async function start(){
    const browser = await puppeteer.launch()
    const page = await browser.newPage()
    
    await page.setDefaultNavigationTimeout(0); 
    
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36');
    
    url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010"
    await page.goto(url, {'waitUntil' : ['load', 'domcontentloaded', 'networkidle0', 'networkidle2']})
    await page.screenshot({path: "screenshot3.png", fullPage:true})
    await browser.close();
}

start()

If we open this URL it will load the page half and when we scroll down it loads rest of the page.

I tried using the scroll as well but it did not work.

Scroll function is following

    [const waitTillHTMLRendered = async (page, timeout = 30000) => {
    const checkDurationMsecs = 1000;
    const maxChecks = timeout / checkDurationMsecs;
    let lastHTMLSize = 0;
    let checkCounts = 1;
    let countStableSizeIterations = 0;
    const minStableSizeIterations = 3;
  
    while(checkCounts++ <= maxChecks){
      let html = await page.content();
      let currentHTMLSize = html.length; 
  
      let bodyHTMLSize = await page.evaluate(() => document.body.innerHTML.length);
  
      console.log('last: ', lastHTMLSize, ' <> curr: ', currentHTMLSize, " body html size: ", bodyHTMLSize);
  
      if(lastHTMLSize != 0 && currentHTMLSize == lastHTMLSize) 
        countStableSizeIterations++;
      else 
        countStableSizeIterations = 0; //reset the counter
  
      if(countStableSizeIterations >= minStableSizeIterations) {
        console.log("Page rendered fully..");
        break;
      }
  
      lastHTMLSize = currentHTMLSize;
      await page.waitForTimeout(checkDurationMsecs);
    }  
  };][2]

Advertisement

Answer

When I run this headfully, I don’t see that the page loads fully with the review content. It seems to be detecting the bot and blocking those reviews from coming through regardless of the scroll.

Using puppeteer-extra-stealth headfully avoids detection, but headless stealth is still blocked. I’ll update if I can find a solution, but I figure this is at least a step forward.

const puppeteer = require("puppeteer-extra"); // ^3.2.3
const StealthPlugin = require("puppeteer-extra-plugin-stealth"); // ^2.9.0
puppeteer.use(StealthPlugin());

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  const url = "https://www.coupang.com/vp/products/2275049712?itemId=3903560010";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector(".sdp-review__article__list__review__content");
  await page.waitForNetworkIdle();
  await page.screenshot({path: "screenshot3.png", fullPage: true});
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

In the future, if you see waitForSelector timeouts when running headlessly, it’s a good idea to add a console.log(await page.content()); which will usually show that you’ve been blocked before you waste time messing with scrolling and other futile strategies.

See also Why does headless need to be false for Puppeteer to work?

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement