I’m facing some problems with Puppeteer, I want to extract a list of items and succeed when headless is FALSE but not when TRUE.
First thing first, I want to get those elements before mapping on it.
Here’s my script, maybe you can reproduce it, it is really basic.
const chalk = require("chalk"); const baseUrl = "https://www.interencheres.com/recherche/lots?search="; const searchTerm = "Apple"; const searchUrl = baseUrl + searchTerm; (async () => { const browser = await puppeteer.launch({ headless: false, ignoreHTTPSErrors: true, args: [`--window-size=1920,1080`], defaultViewport: { width: 1920, height: 1080, }, }); const page = await browser.newPage(); // Begin navigation console.log(chalk.yellow("Beginning navigation.")); await page.goto(searchUrl); // Await List of elements; console.log(chalk.yellow("Wait for Network Idle...")); await page.waitForNetworkIdle(); // get Items const findElements = await page.evaluate(() => { const elements = document.querySelectorAll(".sale-item"); console.log(elements); return elements; }); console.log(findElements); console.log(chalk.blue("Waiting...")); await page.waitForTimeout(10000); await browser.close(); console.log(chalk.red("Closed.")); })();
Expected results : { '0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' }, '1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' }, '2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' }, '3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' }, '4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' }, . . }
Advertisement
Answer
For starters, I’d prefer page.waitForSelector(yourSelector)
over page.waitForNetworkIdle();
. In most cases, it’s a more direct guarantee that the data you want is on the page, whereas network idle can block waiting for all sorts of requests that are totally irrelevant to the data you’re trying to scrape. Another option is page.waitForResponse(predicate)
.
Some websites check the headers to block scrapers. You can try adding a user agent header as described in the Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true } #665:
const puppeteer = require("puppeteer"); // ^19.6.3 const baseUrl = "https://www.interencheres.com/recherche/lots?search="; const searchTerm = "Apple"; const searchUrl = baseUrl + encodeURIComponent(searchTerm); let browser; (async () => { browser = await puppeteer.launch(); const [page] = await browser.pages(); const ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"; await page.setUserAgent(ua); await page.goto(searchUrl, {waitUntil: "domcontentloaded"}); await page.waitForSelector(".sale-item"); const elements = await page.$$(".sale-item"); console.log(elements.length); // => 48 })() .catch(err => console.error(err)) .finally(() => browser?.close());
Using puppeteer-extra as described in Why does headless need to be false for Puppeteer to work? is another option you can try. It also anonymizes the user agent headers.