why can’t puppeteer scrape an element from a iframe even if I add the selector

Tags: , , ,



I have written a small web scraper using puppeteer, but I can’t seem to properly extract the information I want. Could you please help me find the mistake?

Background: I want to scrape a website that indicates how much of premium the city allows a landlord to add to rest controlled apartments (e.g. for a prime location).

What I have done so far (Code below): I am able to navigate through the site, access the iframe, write some input, click a button and get a resulting summary form. I want to extract the date and euro values of the first two rows and save it to a json. Ultimately, I want to do this for a bunch of addresses (still need to check how I can easily do this) and then aggregate this info (difference in the premium to the previous period etc).

The Problem: I can Isolate the selectors for the relevant infos I want, but using frame.$$eval or frame.$ delivers nothing (but runs through without error). So I used waitForSelector which timed out and frame.evaluate threw an error. It was all very weird. My next approach was to scrape the whole form/summary – this worked! When I printed the object to the console, I had one long character string with everything on the page, including my info. However, this was highly unstructured and I couldn’t figure out how to work with it to isolate my info. In addition, I couldn’t save it to the json file (only a portion of the text was saved).

const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");

(async () => {
  try {
    // open the headless browser
      var browser = await puppeteer.launch({slowMo: 250});

    // open a new page
      var page = await browser.newPage();

    // enter url in page
      await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
   // continue without newsletter
      await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
   // let everyhting load
      await page.waitFor(5000)
      console.log('waiting for iframe with form to be ready.');
      //wait until selector is available
      await page.waitForSelector('iframe');
      console.log('iframe is ready. Loading iframe content');
      //choose the relevant iframe
      const elementHandle = await page.$(
          'iframe[src="/richtwertfrontend/lagezuschlag/"]',
      );
      //go into frame in order to input info
      const frame = await elementHandle.contentFrame();
      //enter address
      console.log('filling form in iframe');
      await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 });

      //choose first option from dropdown
      console.log('Choosing from dropdown');
      await frame.click('#react-autowhatever-1--item-0');

      console.log('pressing button');
      //press button to search
      await frame.click('#next-button');

      // scraping data
      console.log('scraping')
      const optionsResult = await frame.$$eval('#summary', (options) => {
          const result = options.map(option => option.textContent);
          return result;
            });

    console.log(optionsResult);

   await browser.close();

          fs.writeFile("data.json", JSON.stringify(optionsResult), 'utf8', function(err) {
            if(err) {
                return console.log(error(err));
            }
            console.log(success("The data has been scraped and saved successfully! View it at './data.json'"));
        });

    console.log(success("Browser Closed"));
  } catch (err) {
      // Catch and display errors
      console.log(error(err));
      await browser.close();
      console.log(error("Browser Closed"));
    }


})();

I am posting the whole code for completion, the important bit is the “scraping” section starting on line 45.

I have perused SO and read many different threads but haven’t yet found the solution. I hope everything is clear and I would appreciate any help!

PS I am quite new with JS/node.js/puppeteer so apologies if there are some inaccuracies and I don’t know the ins and outs of the language yet.

Answer

Some considerations.

  1. await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 }); — 1000 seems too long, maybe 100 or even 50 will suffices.

  2. Prefer innerText to textContent to get more readable content.

  3. This is how you can get more structured data, multidimensional array with rows and cells:

      // scraping data
      console.log('scraping')
      await frame.waitForSelector('#summary > div > div > br ~ div');
      const optionsResult = await frame.evaluate(() => {
        const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
        const cells = rows.map(
          row => [...row.querySelectorAll('div')]
                   .map(cell => cell.innerText)
        );
        return cells;
      });


Source: stackoverflow