I have written a small web scraper using puppeteer, but I can’t seem to properly extract the information I want. Could you please help me find the mistake?
Background: I want to scrape a website that indicates how much of premium the city allows a landlord to add to rest controlled apartments (e.g. for a prime location).
What I have done so far (Code below): I am able to navigate through the site, access the iframe, write some input, click a button and get a resulting summary form. I want to extract the date and euro values of the first two rows and save it to a json. Ultimately, I want to do this for a bunch of addresses (still need to check how I can easily do this) and then aggregate this info (difference in the premium to the previous period etc).
The Problem: I can Isolate the selectors for the relevant infos I want, but using frame.$$eval or frame.$ delivers nothing (but runs through without error). So I used waitForSelector which timed out and frame.evaluate threw an error. It was all very weird. My next approach was to scrape the whole form/summary – this worked! When I printed the object to the console, I had one long character string with everything on the page, including my info. However, this was highly unstructured and I couldn’t figure out how to work with it to isolate my info. In addition, I couldn’t save it to the json file (only a portion of the text was saved).
const puppeteer = require("puppeteer"); const chalk = require("chalk"); const fs = require('fs'); const error = chalk.bold.red; const success = chalk.keyword("green"); (async () => { try { // open the headless browser var browser = await puppeteer.launch({slowMo: 250}); // open a new page var page = await browser.newPage(); // enter url in page await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'}); // continue without newsletter await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light'); // let everyhting load await page.waitFor(5000) console.log('waiting for iframe with form to be ready.'); //wait until selector is available await page.waitForSelector('iframe'); console.log('iframe is ready. Loading iframe content'); //choose the relevant iframe const elementHandle = await page.$( 'iframe[src="/richtwertfrontend/lagezuschlag/"]', ); //go into frame in order to input info const frame = await elementHandle.contentFrame(); //enter address console.log('filling form in iframe'); await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 }); //choose first option from dropdown console.log('Choosing from dropdown'); await frame.click('#react-autowhatever-1--item-0'); console.log('pressing button'); //press button to search await frame.click('#next-button'); // scraping data console.log('scraping') const optionsResult = await frame.$$eval('#summary', (options) => { const result = options.map(option => option.textContent); return result; }); console.log(optionsResult); await browser.close(); fs.writeFile("data.json", JSON.stringify(optionsResult), 'utf8', function(err) { if(err) { return console.log(error(err)); } console.log(success("The data has been scraped and saved successfully! View it at './data.json'")); }); console.log(success("Browser Closed")); } catch (err) { // Catch and display errors console.log(error(err)); await browser.close(); console.log(error("Browser Closed")); } })();
I am posting the whole code for completion, the important bit is the “scraping” section starting on line 45.
I have perused SO and read many different threads but haven’t yet found the solution. I hope everything is clear and I would appreciate any help!
PS I am quite new with JS/node.js/puppeteer so apologies if there are some inaccuracies and I don’t know the ins and outs of the language yet.
Advertisement
Answer
Some considerations.
await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 });
— 1000 seems too long, maybe 100 or even 50 will suffices.Prefer
innerText
totextContent
to get more readable content.This is how you can get more structured data, multidimensional array with rows and cells:
// scraping data console.log('scraping') await frame.waitForSelector('#summary > div > div > br ~ div'); const optionsResult = await frame.evaluate(() => { const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')]; const cells = rows.map( row => [...row.querySelectorAll('div')] .map(cell => cell.innerText) ); return cells; });