I have written a small web scraper using puppeteer, but I can’t seem to properly extract the information I want. Could you please help me find the mistake?
Background: I want to scrape a website that indicates how much of premium the city allows a landlord to add to rest controlled apartments (e.g. for a prime location).
What I have done so far (Code below): I am able to navigate through the site, access the iframe, write some input, click a button and get a resulting summary form. I want to extract the date and euro values of the first two rows and save it to a json. Ultimately, I want to do this for a bunch of addresses (still need to check how I can easily do this) and then aggregate this info (difference in the premium to the previous period etc).
The Problem: I can Isolate the selectors for the relevant infos I want, but using frame.$$eval or frame.$ delivers nothing (but runs through without error). So I used waitForSelector which timed out and frame.evaluate threw an error. It was all very weird. My next approach was to scrape the whole form/summary – this worked! When I printed the object to the console, I had one long character string with everything on the page, including my info. However, this was highly unstructured and I couldn’t figure out how to work with it to isolate my info. In addition, I couldn’t save it to the json file (only a portion of the text was saved).
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch({slowMo: 250});
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(5000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 });
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
const optionsResult = await frame.$$eval('#summary', (options) => {
const result = options.map(option => option.textContent);
return result;
});
console.log(optionsResult);
await browser.close();
fs.writeFile("data.json", JSON.stringify(optionsResult), 'utf8', function(err) {
if(err) {
return console.log(error(err));
}
console.log(success("The data has been scraped and saved successfully! View it at './data.json'"));
});
console.log(success("Browser Closed"));
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I am posting the whole code for completion, the important bit is the “scraping” section starting on line 45.
I have perused SO and read many different threads but haven’t yet found the solution. I hope everything is clear and I would appreciate any help!
PS I am quite new with JS/node.js/puppeteer so apologies if there are some inaccuracies and I don’t know the ins and outs of the language yet.
Advertisement
Answer
Some considerations.
await frame.type('#input_adresse', 'Gumpendorfer Straße 12, 1060 Wien', { delay: 1000 });— 1000 seems too long, maybe 100 or even 50 will suffices.Prefer
innerTexttotextContentto get more readable content.This is how you can get more structured data, multidimensional array with rows and cells:
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');
const optionsResult = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});