I was experimenting with puppeteer on youtube and I was trying to scrape the URL src and title text from the thumbnails of the youtube main page and the scraping program works fine. The issue is when it scrapes for the title and src of the thumbnails the program starts to jump lines when logging the URL src but it works fine for the text data. When I tried to understand the line jump recursivity there is a 7 line jump and then a 14 line jump over and over again after every time the program scrapes the information. I can’t figure out why the line jump occurs when scraping the URL src but not the text title. Does it have anything to do with my infinite scroll handling method?
async function scrape(url) { const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.setViewport({ width: 1200, height: 800 }); const navigationPromise = page.waitForNavigation(); await page.goto(url, { timeout: 0 }); await page.evaluate(_ => { window.scrollBy(0, window.innerHeight); }); await page.waitFor(5000); await page.waitForSelector('#img') await navigationPromise; const loadThumbnailText = []; const loadThumbnailSrc = []; var ytTextData; for (let i = 0; i < 50; i++) { const textSelector = 'h3 > a > #video-title' const srcSelector = 'ytd-thumbnail > a > yt-img-shadow > #img' await page.waitForSelector(textSelector) await page.waitForSelector(srcSelector) const ytTextData = await page.$$eval(textSelector, elems => elems.map(el => el.textContent).join('n')) const ytSrcData = await page.$$eval(srcSelector, elems => elems.map(el => el.src).join('n')) if (ytTextData && ytSrcData) { console.log({ ytTextData, ytSrcData }) loadThumbnailText.push(ytTextData); loadThumbnailSrc.push(ytSrcData); console.log(ytTextData, ytSrcData) } } browser.close(); }
Advertisement
Answer
It looks like the issue is with the selectors you have. I ran the following in the console on youtube.com:
document.querySelectorAll('h3 > a > #video-title').length document.querySelectorAll('ytd-thumbnail > a > yt-img-shadow > #img').length
The first gave me 29 and the second 42. It looks like there are hidden videos on the homepage, that only show up on clicking a down-arrow. Your text selector is picking up those videos but your source selector is not.