How to limit Apify web crawler scope to first three list pages?

Tags: , , ,



I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.

The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.

However, I wish to use Apify’s scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).

The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using “customData”…

{
  "customData": 3
}

… and then the below to take that value and use it to limit…

var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
    context.enqueuePage({

This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.

However, whilst the scheduler successfully fires the crawler – the crawler still runs right through the whole set again; it doesn’t cap out at /page/3.

How can I ensure I only get the first three pages up to /page/3?

Have I malformed something?

In the code, you can see, now commented-out, my previous version of the above addition.


Those LIST pages should only be…

  1. The STARTing one, with an implied “/page/1” URL (https://www.beet.tv/author/randrews)
  2. https://www.beet.tv/author/randrews/page/2
  3. https://www.beet.tv/author/randrews/page/3

… and not the likes of /page/101 or /page/102, which may surface.


Here are the key terms…

START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers

And here is the crawler script…

function pageFunction(context) {

 // Called on every page the crawler visits, use it to extract data from it
 var $ = context.jQuery;

 // If page is START or a LIST,
 if (context.request.label === 'START' || context.request.label === 'LIST') {

     context.skipOutput();

     // First, gather LIST page
     $('a.page-numbers').each(function() {
         // lines added to accept number of pages via customData in Scheduler...
         var pageNumber = parseInt($(this).text());
         // var maxListDepth = context.customData;
         var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
         if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
           context.enqueuePage({
               url: /*window.location.origin +*/ $(this).attr('href'),
               label: 'LIST'
           });
         }
     });

     // Then, gather every DETAIL page
     $('h3>a').each(function(){
         context.enqueuePage({
             url: /*window.location.origin +*/ $(this).attr('href'),
             label: 'DETAIL'
         });
     });

 // If page is actually a DETAIL target page
 } else if (context.request.label === 'DETAIL') {

     /* context.skipLinks(); */

     var categories = [];
     $('span.cat-links a').each( function() {
         categories.push($(this).text());    
     });
     var tags = [];
     $('span.tags-links a').each( function() {
         tags.push($(this).text());    
     });

     result = {
         "title": $('h1').text(),
         "entry": $('div.entry-content').html().trim(),
         "datestamp": $('time').attr('datetime'),
         "photo": $('meta[name="twitter:image"]').attr("content"),
         categories: categories,
         tags: tags
     };

 }
 return result;
 }

Answer

There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)



Source: stackoverflow