Skip to content
Advertisement

Is there a way to scrape website using cheerio if the image that i want to scrape is protected by cloudflare and giving 1020 error?

I am trying to create a manga scraping website as a personal project and just when i completed the whole website, I got to know that the image cant be scraped or viewed by my website and when i try to go to the link of the image, I got 1020 error stating access denied, Is there any way I can bypass that error without getting the authorization token from the website owner,

If the answer is no, then can anyone explain how the cloudflare is protecting the image from scraping because as far as i know everything that are in frontend can be scraped.

Edit : Here is one of the image that i want to scrape but when i am opening on browser it is giving 1020 access denied error

Advertisement

Answer

With that web site, in order to download an image like this one, you need this header on the http request:

Referer: "https://mangakakalot.com/"

Add that header and then it successfully returns the desired image. Remove that header and you get an error (403 in this case).

Here’s a simple test app:

const got = require('got');

const url = "https://s61.mkklcdnv61.com/mangakakalot/u1/uh918990/chapter_0_prologue/1.jpg";

const options = {
    headers: {
        Referer: "https://mangakakalot.com/",
    }
}

got(url, options).then(result => {
    console.log(result);
}).catch(err => {
    console.log(err);
});

FYI, if you’re wondering how I figured this out, I went to the web page that contains this image. I looked in the Network tab of the Chrome debugger and found the reference to this particular image where the browser downloaded it. I then looked at the request to the server to fetch this image and looked at exactly what other headers were on the request. I added two easy ones (Referrer and User-Agent) to more accurately mimic the browser. That changed the response from a 403 to a 200. Then, I experimented to see if I could remove either of these headers and it worked with only the Referrer header.

I’m guessing that the difference between the 403 error here and the 1020 error you saw if you directly to that link in the browser is probably to do with the version of http being used (the browser being more advanced than my nodejs script). But, the point is you can now download the image in the above script.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement