Skip to content
Advertisement

How can I return all data from different scope and receive them?

I tried to scrape a website using Node.JS + Cheerio + Axios, I’ve get all the things I need, but the problem is I don’t know how to return the data from different scopes to receive it (I can only receive the url, not the data inside another scope).

The only data I can receive is the url, but all the data in another scope, I’m quite of can’t figure out how to receive it together with the url

How’s my module work, it scrapes multiple url, and inside of each url contains things like title, description, subtitle, etc, so that’s why I have to map 2 times.

Here’s my code:

The services that I’m using to scrape:

exports.getSlides = async () => {
    const { data } = await client.get("/")

    const $ = cheerio.load(data)
    return $(".MovieListTop .TPostMv")
        .toArray()
        .map((element) => {
            const listItem = $(element)

            const url = listItem.find("a").attr("href")

            axios(url).then((res) => {
                const new$ = cheerio.load(res.data)

                new$(".TpRwCont")
                    .toArray()
                    .map((element) => {
                        const item = new$(element)

                        const title = item.find(".Title").first().text().trim()
                        const subTitle = item.find(".SubTitle").first().text().trim()
                        const description = item.find(".Description").first().text().trim()
                        const time = item.find(".Time").first().text().trim()
                        const date = item.find(".Date").first().text().trim()
                        const view = item.find(".View").first().text().trim()

                        // console.log({ title, subTitle, description, time, date, view })
                        return { data: { title, subTitle, description, time, date, view } }
                    })
            })
            return { url }
        })
}

The controller that I’m using to receive the data:

const movieServices = require("../services/index")

exports.getSlides = async (req, res, next) => {
    const data = await movie.getSlides()
    try {
        res.json({
            message: "Success",
            data: data,
        })
    } catch (err) {
        next(err)
    }
}

What I’m expected:

{
  "message:": "Success",
  "data": [
    {
      "url": "url1",
      "data": {
        "title": "titleA",
        "subTitle": "subTitleA",
        ...key : value
      }
    },
    {
      "url": "url2",
      "data": {
        "title": "titleB",
        "subTitle": "subTitleB",
        ...key : value
      }
    },
    {
      "url": "url3",
      "data": {
        "title": "titleC",
        "subTitle": "subTitleC"
        ...key : value
      },
      more objects
    }
  ]
}

Advertisement

Answer

Here’s a reworked version that uses async/await in order to serialize the requests, organize the data and return the data in a promise. The caller can then use await or .then() to get the data out of the promise.

I’m not entirely sure I understood what result you wanted because what you described in your question and comments doesn’t quite match with what the code produces. This code gets a top level array of URLs and then for each URL, there is an array of data objects for each newsElement that URL has. So, there’s an array of objects where each object has a url and an array of data. The data is an array of newsElement objects in the url’s page like this:

[
    {
      url: url1, 
      data: [
        {
          title: someTitle1, 
          subTitle: someSubTitle1, 
          description: someDescription1, 
          time: someTime1, 
          date: someDate1, 
          view: someView1
        },
        {
          title: someTitle2, 
          subTitle: someSubTitle2, 
          description: someDescription2, 
          time: someTime2, 
          date: someDate2, 
          view: someView2
        }
      ]
    },
    {
      url: url2, 
      data: [
        {
          title: someTitle3, 
          subTitle: someSubTitle3, 
          description: someDescription3, 
          time: someTime3, 
          date: someDate3, 
          view: someView3
        },
        {
          title: someTitle4, 
          subTitle: someSubTitle4, 
          description: someDescription4, 
          time: someTime4, 
          date: someDate4, 
          view: someView4
        }
      ]
   },
]

And, here’s the code:

exports.getSlides = async () => {
    const { data } = await client.get("/");
    const $ = cheerio.load(data);
    const elements = $(".MovieListTop .TPostMv").toArray();
    const results = [];
    for (let element of elements) {
        const listItem = $(element);
        const url = listItem.find("a").attr("href");
        // for each url, we collect an array of objects where
        // each object has title, subTitle, etc.. from a newsElement
        const urlData = [];
        const res = await axios(url);
        const new$ = cheerio.load(res.data);
        const newsElements = new$(".TpRwCont").toArray();
        for (let newsElement of newsElements) {
            const item = new$(newsElement);
            const title = item.find(".Title").first().text().trim()
            const subTitle = item.find(".SubTitle").first().text().trim()
            const description = item.find(".Description").first().text().trim()
            const time = item.find(".Time").first().text().trim()
            const date = item.find(".Date").first().text().trim()
            const view = item.find(".View").first().text().trim()

            // console.log({ title, subTitle, description, time, date, view })
            urlData.push({ title, subTitle, description, time, date, view });
        }
        results.push({ url, data: urlData });
    }
    return results;
}

If you want to data collected slightly differently, you should be able to modify this code to change how it organizes the data.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement