Is it possible to create a chrome extension , containing a puppeteer script to scrape and do some browser automation.
I would like to create one where a user would enter a url click a button then a puppeteer script runs, is this possible if so what would be the best way to implement?
Seen some answers referring to puppeteer-web, but seems the Puppeteer team removed puppeteer-web, is there a new way of implementing this?
Advertisement
Answer
The short answer is: no, it is not possible.
Puppeteer runs only on Node.Js at the moment which means it is a backend side solution, there is no alternative way to run your script other than running it on a server (browser extension is considered client-side).
In theory:*
However, you could use Express to expose your puppeteer results to an API endpoint, where you could define which page you want to scrape with a GET url
parameter (e.g. Google’s homepage: https://my-server.com/my-puppeteer-endpoint?url=https://google.com). This could be called by your extension’s click.
Note: this means https://my-server.com
should be available 24/7 to serve your extension. As an example, this is how Grammarly or Google Translate browser extensions communicate with their official APIs.
Fragments of the advised solution:
// puppeteer const getPage = async (url) => { ... await page.goto(url) ... return resultsOfScraping }
// express app.get('/my-puppeteer-endpoint', async (req, res) => { try { const url = req.query.url const response = await getPage(url) res.json(response) console.log(`/my-puppeteer-endpoint?url=${url} endpoint has been called!`) } catch (e) { console.error(e) } })
You can get more ideas from Thomas Dondorf’s evergreen answer on client-side puppeteer usage: How to make Puppeteer work with a ReactJS application on the client-side
On the extension side, you need to make sure that you give permission to your server https://my-server.com
to be called without CORS errors, see this question/answer.
*EDIT/WARNING: as on the server you will need the '--no-sandbox'
puppeteer launch flag, in general, I advise instead to set up your own sandbox on a Linux server if you’d go this way (see in the link above).
Another possible way would be if you’d create a whitelisted domain list where you could allow pages you trust, others would be forbidden by the extension (required to be implemented on the server-side).