How to select all tags except anchors (neither anchors inside another element) with document.querySelectorAll?

Question

edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags (neither the the text from anchors inside another elements) with the document.querySelectorAll method? My program has an input field that allows users to insert some selector to get the t…

Accepted Answer

Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of “which elements to scrape” is just part of it. The other part is “how to transform the set of elements to scrape into a data set that the user ultimately is interested in”.Meaning that CSS selectors alone won’t do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a elements were not there. That is your transformation procedure in this particular case.You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things — transforming data.Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won’t need both XSLT and CSS selectors — the former uses XPath which you can utilize and even expose to users.Let’s consider the following short example of a HTML document you want scraped:

I think the document you are looking for is at example.com.

If you want all text extracted but not a elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:

Advertisement

Answer