edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags <a>
(neither the the text from <a>
anchors inside another elements) with the document.querySelectorAll
method?
My program has an input field that allows users to insert some selector to get the text for certain tags in a given site page.
So, if I want to insert a selector that gets text from all nodes except <a>
tags, how can I accomplish that?
I mean *:not(a)
does not work, because it selects tags that may have <a>
descendants and not()
selector does not accept complex selectors, so *:not(* a)
does not work.
I know I could delete those nodes from document first, but is it possible to accomplish this task only selecting those nodes I want with the document.querySelectorAll
method?
Example:
<html> <... lots of other tags with text inside> <div> <p> one paragraph </p> <a> one link </a> </div> </...> </html>
I want all the text in the html except “one link”
edit:
If you do document.querySelectorAll('*:not(a)')
, you select the div
, that has inside an a
element. So, the innerText of this div
contains the text from a
element
Thank you
Advertisement
Answer
Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of “which elements to scrape” is just part of it. The other part is “how to transform the set of elements to scrape into a data set that the user ultimately is interested in”.
Meaning that CSS selectors alone won’t do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a
elements were not there. That is your transformation procedure in this particular case.
You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.
With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things — transforming data.
Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won’t need both XSLT and CSS selectors — the former uses XPath which you can utilize and even expose to users.
Let’s consider the following short example of a HTML document you want scraped:
<html> <body> <p>I think the document you are looking for is at <a href="example.com">example.com</a>.</p> </body> </html>
If you want all text extracted but not a
elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:
<?xml version="1.0" encoding="utf-8" ?> <stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform"> <output method="text" /> <template match="a" /><!-- empty template element, meaning that the transformation result for every 'a' element is empty text --> </stylesheet>
The result of transforming the HTML document with the above XSL stylesheet document is the following text:
I think the document you are looking for is at .
Note how the a
element is “stripped” leaving an empty space between “at” and the sentence punctuation (“.”). The template
element, being empty, configures the XSLT processor to not produce any text when transforming a
elements ("a"
is a valid, if very simple, XPath expression, by the way — it selects all a
elements). This is all part of XSLT, of course.
I have tested this with Free Online XSL Transformer which uses the very potent SAX library.
Of course, you can cover one particular use case — yours — with JavaScript, without XSLT. But how are you going to let your users express what they want scraped? You will probably need to invent some [simple] language — which might as well be [the already invented] XSLT.
XSLT isn’t readily available across different user agents or JavaScript runtimes, not out of the box — native XSLT 1.0 implementations are indeed provided by both Firefox and Chrome (with the XSLTProcessor
class) but are not specified by any standards body and so may be missing in your particular runtime environment. You may be able to find a suitable JavaScript implementation though, but in any case you can invoke the scraper on the server side.
Encapsulating the XSLT language behind some simpler query language and user interface, is something you will need to decide on — if you’re going to give your users the kind of possibilities you say you want them to have, they need to express their queries somehow, whether through a WYSIWYG form or with text.