Skip to content
Advertisement

Python crawler to get DOM info by using Selenium and PhantomJS

I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM.

The simple code below works, but it’s not always valid. I meant that most of time it would return an empty website which didn’t execute the javascript. It could seldom get the correct info I want.

JavaScript

It has a high probability to return an empty string :

JavaScript

Is the website server not allowing web crawlers? What can I do to fix my code?

What’s more, all the info I need could be find in the <head> ‘s <meta>tag. (Like showing above, the data has an id MetaDescription)

Or is there any simpler way to just get the data in <head> tag?

Advertisement

Answer

First of all, driver = webdriver.PhantomJS is not a correct way to initialize a selenium webdriver in Python, replace it with:

JavaScript

The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:

JavaScript

You may also need to ignore SSL errors and set the SSL protocol to any. In some cases, pretending not be PhantomJS helps as well.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement