Skip to content
Advertisement

Python crawler to get DOM info by using Selenium and PhantomJS

I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM.

The simple code below works, but it’s not always valid. I meant that most of time it would return an empty website which didn’t execute the javascript. It could seldom get the correct info I want.

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://mall.pchome.com.tw/prod/QAAO6V-A9006XI59'
driver = webdriver.PhantomJS
driver.get(url)

print(driver.page_source, file=open('output.html','w'))

soup = BeautifulSoup(driver.page_source,"html5lib")
print(soup.select('#MetaDescription'))

It has a high probability to return an empty string :

[<meta content="" id="MetaDescription" name="description"/>]

Is the website server not allowing web crawlers? What can I do to fix my code?

What’s more, all the info I need could be find in the <head> ‘s <meta>tag. (Like showing above, the data has an id MetaDescription)

Or is there any simpler way to just get the data in <head> tag?

Advertisement

Answer

First of all, driver = webdriver.PhantomJS is not a correct way to initialize a selenium webdriver in Python, replace it with:

driver = webdriver.PhantomJS()

The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.PhantomJS()
driver.get(url)

# waiting for presence of an element
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#MetaDescription")))

print(driver.page_source, file=open('output.html','w'))

driver.close()

# further HTML parsing here

You may also need to ignore SSL errors and set the SSL protocol to any. In some cases, pretending not be PhantomJS helps as well.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement