Python crawler to get DOM info by using Selenium and PhantomJS

Question

I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM. The simple code below works, but it's not always valid. I meant that most of time it would return an empty website which didn't execute the javascript. It could seldom get the correct info I want. It has a high probability

Accepted Answer

First of all, driver = webdriver.PhantomJS is not a correct way to initialize a selenium webdriver in Python, replace it with:driver = webdriver.PhantomJS()The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.support.wait import WebDriverWaitdriver = webdriver.PhantomJS()driver.get(url)# waiting for presence of an elementwait = WebDriverWait(driver, 10)wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#MetaDescription")))print(driver.page_source, file=open('output.html','w'))driver.close()# further HTML parsing hereYou may also need to ignore SSL errors and set the SSL protocol to any. In some cases, pretending not be PhantomJS helps as well.

Advertisement

Answer