I used Selenium and PhantomJS hoping to get data from a website which using javascript to build the DOM.
The simple code below works, but it’s not always valid. I meant that most of time it would return an empty website which didn’t execute the javascript. It could seldom get the correct info I want.
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://mall.pchome.com.tw/prod/QAAO6V-A9006XI59'
driver = webdriver.PhantomJS
driver.get(url)
print(driver.page_source, file=open('output.html','w'))
soup = BeautifulSoup(driver.page_source,"html5lib")
print(soup.select('#MetaDescription'))
It has a high probability to return an empty string :
[<meta content="" id="MetaDescription" name="description"/>]
Is the website server not allowing web crawlers? What can I do to fix my code?
What’s more, all the info I need could be find in the <head>
‘s <meta>
tag.
(Like showing above, the data has an id MetaDescription
)
Or is there any simpler way to just get the data in <head>
tag?
Advertisement
Answer
First of all, driver = webdriver.PhantomJS
is not a correct way to initialize a selenium webdriver in Python, replace it with:
driver = webdriver.PhantomJS()
The symptoms you are describing are similar to when you have the timing issues. Add a wait to wait for the desired element(s) to be present before trying to get the page source:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.PhantomJS()
driver.get(url)
# waiting for presence of an element
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#MetaDescription")))
print(driver.page_source, file=open('output.html','w'))
driver.close()
# further HTML parsing here
You may also need to ignore SSL errors and set the SSL protocol to any
. In some cases, pretending not be PhantomJS helps as well.