基本使用

AJAX动态网页数据抓取、Selenium使用

什么是AJAX

AJAX（Asynchronouse JavaScript And XML）异步JavaScript和XML。过在后台与服务器进行少量数据交换，Ajax 可以使网页实现异步更新。这意味着可以在不重新加载整个网页的情况下，对网页的某部分进行更新。传统的网页（不使用Ajax）如果需要更新内容，必须重载整个网页页面。因为传统的在传输数据格式方面，使用的是XML语法。因此叫做AJAX，其实现在数据交互基本上都是使用JSON。使用AJAX加载的数据，即使使用了JS，将数据渲染到了浏览器中，在右键->查看网页源代码还是不能看到通过ajax加载的数据，只能看到使用这个url加载的html代码。

获取ajax数据的方式

直接分析ajax调用的接口。然后通过代码请求这个接口。
使用Selenium+chromedriver模拟浏览器行为获取数据。

接口分析

豆瓣电影爬取

https://movie.douban.com/typerank?type_name=%E5%96%9C%E5%89%A7&type=24&
interval_id=100:90&action=

打开页面，电影页面没有分页显示按钮，当鼠标向下滚动时，会加载新的电影，

查看网页，查看Request URL:分析

https://movie.douban.com/j/chart/top_list?type=24&
interval_id=100%3A90&action=&start=0&limit=20

URL中的 start 参数是动态变化的

测试，能获取‘一页’的电影

import requests
url = "https://movie.douban.com/j/chart/top_list?type=24&
        interval_id=100%3A90&action=&start=20&limit=20"
res = requests.get(url)
print(res.text)

测试，获取所有电影

import requests
base_url = "https://movie.douban.com/j/chart/top_list?type=24&
                interval_id=100%3A90&action=&start={}&limit=20"
#可以适当的增加 limit= 的值 提高效率

i = 0
while True:
    print(i)
    url=base_url.format(i*20)
    res=requests.get(url)

    info = res.text
    print(len(info))
    if len(info) == 0:
        break

    i += 1

Selenium+chromedriver获取动态数据

Selenium 相当于是一个机器人。可以模拟人类在浏览器上的一些行为，自动处理浏览器上的一些行为，比如点击，填充数据，删除cookie等。chromedriver是一个驱动Chrome浏览器的驱动程序，使用他才可以驱动浏览器。当然针对不同的浏览器有不同的driver。以下列出了不同浏览器及其对应的driver：

安装Selenium和chromedriver

安装Selenium：Selenium有很多语言的版本，有java、ruby、python等。我们下载python版本的就可以了

pip install selenium

安装chromedriver：下载完成后，放到不需要权限的纯英文目录下就可以了。

快速入门

现在以一个简单的获取百度首页的例子来讲下Selenium和chromedriver如何快速入门：

from selenium import webdriver

# chromedriver的绝对路径
driver_path = r'D:\ProgramApp\chromedriver\chromedriver.exe'

# 初始化一个driver，并且指定chromedriver的路径
driver = webdriver.Chrome(executable_path=driver_path)
# 或者
driver = webdriver.Chrome()

# 请求网页
driver.get("https://www.baidu.com/")

# 通过page_source获取网页代码（浏览器渲染完后的代码）
print(driver.page_source)

selenium常用操作

更多教程请参考

定位元素方法

Selenium提供了一下方法来定义一个页面中的元素：

find_element_by_id

通过ID查找元素

当你知道一个元素的 id 时，你可以使用本方法。在该策略下，页面中第一个该 id 元素会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>
```
可以这样查找表单(form)元素:
```
login_form = driver.find_element_by_id('loginForm')
```

find_element_by_name
通过Name查找元素

当你知道一个元素的 name 时，你可以使用本方法。在该策略下，页面中第一个该 name 元素会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>

name属性为 username & password 的元素可以像下面这样查找:

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

这会得到 “Login” 按钮，因为他在 “Clear” 按钮之前:

continue = driver.find_element_by_name('continue')

find_element_by_xpath

通过XPath查找元素

XPath是XML文档中查找结点的语法。因为HTML文档也可以被转换成XML(XHTML)文档， Selenium的用户可以利用这种强大的语言在web应用中查找元素。 XPath扩展了（当然也支持）这种通过id或name属性获取元素的简单方式，同时也开辟了各种新的可能性，例如获取页面上的第三个复选框。

使用XPath的主要原因之一就是当你想获取一个既没有id属性也没有name属性的元素时，你可以通过XPath使用元素的绝对位置来获取他（这是不推荐的），或相对于有一个id或name属性的元素（理论上的父元素）的来获取你想要的元素。XPath定位器也可以通过非id和name属性查找元素。

绝对的XPath是所有元素都从根元素的位置（HTML）开始定位，只要应用中有轻微的调整，会就导致你的定位失败。但是通过就近的包含id或者name属性的元素出发定位你的元素，这样相对关系就很靠谱，因为这种位置关系很少改变，所以可以使你的测试更加强大。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>
```
可以这样查找表单(form)元素:
```
login_form = driver.find_element_by_xpath("/html/body/form[1]")
login_form = driver.find_element_by_xpath("//form[1]")
login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
```
1. 绝对定位 (页面结构轻微调整就会被破坏)
2. HTML页面中的第一个form元素
3. 包含 id 属性并且其值为 loginForm 的form元素
username元素可以如下获取:
```
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
```
“Clear” 按钮可以如下获取:
```
clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")
clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")
```
这些实例都是一些举出用法, 为了学习更多有用的东西，下面这些参考资料推荐给你:
- W3Schools XPath Tutorial
- W3C XPath Recommendation
- XPath Tutorial - with interactive examples.
还有一些非常有用的插件，可以协助发现元素的XPath:
- XPath Checker - suggests XPath and can be used to test XPath results.
- Firebug - XPath suggestions are just one of the many powerful features of this very useful add-on.
- XPath Helper - for Google Chrome

find_element_by_link_text

通过链接文本获取超链接

当你知道在一个锚标签中使用的链接文本时使用这个。在该策略下，页面中第一个匹配链接内容锚标签会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>
```
continue.html 超链接可以被这样查找到:
```
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')
```

find_element_by_partial_link_text
find_element_by_tag_name

通过标签名查找元素

当你向通过标签名查找元素时使用这个。在该策略下，页面中第一个匹配该标签名的元素会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>
```
h1 元素可以如下查找:
```
heading1 = driver.find_element_by_tag_name('h1')
```

find_element_by_class_name

当你向通过class name查找元素时使用这个。在该策略下，页面中第一个匹配该class属性的元素会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
```
p 元素可以如下查找:
```
content = driver.find_element_by_class_name('content')
```

find_element_by_css_selector

当你向通过CSS选择器查找元素时使用这个。在该策略下，页面中第一个匹配该CSS 选择器的元素会被匹配并返回。如果找不到任何元素，会抛出 NoSuchElementException 异常。

作为示例，页面元素如下所示:
```
<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>
```
p 元素可以如下查找:
```
content = driver.find_element_by_css_selector('p.content')
```

下面是查找多个元素（这些方法将返回一个列表）

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

常用方法是通过xpath相对路径进行定位, 例如

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>

定位username元素的方法如下：

username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")

[1] 第一个form元素通过一个input子元素，name属性和值为username实现 [2] 通过id=loginForm值的form元素找到第一个input子元素 [3] 属性名为name且值为username的第一个input元素

操作元素方法

通常所有的操作与页面交互都将通过WebElement接口，常见的操作元素方法如下：

clear 清除元素的内容
send_keys 模拟按键输入
click 点击元素
submit 提交表单

举例自动访问FireFox浏览器自动登录163邮箱。

from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
import time

# Login 163 email
driver = webdriver.Firefox()  
driver.get("http://mail.163.com/")

elem_user = driver.find_element_by_name("username")
elem_user.clear()
elem_user.send_keys("15201615157")  
elem_pwd = driver.find_element_by_name("password")
elem_pwd.clear()
elem_pwd.send_keys("******")  
elem_pwd.send_keys(Keys.RETURN)
#driver.find_element_by_id("loginBtn").click()
#driver.find_element_by_id("loginBtn").submit()
time.sleep(5)  
assert "baidu" in driver.title  
driver.close()  
driver.quit()

首先通过name定位用户名和密码，再调用方法clear()清除输入框默认内容，如“请输入密码”等提示，通过send_keys("**")输入正确的用户名和密码，最后通过click()点击登录按钮或send_keys(Keys.RETURN)相当于回车登录，submit()提交表单。 PS：如果需要输入中文，防止编码错误使用send_keys(u"中文用户名")。

WebElement接口获取值

通过WebElement接口可以获取常用的值，这些值同样非常重要。

size 获取元素的尺寸
text 获取元素的文本
get_attribute(name) 获取属性值
location 获取元素坐标，先找到要获取的元素，再调用该方法
page_source 返回页面源码
driver.title 返回页面标题
current_url 获取当前页面的URL
is_displayed() 设置该元素是否可见
is_enabled() 判断元素是否被使用
is_selected() 判断元素是否被选中
tag_name 返回元素的tagName

from selenium import webdriver  
from selenium.webdriver.common.keys import Keys  
import time

driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")   
driver.get("http://www.baidu.com/")

size = driver.find_element_by_name("wd").size
print size
#尺寸: {'width': 500, 'height': 22}

news = driver.find_element_by_xpath("//div[@id='u1']/a[1]").text
print news
#文本: 新闻

href = driver.find_element_by_xpath("//div[@id='u1']/a[2]").get_attribute('href')
name = driver.find_element_by_xpath("//div[@id='u1']/a[2]").get_attribute('name')
print href,name
#属性值: http://www.hao123.com/ tj_trhao123

location = driver.find_element_by_xpath("//div[@id='u1']/a[3]").location
print location
#坐标: {'y': 19, 'x': 498}

print driver.current_url
#当前链接: https://www.baidu.com/
print driver.title
#标题: 百度一下， 你就知道

result = location = driver.find_element_by_id("su").is_displayed()
print result
#是否可见: True

鼠标操作

在现实的自动化测试中关于鼠标的操作不仅仅是click()单击操作，还有很多包含在ActionChains类中的操作。如下：

context_click(elem) 右击鼠标点击元素elem，另存为等行为
double_click(elem) 双击鼠标点击元素elem，地图web可实现放大功能
drag_and_drop(source,target) 拖动鼠标，源元素按下左键移动至目标元素释放
move_to_element(elem) 鼠标移动到一个元素上
click_and_hold(elem) 按下鼠标左键在一个元素上
perform() 在通过调用该函数执行ActionChains中存储行为

关闭浏览器

driver.close()：关闭当前页面。
driver.quit()：退出整个浏览器。

方法	区别
close	关闭当前的浏览器窗口
quit	不仅关闭窗口，还会彻底的退出webdriver，释放与driver server之间的连接

页面等待

现在的网页越来越多采用了 Ajax 技术，这样程序便不能确定何时某个元素完全加载出来了。如果实际页面等待时间过长导致某个dom元素还没出来，但是你的代码直接使用了这个WebElement，那么就会抛出NullPointer的异常。为了解决这个问题。所以 Selenium 提供了两种等待方式：一种是隐式等待、一种是显式等待。

隐式等待：调用driver.implicitly_wait。那么在获取不可用的元素之前，会先等待10秒中的时间。示例代码如下：

driver = webdriver.Chrome(executable_path=driver_path)
driver.implicitly_wait(10)
# 请求网页
driver.get("https://www.douban.com/")

selenium_通过selenium控制浏览器滚动条

目的：通过selenium控制浏览器滚动条原理：通过 driver.execute_script()执行js代码，达到目的

 driver.execute_script("window.scrollBy(0,1000)")

语法：scrollBy(x,y)

参数    描述
x        必需。向右滚动的像素值。
y        必需。向下滚动的像素值。

或者使用

js="var q=document.documentElement.scrollTop=10000"
driver.execute_script(js)

例如在京东的某个页面，当鼠标向下滚动的时候才会把整个页面的商品加载出来，利用 driver.execute_script() 可以设置某个阈值将页面一次滑倒最底端

from selenium import webdriver
from lxml import etree
from time import sleep

url = 'https://search.jd.com/Search?keyword=mac&enc=utf-8&wq=mac&pvid=9862d03c24e741c6a58079d004f5aabf'

chrome = webdriver.Chrome()
chrome.get(url)

js = 'document.documentElement.scrollTop=100000'
chrome.execute_script(js)
sleep(3)
html = chrome.page_source
e = etree.HTML(html)
prices = e.xpath('//div[@class="gl-i-wrap"]/div[@class="p-price"]/strong/i/text()')
names = e.xpath('//div[@class="gl-i-wrap"]/div[@class="p-name p-name-type-2"]/a/em')

print(len(names))
for name, price in zip(names, prices):
    print(name.xpath('string(.)'), ":", price)
chrome.quit()

selenium操作无界面chrome浏览器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
req_url = "https://www.baidu.com"

chrome_options=Options()
#设置chrome浏览器无界面模式
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options)

# 开始请求
browser.get(req_url)
#打印页面源代码
print(browser.page_source)
#关闭浏览器
browser.close()
#关闭chreomedriver进程
browser.quit()

selenium爬取局部动态刷新网站（URL始终固定）

测试，虎牙直播，当进入某个直播分类的时候，点击不同的分页，URL不发生变化，可以通过selenium的click()事件，实现翻页的情况，

from selenium import webdriver
from lxml import etree
import time
import  requests

driver_path=r'D:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)
content=[]
url = 'https://www.huya.com/g/wzry'
driver.get(url)
driver.implicitly_wait(1)
m=1

def page():
    global  m
    global  content
    while m<3:
        m+=1
        source=driver.page_source
        mm(source)
        next_btn = driver.find_element_by_xpath("//a[@class='laypage_next']")
        print('over'*50)
        if next_btn:
            next_btn.click()
            time.sleep(5)
        else:
            break
def mm(source):
    html = etree.HTML(source)
    links = html.xpath('//ul[@class="live-list clearfix"]//a[2]/@href')
    for i in links:
        # print(i)
        request_detail_page(i)
        # break


def request_detail_page(u):
    # pass
    global  content
    # driver.get(u)
    # sour=driver.page_source
    sour=requests.get(u).text
    html = etree.HTML(sour)
    title=html.xpath('//div[@class="host-info"]//h1[@id="J_roomTitle"]/text()')[0]
    name=html.xpath('//div[@class="host-info"]//h3[@class="host-name"]/text()')[0]


    content.append({'name':name,'title':title,'url':url})
    print(name,title,u)



if __name__=="__main__":
    page()

    with open('huya.txt','w') as fp:
        for i , n in enumerate(content):
            fp.write(str(i)+n['name']+'\t'+n['title']+'\t'+n['url']+'\n')

基本使用

基本使用

什么是AJAX

获取ajax数据的方式

接口分析

Selenium+chromedriver获取动态数据

安装Selenium和chromedriver

快速入门

定位元素方法

操作元素方法

WebElement接口获取值

鼠标操作

关闭浏览器

页面等待

selenium_通过selenium控制浏览器滚动条

selenium操作无界面chrome浏览器

selenium爬取局部动态刷新网站（URL始终固定）

参考

results matching ""

No results matching ""