新浪体育——篮球足球的直播和战报爬取
新浪體育——籃球足球的直播和戰報爬取
用到的包的介紹
以上是基于python3.6.5的新浪體育直播間籃球足球的直播和戰報爬取所引用到的全部內外部包,下面將會挑幾個與爬蟲息息相關的包進行介紹。
1.lxml
Python 標準庫中自帶了 xml 模塊,但是性能不夠好,而且缺乏一些人性化的 API,相比之下,第三方庫 lxml 是用 Cython 實現的,而且增加了很多實用的功能,可謂爬蟲處理網頁數據的一件利器。lxml 大部分功能都存在 lxml.etree`中。
xml 是一個樹形結構,lxml 使用etree._Element和 etree._ElementTree來分別代表樹中的節點和樹。etree._Element 是一個設計很精妙的結構,可以把他當做一個對象訪問當前節點自身的文本節點,可以把他當做一個數組,元素就是他的子節點,可以把它當做一個字典,從而遍歷他的屬性。
2.BeautifulSoup*
官方解釋:
Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱,通過解析文檔為用戶提供需要抓取的數據,因為簡單,所以不需要多少代碼就可以寫出一個完整的應用程序。
Beautiful Soup自動將輸入文檔轉換為Unicode編碼,輸出文檔轉換為utf-8編碼。你不需要考慮編碼方式,除非文檔沒有指定一個編碼方式,這時,Beautiful Soup就不能自動識別編碼方式了。然后,你僅僅需要說明一下原始編碼方式就可以了。
Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:
- Tag
- NavigableString
- BeautifulSoup
- Comment
Tag: HTML中的一個個標簽
<title>The Dormouse's story</title> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>HTML 標簽加上里面包括的內容就是 Tag,它有兩個重要的屬性,是 name 和 attrs。
單獨獲取某個屬性:
print (soup.p['class']) #['title']NavigableString: 標簽內部的文字
print (soup.p.get_text())BeautifulSoup : 表示的是一個文檔的全部內容
大部分時候,可以把它當作 Tag 對象,是一個特殊的 Tag,我們可以分別獲取它的類型
Comment: 一個特殊類型的NavigableString對象,其實輸出的內容仍然不包括注釋符號
3.Selenium*
一種自動化測試工具。它支持各種瀏覽器,包括 Chrome,Safari,Firefox 等主流界面式瀏覽器,如果你在這些瀏覽器里面安裝一個 Selenium 的插件,那么便可以方便地實現Web界面的測試。換句話說 Selenium 支持這些瀏覽器驅動。
聲明瀏覽器對象:
chromedriver = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe' driver=webdriver.Chrome(chromedriver)用selenium做自動化,有時候會遇到需要模擬鼠標操作才能進行的情況,比如單擊、雙擊、點擊鼠標右鍵、拖拽等等。而selenium給我們提供了一個類來處理這類事件——ActionChains
我在寫代碼的過程中參考了以下這篇博客:
https://blog.csdn.net/huilan_same/article/details/52305176
(selenium之 玩轉鼠標鍵盤操作(ActionChains))
click(on_element=None) ——單擊鼠標左鍵
click_and_hold(on_element=None) ——點擊鼠標左鍵,不松開
context_click(on_element=None) ——點擊鼠標右鍵
double_click(on_element=None) ——雙擊鼠標左鍵
drag_and_drop(source, target) ——拖拽到某個元素然后松開
drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某個坐標然后松開
key_down(value, element=None) ——按下某個鍵盤上的鍵
key_up(value, element=None) ——松開某個鍵
move_by_offset(xoffset, yoffset) ——鼠標從當前位置移動到某個坐標
move_to_element(to_element) ——鼠標移動到某個元素
move_to_element_with_offset(to_element, xoffset, yoffset) ——移動到距某個元素(左上角坐標)多少距離的位置
perform() ——執行鏈中的所有動作
release(on_element=None) ——在某個元素位置松開鼠標左鍵
send_keys(*keys_to_send) ——發送某個鍵到當前焦點的元素
send_keys_to_element(element, *keys_to_send) ——發送某個鍵到指定元素
Selenium還可以很方便地進行元素定位,可以參考以下博客:
https://blog.csdn.net/jojoy_tester/article/details/53453888
(selenium WebDriver定位元素學習總結)
Selenium中元素定位共有八種
id
name
className
tagName
linkText
partialLinkText
xpath
cssSelector
詳細可以參考:https://blog.csdn.net/kaka1121/article/details/51850881
4.其他
re——正則表達式工具
xlrd、xlwt——python中操作exel表格的工具
json——將 JavaScript 對象中表示的一組數據轉換為字符串
…
新浪直播間網頁結構介紹
首先進入新浪體育的官網首頁,如下圖所示,在最上面一排有一個直播,里面有我們需要爬取的內容:
頁面整潔,層次分明的新浪體育直播間:
打開這個界面我們直接跳到了當前的日期,同時我們可以看到在網頁的右面有日期的選擇器,最早可以選至2012年1月1日,可供數據的時間跨度可以說是非常大了,對比騰訊直播間的2015年,加上該網頁構造較為簡單,這是我們選擇新浪體育直播間的主要原因。
建議使用Google的chrome瀏覽器打開該界面,因為chrome有非常豐富的開發者工具。現在讓我們來看看當前網頁的源碼和構造。
網頁中部的主要內容都在class為‘main clearfix’的一個大div標簽內。其主要內容分為三個部分:
默認選擇全部直播的toptab板塊
羅列了當天所有比賽的topcont板塊
下方評論comment板塊
打開其中我們需要的第二個內容板塊,再接著打開其中的比賽時間表的字段,依次打開main_data和cont_figure_list字段找到了我們需要的比賽的列表,其中次序為奇偶的比賽因為背景顏色不同class并不相同。
每一場比賽在一個div結構下,于是接下來我們可以看一下每個比賽條目下具體的代碼結構,并找出我們需要的部分和爬取方法了。
我們可以看到每個新聞目錄下對應著在網頁上橫著羅列開的6個項目,這6個項目所代表的的部分我在圖片中用紅框標識了出來,開始的時候我決定用最左邊圖片的超鏈接的類來分別我所需要的籃球或足球體育類別,但是進一步點開觀察每個比賽的直播和戰報數據的時候發現cba的這兩樣數據很亂,于是選擇了只用nba,而cba和nba最左邊的標識是一樣的,因此我選澤了直接判斷第三個項目的文字是否是NBA來決定是否進入該比賽爬取數據。
于是就非常方便容易地判斷出了該比賽的類別,但是我們注意到,電競比賽前面也用到了足球圖標,而且足球不只有世界杯歐洲杯等等而已,而是類別非常多,這就是為什么我們不能用圖標類別,也不能直接判斷類別標簽是否等于某個值來確定這個比賽是否是足球。
我們可以發現網頁中的圖片,隊名等幾乎都是超鏈接,(可以說是非常用心非常便利了),其中我標出來的兩部分,一個的超鏈接能夠讓我們進入這個比賽的細節,一個能夠直接跳到戰報界面,是我們需要用到的。
接下去是點擊比分進入的比賽細節:
我們可以看到最上面是整場比賽的比分和概況,接下里默認是戰報板塊,基本上是一個標題一個圖片配上一篇戰報的格式。內部分為‘戰報’,‘直播’,‘統計’,‘評論’四個板塊。該板塊由頭部( )中的四個超鏈接控制轉換tab。
直播界面對用戶也十分友好,兩邊能夠看到兩隊首發的全部數據情況,中間是按照時間順序排列的場上情況的直播,除了中場休息等這樣的信息以外大多是某隊的一個球員跟上某個動作。這些列表式的直播數據使我們需要的,下方還有球隊數據的對比等我們暫時不爬取。
我們可以看到一大堆的直播數據每個由一個列表項表示,在標簽屬性里已經注明了,這條表述歸屬的隊號和球員號,但是形式非常奇怪,有點像亂碼,再打開看一下它們的內部結構:
打開之后我們找到了我們可以爬取的隊名和直播文字。而戰報的文字當然直接在
標簽內了。
同時我也發現,當我們改變直播大廳的日期的時候,網址只有scheduledate參數改變,如2018年6月12號的,網址為
http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=all&scheduledate=2018-06-12
然后我改變最后的日期參數就可以跳轉到不同天的全部比賽,給我省了不少的時間。
以上就是新浪體育直播大廳結合源碼有關頁面的全部探索,于是就有了一個爬數據的基本思路。
代碼與爬數據的步驟
1.準備
首先定義出一個webdriver,調用這個webdriver來打開模仿瀏覽器的行為訪問數據。定義兩個全局變量,page_list和cangoin,分別表示爬取數據的標號以及是否能夠進入比賽細節頁面爬取戰報,我發現2014年7月以前的比賽只能進入單獨的戰報界面爬取戰報,因為單場比賽界面里面的戰報全部是空的,em。。可能是新浪后臺的問題。選取2012年6月這一天的url方便進行測試。
chromedriver = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe' driver=webdriver.Chrome(chromedriver) #driver=webdriver.Chrome() global page_list global cangoin cangoin=0 page_return=1 driver.implicitly_wait(2) url='http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=2012-06-06'2.挑出所有為NBA的比賽
首先用driver進入我們需要爬取數據的網頁,定義出BeautifulSoup,然后按照剛才我找到的標識比賽類型的標簽一層層向下找,用到了soup.find('標簽名', 屬性=值) 和soup.find_all('標簽名', 屬性=值) 的方法,其中第二個方法返回的是一個列表。為了做測試,我把所有不是NBA的比賽都輸出在了控制臺里,如果發現了對應的NBA比賽,就獲取進入search方法進行戰報和直播的爬取,如果是里面不能爬取戰報的這類比賽,就要在該頁面點擊進入戰報超鏈接的單獨網頁,我為了這種情況寫了一個getzb的方法。所以isNBA的代碼如下:
def isNBA(url):driver.get(url)#driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')div_m = soup.find('div', class_='main_data')div_c = div_m.find('div', class_='cont_figure')div_l = div_c.find('div', class_='cont_figure_lis')div = div_l.find_all('div', recursive=False)for d in div:dd = d.find_all('div', recursive=False)[2]dp = dd.find('p')if dp.get_text()=='NBA':dzb = d.find_all('div', recursive=False)[3]dfl = dzb.find('div', class_='cont_figure_li03_m')span = dfl.find('span', class_='cRed')sa = span.find('a')print(sa['href'])url_into = sa['href']#print(21321)print(url_into)#print(213)search(url_into)print(cangoin)if not cangoin:p = dfl.find('p', recursive=False)a = p.find('a', text='戰報')print (a['href'])url_zb = a['href']getzb(url_zb)else:print ('不在')print (dp.get_text())3.進入單場比賽search戰報直播
在進入單場比賽的界面后,默認就是戰報板塊,所以如果我設置的cangoin變量是1的話,就可以直接在當前界面爬取戰報了,這里我設置了一個getnews的方法將爬數據和存儲戰報的這一塊提出了search方法,接下來點擊直播超鏈接,這里找到這個元素,因為沒有id什么的,所以是直接復制那個元素的xpath找過去的,非常好用。點擊后試圖爬取直播卻發現僅有
search的代碼如下:
def search(url_into):print(1)print(url_into)driver.get(url_into)driver.get(url_into)global page_returnglobal real_nameglobal real_timesoup = BeautifulSoup(driver.page_source, 'lxml')if cangoin:getnews(url_into)tab_zb = driver.find_element_by_xpath('/html/body/section[2]/div/div[1]/div[1]/a[4]').click()#'a[tab()="live"]'ActionChains(driver).click(tab_zb)#body = soup.find('body', ppccont='news')#print (body['class'])span = soup.find('span', class_='qq_spanoption')as_ = span.find('a', class_='qq_login_h')print (as_['href'])id = as_['href'][-10:]href='http://api.sports.sina.com.cn/pbp/?format=json&source=web&withhref=1&mid='+id+'&pid=&eid=0&dpc=1'de_json(href)#print (soup.prettify())a = soup.find('a', tab='live')print(a['class'])# div = soup.find('div', class_='ppc03_cast_cont', stype='auto')# print (div['scrolling'])# if(div!=None):# #div = soup.find('div', class_='ppc03_cast_tabs clearfix')# ol = div.find('ol', recursive=False)# print (ol['class'])# div_d = div.find('div', recursive=False)# print (div_d['class'])# guest = div_d.find('div', class_ ="ppc03_cast_select bselector01 fr")# select = guest.find('select')# option = select.find('option')# print (select.name)# #guest = div_d.find('a', tab = 'guest', recursive=True)# print (guest.get_text())# li = ol.find_all('li', recursive=False)# li = ol.find_all(re.compile("^li"))# divs = ol.find_all('div', class_ = 'ppc03_cast_score fr')# #print (divs[0].get_text())# #print (ol.descendants[0])# for l in li:# div1 = l.find('div', recursive=False)#, class_='ppc03_cast_time f1' c# print ('哈哈哈哈哈')# print (l['nid'],'hhhhhhhhhh')# real_name.append(div1.get_text())# print (div1)# print('hehehe')# print (real_name)# else:# return# page_return=14.在單獨的戰報界面里爬取戰報getzb
單獨的戰報界面只需要找到包含大段文字的部分就非常簡單爬取文字了,因為戰報的文字都非常規整地羅列在
標簽內,然后定義一個文件夾路徑,這里用到了page_list來作為戰報的序號及文件名。用open()函數打開文件。
def getzb(url):global page_listdriver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')db = soup.find('div', class_='blkContainerSblk')dbody = db.find('div', id='artibody')ps = dbody.find_all('p', recursive=False)page_list = page_list - 1write_path = 'D:\其他\戰報\\' + str(page_list-1) + '.txt'fo = open(write_path, "w", encoding='utf-8')for p in ps:pt = p.get_text()print(pt)fo.write(pt.replace(' ', ''))fo.write('\n')fo.close()5.解析json爬取直播
剛才我找到的json網站打開其實是這樣的:
在解碼json的網站把這段json解碼后看到了它真正的結構:
這里寫圖片描述
用json.load(json文件) 的方法可以解碼json,讓他變成右邊的樹狀結構,然后我們可以用索引的方式找到需要的信息,這里我保存了五條可能用到的信息,分別是:隊名,當前比賽時間,該條比賽描述,當前主隊得分,客隊得分。調用xlwt打開exel寫入信息。
代碼如下:
def de_json(url):driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')print (soup.prettify())pre = soup.find('pre')json_t = pre.get_text()json_string = json.loads(json_t)#print (json_string)workbook = xlwt.Workbook() # excle打開sheet1 = workbook.add_sheet('sheet1', cell_overwrite_ok=True)write_path = 'D:\其他\直播\\'+str(page_list-1)+'.xls'#page_list = page_list-1page_in_list = 0for i in json_string['result']['data']['pbp_msgs']:#ele = json_string[i]#print (i.key)print (i)print (json_string['result']['data']['pbp_msgs'][i]['team_name'])print (json_string['result']['data']['pbp_msgs'][i]['game_clock'])des = json_string['result']['data']['pbp_msgs'][i]['description']txt = re.sub(r'<.*?>','',des)#print(re.match(r'>[\u4e00-\u9fa5]*<', des))#if re.match(r'>[\u4e00-\u9fa5]*<', des):#txt = re.match(r'>[\u4e00-\u9fa5]*<', des)[1:-1] + re.match(r'a>[\u4e00-\u9fa5]*',des)[2:]#print('Yesyesyes')#else:#txt = desprint (txt)print (json_string['result']['data']['pbp_msgs'][i]['home_score'])print (json_string['result']['data']['pbp_msgs'][i]['visitor_score'])#print (i['game_clock'])#print ('\n')sheet1.write(page_in_list, 0, json_string['result']['data']['pbp_msgs'][i]['team_name'])sheet1.write(page_in_list, 1, json_string['result']['data']['pbp_msgs'][i]['game_clock'])sheet1.write(page_in_list, 2, txt)sheet1.write(page_in_list, 3, json_string['result']['data']['pbp_msgs'][i]['home_score'])sheet1.write(page_in_list, 4, json_string['result']['data']['pbp_msgs'][i]['visitor_score'])page_in_list = page_in_list + 1workbook.save(write_path)page_in_list = page_in_list + 1#json=soup.prettify()#json_string = json.load(json)#for i in [0:565]6.在單場比賽界面爬戰報getnews()
與getzb()類似,代碼如下:
def getnews(url):driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')divc = soup.find('div', class_='barticle_content')ps = divc.find_all('p', recursive=False)write_path = 'D:\其他\戰報\\'+str(page_list)+'.txt'fo = open(write_path, "w", encoding='utf-8')for p in ps:pt = p.get_text()print (pt)fo.write(pt.replace(' ', ''))fo.write('\n')fo.close()7.調整時間
表示出所有的日期放入網址爬取不同天的數據。
for i in years[0:]:if (i == '2012' ):for m in mouth[5:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):for n in days2[:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days4:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:if (m=='06' and n =='02'):continueprint(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (i=='2016'):for m in mouth[5:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):for n in days2[:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days4:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for m in mouth[0:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):if (i == '2014' and m=='08'):cangoin=1for n in days2:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days3:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)7.整體代碼
最后不能忘了關閉driver
# -*- coding:utf-8 -*- from lxml import etree from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By import time from distutils import log import os import sys from selenium.webdriver.common.action_chains import * import re import xlrd import xlwt import json import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')def setpl():global page_listpage_list = 4056global cangoindef isNBA(url):driver.get(url)#driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')div_m = soup.find('div', class_='main_data')div_c = div_m.find('div', class_='cont_figure')div_l = div_c.find('div', class_='cont_figure_lis')div = div_l.find_all('div', recursive=False)for d in div:dd = d.find_all('div', recursive=False)[2]dp = dd.find('p')if dp.get_text()=='NBA':dzb = d.find_all('div', recursive=False)[3]dfl = dzb.find('div', class_='cont_figure_li03_m')span = dfl.find('span', class_='cRed')sa = span.find('a')print(sa['href'])url_into = sa['href']#print(21321)print(url_into)#print(213)search(url_into)print(cangoin)if not cangoin:p = dfl.find('p', recursive=False)a = p.find('a', text='戰報')print (a['href'])url_zb = a['href']getzb(url_zb)else:print ('不在')print (dp.get_text())def getzb(url):global page_listdriver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')db = soup.find('div', class_='blkContainerSblk')dbody = db.find('div', id='artibody')ps = dbody.find_all('p', recursive=False)page_list = page_list - 1write_path = 'D:\其他\戰報\\' + str(page_list-1) + '.txt'fo = open(write_path, "w", encoding='utf-8')for p in ps:pt = p.get_text()print(pt)fo.write(pt.replace(' ', ''))fo.write('\n')fo.close()def search(url_into):print(1)print(url_into)driver.get(url_into)driver.get(url_into)global page_returnglobal real_nameglobal real_timesoup = BeautifulSoup(driver.page_source, 'lxml')if cangoin:getnews(url_into)tab_zb = driver.find_element_by_xpath('/html/body/section[2]/div/div[1]/div[1]/a[4]').click()#'a[tab()="live"]'ActionChains(driver).click(tab_zb)#body = soup.find('body', ppccont='news')#print (body['class'])span = soup.find('span', class_='qq_spanoption')as_ = span.find('a', class_='qq_login_h')print (as_['href'])id = as_['href'][-10:]href='http://api.sports.sina.com.cn/pbp/?format=json&source=web&withhref=1&mid='+id+'&pid=&eid=0&dpc=1'de_json(href)#print (soup.prettify())a = soup.find('a', tab='live')print(a['class'])# div = soup.find('div', class_='ppc03_cast_cont', stype='auto')# print (div['scrolling'])# if(div!=None):# #div = soup.find('div', class_='ppc03_cast_tabs clearfix')# ol = div.find('ol', recursive=False)# print (ol['class'])# div_d = div.find('div', recursive=False)# print (div_d['class'])# guest = div_d.find('div', class_ ="ppc03_cast_select bselector01 fr")# select = guest.find('select')# option = select.find('option')# print (select.name)# #guest = div_d.find('a', tab = 'guest', recursive=True)# print (guest.get_text())# li = ol.find_all('li', recursive=False)# li = ol.find_all(re.compile("^li"))# divs = ol.find_all('div', class_ = 'ppc03_cast_score fr')# #print (divs[0].get_text())# #print (ol.descendants[0])# for l in li:# div1 = l.find('div', recursive=False)#, class_='ppc03_cast_time f1' c# print ('哈哈哈哈哈')# print (l['nid'],'hhhhhhhhhh')# real_name.append(div1.get_text())# print (div1)# print('hehehe')# print (real_name)# else:# return# page_return=1def getnews(url):driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')divc = soup.find('div', class_='barticle_content')ps = divc.find_all('p', recursive=False)write_path = 'D:\其他\戰報\\'+str(page_list)+'.txt'fo = open(write_path, "w", encoding='utf-8')for p in ps:pt = p.get_text()print (pt)fo.write(pt.replace(' ', ''))fo.write('\n')fo.close()def de_json(url):driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')print (soup.prettify())pre = soup.find('pre')json_t = pre.get_text()json_string = json.loads(json_t)#print (json_string)workbook = xlwt.Workbook() # excle打開sheet1 = workbook.add_sheet('sheet1', cell_overwrite_ok=True)write_path = 'D:\其他\直播\\'+str(page_list-1)+'.xls'#page_list = page_list-1page_in_list = 0for i in json_string['result']['data']['pbp_msgs']:#ele = json_string[i]#print (i.key)print (i)print (json_string['result']['data']['pbp_msgs'][i]['team_name'])print (json_string['result']['data']['pbp_msgs'][i]['game_clock'])des = json_string['result']['data']['pbp_msgs'][i]['description']txt = re.sub(r'<.*?>','',des)#print(re.match(r'>[\u4e00-\u9fa5]*<', des))#if re.match(r'>[\u4e00-\u9fa5]*<', des):#txt = re.match(r'>[\u4e00-\u9fa5]*<', des)[1:-1] + re.match(r'a>[\u4e00-\u9fa5]*',des)[2:]#print('Yesyesyes')#else:#txt = desprint (txt)print (json_string['result']['data']['pbp_msgs'][i]['home_score'])print (json_string['result']['data']['pbp_msgs'][i]['visitor_score'])#print (i['game_clock'])#print ('\n')sheet1.write(page_in_list, 0, json_string['result']['data']['pbp_msgs'][i]['team_name'])sheet1.write(page_in_list, 1, json_string['result']['data']['pbp_msgs'][i]['game_clock'])sheet1.write(page_in_list, 2, txt)sheet1.write(page_in_list, 3, json_string['result']['data']['pbp_msgs'][i]['home_score'])sheet1.write(page_in_list, 4, json_string['result']['data']['pbp_msgs'][i]['visitor_score'])page_in_list = page_in_list + 1workbook.save(write_path)page_in_list = page_in_list + 1#json=soup.prettify()#json_string = json.load(json)#for i in [0:565]chromedriver = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe' driver=webdriver.Chrome(chromedriver) #driver=webdriver.Chrome() global page_list global cangoin cangoin=0 setpl() page_return=1 real_name=[] driver.implicitly_wait(2) url='http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=2012-06-06' #isNBA(url) print('url1 is done!') url2='http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=all&scheduledate=2014-02-10' #isNBA(url2) #search('http://sports.sina.com.cn/nba/live.html?id=2014101502') i='2012' m='06' n='06' #isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(1)years=['2012','2013','2014','2015','2016','2017'] mouth=['01','02','03','04','05','06','07','08','09','10','11','12'] days1=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30'] days2=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31'] days3=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28'] days4=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29']''' i='2012' m='06' for k in range(28):n=days3[k]isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)''' for i in years[0:]:if (i == '2012' ):for m in mouth[5:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):for n in days2[:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days4:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:if (m=='06' and n =='02'):continueprint(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (i=='2016'):for m in mouth[5:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):for n in days2[:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days4:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for m in mouth[0:]: # 每次出問題記得更改if (m in ['01','03','05','07','08','10','12']):if (i == '2014' and m=='08'):cangoin=1for n in days2:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in ['02']):for n in days3:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)else:for n in days1:print(i + '-' + m + '-' + n)isNBA('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)driver.quit()足球代碼爬取
由于前面介紹過的原因,足球種類太多沒法用圖標的名字或者單個名字進行過濾,所以采取了排除法,并且對爬取到的戰報進行了過濾,過濾掉了一些不適于做我們數據的過亂的戰報。比如查找strong標簽,因為帶有strong標簽的一篇戰報中有2場或多場比賽的報道,但基本過程跟籃球差不多。
代碼如下:
#encoding=utf-8 from lxml import etree from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By import time from distutils import log import os import sys from selenium.webdriver.common.action_chains import * import re import xlrd import xlwt import json import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') def search(url):print(1)print(url)driver.get(url)driver.get(url)global page_returnglobal real_nameglobal real_timesoup = BeautifulSoup(driver.page_source, 'lxml')div = soup.find('div', class_='cont_figure_lis')if(div!=None):div = div.find_all('div', recursive=False)page_return=1for i in div:name_judge=0 #name=[]real_name = []real_time=[]#print(real_name)index = 0url_set = []page_return = 1#找到合適的戰報等元素并進入if (i.find('div')):m = i('div')[2]# print(m.get_text())if (m.get_text().replace(" ", "").strip() == "" or '奧' in m.get_text() or '籃' in m.get_text()or '排' in m.get_text() or 'NBA' in m.get_text() or '網' in m.get_text()or 'U' in m.get_text() or '抽簽' in m.get_text() or '斯' in m.get_text()or '拳擊'in m.get_text() or 'F'in m.get_text() or 'BA' in m.get_text()or '棋' in m.get_text() or '牌' in m.get_text() or '排球' in m.get_text()or '乒乓'in m.get_text() or '羽' in m.get_text() or '游泳' in m.get_text() or '亞俱杯' in m.get_text() or '東京賽'in m.get_text()or '美傳奇巨星' in m.get_text() or '新秀賽' in m.get_text() or '冰壺' in m.get_text() or 'NCAA' in m.get_text()):continueelse:m=i.find_all('div',recursive=False)[1]for m_time in re.findall(r'[0-9]{2,2}',m.get_text()):real_time.append(m_time)n = i.find_all('div',recursive=False)[3]#print(len(n.find_all('div')))if(len(n.find_all('div',recursive=False))!=3):continue# 選中中間那一欄n1 = n.find_all('div',recursive=False)[1] # 選中中間有字的一欄if (n1.find('h4')):#print(50)n1_1 = n1.find('h4')if (n1_1.find_all('a',recursive=False)):for name_ in n1_1.find_all('a',recursive=False):real_name.append(name_.get_text())print(real_name)for name_judge_ in real_name:if ('籃' in name_judge_ or '排' in name_judge_ or '美傳奇巨星' in name_judge_ or '李娜' in name_judgeor ('長春亞泰' in name_judge_ and real_time[0] == '07' and real_time[1] == '09')or ('利物浦' in name_judge_ )or ('廣州恒大' in name_judge_ and real_time[0] == '11' and real_time[1] == '04')or ('拜仁' in name_judge_ and real_time[0] == '12' and real_time[1] == '16')):name_judge=1breakif(name_judge==1):continueif (n1.find('p')):n1_2 = n1.find('p')if (n1_2.find('a')):for n1_2_ in n1_2('a'):# print(10)#print(n1_2_.get_text())if (n1_2_.get_text() == "戰報" or n1_2_.get_text() == '實錄'):print(n1_2_.get_text())index = index + 1print(n1_2_['href'])url_set.append(n1_2_['href'])else:continueelse:continueelse:continueelse:continueif (index != 2):#print(index)continueelse:get_txt(url_set[0], name)if (page_return != 0):get_livetxt(url_set[1])# 方法體1:遍歷網頁依次找到鏈接地址# 方法體1結束# url = 'http://match.sports.sina.com.cn/livecast/1/iframe/live_log.html?168198'# url_new = 'http://sports.sina.com.cn/j/2012-08-17/21216193280.shtml'else:return def get_txt(url, name): # ,real_name)print(2)driver.get(url)global page_returnsoup = BeautifulSoup(driver.page_source, 'lxml')strong_list = 0txt_list = []test_time_count=0test_time=[]if(soup.find('span',id='pub_date')):mtime=soup.find('span',id='pub_date')test_time.append(re.findall(r'[0-9]{2,2}\u6708',mtime.get_text())[0][0:2]) #月test_time.append(re.findall(r'[0-9]{2,2}\u65e5',mtime.get_text())[0][0:2]) #日if(len(test_time)==0):if (soup.find('span', class_='article-a__time')):mtime = soup.find('span',class_='article-a__time')test_time.append(re.findall(r'[0-9]{2,2}\u6708', mtime.get_text())[0][0:2])test_time.append(re.findall(r'[0-9]{2,2}\u65e5', mtime.get_text())[0][0:2])print(test_time)if(len(test_time)==2):if(test_time[0]!=real_time[0]):page_return=0returnif(int(test_time[1])<int(real_time[1])-1 or int(test_time[1])>int(real_time[1])+1):page_return=0returnif (soup.find('div', class_='BSHARE_POP blkContainerSblkCon clearfix blkContainerSblkCon_14')):intern_deal('BSHARE_POP blkContainerSblkCon clearfix blkContainerSblkCon_14', soup, txt_list,name, strong_list)elif (soup.find('div', class_='blkContainerSblkCon')):intern_deal('blkContainerSblkCon', soup, txt_list, name, strong_list)elif (soup.find('div', class_='article-a__content')):intern_deal('article-a__content', soup, txt_list, name, strong_list)elif (soup.find('div',class_='layout-equal-height__item layout-fl layout-of-hidden layout-pt-c layout-wrap-b layout-pr-a layout-br-a')):intern_deal('layout-equal-height__item layout-fl layout-of-hidden layout-pt-c layout-wrap-b layout-pr-a layout-br-a',soup, txt_list, name, strong_list)else:page_return = 0 def intern_deal(class1,soup,txt_list,name,strong_list):print(3)global page_listtag=1#用來判斷是否已經到達進球信息的標簽global page_returnstart_list=0#定義有文字的p的開始位置previous_list=0#定義一個計數器,算出第一個分鐘起始位置txt1 = soup.find('div', class_=class1)# 如果有一個空P開頭那么就計算一下,從后面第二個P開始if txt1.find('p').get_text().replace(' ','').strip()=="":start_list=1print(100)if (txt1.find('p')):list_number=0# print(len(txt1('p')))if (len(txt1('p')) <= 4+start_list):page_return=0return#用來判斷是否是一篇合格的新聞即分鐘要出現在前4個p中newstag=0for news_tag in txt1.find_all('p',recursive=False)[0:5]:if(re.match(r'.*\u5206\u949f.*', news_tag.get_text()) != None):newstag=1if(newstag==0):page_return=0returnelse:for i in txt1.find_all('p',recursive=False)[1+start_list:4+start_list]:# print(3)if (i.find('strong')):#print(i('strong')[0].get_text())#print(len(i('strong')[0].get_text()))#print(i.get_text().strip())#print(i.get_text().strip()[0:len(i('strong')[0].get_text())])#print(3)if (i('strong')[0].get_text() ==i.get_text().strip()[0:len(i('strong')[0].get_text())] and notre.match(r'.*\u5206\u949f.*',i('strong')[0].get_text())):strong_list = strong_list + 1# print(strong_list)if (strong_list >= 2):page_return=0returnfor i in txt1.find_all('p')[1+start_list:-1]:# print(10000)# i = i.get_text().replace(" ", "").strip()# print(i[0:2])# print(i.attrs)#print(i.get_text)if(i.attrs!={}):# print(i.attrs)#print(i.get_text)continue#print()if (i.get_text().replace(" ", "").strip()[0:2] == "進球" or i.get_text().replace(" ", "").strip()[0:2] == '信息' ):tag=0continueif(len(i.get_text().replace(" ", "").strip())<=35 and tag==0):continueif ((re.match(r'.*[0-9]-[\u4e00-\u9fa5].*', i.get_text()) != None #一龥or re.match(r'.*[0-9]\'',i.get_text())!=None)and list_number>=3): # 如果匹配到了最后一個球員名單name.append(i.get_text().replace(" ", "").strip()[0:2]) # 加入名字列表breaklist_number=list_number+1 #分鐘if ((re.match(r'.*\u5206\u949f.*', i.get_text()) == None and '開場' not in i.get_text() and '開始' not in i.get_text()) and previous_list == 0): # 如果不match分鐘就跳過continueelse:final_txt = i.get_text()if (i.find('a')):len1 = len(i.find_all('a'))# print(len1)final_txt = final_txt.replace('[點擊觀看視頻]', '').replace('[點擊觀看進球視頻]', '')for m in range(len1):a_txt = i('a')[m].get_text()# print(a_txt)final_txt = final_txt.replace(a_txt, '')else:print()if (i.find('script')):len1 = len(i.find_all('script'))# print(len1)for m in range(len1):a_txt = i('script')[m].get_text()# print(a_txt)final_txt = final_txt.replace(a_txt, '')else:print()if (i.find('style')):len1 = len(i.find_all('style'))# print(len1)for m in range(len1):a_txt = i('style')[m].get_text()# print(a_txt)final_txt = final_txt.replace(a_txt, '')if (i.find('span')):len1 = len(i.find_all('span'))# print(len1)for m in range(len1):a_txt = i('span')[m].get_text()# print(a_txt)final_txt = final_txt.replace(a_txt, '')else:print()# print(3)final_txt.replace(" ", "").replace('[','').replace(']','').replace(':','').replace('【','').replace('】','').strip()final_txt = final_txt.replace(" ", "").replace('[', '').replace(']', '').replace(':', '').replace('【', '').replace('】', '').replace('(', '').replace(')', '').replace('(', '').replace(')','').strip()if (len(final_txt) >= 10):txt_list.append(final_txt) # 將鏈接內的字符刪除previous_list = previous_list + 1# if (name[1] not in real_name): # 判斷名字是否在其中#如果一個網頁最后一個P元素為空,那么就這么做..爬倒數第二個zuihounumber=-1maxxunhuan=10while(zuihounumber<0 and maxxunhuan>0):if (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[zuihounumber].get_text().replace(" ", "").strip()) and not txt1('p')[zuihounumber].find('a')):txt_list.append( (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[zuihounumber].get_text().replace(" ", "").strip()))[0])zuihounumber=1else:zuihounumber=zuihounumber-1maxxunhuan=maxxunhuan-1# if (# re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[-1].get_text().replace(" ", "").strip())):# txt_list.append(# (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])',# txt1('p')[-1].get_text().replace(" ", "").strip()))[# 0])# elif (# re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[-2].get_text().replace(" ", "").strip())):# txt_list.append(# (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])',# txt1('p')[-2].get_text().replace(" ", "").strip()))[# 0])# elif (# re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[-3].get_text().replace(" ", "").strip())):# txt_list.append(# (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])',# txt1('p')[-3].get_text().replace(" ", "").strip()))[# 0])# elif (# re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])', txt1('p')[-4].get_text().replace(" ", "").strip())):# txt_list.append(# (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])',# txt1('p')[-4].get_text().replace(" ", "").strip()))[# 0])# else:# txt_list.append(# (re.findall(r'([\u4e00-\u9fa5].*[\u4e00-\u9fa5])',# txt1('p')[-5].get_text().replace(" ", "").strip()))[# 0])# print(real_name[0])# print(real_name[1])print(txt_list)#print(name[0],real_name)if(len(name)==0):page_return=0returnif((name[0] not in real_name[0] and name[0] not in real_name[1])and name[0]!='女王' and name[0]!='托' and name[0]!='皇馬' and name[0]!='巴薩'):page_return=0else:# print(1000)write_path = '/Users/hejie/Desktop/課外學習/數據集/新浪直播數據/戰報/'+str(page_list)+'.txt'#print(write_path)# print(1000)fo = open(write_path, "w",encoding='utf-8')# print(1000)# print(6)# print(txt)for i in txt_list:# print(1000)# print(type(i.strip()))print(i)fo.write(i.replace(' ', ''))fo.write('\n')fo.close()# print(txt_list)#瀏覽器配置else:page_return=0return def get_livetxt(url):print(4)print(url)number = re.findall(r'\b[0-9][0-9]{4,7}\b',url)[0]#print(number)url_='http://api.sports.sina.com.cn/?p=live&s=livecast&a=livecastlog&id='+number+'&dpc=1'#真正的要訪問的urlprint(url_)msg=[]total_time=[]score=[]driver.get(url_)soup = BeautifulSoup(driver.page_source, 'lxml')#如果得到的url是一個網站if (soup.find('pre')==None):print("選擇1")msg,total_time,score=get_txt_direct(url)#如果得到的url是直接一個數據庫文件else:print('選擇2')msg,total_time,score=get_txt_indirect(url_)global page_listpage_in_list = 0judge=0#定義的是上半場最終訪問時間workbook=xlwt.Workbook()#excle打開sheet1=workbook.add_sheet('sheet1',cell_overwrite_ok=True)#txt=soup.find("tbody")#print(txt)list=0#a=['上' ,'下','完']print(1000)write_path = '/Users/hejie/Desktop/課外學習/數據集/新浪直播數據/實錄/'+str(page_list)+'.xls'print(write_path)# for i in txt('tr'):# if(i('td')[2]):# if(i('td')[2].get_text()[0] not in a ):# continue# elif(i('td')[2].get_text()[0]=='完' and list<2):# #print(i('td'))# sheet1.write(page_in_list, 0, i('td')[1].get_text().strip())# sheet1.write(page_in_list, 1, "完賽")# sheet1.write(page_in_list, 2, i('td')[3].get_text().strip())# page_in_list=page_in_list+1# list=list+1# #print(list)# elif(i('td')[2].get_text()[0]=='上'):#記得將下半場的時間加上上半場的時間# sheet1.write(page_in_list, 0, i('td')[1].get_text().strip())# sheet1.write(page_in_list, 1, re.findall(r'[0-9]+',i('td')[2].get_text())[0])# sheet1.write(page_in_list, 2, i('td')[3].get_text().strip())# judge=re.findall(r'[0-9]+',i('td')[2].get_text())# page_in_list = page_in_list + 1# else:# sheet1.write(page_in_list, 0, i('td')[1].get_text().strip())# sheet1.write(page_in_list, 1, str(int(re.findall(r'[0-9]+', i('td')[2].get_text())[0])+judge))# sheet1.write(page_in_list, 2, i('td')[3].get_text().strip())# page_in_list = page_in_list + 1# print(i('td'))# print(3)for i in range(len(msg)):sheet1.write(page_in_list, 0, msg[i])sheet1.write(page_in_list, 1, total_time[i])sheet1.write(page_in_list, 2, score[i])page_in_list=page_in_list+1workbook.save(write_path)#存放excle表page_list=page_list+1##全局變量在外面賦值 def get_txt_indirect(url):#有些網站分鐘沒有直接顯示出來msg=[]total_time=[]score=[]driver.get(url)soup = BeautifulSoup(driver.page_source, 'lxml')# 定義一個文本,其是直播數據txt = soup.find('pre').get_text()txt = re.findall(r'\[.*\]', txt)txt = txt[0]print(txt)jo = json.loads(txt)shang_end=0 #end標志標志比賽的結束最多有兩個# json_number=0for i in jo:if ('st' in i and i['st']!=None ):if ('q' in i ):if(i['q']==1):if(':' in i['m']):continuemsg.append(i['m'].replace('.',''))total_time.append((i['st']//60)+1)score.append(i['s']['s1']+'-'+i['s']['s2'])# print(i['m'])# #print('上半場')# print((i['st']//60)+1)# print(i['s']['s1']+'-'+i['s']['s2'])shang_end=(i['st']//60)+1#標記上半場的結束時間elif(i['q']==2):if (':' in i['m']):continuemsg.append(i['m'].replace('.',''))total_time.append((i['st'] // 60) + 1+shang_end)score.append(i['s']['s1'] + '-' + i['s']['s2'])# print(i['m'])# #print('下半場')# print((i['st'] // 60) + 1+shang_end)# print(i['s']['s1'] + '-' + i['s']['s2'])elif(i['q']==5):if (':' in i['m']):continueif (len(re.findall(r'[0-9]-[0-9]',i['m']))==1):msg.append(i['m'].replace('.',''))total_time.append('完賽')score.append(i['s']['s1'] + '-' + i['s']['s2'])# print(i['m'])# print('完賽')# print(i['s']['s1'] + '-' + i['s']['s2'])# print(1000)breakelse:continue#print(re.findall(r'[0-9]-[0-9]',i['m']))else:continueelse:continueelse:continue#print(1000)return msg,total_time,score #有些網站分鐘直接顯示出來了 def get_txt_direct(url):msg = []total_time = []score = []global page_listpage_in_list_=0driver.get(url)judge = 0# workbook = xlwt.Workbook()# sheet1 = workbook.add_sheet('sheet1', cell_overwrite_ok=True)soup = BeautifulSoup(driver.page_source, 'lxml')txt = soup.find("tbody")# print(3)# print(txt)list = 0a = ['上', '下', '完']start_time=0#記錄上半場時間# write_path = "E:直播數據\新郎直播數據\實錄\\" + str(page_list) + ".xls"# workbook.save(write_path) # page_list = page_list + 1 #for i in txt('tr')[::-1]:#print(i('th')[0])if (i.find('th')):#print(1000)#print(type(i('th')[0].get_text()))if (re.findall(r'[0-9]+',i('th')[0].get_text())):#print(1000)#print(page_in_list_, 0, i('td')[0].get_text().strip())if (':' in i('td')[0].get_text()):continuemsg.append(i('td')[0].get_text().replace('.','').strip())if(len(re.findall(r'[0-9]+',i('th')[0].get_text()))==1):#如果時間的長度為1,就調用一個就行 ,否則兩者相加total_time.append(re.findall(r'[0-9]+',i('th')[0].get_text())[0])# print(page_in_list_, 1, re.findall(r'[0-9]+',i('th')[0].get_text())[0])start_time=re.findall(r'[0-9]+',i('th')[0].get_text())[0]else:time_=0for time_1 in re.findall(r'[0-9]+',i('th')[0].get_text()):time_=time_+int(time_1)total_time.append(time_)# print(page_in_list_, 1, time_)start_time = re.findall(r'[0-9]+', i('th')[0].get_text())[0]score.append(i('td')[1].get_text().strip())#print(page_in_list_, 2, i('td')[1].get_text().strip())page_in_list_ = page_in_list_ + 1elif (i('th')[0].get_text().replace(' ','').strip()=="" and int(start_time)>80):if(re.findall(r'[0-9]-[0-9]',i('td')[0].get_text())):# print(i('td'))if (':' in i('td')[0].get_text()):continuemsg.append(i('td')[0].get_text().replace('.','').strip())total_time.append("完賽")score.append(i('td')[1].get_text().strip())# print(page_in_list_, 0, i('td')[0].get_text().strip())# print(page_in_list_, 1, "完賽")# print(page_in_list_, 2, i('td')[1].get_text().strip())page_in_list_ = page_in_list_ + 1list = list + 1breakelse:continue# print(list)else:continueelse:continue#print(3)return msg,total_time,score service_args=[] #設置驅動器的瀏覽器 driver=webdriver.Chrome( ) page_list=4564 page_return=1 real_name=[] #獲取網頁地址 #方法體2:切換到新窗口中,點擊新窗口中的按鈕 #driver.switch_to.window(driver.window_handles[1]) ''' link=element.get_attribute('href')#獲取到鏈接地址,然后進行跳轉 driver.navigate().to(link) driver.implicitly_wait(10)#等待10s,有可能鏈接還不能找到 element.click()模擬元素點擊 ''' driver.implicitly_wait(2) #等待10s以便頁面加載完全 #element=driver.find_element_by_xpath("/html/body/div[2]/div/div[2]/div/a[2]")#找到直播數據按鈕 #element.click() #driver.implicitly_wait(2)#等待10s以便頁面加載完全#抓取直播里面的文字信息#element=driver.find_element_by_xpath("/html/body/div[2]/div/div[2]/div/a[6]")#找到戰報元素 #print (5)#一個斷點低級提示 #element.click() #點擊戰報元素 #mse=get_txt(url_new)#a=['1','2','3','4'] #soup=BeautifulSoup(driver.page_source, 'lxml') #txt1=soup.find('div',class_='article-a__content') #for i in txt1('p')[:-1]:# if('strong' in i.prettify())# #print(len(i('strong')[0].get_text()))# print(i('strong')[0].get_:text()[-3])# #print("dasdsadsa")# if(i('strong')[0].get_text()[-3] in a):# #print(321321)# continue# print(i.get_text().replace(" ","").strip())#print(i) #fo.close() years=['2012','2013','2014','2015','2016','2017'] mouth=['01','02','03','04','05','06','07','08','09','10','11','12'] days1=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30'] days2=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31'] days3=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28'] days4=['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29'] try:for i in years[0:]:if(i== '2012' ):for m in mouth[0:]:#每次出問題記得更改if(m in ['06']):for n in days1[16:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate='+i+'-'+m+'-'+n)elif (m in [ '09']):for n in days1[25:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif (m in [ '11']):for n in days1[28:]:print(i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)elif(m in ['07']):for n in days2[22:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['08']):for n in days2[21:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['10']):for n in days2[30:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2[30:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif( i == '2016'):for m in mouth[11:]:if (m in [ '11']):for n in days1[23:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['09']):for n in days1[13:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in [ '06']):for n in days1[18:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['04']):for n in days1[8:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['02']):for n in days4[2:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['05']):for n in days2[4:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['07']):for n in days2[1:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['12']):for n in days2[9:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(i=='2013'):for m in mouth[9:]:if (m in [ '11']):for n in days1:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['09']):for n in days1[28:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['04']):for n in days1[26:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['06']):for n in days1[1:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['02']):for n in days3[15:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['01']):for n in days2:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['03']):for n in days2[30:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['10']):for n in days2[23:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['07']):for n in days2[2:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(i=='2014'):for m in mouth[11:]:if (m in ['04', '09']):for n in days1:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['12']):for n in days2[13:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['11']):for n in days1[26:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in [ '06']):for n in days1[28:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['02']):for n in days3[24:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['07']):for n in days2[12:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['08']):for n in days2[15:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['10']):for n in days2[23:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(i=='2015'):for m in mouth[8:]:if (m in [ '06', '11']):for n in days1:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['09']):for n in days1[23:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['04']):for n in days1[14:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['02']):for n in days3:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['01']):for n in days2[22:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['03']):for n in days2[29:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['05']):for n in days2[2:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['07']):for n in days2[26:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for m in mouth[11:]:if (m in [ '06', '09']):for n in days1:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in [ '11']):for n in days1[29:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['04']):for n in days1[26:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['02']):for n in days3[19:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['01']):for n in days2[11:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['03']):for n in days2[23:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['05']):for n in days2[13:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif (m in ['07']):for n in days2[18:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['08']):for n in days2[27:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)elif(m in ['10']):for n in days2[27:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)else:for n in days2[16:]:search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=' + i + '-' + m + '-' + n)print(i + '-' + m + '-' + n)#search('http://match.sports.sina.com.cn/index.html#type=schedule&matchtype=all&filtertype=time&livetype=ed&scheduledate=2017')driver.quit() except Exception as e:print(e) else:print("error")總結
以上是生活随笔為你收集整理的新浪体育——篮球足球的直播和战报爬取的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如果你想要做好抖音直播间需要注意的其他事
- 下一篇: 基于testNG的数据驱动测试的自动化测