スクレイピング - Python徹底解説

Fri, Aug 7, 2020

${{DZ_TITLE}}$
スクレイピングでWebデータを取得

基本形

import requests

url = 'http://www.yahoo.co.jp/'

response = requests.get(url)
response.encoding = response.apparent_encoding
print(response.content.decode())

response.encoding = response.apparent_encoding 部分は、文字化け対策

ブラウザ偽装

import requests

url = 'http://www.yahoo.co.jp/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding
print(response.content.decode())

一部、ブラウザ偽装しないと取得できないページも。
上記の例はChromeを偽装
response.encoding = response.apparent_encoding 部分は、文字化け対策

Beautiful Soupで取得

import requests
from bs4 import BeautifulSoup

url = 'http://www.yahoo.co.jp/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
response = requests.get(url, headers=headers)
response.encoding = response.apparent_encoding

soup = BeautifulSoup(response.text, 'html.parser')

links = [url.get('href') for url in soup.find_all('a')]
print(links)

find_allで属性指定

soup.find_all('div',class_='wrapper_')
soup.find_all('div',id='wrapper_')

find_allに直接[0]等配列要素を指定できない場合

find_allの結果がListでなくTupleで帰ってくる場合あり。
その際は ***soup.find_all(…)[0]***の様な対応が不可能。
一旦res等に代入してから読み取る必要あり

res = soup.find_all('a',dataslitest="resultlink")
res[0]

find_allの結果でIndexError: list index out of rangeが出る

soup.find_all('a',dataslitest="resultlink")[0]

の様に直接配列の要素に読み込めない。
上述の「find_allに直接[0]等配列要素を指定できない場合」を参照。

find_allで見つけられないときはselectを使う

原因不明ですがfind_allで検索にHitしないときがあります。
selectメソッドでJQueryっぽく指定することで見つかることがあります。

soup.select('.color_')

onclick が取得できないときはfind_allを使う

selectメソッドで取得した場合、onclickを取得できなかった。
find_allを指定する事で改善することが出来る。
しかし、selectでしか引っかからない検索の場合どうするべきか…めんどくさいですが、以下の方法が今のところベストアンサー。

html2 = soup.select('.size_')
soup2 = BeautifulSoup(str(html2), 'html.parser')
for item in soup2.find_all('div'):
    if 'onclick' in item.attrs.keys():
        print(item['onclick'])

PandasでTable取得

import pandas as pd

url = 'https://weather.yahoo.co.jp/weather/week/'
dfs = pd.read_html(url)
dfs[2]

→ Pandas詳細解説

PandasでTable取得（With ブラウザ偽装）

import requests
import pandas as pd

url = 'https://weather.yahoo.co.jp/weather/week/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

response = requests.get(url, headers=headers)

dfs = pd.read_html(response.content)
dfs[2]