Beautiful Soup 크롤링 (with 파이썬)

N코딩 2022. 4. 7. 00:15

2022. 4. 7. 00:15

1. Beautiful Soup이란?

HTML과 XML 문서를 파싱하기위한 파이썬 패키지입니다.
웹 스크래핑에 유용한 HTML에서 데이터를 추출하는 데 사용할 수있는 구문 분석 된 페이지에 대한 구문 분석 트리를 생성합니다.

2. 크롤링 기본 뼈대

import requests
from bs4 import BeautifulSoup

url = 'https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=pnt&date=20200303'  // 크롤링 할 사이트 url

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get(url, headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

3. 사용법

인터넷 페이지 -> F12(검사) -> 크롤링 할 코드 (elements)에서 우클릭 -> copy -> copy selector

1. 
title = soup.select_one('') 
''안에 copy selector 한 내용 붙여넣음

-- print(title)
-> <a href="/movie/bi/mi/basic.naver?code=171539" title="그린 북">그린 북</a>

-- print(title.text)
-> 그린 북

-- print(title['href'])  (속성 가져올때)
--> /movie/bi/mi/basic.naver?code=171539

2. 
trs = soup.select('#old_content > table > tbody > tr')   // 결과값이 list로 나옴

-> 
trs = soup.select('#old_content > table > tbody > tr')

for tr in trs:
    a_tag = tr.select_one('td.title > div > a')
    print(a_tag)

--> <a href="/movie/bi/mi/basic.naver?code=82432" title="헬프">헬프</a>
<a href="/movie/bi/mi/basic.naver?code=17159" title="포레스트 검프">포레스트 검프</a>
<a href="/movie/bi/mi/basic.naver?code=181700" title="안녕 베일리">안녕 베일리</a>
<a href="/movie/bi/mi/basic.naver?code=29217" title="글래디에이터">글래디에이터</a>

리스트 안의 타이틀 코드 값이 쭉 출력됨

- None 값 가져오지 않도록 설정하기 (is not None)

trs = soup.select('#old_content > table > tbody > tr')

for tr in trs:
    a_tag = tr.select_one('td.title > div > a')
    if a_tag is not None:
        title = a_tag.text
        rank = tr.select_one('td.ac > img')['alt']
        point = tr.select_one('td.point').text
        print(rank, title, point)

4. meta 태그 정보를 이용한 크롤링 (식별자로 호출이 안되는 경우)

* 동적 페이지의 경우 셀리니움 사용

* 소스 코드의 head 영역에 있는 meta 태그를 이용하여 크롤링

import requests
from bs4 import BeautifulSoup

url = 'https://movie.naver.com/movie/bi/mi/basic.naver?code=171539'

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get(url, headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

title = soup.select_one('meta[property="og:title"]')['content']
image = soup.select_one('meta[property="og:image"]')['content']
description = soup.select_one('meta[property="og:description"]')['content']
print(title, image, description)

저작자표시 비영리 변경금지 (새창열림)

'프로그래밍 > Python' 카테고리의 다른 글

ajax 골격, 로딩 후 실행 함수 (0)	2022.04.08
나홀로메모장 프로젝트 2 - 뼈대 코드 (app.py) (0)	2022.04.07
flask 사용법 (with 파이썬) (0)	2022.04.04
파이썬 (python) - 변수, 자료형, 함수, 조건문, 반복문 (0)	2022.03.30
Ajax (Get), Open API + jQuery (0)	2022.03.30

오늘의코딩

Beautiful Soup 크롤링 (with 파이썬)

'프로그래밍 > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바