Apache | Spark. Developer. Creative.

Web Crawling using BeautifulSOUP Python

by adnan aminJan 12th, 2022

Table of Contents

Web crawler's basic workflow:

  • Retrieve the URL that created the URL. Web crawlers use the initial URL to access the web page to be crawled.
  • When crawling a website, we must first retrieve the HTML content of the page and then parse it to obtain the URLs of all pages linked to it.
  • Set up a queue for these URLs.
  • To crawl a web page, you need to loop through the queue reading each URL in turn.
  • Examine the stop condition. It will keep crawling until it can't get a new URL.

You'll need the following libraries. 

  • It's a library for parsing HTML and XML.
  • “requests” simulates HTTP requests (such as GET and POST)
  • We will mostly use it to access a website's source code.
pip install beautifulsoup4
pip install requests
pip install lxml

Or follow these steps (for Anaconda IDE):

Or follow these steps (for Anaconda IDE):

conda install beautifulsoup4
conda install requests
conda install lxml

Create a BeautifulSoup object and specify the parser as lxml.

soup = BeautifulSoup(f.content,'lxml')

BeautifulSoup extracts the page content, and the find method extracts the relevant information. We can use BeautifulSoup's three methods.

  • Find_all() is used to find all the nodes.
  • Find () is used for single node.
  • Select () is used for CSS selector.
movies = soup.find('table',{'class':'table'}).find_all('a')

Complete Source Code (Python):

# Import the following Python Libs

import requests

import lxml

from bs4 import BeautifulSoup

from xlwt import *

myworkbook = Workbook(encoding = 'utf-8')

tbl = myworkbook.add_sheet('downloaded_data')

tbl.write(0, 0, 'Number')

tbl.write(0, 1, 'movie_url')

tbl.write(0, 2, 'movie_name')

tbl.write(0, 3, 'movie_overview')

line = 1

url = "https://www.rottentomatoes.com/top/bestofrt/"

#Create a URL address that needs to be crawled, then create the header information, and then send a network request to wait for a response.

headers = {

 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'

}

getdata = requests.get(url, headers = headers)

movies_lst = []

soup = BeautifulSoup(getdata.content, 'lxml') #Webpage parse: Create a BeautifulSoup object and specify the parser as lxml.

movies = soup.find('table', {

  'class': 'table'

 }).find_all('a')

num = 0

for anchor_tag in movies:

  if num<=3: 
    urls = 'https://www.rottentomatoes.com' + anchor['href'] 
    movies_lst.append(urls)

    num += 1

    movie_url = urls

    movie_f = requests.get(movie_url, headers = headers)

    movie_soup = BeautifulSoup(movie_f.content, 'lxml')

    movie_content = movie_soup.find('div', {

     'class': 'movie_synopsis clamp clamp-6 js-clamp'

    })

    print(num, urls, '\n', 'Movie:' + anchor_tag.string.strip())

    print('Movie info:' + movie_content.string.strip())

    tbl.write(line, 0, num)

    tbl.write(line, 1, urls)

    tbl.write(line, 2, anchor_tag.string.strip())

    tbl.write(line, 3, movie_content.string.strip())

    line += 1

print('Records Downloaded', num)

myworkbook.save('movies_top4.xls')