Allison is coding...

100 Days of Code: Day 45

Notes of 100 Days of Code: The Complete Python Pro Bootcamp.

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

BeautifulSoup: document.

Install BeautifulSoup: pip install beautifulsoup4

Import BeautifulSoup in Python:


from bs4 import BeautifulSoup

with open("website.html") as file:
    content = file.read()

soup = BeautifulSoup(content, "html.parser")

Use class_ to filter attribute instead of class.

class_is_heading = soup.find_all(class_="heading") # find all the elements with class "heading"
print(class_is_heading)

Use CSS selector to find certain element.

name = soup.select_one(slector = '#name') # find the element with id "name"
print(name)

headings = soup.select(".heading") # find all the elements with class "headings"
print(headings)

Final Project

Generate a ascending “100 Greatest Movies” list from EMPIRE.

from bs4 import BeautifulSoup
from requests import get

response = get("https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/")

website_html = response.text

soup = BeautifulSoup(website_html, "html.parser")

movie_list = [movie.getText() for movie in soup.find_all(name="h3", class_="title")]

movie_list.reverse()

with open("movies.txt", mode="w", encoding="utf-8") as file:
    for movie in movie_list:
        file.write(f"{movie}\n")

If there is a UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-11: character maps to <undefined> error, it means this Python script is trying to write Unicode characters to a file using the cp1252 encoding, but some of those characters aren’t supported by cp1252.

How to fix it:

Understanding the Problem:

  • Unicode: Python strings are Unicode by default, which means they can represent a wide range of characters from different languages.
  • cp1252: This is a character encoding that’s often used on Windows systems. It’s a limited encoding that doesn’t support all Unicode characters.
  • The Error: The error message tells you that the characters at positions 10 and 11 of the string you’re trying to write cannot be encoded using cp1252. This usually happens when you have special characters, accented letters, or symbols that aren’t in the cp1252 character set.

The most common and recommended solution is to use the utf-8 encoding, which supports virtually all Unicode characters.

So, add encoding="utf-8" to make sure the script using utf-8 endoing.

To reverse a list in Python:

Use [::-1].

[::-1] is a slicing technique used to reverse a sequence, such as a string, list, or tuple. Let’s break it down:

  • Slicing: Python’s slicing syntax is [start:stop:step].
  • start: The starting index of the slice (inclusive). If omitted, it defaults to the beginning of the sequence.
  • stop: The ending index of the slice (exclusive). If omitted, it defaults to the end of the sequence.
  • step: The increment between each index in the slice. If omitted, it defaults to 1.
  • [::-1]:
    • Omitting start and stop means the slice will cover the entire sequence.
    • -1 as the step value means the slice will iterate through the sequence in reverse order, from the last element to the first.

Therefore, [::-1] creates a reversed copy of the original sequence.

Use list.reverse().

Use for loop and list.insert().

All of above would have a O(n) time complexity.