100 Days of Code: Day 45
Notes of 100 Days of Code: The Complete Python Pro Bootcamp.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
BeautifulSoup: document.
Install BeautifulSoup: pip install beautifulsoup4
Import BeautifulSoup in Python:
from bs4 import BeautifulSoup
with open("website.html") as file:
content = file.read()
soup = BeautifulSoup(content, "html.parser")
Use class_
to filter attribute instead of class
.
class_is_heading = soup.find_all(class_="heading") # find all the elements with class "heading"
print(class_is_heading)
Use CSS selector to find certain element.
name = soup.select_one(slector = '#name') # find the element with id "name"
print(name)
headings = soup.select(".heading") # find all the elements with class "headings"
print(headings)
Final Project
Generate a ascending “100 Greatest Movies” list from EMPIRE.
from bs4 import BeautifulSoup
from requests import get
response = get("https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/")
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
movie_list = [movie.getText() for movie in soup.find_all(name="h3", class_="title")]
movie_list.reverse()
with open("movies.txt", mode="w", encoding="utf-8") as file:
for movie in movie_list:
file.write(f"{movie}\n")
If there is a UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-11: character maps to <undefined>
error, it means this Python script is trying to write Unicode characters to a file using the cp1252
encoding, but some of those characters aren’t supported by cp1252
.
How to fix it:
Understanding the Problem:
- Unicode: Python strings are Unicode by default, which means they can represent a wide range of characters from different languages.
cp1252
: This is a character encoding that’s often used on Windows systems. It’s a limited encoding that doesn’t support all Unicode characters.- The Error: The error message tells you that the characters at positions 10 and 11 of the string you’re trying to write cannot be encoded using
cp1252
. This usually happens when you have special characters, accented letters, or symbols that aren’t in thecp1252
character set.
The most common and recommended solution is to use the utf-8
encoding, which supports virtually all Unicode characters.
So, add encoding="utf-8"
to make sure the script using utf-8
endoing.
To reverse a list in Python:
Use [::-1]
.
[::-1]
is a slicing technique used to reverse a sequence, such as a string, list, or tuple. Let’s break it down:
- Slicing: Python’s slicing syntax is
[start:stop:step]
. start
: The starting index of the slice (inclusive). If omitted, it defaults to the beginning of the sequence.stop
: The ending index of the slice (exclusive). If omitted, it defaults to the end of the sequence.step
: The increment between each index in the slice. If omitted, it defaults to 1.[::-1]
:- Omitting
start
andstop
means the slice will cover the entire sequence. -1
as thestep
value means the slice will iterate through the sequence in reverse order, from the last element to the first.
- Omitting
Therefore, [::-1]
creates a reversed copy of the original sequence.
Use list.reverse()
.
Use for loop and list.insert()
.
All of above would have a O(n) time complexity.