programming python

Làm cách nào để đọc tệp HTML bằng Python?

Hướng dẫn Python BeautifulSoup là hướng dẫn giới thiệu về thư viện BeautifulSoup Python. Các ví dụ tìm thẻ, duyệt cây tài liệu, sửa đổi tài liệu và quét các trang web

BeautifulSoup là một thư viện Python để phân tích cú pháp các tài liệu HTML và XML. Nó thường được sử dụng để quét web. BeautifulSoup chuyển đổi một tài liệu HTML phức tạp thành một cây phức tạp gồm các đối tượng Python, chẳng hạn như thẻ, chuỗi có thể điều hướng hoặc nhận xét

Cài đặt BeautifulSoup

Chúng tôi sử dụng lệnh

from bs4 import BeautifulSoup

6 để cài đặt các mô-đun cần thiết

$ sudo pip3 install lxml

Chúng ta cần cài đặt mô-đun

from bs4 import BeautifulSoup

7, được sử dụng bởi BeautifulSoup

$ sudo pip3 install bs4

BeautifulSoup được cài đặt bằng lệnh trên

Trong các ví dụ, chúng tôi sẽ sử dụng tệp HTML sau

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Python BeautifulSoup ví dụ đơn giản

Trong ví dụ đầu tiên, chúng tôi sử dụng mô-đun BeautifulSoup để lấy ba thẻ

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ mã in mã HTML của ba thẻ

from bs4 import BeautifulSoup

Chúng tôi nhập lớp

from bs4 import BeautifulSoup

8 từ mô-đun

from bs4 import BeautifulSoup

from bs4 import BeautifulSoup

8 là lớp chính để làm việc

with open['index.html', 'r'] as f:

    contents = f.read[]

Chúng tôi mở tệp

with open['index.html', 'r'] as f:

    contents = f.read[]

1 và đọc nội dung của nó bằng phương pháp

with open['index.html', 'r'] as f:

    contents = f.read[]

soup = BeautifulSoup[contents, 'lxml']

Một đối tượng

from bs4 import BeautifulSoup

8 được tạo ra; . Tùy chọn thứ hai chỉ định trình phân tích cú pháp

print[soup.h2]
print[soup.head]

Ở đây chúng tôi in mã HTML của hai thẻ.

with open['index.html', 'r'] as f:

    contents = f.read[]

4 và

with open['index.html', 'r'] as f:

    contents = f.read[]

print[soup.li]

Có nhiều phần tử

with open['index.html', 'r'] as f:

    contents = f.read[]

$ ./simple.py
Operating systems

Header


Solaris

Thuộc tính

with open['index.html', 'r'] as f:

    contents = f.read[]

7 của thẻ cung cấp tên của nó và thuộc tính

with open['index.html', 'r'] as f:

    contents = f.read[]

8 cho nội dung văn bản của nó

$ sudo pip3 install bs4

Ví dụ mã in mã HTML, tên và văn bản của thẻ

with open['index.html', 'r'] as f:

    contents = f.read[]

Quảng cáo

$ sudo pip3 install bs4

Thẻ duyệt BeautifulSoup

Với phương pháp

soup = BeautifulSoup[contents, 'lxml']

0, chúng tôi duyệt qua tài liệu HTML

$ sudo pip3 install bs4

Ví dụ đi qua cây tài liệu và in tên của tất cả các thẻ HTML

$ sudo pip3 install bs4

Trong tài liệu HTML, chúng tôi có các thẻ này

Với thuộc tính

soup = BeautifulSoup[contents, 'lxml']

1, chúng ta có thể lấy các phần tử con của thẻ

$ sudo pip3 install bs4

Ví dụ truy xuất phần tử con của thẻ

soup = BeautifulSoup[contents, 'lxml']

2, đặt chúng vào danh sách Python và in chúng ra bàn điều khiển. Vì thuộc tính

soup = BeautifulSoup[contents, 'lxml']

1 cũng trả về khoảng cách giữa các thẻ nên chúng tôi thêm một điều kiện để chỉ bao gồm các tên thẻ

$ sudo pip3 install bs4

Các thẻ

soup = BeautifulSoup[contents, 'lxml']

2 có hai con.

with open['index.html', 'r'] as f:

    contents = f.read[]

5 và

soup = BeautifulSoup[contents, 'lxml']

Hậu duệ phần tử BeautifulSoup

Với thuộc tính

soup = BeautifulSoup[contents, 'lxml']

7, chúng ta có được tất cả con cháu [con của mọi cấp độ] của một thẻ

$ sudo pip3 install bs4

Ví dụ truy xuất tất cả hậu duệ của thẻ

soup = BeautifulSoup[contents, 'lxml']

$ sudo pip3 install bs4

Đây là tất cả các hậu duệ của thẻ

soup = BeautifulSoup[contents, 'lxml']

Quét web BeautifulSoup

Yêu cầu là một thư viện Python HTTP đơn giản. Nó cung cấp các phương thức để truy cập tài nguyên Web thông qua HTTP

$ sudo pip3 install bs4

Ví dụ lấy tiêu đề của một trang web đơn giản. Nó cũng in cha mẹ của nó

$ sudo pip3 install bs4

Chúng tôi lấy dữ liệu HTML của trang

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Chúng tôi truy xuất mã HTML của tiêu đề, văn bản của tiêu đề và mã HTML của tiêu đề gốc

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Với phương pháp

print[soup.h2]
print[soup.head]

0, chúng ta có thể làm cho mã HTML trông đẹp hơn

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Chúng tôi làm đẹp mã HTML của một trang web đơn giản

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Chúng tôi cũng có thể phục vụ các trang HTML bằng máy chủ HTTP tích hợp đơn giản

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Chúng tôi tạo một thư mục

print[soup.h2]
print[soup.head]

1 và sao chép

with open['index.html', 'r'] as f:

    contents = f.read[]

1 ở đó

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Sau đó, chúng tôi khởi động máy chủ Python HTTP

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Bây giờ chúng tôi lấy tài liệu từ máy chủ đang chạy cục bộ

Với phương pháp

print[soup.h2]
print[soup.head]

3, chúng ta có thể tìm các phần tử bằng nhiều cách khác nhau bao gồm id phần tử

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Ví dụ mã tìm thấy thẻ

print[soup.h2]
print[soup.head]

4 có id

print[soup.h2]
print[soup.head]

5. Dòng nhận xét có một cách khác để thực hiện cùng một tác vụ

BeautifulSoup tìm tất cả các thẻ

Với phương pháp

print[soup.h2]
print[soup.head]

6, chúng ta có thể tìm thấy tất cả các phần tử đáp ứng một số tiêu chí

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Ví dụ mã tìm và in tất cả các thẻ

with open['index.html', 'r'] as f:

    contents = f.read[]

Header

Operating systems

Solaris
FreeBSD
Debian
NetBSD
Windows

FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

Debian is a Unix-like computer operating system that is composed entirely of free software.

Phương thức

print[soup.h2]
print[soup.head]

6 có thể lấy một danh sách các phần tử để tìm kiếm

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ tìm tất cả các phần tử

with open['index.html', 'r'] as f:

    contents = f.read[]

4 và

print[soup.li]

0 và in văn bản của chúng

Phương thức

print[soup.h2]
print[soup.head]

6 cũng có thể nhận một hàm xác định phần tử nào sẽ được trả về

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ in các phần tử trống

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Phần tử trống duy nhất trong tài liệu là

print[soup.li]

Cũng có thể tìm các phần tử bằng cách sử dụng các biểu thức chính quy

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ in nội dung của các phần tử chứa chuỗi 'BSD'

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Với các phương thức

print[soup.li]

3 và

print[soup.li]

4, chúng ta có thể sử dụng một số bộ chọn CSS để tìm các phần tử

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ này sử dụng bộ chọn CSS để in mã HTML của phần tử

with open['index.html', 'r'] as f:

    contents = f.read[]

6 thứ ba

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Đây là phần tử

with open['index.html', 'r'] as f:

    contents = f.read[]

6 thứ ba

Ký tự # được sử dụng trong CSS để chọn các thẻ theo thuộc tính id của chúng

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ in phần tử có id

print[soup.h2]
print[soup.head]

Phần tử nối thêm BeautifulSoup

Phương thức

print[soup.li]

8 gắn một thẻ mới vào tài liệu HTML

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Ví dụ thêm một thẻ

with open['index.html', 'r'] as f:

    contents = f.read[]

6 mới

#!/usr/bin/python

from bs4 import BeautifulSoup

with open['index.html', 'r'] as f:

    contents = f.read[]

    soup = BeautifulSoup[contents, 'lxml']

    print[soup.h2]
    print[soup.head]
    print[soup.li]

Đầu tiên, chúng tôi tạo một thẻ mới với phương pháp

$ ./simple.py
Operating systems

Header


Solaris

Quảng cáo

from bs4 import BeautifulSoup

Chúng tôi nhận được tham chiếu đến thẻ

print[soup.h2]
print[soup.head]

from bs4 import BeautifulSoup

Chúng tôi nối thẻ mới tạo vào thẻ

print[soup.h2]
print[soup.head]

Quảng cáo

from bs4 import BeautifulSoup

Chúng tôi in thẻ

print[soup.h2]
print[soup.head]

4 ở định dạng gọn gàng

phần tử chèn BeautifulSoup

Phương thức

$ ./simple.py
Operating systems

Header


Solaris

4 chèn thẻ tại vị trí đã chỉ định

from bs4 import BeautifulSoup

Ví dụ chèn thẻ

with open['index.html', 'r'] as f:

    contents = f.read[]

6 ở vị trí thứ ba vào thẻ

print[soup.h2]
print[soup.head]

BeautifulSoup thay thế văn bản

$ ./simple.py
Operating systems

Header


Solaris

7 thay thế văn bản của một phần tử

from bs4 import BeautifulSoup

Ví dụ tìm một phần tử cụ thể bằng phương thức

print[soup.h2]
print[soup.head]

3 và thay thế nội dung của nó bằng phương thức

$ ./simple.py
Operating systems

Header


Solaris

Làm cách nào để tìm nạp nội dung HTML bằng Python?

Giải pháp đơn giản nhất như sau. .

yêu cầu nhập khẩu. in [yêu cầu. nhận được [url = 'https. //Google. com']. chữ].

nhập urllib. yêu cầu như r. trang = r. urlopen['https. //Google. com'].

Cài đặt BeautifulSoup

Operating systems

Python BeautifulSoup ví dụ đơn giản

Operating systems

Thẻ duyệt BeautifulSoup

Hậu duệ phần tử BeautifulSoup

Quét web BeautifulSoup

Operating systems

Operating systems

Operating systems

Operating systems

Operating systems

Operating systems

Operating systems

Operating systems

BeautifulSoup tìm tất cả các thẻ

Operating systems

Operating systems

Phần tử nối thêm BeautifulSoup

Operating systems

phần tử chèn BeautifulSoup

Operating systems

BeautifulSoup thay thế văn bản

Operating systems

Operating systems

Làm cách nào để tìm nạp nội dung HTML bằng Python?

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề