Hướng dẫn what is html parser in java? - phân tích cú pháp html trong java là gì?

Nhà phát triển Java | Thư viện

Dễ dàng phân tích HTML, trích xuất các yếu tố được chỉ định, xác nhận cấu trúc và vệ sinh nội dung.

Bởi Mert çalişkan

Ngày nay, các nhà phát triển ứng dụng web Java của Enterprise sử dụng HTML trong mọi khía cạnh của dự án. Công việc này đôi khi trở nên khó khăn vì nội dung HTML phân tích cú pháp là một nhiệm vụ tẻ nhạt. Làm như vậy mà không có khung phân tích cú pháp là một việc vặt không mong muốn nhất. May mắn thay, có một số ít các trình phân tích cú pháp HTML có trụ sở tại Java có sẵn. Trong bài viết này, tôi sẽ tập trung vào một trong những mục yêu thích của tôi, JSOUP, lần đầu tiên được phát hành dưới dạng nguồn mở vào tháng 1 năm 2010. Nó đã được phát triển tích cực kể từ đó bởi Jonathan Hedley và Bộ luật sử dụng giấy phép tự do MIT.

Nó là gì

JSOUP có thể thao tác nội dung: chính phần tử HTML, thuộc tính của nó hoặc văn bản của nó. Nó cập nhật nội dung cũ hơn dựa trên HTML 4.X sang HTML5 hoặc XHTML bằng cách chuyển đổi các thẻ không dùng nữa thành các phiên bản mới. Nó cũng có thể làm sạch dựa trên danh sách trắng, đầu ra HTML gọn gàng và các thẻ hoàn toàn không cân bằng tự động. Tôi sẽ chứng minh những tính năng này với một số ví dụ làm việc.

Tất cả các ví dụ trong bài viết này dựa trên phiên bản JSOUP 1.10.2, đây là phiên bản mới nhất có sẵn tại thời điểm viết bài này. Mã nguồn đầy đủ cho bài viết này có sẵn trên GitHub.

Dom và Jsoup Essentials

DOM là đại diện độc lập với ngôn ngữ của các tài liệu HTML, xác định cấu trúc và kiểu dáng của tài liệu. Hình 1 cho thấy sơ đồ lớp của các lớp khung JSOUP. Sau đó, tôi sẽ chỉ cho bạn cách họ ánh xạ tới các yếu tố DOM.Figure 1 shows the class diagram of jsoup framework classes. Later, I’ll show you how they map to the DOM elements.

Lớp trừu tượng

org.jsoup:jsoup:1.10.2

4 là yếu tố chính của JSOUP. Nó đại diện cho một nút trong cây dom, có thể là chính tài liệu, nút văn bản, nhận xét hoặc một phần tử, đó là các yếu tố hình thành các phần tử trong tài liệu. Lớp

org.jsoup:jsoup:1.10.2

5 đề cập đến nút cha mẹ của nó và biết tất cả các nút con của cha mẹ.

Lớp

org.jsoup:jsoup:1.10.2

6 đại diện cho một phần tử HTML, bao gồm một tên thẻ, thuộc tính và nút con. Lớp

org.jsoup:jsoup:1.10.2

7 là một thùng chứa cho các thuộc tính của các phần tử HTML và được sáng tác trong lớp

org.jsoup:jsoup:1.10.2

Hình 1. Sơ đồ lớp JSOUP

Bắt đầu

Bạn có thể có được phiên bản mới nhất của JSOUP từ kho lưu trữ trung tâm của Maven, với định nghĩa phụ thuộc sau. Bản phát hành hiện tại sẽ chạy trên bất kỳ phiên bản Java nào kể từ Java 5.


    org.jsoup
    jsoup
    1.10.2

Người dùng tốt nghiệp có thể lấy lại vật phẩm với

org.jsoup:jsoup:1.10.2

Lớp điểm truy cập chính,

org.jsoup:jsoup:1.10.2

9, là cách chính để sử dụng chức năng của JSOUP. Nó cung cấp các phương thức cơ bản có thể phân tích tài liệu HTML được truyền cho nó dưới dạng tệp hoặc luồng đầu vào, chuỗi hoặc tài liệu HTML được cung cấp thông qua URL. Ví dụ trong việc liệt kê 1 phân tích cú pháp văn bản HTML và đầu ra trước tiên là tên nút của phần tử và sau đó là văn bản HTML thuộc sở hữu của phần tử, như được hiển thị ngay bên dưới mã.Listing 1 parses HTML text and outputs first the node name of the element and then the HTML text owned by the element, as shown immediately below the code.

Liệt kê 1.

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

Đầu ra là

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

& NBSP; CSS và các bộ chọn giống như jQuery mạnh mẽ so với các phương pháp dành riêng cho DOM. Chúng có thể được kết hợp với nhau để tinh chỉnh lựa chọn. & nbsp;CSS and jQuery-like selectors are powerful compared with DOM-specific methods. They can be combined together to refine selection.

Các cách để chọn các thành phần DOM. JSOUP cung cấp một số cách để lặp lại thông qua các phần tử HTML được phân tích cú pháp và tìm các yếu tố được yêu cầu. Bạn có thể sử dụng các phương thức

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

0 cụ thể DOM hoặc các bộ chọn giống như CSS và jQuery. Tôi sẽ chứng minh cả hai cách tiếp cận bằng cách phân tích một trang web và trích xuất tất cả các liên kết có thẻ HTML
public class Example1Main { static String htmlText = "" + " " + " " + " Java Magazine" + " " + " " + "
Hello World!
" + " " + ""; public static void main[String... args] { Document document = Jsoup.parse[htmlText]; Elements allElements = document.getAllElements[]; for [Element element : allElements] { System.out.println[element.nodeName[] + " " + element.ownText[]]; } } }
1. Mã trong việc liệt kê 2 phân tích các trang sinh học của Java Champions và trích xuất tên liên kết cho tất cả các nhà vô địch Java được đánh dấu là ____ ____22 [xem Hình 2]. jsoup provides several ways to iterate through the parsed HTML elements and find the requested ones. You can use either the DOM-specific

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

0 methods or CSS and jQuery-like selectors. I will demonstrate both approaches by parsing a web page and extracting all links that have HTML

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

1 tags. The code in Listing 2 parses the Java Champions bio page and extracts the link names for all the Java Champions marked as “

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

2” [see Figure 2].

Hình 2. Một phần của trang HTML được phân tích cú pháp

Việc đánh dấu được thực hiện bằng cách thêm thẻ

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

3 với văn bản

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

2 ngay bên cạnh liên kết. Vì vậy, tôi sẽ kiểm tra nội dung của phần tử anh chị em tiếp theo của mỗi liên kết.

Liệt kê 2.

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

Việc trích xuất tương tự của các liên kết cũng có thể được thực hiện với các bộ chọn, như được hiển thị trong liệt kê 3. Mã này trích xuất các liên kết bắt đầu bằng giá trị

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

6.Listing 3. This code extracts the links that start with

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

5 value

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

Liệt kê 3.

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

Bộ chọn là mạnh mẽ so với các phương pháp cụ thể của DOM. Chúng có thể được kết hợp với nhau để tinh chỉnh lựa chọn. Trong các ví dụ mã trước, chúng tôi đang tự kiểm tra văn bản ____22, điều này là tầm thường. Ví dụ trong liệt kê 4 chọn thẻ

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

3 chứa văn bản

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

2, nằm sau một liên kết có HREF bắt đầu với giá trị

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

6. Điều này thực sự cho thấy sức mạnh của bộ chọn.Listing 4 selects the

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

3 tag that contains the

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

2 text, which resides after a link that has an href starting with the value

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

6. This really shows the power of selectors.

Liệt kê 4.

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

Ở đây, các bộ chọn định vị thẻ

public class Example1Main {


    static String htmlText = "" +
            "    " +
            "    " +
            "       Java Magazine" +
            "    " +
            "    " +
            "       Hello World!" +
            "    " +
            "";

    public static void main[String... args] {
        Document document = Jsoup.parse[htmlText];
        Elements allElements = 
            document.getAllElements[];
        for [Element element : allElements] {
            System.out.println[element.nodeName[] 
            + " " + element.ownText[]];
        }
    }
}

3 dưới dạng phần tử. Sau đó, tôi gọi phương thức

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

2 trên nó, để bước một phần tử trở lại liên kết. Phương pháp

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

3 này có sẵn trong các lớp

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

org.jsoup:jsoup:1.10.2

6 và

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

6. Hiện tại, JSOUP không hỗ trợ các truy vấn XPath trên các bộ chọn. Thông tin thêm về các bộ chọn có sẵn tại trang web JSOUP.

Đi qua các nút. JSOUP cung cấp giao diện

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

7, chứa hai phương pháp:
#document html head title Java Magazine body h2 Hello World!
8 và
#document html head title Java Magazine body h2 Hello World!
9. Bằng cách triển khai một lớp ẩn danh từ giao diện đó và chuyển nó dưới dạng tham số cho phương thức
public class Example2Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect[ "//java.net/website/" + "java-champions/bios.html" ] .timeout[0].get[]; Elements allElements = document.getElementsByTag["a"]; for [Element element : allElements] { if ["New!".equals[ element.nextElementSibling[]!=null ? element.nextElementSibling[] .ownText[] : ""]] { System.out.println[ element.ownText[]]; } } } }
0, có thể có một cuộc gọi lại khi nút được truy cập đầu tiên và lần cuối. Mã trong Liệt kê 5 sử dụng kỹ thuật này để vượt qua một văn bản HTML đơn giản và xuất ra tất cả các chi tiết nút. jsoup provides the

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

7 interface, which contains two methods:

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

8 and

#document

html 
head 
title Java Magazine
body 
h2 Hello World!

9. By implementing an anonymous class from that interface and passing it as a parameter to the

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

0 method, it is possible to have a callback when the node is first and last visited. The code in Listing 5 uses this technique to traverse a simple HTML text and outputs all node details.

Liệt kê 5.

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

Đầu ra từ đường truyền này như sau:

Node start: #document

Node start: #doctype
Node end: #doctype
Node start: html
Node start: head
Node start: title
Node start: #text
Node end: #text
Node end: title
Node end: head
Node start: body
Node start: h2
Node start: #text
Node end: #text
Node end: h2
Node end: body
Node end: html
Node end: #document

Phân tích các tệp XML. JSOUP hỗ trợ phân tích các tệp XML với trình phân tích cú pháp XML tích hợp. Ví dụ trong việc liệt kê 6 phân tích một văn bản XML và xuất ra nó với định dạng phù hợp. Lưu ý một lần nữa làm thế nào dễ dàng này được thực hiện. jsoup supports parsing of XML files with a built-in XML parser. The example in Listing 6 parses an XML text and outputs it with appropriate formatting. Note once again how easily this is accomplished.

Liệt kê 6.

public class Example6Main {


    static String xml = 
         "" +
         "xxx" +
         "yyy" +
         "xxx" +
         "zzz" +
         "";

    public static void main[String... args] {
        Document doc = 
          Jsoup.parse[xml, "", Parser.xmlParser[]];
        System.out.println[doc.toString[]];
    }
}

Như bạn mong đợi, đầu ra từ điều này là

org.jsoup:jsoup:1.10.2

Nó cũng có thể sử dụng các bộ chọn để chọn các giá trị từ các thẻ XML được chỉ định. Đoạn mã trong liệt kê 7 thẻ

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

1 nằm trong các thẻ

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

2.Listing 7 selects

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

1 tags that reside in

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

2 tags.

Liệt kê 7.

org.jsoup:jsoup:1.10.2

Ngăn chặn các cuộc tấn công XSS. Nhiều trang web ngăn chặn các cuộc tấn công kịch bản chéo trang [XSS] bằng cách cấm người dùng gửi nội dung HTML hoặc bằng cách thực thi việc sử dụng cú pháp đánh dấu thay thế, chẳng hạn như Markdown. Một giải pháp thông minh để ngăn chặn đầu vào HTML độc hại là sử dụng trình soạn thảo WYSIWYG và lọc đầu ra HTML với chất khử trùng danh sách trắng của JSOUP. Bộ khử trùng trong danh sách trắng phân tích HTML và lặp lại thông qua nó và loại bỏ các thẻ, thuộc tính hoặc giá trị không mong muốn theo danh sách trắng tích hợp vào khung. Many sites prevent cross-site scripting [XSS] attacks by prohibiting the user from submitting HTML content or by enforcing the use of alternative markup syntax, such as markdown. A clever solution to prevent malicious HTML input is to use a WYSIWYG editor and filter the HTML output with jsoup’s whitelist sanitizer. The whitelist sanitizer parses the HTML, and iterates through it and removes the unwanted tags, attributes, or values according to the whitelist built into the framework.

Ví dụ trong việc liệt kê 8 định nghĩa một phương thức kiểm tra làm sạch văn bản HTML theo danh sách trắng văn bản đơn giản. Danh sách này, như bạn sẽ thấy trong một khoảnh khắc, chỉ cho phép định dạng văn bản đơn giản với các thẻ HTML:

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

6 và

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

7.Listing 8 defines a test method that cleans up HTML text according to a simple text whitelist. This list, as you will see in a moment, allows only simple text formatting with HTML tags:

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

6, and

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

Liệt kê 8.

org.jsoup:jsoup:1.10.2

Lớp

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

8 cung cấp các danh sách được xây dựng sẵn như

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

9, giới hạn HTML đối với các yếu tố trước đó. Có các tùy chọn chấp nhận khác, chẳng hạn như

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

2 và

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

Liệt kê 9 cho thấy một ví dụ về việc sử dụng

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

1, cho phép các thẻ HTML này:
public class Example3Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + " /website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#]"]; for [Element element : allElements] { if ["New!".equals[element .nextElementSibling[] != null ? element.nextElementSibling [].ownText[] : ""]] { System.out.println[element .ownText[]]; } } } }
5,
public class Example2Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect[ "//java.net/website/" + "java-champions/bios.html" ] .timeout[0].get[]; Elements allElements = document.getElementsByTag["a"]; for [Element element : allElements] { if ["New!".equals[ element.nextElementSibling[]!=null ? element.nextElementSibling[] .ownText[] : ""]] { System.out.println[ element.ownText[]]; } } } }
3,
public class Example3Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + " /website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#]"]; for [Element element : allElements] { if ["New!".equals[element .nextElementSibling[] != null ? element.nextElementSibling [].ownText[] : ""]] { System.out.println[element .ownText[]]; } } } }
7,
public class Example3Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + " /website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#]"]; for [Element element : allElements] { if ["New!".equals[element .nextElementSibling[] != null ? element.nextElementSibling [].ownText[] : ""]] { System.out.println[element .ownText[]]; } } } }
8,
public class Example3Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + " /website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#]"]; for [Element element : allElements] { if ["New!".equals[element .nextElementSibling[] != null ? element.nextElementSibling [].ownText[] : ""]] { System.out.println[element .ownText[]]; } } } }
9,
public class Example4Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + ".website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#] ~ font:containsOwn" + "[New!]"]; for [Element element : allElements] { System.out.println[element .previousElementSibling[] .ownText[]]; } } }
0,
public class Example4Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect ["//java.net" + ".website/java-champions/bios.html"] .timeout[0].get[]; Elements allElements = document.select ["a[href*=#] ~ font:containsOwn" + "[New!]"]; for [Element element : allElements] { System.out.println[element .previousElementSibling[] .ownText[]]; } } }
1, ____. ,
public class Example5Main { static String htmlText = "" + "" + "" + "Java Magazine" + "" + "" + "
Hello World!
" + "" + ""; public static void main[String... args] throws IOException { Document document = Jsoup.parse[htmlText]; document.traverse[new NodeVisitor[] { public void head[Node node, int depth]{ System.out.println["Node start: " + node.nodeName[]]; } public void tail[Node node, int depth]{ System.out.println["Node end: " + node.nodeName[]]; } }]; } }
2,
public class Example5Main { static String htmlText = "" + "" + "" + "Java Magazine" + "" + "" + "
Hello World!
" + "" + ""; public static void main[String... args] throws IOException { Document document = Jsoup.parse[htmlText]; document.traverse[new NodeVisitor[] { public void head[Node node, int depth]{ System.out.println["Node start: " + node.nodeName[]]; } public void tail[Node node, int depth]{ System.out.println["Node end: " + node.nodeName[]]; } }]; } }
3,
public class Example2Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect[ "//java.net/website/" + "java-champions/bios.html" ] .timeout[0].get[]; Elements allElements = document.getElementsByTag["a"]; for [Element element : allElements] { if ["New!".equals[ element.nextElementSibling[]!=null ? element.nextElementSibling[] .ownText[] : ""]] { System.out.println[ element.ownText[]]; } } } }
6,
public class Example5Main { static String htmlText = "" + "" + "" + "Java Magazine" + "" + "" + "
Hello World!
" + "" + ""; public static void main[String... args] throws IOException { Document document = Jsoup.parse[htmlText]; document.traverse[new NodeVisitor[] { public void head[Node node, int depth]{ System.out.println["Node start: " + node.nodeName[]]; } public void tail[Node node, int depth]{ System.out.println["Node end: " + node.nodeName[]]; } }]; } }
5,
public class Example5Main { static String htmlText = "" + "" + "" + "Java Magazine" + "" + "" + "
Hello World!
" + "" + ""; public static void main[String... args] throws IOException { Document document = Jsoup.parse[htmlText]; document.traverse[new NodeVisitor[] { public void head[Node node, int depth]{ System.out.println["Node start: " + node.nodeName[]]; } public void tail[Node node, int depth]{ System.out.println["Node end: " + node.nodeName[]]; } }]; } }
6,
public class Example2Main { public static void main[String... args] throws IOException { Document document = Jsoup.connect[ "//java.net/website/" + "java-champions/bios.html" ] .timeout[0].get[]; Elements allElements = document.getElementsByTag["a"]; for [Element element : allElements] { if ["New!".equals[ element.nextElementSibling[]!=null ? element.nextElementSibling[] .ownText[] : ""]] { System.out.println[ element.ownText[]]; } } } }
7,
public class Example5Main { static String htmlText = "" + "" + "" + "Java Magazine" + "" + "" + "
Hello World!
" + "" + ""; public static void main[String... args] throws IOException { Document document = Jsoup.parse[htmlText]; document.traverse[new NodeVisitor[] { public void head[Node node, int depth]{ System.out.println["Node start: " + node.nodeName[]]; } public void tail[Node node, int depth]{ System.out.println["Node end: " + node.nodeName[]]; } }]; } }
8. shows an example of the usage of

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

1, which allows these HTML tags:

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example3Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" + 
            " /website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#]"];
        for [Element element : allElements] {
            if ["New!".equals[element
                    .nextElementSibling[] != null
                    ? element.nextElementSibling
                    [].ownText[] : ""]] {
                System.out.println[element
                        .ownText[]];
            }
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example4Main {


    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.connect
                ["//java.net" +
            ".website/java-champions/bios.html"]
            .timeout[0].get[];
        Elements allElements = document.select
                ["a[href*=#] ~ font:containsOwn" +
                        "[New!]"];
        for [Element element : allElements] {
            System.out.println[element
                    .previousElementSibling[]
                    .ownText[]];
        }
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

public class Example2Main {


    public static void main[String... args] 
        throws IOException {
        Document document = Jsoup.connect[
            "//java.net/website/" + 
            "java-champions/bios.html" ]
            .timeout[0].get[];

        Elements allElements = 
            document.getElementsByTag["a"];
        for [Element element : allElements] {
            if ["New!".equals[
                 element.nextElementSibling[]!=null 
                 ? element.nextElementSibling[]
                   .ownText[]
                 : ""]] {
                   System.out.println[
                       element.ownText[]];
            }
        }
    }
}

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

Liệt kê 9.

org.jsoup:jsoup:1.10.2

Như đã thấy trong bài kiểm tra, cuộc gọi tập lệnh được loại bỏ và các thẻ không được phép, chẳng hạn như

public class Example5Main {


    static String htmlText = "" +
            "" +
            "" +
            "Java Magazine" +
            "" +
            "" +
            "Hello World!" +
            "" +
            "";

    public static void main[String... args] 
            throws IOException {
        Document document = Jsoup.parse[htmlText];

        document.traverse[new NodeVisitor[] {
            public void head[Node node, int depth]{
                System.out.println["Node start: "
                        + node.nodeName[]];
            }

            public void tail[Node node, int depth]{
                System.out.println["Node end: " +
                        node.nodeName[]];
            }
        }];
    }
}

9, cũng bị xóa. Ngoài ra, JSOUP tự động hoàn thành các thẻ không cân bằng, chẳng hạn như thiếu

Node start: #document

Node start: #doctype
Node end: #doctype
Node start: html
Node start: head
Node start: title
Node start: #text
Node end: #text
Node end: title
Node end: head
Node start: body
Node start: h2
Node start: #text
Node end: #text
Node end: h2
Node end: body
Node end: html
Node end: #document

0 trong ví dụ của chúng tôi.

Sự kết luận

Bài viết này, trước đây đã xuất hiện trên tạp chí Java nhưng đã được cập nhật ở đây, chỉ hiển thị một tập hợp con của những gì Jsoup có thể làm. Nó cũng cung cấp các tính năng như TIDING HTML, thao tác các thẻ HTML Thẻ hoặc văn bản, v.v. Nói cách khác, bất kỳ xử lý HTML nào bạn có thể cần làm là một ứng cử viên có khả năng sử dụng JSOUP.

Bài viết này ban đầu được xuất bản trên tạp chí Java.

Mert çalişkan [@0HJC] là một nhà vô địch Java và đồng tác giả của Primefaces Cookbook và bắt đầu mùa xuân [Wiley Publications]. Ông là người sáng lập Ankarajug, nhóm người dùng Java tích cực nhất ở Thổ Nhĩ Kỳ. [@0hjc] is a Java Champion and coauthor of PrimeFaces Cookbook and Beginning Spring [Wiley Publications]. He is the founder of AnkaraJUG, which is the most active Java user group in Turkey.

Trình phân tích cú pháp HTML tốt nhất là gì?

cheerio..

Cheerio. Việc triển khai nhanh chóng, linh hoạt và tinh gọn của jQuery lõi được thiết kế dành riêng cho máy chủ. Theo dõi. Định nghĩa TypeScript: tích hợp. ....

mệnh. Parse5. Bộ công cụ phân tích/tuần tự hóa HTML cho Node.js. Whatwg HTML Standard Living Standard [còn gọi là HTML5] -Compliant. ....

HTM. htmlparser2. Trình phân tích cú pháp HTML và XML nhanh chóng và tha thứ ..

Tại sao JSOUP được sử dụng?

JSOUP có thể phân tích các tệp HTML, luồng đầu vào, URL hoặc thậm chí các chuỗi.Nó giúp trích xuất dữ liệu từ HTML bằng cách cung cấp các phương thức truyền tải mô hình đối tượng [DOM] và các bộ chọn giống như CSS và jQuery.JSOUP có thể thao tác nội dung: chính phần tử HTML, thuộc tính của nó hoặc văn bản của nó.eases data extraction from HTML by offering Document Object Model [DOM] traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Làm cách nào để thêm mã HTML vào tệp Java?

Viết HTML vào tệp bằng cách sử dụng Java một cách tự động mã câu trả lời..

Nhập Java.AWT.Máy tính để bàn ;.

Nhập Java.io.*;.

lớp showgeneratedhtml {.

công khai void void main [String [] args] ném ngoại lệ {.

Tệp f = Tệp mới ["Nguồn.htm"] ;.

BufferedWriter bw = new BufferedWriter [FileWriter mới [f]] ;.

Jsoup Clean làm gì?

lau dọn.Tạo một tài liệu mới, sạch, từ tài liệu bẩn gốc, chỉ chứa các yếu tố được cho phép bởi Safelist.Các tài liệu gốc không được sửa đổi.Chỉ các yếu tố từ cơ thể tài liệu bẩn được sử dụng.Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

Nó là gì

Dom và Jsoup Essentials

Bắt đầu

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Hello World!

Sự kết luận

Trình phân tích cú pháp HTML tốt nhất là gì?

Tại sao JSOUP được sử dụng?

Làm cách nào để thêm mã HTML vào tệp Java?

Jsoup Clean làm gì?

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề