用 Java 拿下 HTML，分分鐘寫個小爬蟲

作者 | HelloGitHub-秦人

來源 | HelloGitHub（ID：GitHub520）

HelloGitHub 推出的《講解開源項目》系列，今天給大家?guī)硪豢铋_源 JAVA 版一款網(wǎng)頁元素解析框架——jsoup，通過程序自動獲取網(wǎng)頁數(shù)據(jù)。

項目源碼地址：https://github.com/jhy/jsoup

項目介紹

jsoup 是一款 Java 的 html 解析器。可直接解析某個 URL 地址的 HTML 文本內(nèi)容。它提供了一套很省力的 API，可通過 DOM、css 以及類似于 jQuery 選擇器的操作方法來取出和操作數(shù)據(jù)。

jsoup 主要功能：

從一個 URL、文件或字符串中解析 HTML。
使用 DOM 或 CSS 選擇器來查找、取出數(shù)據(jù)。
可操作 HTML 元素、屬性、文本。

使用框架

2.1 準(zhǔn)備工作

掌握 HTML 語法
Chrome 瀏覽器調(diào)試技巧
掌握開發(fā)工具 idea 的基本操作

2.2 學(xué)習(xí)源碼

將項目導(dǎo)入 idea 開發(fā)工具，會自動下載 maven 項目需要的依賴。源碼的項目結(jié)構(gòu)如下：

快速學(xué)習(xí)源碼是每個程序員必備的技能，我總結(jié)了以下幾點：

閱讀項目 ReadMe 文件，可以快速知道項目是做什么的。
概覽項目 pom.xml 文件，了解項目引用了哪些依賴。
查看項目結(jié)構(gòu)、源碼目錄、測試用例目錄，好的項目結(jié)構(gòu)清晰，層次明確。
運行測試用例，快速體驗項目。

2.3 下載項目

git clone https://github.com/jhy/jsoup

2.4 運行項目測試代碼

通過上面的方法，我們很快可知 example 目錄是測試代碼，那我們直接來運行。注：有些測試代碼需要稍微改造一下才可以運行。

例如，jsoup 的 Wikipedia 測試代碼：

public class Wikipedia {

public static void main(String[] args) throws IOException {

Document doc = Jsoup.connect("http://en.wikipedia.org/").get;

log(doc.title);

Elements newsHeadlines = doc.select("#mp-itn b a");

for (Element headline : newsHeadlines) {

log("%snt%s", headline.attr("title"), headline.absUrl("href"));

}

private static void log(String msg, String... vals) {

System.out.println(String.format(msg, vals));

}

說明：上面代碼是獲取頁面（http://en.wikipedia.org/）包含（#mp-itn b a）選擇器的所有元素，并打印這些元素的 title , herf 屬性。維基百科國內(nèi)無法訪問，所以上面這段代碼運行會報錯。

改造后可運行的代碼如下：

public static void main(String[] args) throws IOException {

Document doc = Jsoup.connect("https://www.baidu.com/").get;

Elements newsHeadlines = doc.select("a[href]");

for (Element headline : newsHeadlines) {

System.out.println("href: " +headline.absUrl("href") );

}

工作原理

Jsoup 的工作原理，首先需要指定一個 URL，框架發(fā)送 HTTP 請求，然后獲取響應(yīng)頁面內(nèi)容，然后通過各種選擇器獲取頁面數(shù)據(jù)。整個工作流程如下圖：

以上面為例：

3.1 發(fā)請求

Document doc = Jsoup.connect("https://www.baidu.com/").get;

這行代碼就是發(fā)送 HTTP 請求，并獲取頁面響應(yīng)數(shù)據(jù)。

3.2 數(shù)據(jù)篩選

Elements newsHeadlines = doc.select("a[href]");

定義選擇器，獲取匹配選擇器的數(shù)據(jù)。

3.3 數(shù)據(jù)處理

for (Element headline : newsHeadlines) {

System.out.println("href: " +headline.absUrl("href") );

}

這里對數(shù)據(jù)只做了一個簡單的數(shù)據(jù)打印，當(dāng)然這些數(shù)據(jù)可寫入文件或數(shù)據(jù)的。

實戰(zhàn)

獲取豆瓣讀書 -> 新書速遞中每本新書的基本信息。包括：書名、書圖片鏈接、作者、內(nèi)容簡介（詳情頁面）、作者簡介（詳情頁面）、當(dāng)當(dāng)網(wǎng)書的價格（詳情頁面），最后將獲取的數(shù)據(jù)保存到 Excel 文件。

目標(biāo)鏈接：https://book.douban.com/latest?icn=index-latestbook-all

4.1 項目 pom.xml 文件

項目引入 jsoup、lombok、easyexcel 三個庫。

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.Apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>org.example</groupId>

<artifactId>JsoupTest</artifactId>

<version>1.0-SNAPSHOT</version>

<maven.compiler.target>1.8</maven.compiler.target>

<maven.compiler.source>1.8</maven.compiler.source>

</properties>

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

</dependency>

<groupId>org.projectlombok</groupId>

<artifactId>lombok</artifactId>

</dependency>

<groupId>com.alibaba</groupId>

<artifactId>easyexcel</artifactId>

</dependency>

</dependencies>

</project>

4.2 解析頁面數(shù)據(jù)

public class BookInfoUtils {

public static List<BookEntity> getBookInfoList(String url) throws IOException {

List<BookEntity> bookEntities=new ArrayList<>;

Document doc = Jsoup.connect(url).get;

Elements liDiv = doc.select("#content > div > div.article > ul > li");

for (Element li : liDiv) {

Elements urls = li.select("a[href]");

Elements imgUrl = li.select("a > img");

Elements bookName = li.select(" div > h2 > a");

Elements starsCount = li.select(" div > p.rating > span.font-small.color-lightgray");

Elements author = li.select("div > p.color-gray");

Elements description = li.select(" div > p.detail");

String bookDetailUrl = urls.get(0).attr("href");

BookDetailInfo detailInfo = getDetailInfo(bookDetailUrl);

BookEntity bookEntity = BookEntity.builder

.detailPageUrl(bookDetailUrl)

.bookImgUrl(imgUrl.attr("src"))

.bookName(bookName.html)

.starsCount(starsCount.html)

.author(author.text)

.bookDetailInfo(detailInfo)

.description(description.html)

.build;

// System.out.println(bookEntity);

bookEntities.add(bookEntity);

}

return bookEntities;

}

/**

* @param detailUrl

* @return

* @throws IOException

public static BookDetailInfo getDetailInfo(String detailUrl)throws IOException{

Document doc = Jsoup.connect(detailUrl).get;

Elements content = doc.select("body");

Elements price = content.select("#buyinfo-printed > ul.bs.current-version-list > li:nth-child(2) > div.cell.price-btn-wrApper > div.cell.impression_track_mod_buyinfo > div.cell.price-wrapper > a > span");

Elements author = content.select("#info > span:nth-child(1) > a");

BookDetailInfo bookDetailInfo = BookDetailInfo.builder

.author(author.html)

.authorUrl(author.attr("href"))

.price(price.html)

.build;

return bookDetailInfo;

}

這里的重點是要獲取網(wǎng)頁對應(yīng)元素的選擇器。

例如：獲取 li.select("div > p.color-gray") 中 div > p.color-gray 是怎么知道的。

使用 chrome 的小伙伴應(yīng)該都猜到了。打開 chrome 瀏覽器 Debug 模式，Ctrl + Shift +C 選擇一個元素,然后在 html 右鍵選擇 Copy ->Copy selector,這樣就可以獲取當(dāng)前元素的選擇器。如下圖：

4.3 存儲數(shù)據(jù)到 Excel

為了數(shù)據(jù)更好查看，我將通過 jsoup 抓取的數(shù)據(jù)存儲的 Excel 文件，這里我使用的 easyexcel 快速生成 Excel 文件。

Excel 表頭信息

@Data

@Builder

public class ColumnData {

@ExcelProperty("書名稱")

private String bookName;

@ExcelProperty("評分")

private String starsCount;

@ExcelProperty("作者")

private String author;

@ExcelProperty("封面圖片")

private String bookImgUrl;

@ExcelProperty("簡介")

private String description;

@ExcelProperty("單價")

private String price;

}

生成 Excel 文件

public class EasyExcelUtils {

public static void simpleWrite(List<BookEntity> bookEntityList) {

String fileName = "D:\devEnv\JsoupTest\bookList" + System.currentTimeMillis + ".xlsx";

EasyExcel.write(fileName, ColumnData.class).sheet("書本詳情").doWrite(data(bookEntityList));

System.out.println("excel文件生成完畢...");

}

private static List<ColumnData> data(List<BookEntity> bookEntityList) {

List<ColumnData> list = new ArrayList<>;

bookEntityList.forEach(b -> {

ColumnData data = ColumnData.builder

.bookName(b.getBookName)

.starsCount(b.getStarsCount)

.author(b.getBookDetailInfo.getAuthor)

.bookImgUrl(b.getBookImgUrl)

.description(b.getDescription)

.price(b.getBookDetailInfo.getPrice)

.build;

list.add(data);

});

return list;

}

4.4 最終展示效果

最終的效果如下圖：

以上就是從想法到實踐，我們就在實戰(zhàn)中使用了 jsoup 的基本操作。

完整代碼地址：https://github.com/hellowHuaairen/JsoupTest

最后

Java HTML Parser 庫：jsoup，把它當(dāng)成簡單的爬蟲用起來還是很方便的吧？

為什么會講爬蟲？大數(shù)據(jù)，人工智能時代玩的就是數(shù)據(jù)，數(shù)據(jù)很重要。作為懂點技術(shù)的我們，也需要掌握一種獲取網(wǎng)絡(luò)數(shù)據(jù)的技能。當(dāng)然也有一些工具 Fiddler、webscraper 等也可以抓取你想要的數(shù)據(jù)。

教程至此，你應(yīng)該也能對 jsoup 有一些感覺了吧。編程是不是也特別有意思呢？參考我上面的實戰(zhàn)案例，有好多網(wǎng)站可以實踐一下啦～