當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Nutch开发（三）

發布時間：2024/9/19 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 Nutch开发（三）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Nutch開發（三）

文章目錄

- Nutch開發（三）
- - - 開發環境
  - 1.Nutch url過濾
  - 2.示例
  - 3.在Solr建立index
  - - 關于solr字段的配置
  - 4.關于Nutch plugin
  - 5.關于Nutch的默認配置信息
  - 6.使用metadata plugin
  - 7.Nutch2.4 存儲方式配置

開發環境

Linux，Ubuntu20.04LST
IDEA
Nutch1.18
Solr8.11

轉載請聲明出處！！！By 鴨梨的藥丸哥

1.Nutch url過濾

Nutch的url過濾配置主要放在regex-urlfilter.txt，通過配置regex-urlfilter.txt可以定制nutch的爬蟲url過濾規則。

# The default url filter. # Better for whole-internet crawling. # Please comment/uncomment rules to your needs.# Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored.#'+'代表留下url，'-'代表過濾掉url #匹配到第一個的正則表達式將決定了該url是過濾 #匹配從上到下 #沒匹配到的url默認是過濾掉的#過濾掉file，ftp，mailto等url # skip file: ftp: and mailto: urls -^(?:file|ftp|mailto):# skip URLs longer than 2048 characters, see also db.max.outlink.length #-^.{2049,}# skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin #過濾掉一些圖片，xml，js等url -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js|svg)$# skip URLs containing certain characters as probable queries, etc. #動態頁面的過濾 -[?*!@=] #-[!@]#過濾掉循環的url，如http://www.baidu.com/p/p/p/p # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/#然后接受其他所有url # accept anything else #+.

2.示例

規則添加要注意順序，因為在匹配到第一個的正則表達式將決定了該url是過濾掉了，將下面的url過過濾規則進行添加。下面的示例將可以爬取一些博客網站中的博文。

#添加在文件尾部 #先包含博客主頁 +^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)$ +^https://cloud.tencent.com/developer$ +^https://developer.aliyun.com$ +^https://segmentfault.com$#在包含博客獨立的域名和路徑格式 +^https://blog.csdn.net/[^/]+/article/details/.* +^https://my\.oschina\.net/.*/blog/.* +^https://cloud.tencent.com/developer/article/.+ +^https://www.jianshu.com/p/.+ +^https://www.cnblogs.com/.+/p/.+ +^https://developer.aliyun.com/article/.+ +^https://segmentfault.com/a/.+#再排除除了主頁下的其他資源 -^https://developer.aliyun.com/.+ -^https://segmentfault.com/.+ -^https://cloud.tencent.com/developer/.+ -^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)/.+

3.在Solr建立index

Nutch爬蟲支持對各種全文檢索服務器提交索引建立，這功能歸功于Nutch的強大的插件式設置，通過包含一些插件，Nutch可以輕松地將爬取到的信息在全文檢索服務器上建立索引。

使用bin/下面的nutch腳本

./nutch solrindex ../nutch/crawldb/ -dir ../nutch/segments/ -deleteGone

關于solr服務器的位置的配置，前面有說過，這里再重復一遍，在Nutch1.18中有關index建立的配置都放在了index-writers.xml的配置文件。該文件可以才conf/目錄找到

關于solr字段的配置

關于nutch會進行那些字段的索引可以通過下面nutch腳本命令進行參考。（其中的url是我的另一篇博客，關于如何在solr中使用Ik分詞器的）

./nutch indexchecker https://blog.csdn.net/musicmtv/article/details/22758817

在Nutch1.18中并不會像Nutch1.8等比較以前的版本一樣提供schema.xml，用于solr core的建立配置文件。

下面是Nutch2.4版本下的schema.xml文件，可以參考，具體的field的配置要根據實際情況進行配置，不過可以參考其他版本nutch的下的schema.xml。

<fields><field name="_version_" type="long" indexed="true" stored="true"/><field name="id" type="string" stored="true" indexed="true" required="true"/><field name="batchId" type="string" stored="true" indexed="false"/><field name="digest" type="string" stored="true" indexed="false"/><field name="boost" type="float" stored="true" indexed="false"/><field name="host" type="url" stored="false" indexed="true"/><field name="url" type="url" stored="true" indexed="true"/><field name="content" type="text_general" stored="true" indexed="true"/><field name="title" type="text_general" stored="true" indexed="true" multiValued="true"/><field name="cache" type="string" stored="true" indexed="false"/><field name="tstamp" type="date" stored="true" indexed="false"/><field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/><field name="anchor" type="text_general" stored="true" indexed="true"multiValued="true"/><field name="type" type="string" stored="true" indexed="true" multiValued="true"/><field name="contentLength" type="string" stored="true" indexed="false"/><field name="lastModified" type="date" stored="true" indexed="false"/><field name="date" type="tdate" stored="true" indexed="true"/> <dynamicField name="meta_*" type="string" stored="true" indexed="true"/><field name="lang" type="string" stored="true" indexed="true"/><field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/><field name="author" type="string" stored="true" indexed="true"/><field name="tag" type="string" stored="true" indexed="true" multiValued="true"/><field name="feed" type="string" stored="true" indexed="true"/><field name="publishedDate" type="date" stored="true" indexed="true"/><field name="updatedDate" type="date" stored="true" indexed="true"/><field name="cc" type="string" stored="true" indexed="true" multiValued="true"/> <field name="tld" type="string" stored="false" indexed="false"/><field name="rawcontent" type="string" stored="true" indexed="false"/></fields><uniqueKey>id</uniqueKey><defaultSearchField>text</defaultSearchField><solrQueryParser defaultOperator="OR"/><copyField source="content" dest="text"/><copyField source="url" dest="text"/><copyField source="title" dest="text"/><copyField source="anchor" dest="text"/><copyField source="author" dest="text"/>

4.關于Nutch plugin

Nutch通過可以通過添加各種類型的插件，對Nutch自身的功能進行擴展，有那些插件，可以在lib/目錄找到，至于你要使用那些插件，可以通配置變量plugin.includes進行配置，在conf/nutch-site.xml中添加即可。

5.關于Nutch的默認配置信息

Nutch的所有默認的配置信息都可以在nutch-default.xml中找到。我們可以通過nutch-default.xml文件去了解Nutch的配置，并且在conf/nutch-site.xml添加配置以覆蓋默認的配置信息。

6.使用metadata plugin

這個我寫了一篇博客怎么用，看下面鏈接

Nutch 使用metadata plugin捕獲頁面中的meta標簽數據_鴨梨的藥丸哥的博客-CSDN博客

7.Nutch2.4 存儲方式配置

這個我也寫了博客了，鏈接如下：

Nutch2.4 存儲方式配置_鴨梨的藥丸哥的博客-CSDN博客

總結

以上是生活随笔為你收集整理的Nutch开发（三）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Nutch

上一篇： Arm IPO后首份财报：营收8亿美元同
下一篇： Arm 公布 IPO 后首份财报：营收