Nutch开发(三)
Nutch開發(三)
文章目錄
- Nutch開發(三)
- 開發環境
- 1.Nutch url過濾
- 2.示例
- 3.在Solr建立index
- 關于solr字段的配置
- 4.關于Nutch plugin
- 5.關于Nutch的默認配置信息
- 6.使用metadata plugin
- 7.Nutch2.4 存儲方式配置
開發環境
- Linux,Ubuntu20.04LST
- IDEA
- Nutch1.18
- Solr8.11
轉載請聲明出處!!!By 鴨梨的藥丸哥
1.Nutch url過濾
Nutch的url過濾配置主要放在regex-urlfilter.txt,通過配置regex-urlfilter.txt可以定制nutch的爬蟲url過濾規則。
# The default url filter. # Better for whole-internet crawling. # Please comment/uncomment rules to your needs.# Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored.#'+'代表留下url,'-'代表過濾掉url #匹配到第一個的正則表達式將決定了該url是過濾 #匹配從上到下 #沒匹配到的url默認是過濾掉的#過濾掉file,ftp,mailto等url # skip file: ftp: and mailto: urls -^(?:file|ftp|mailto):# skip URLs longer than 2048 characters, see also db.max.outlink.length #-^.{2049,}# skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin #過濾掉一些圖片,xml,js等url -(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js|svg)$# skip URLs containing certain characters as probable queries, etc. #動態頁面的過濾 -[?*!@=] #-[!@]#過濾掉循環的url,如http://www.baidu.com/p/p/p/p # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/#然后接受其他所有url # accept anything else #+.2.示例
規則添加要注意順序,因為在匹配到第一個的正則表達式將決定了該url是過濾掉了,將下面的url過過濾規則進行添加。下面的示例將可以爬取一些博客網站中的博文。
#添加在文件尾部 #先包含博客主頁 +^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)$ +^https://cloud.tencent.com/developer$ +^https://developer.aliyun.com$ +^https://segmentfault.com$#在包含博客獨立的域名和路徑格式 +^https://blog.csdn.net/[^/]+/article/details/.* +^https://my\.oschina\.net/.*/blog/.* +^https://cloud.tencent.com/developer/article/.+ +^https://www.jianshu.com/p/.+ +^https://www.cnblogs.com/.+/p/.+ +^https://developer.aliyun.com/article/.+ +^https://segmentfault.com/a/.+#再排除除了主頁下的其他資源 -^https://developer.aliyun.com/.+ -^https://segmentfault.com/.+ -^https://cloud.tencent.com/developer/.+ -^https://www\.(cnblogs|jianshu|csdn|oschina)\.(net|com)/.+3.在Solr建立index
Nutch爬蟲支持對各種全文檢索服務器提交索引建立,這功能歸功于Nutch的強大的插件式設置,通過包含一些插件,Nutch可以輕松地將爬取到的信息在全文檢索服務器上建立索引。
使用bin/下面的nutch腳本
./nutch solrindex ../nutch/crawldb/ -dir ../nutch/segments/ -deleteGone關于solr服務器的位置的配置,前面有說過,這里再重復一遍,在Nutch1.18中有關index建立的配置都放在了index-writers.xml的配置文件。該文件可以才conf/目錄找到
<writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter"><!--服務器的位置等配置信息--><parameters><param name="type" value="http"/><param name="url" value="http://localhost:8983/solr/nutch"/><param name="collection" value=""/><param name="weight.field" value=""/><param name="commitSize" value="1000"/><param name="auth" value="false"/><param name="username" value="username"/><param name="password" value="password"/></parameters><!--這里配置filed字段的映射--><mapping><!--將一個field里面的值復制并拼接到另一個field值的后面--><copy><!-- <field source="content" dest="search"/> --><!-- <field source="title" dest="title,search"/> --></copy><!--這個式index-metedata插件使用的filed的重命名--><rename><field source="metatag.description" dest="description"/><field source="metatag.keywords" dest="keywords"/></rename><!--移除filed--><remove><field source="segment"/></remove></mapping></writer>關于solr字段的配置
關于nutch會進行那些字段的索引可以通過下面nutch腳本命令進行參考。(其中的url是我的另一篇博客,關于如何在solr中使用Ik分詞器的)
./nutch indexchecker https://blog.csdn.net/musicmtv/article/details/22758817在Nutch1.18中并不會像Nutch1.8等比較以前的版本一樣提供schema.xml,用于solr core的建立配置文件。
下面是Nutch2.4版本下的schema.xml文件,可以參考,具體的field的配置要根據實際情況進行配置,不過可以參考其他版本nutch的下的schema.xml。
<!--Nutch2.4版本下,使用各種plugin使用的field--> <fields><!-- This field is used internally by Solr, for example by features like partial update functionality and update log. It is NOT requiredif updateLog is turned off in your updateHandler, however it is advisedto include it as performance improvements are minimal. --><field name="_version_" type="long" indexed="true" stored="true"/><field name="id" type="string" stored="true" indexed="true" required="true"/><!-- core fields --><field name="batchId" type="string" stored="true" indexed="false"/><field name="digest" type="string" stored="true" indexed="false"/><field name="boost" type="float" stored="true" indexed="false"/><!-- fields for index-basic plugin --><field name="host" type="url" stored="false" indexed="true"/><field name="url" type="url" stored="true" indexed="true"/><!-- stored=true for highlighting, use term vectors and positions for fast highlighting --><field name="content" type="text_general" stored="true" indexed="true"/><field name="title" type="text_general" stored="true" indexed="true" multiValued="true"/><field name="cache" type="string" stored="true" indexed="false"/><field name="tstamp" type="date" stored="true" indexed="false"/><!-- catch-all field --><field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/><!-- fields for index-anchor plugin --><field name="anchor" type="text_general" stored="true" indexed="true"multiValued="true"/><!-- fields for index-more plugin --><field name="type" type="string" stored="true" indexed="true" multiValued="true"/><field name="contentLength" type="string" stored="true" indexed="false"/><field name="lastModified" type="date" stored="true" indexed="false"/><field name="date" type="tdate" stored="true" indexed="true"/><!-- fields for index-metadata plugin --> <dynamicField name="meta_*" type="string" stored="true" indexed="true"/><!-- fields for languageidentifier plugin --><field name="lang" type="string" stored="true" indexed="true"/><!-- fields for subcollection plugin --><field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/><!-- fields for feed plugin (tag is also used by microformats-reltag)--><field name="author" type="string" stored="true" indexed="true"/><field name="tag" type="string" stored="true" indexed="true" multiValued="true"/><field name="feed" type="string" stored="true" indexed="true"/><field name="publishedDate" type="date" stored="true" indexed="true"/><field name="updatedDate" type="date" stored="true" indexed="true"/><!-- fields for creativecommons plugin --><field name="cc" type="string" stored="true" indexed="true" multiValued="true"/><!-- fields for tld plugin --> <field name="tld" type="string" stored="false" indexed="false"/><!-- fields for index-html pluginNote: although raw document content may be binary,index-html adds a String to the index field --><field name="rawcontent" type="string" stored="true" indexed="false"/></fields><uniqueKey>id</uniqueKey><defaultSearchField>text</defaultSearchField><solrQueryParser defaultOperator="OR"/><!-- copyField commands copy one field to another at the time a documentis added to the index. It's used either to index the same field differently,or to add multiple fields to the same field for easier/faster searching. --><copyField source="content" dest="text"/><copyField source="url" dest="text"/><copyField source="title" dest="text"/><copyField source="anchor" dest="text"/><copyField source="author" dest="text"/>4.關于Nutch plugin
Nutch通過可以通過添加各種類型的插件,對Nutch自身的功能進行擴展,有那些插件,可以在lib/目錄找到,至于你要使用那些插件,可以通配置變量plugin.includes進行配置,在conf/nutch-site.xml中添加即可。
<property><name>plugin.includes</name><!--使用正則匹配選擇你需要的插件--><value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value><description>Regular expression naming plugin directory names toinclude. Any plugin not matching this expression is excluded.In any case you need at least include the nutch-extensionpoints plugin. Bydefault Nutch includes crawling just HTML and plain text via HTTP,and basic indexing and search plugins. In order to use HTTPS please enableprotocol-httpclient, but be aware of possible intermittent problems with theunderlying commons-httpclient library.</description> </property>5.關于Nutch的默認配置信息
Nutch的所有默認的配置信息都可以在nutch-default.xml中找到。我們可以通過nutch-default.xml文件去了解Nutch的配置,并且在conf/nutch-site.xml添加配置以覆蓋默認的配置信息。
6.使用metadata plugin
這個我寫了一篇博客怎么用,看下面鏈接
Nutch 使用metadata plugin捕獲頁面中的meta標簽數據_鴨梨的藥丸哥的博客-CSDN博客
7.Nutch2.4 存儲方式配置
這個我也寫了博客了,鏈接如下:
Nutch2.4 存儲方式配置_鴨梨的藥丸哥的博客-CSDN博客
總結
以上是生活随笔為你收集整理的Nutch开发(三)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Arm IPO后首份财报:营收8亿美元同
- 下一篇: Arm 公布 IPO 后首份财报:营收