`
youkimra
  • 浏览: 33689 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

nutch运行x point org.apache.nutch.net.URLNormalizer not found.处理

阅读更多
最近工作中遇到瓶颈,主要是没有很好的理解nutch从而使之效率低下,现在要对nutch进行优化,以后也会记录下在学习nutch的时候所遇到的问题。首先x point org.apache.nutch.net.URLNormalizer not found. 这是在运行nutch的时候报出的异常。我们可以发现和URLNormalizer这个有关,URLNormalizer是nutch在inject的时候对url进行规范化的东西,它是通过插件完成的,因此我认为是插件存在问题,后来仔细排查发现在nutch-default.xml中plugin.folders参数路径设置错误,由原来的lib/plugin改为plugin后运行正常。也有可能是配置文件的问题。

   1. JAVA_HOME环境变量未设置
   2. 未在conf/crawl-urlfilter.txt中设定过滤信息
   3. Fetcher: No agents listed in 'http.agent.name' property.
      原因:没有修改nutch-site.xml
   4. 没有fetch到任何网页
      原因:conf/crawl-urlfilter.txt中url匹配字符串(*.TARGET.COM)与urls中大小写不一致

调试时遇到的问题:

   1. javax.login.LoginException。原因是nutch引用cygwin。必须把c:\cygwin\bin添加到path环境变量
   2. OutOfMemoryException。需要在eclipse中设置VM内存大小。在debug configuration中的vm arguments中设置 -Xmx768m
   3. plugin.folders没有设置java.lang.IllegalArgumentException: plugin.folders is not set:将conf加入源程序目录
   4. java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
      原因:crawl-urlfilter.txt中的正则表达式nutch不接受
   5. java.lang.IllegalArgumentException:Fetcher: No agents listed in 'http.agent.name‘
      原因:nutch-default.xml中http.agent.name为空
      解决:
分享到:
评论
1 楼 ChenHotOne 2017-10-09  
你好,我遇到你说的这个4. java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
问题,具体打印信息是:
java.lang.Exception: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
	at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
	at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:141)
	at org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:94)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2017-10-09 19:05:08,610 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local456134380_0001
	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
	at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
	at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)


其中regex-urlfilter.txt
# accept anything else
+^http://www.aossama.com/
#+^http://([a-z0-9]*\.)*nutch.apache.org/
# +.
其中:urls里面是seed.txt,里面保存着
http://www.aossama.com/
在执行下面命令的时候报错:
./bin/nutch inject urls/

相关推荐

Global site tag (gtag.js) - Google Analytics