本网页所有文字内容由 imapbox邮箱云存储,邮箱网盘, iurlBox网页地址收藏管理器 下载并得到。
ImapBox 邮箱网盘 工具地址: https://www.imapbox.com/download/ImapBox.5.5.1_Build20141205_CHS_Bit32.exe
PC6下载站地址:PC6下载站分流下载
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox 网页视频 工具地址: https://www.imapbox.com/download/ImovieBox4.7.0_Build20141115_CHS.exe
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
在Crawl中的main函数中有一句是:
// initializecrawlDb
injector.inject(crawlDb, rootUrlDir);
引用[李阳]:inject操作调用的是nutch的核心包之一crawl包中的类Injector。
inject操作主要作用:
1. 将URL集合进行格式化和过滤,消除其中的非法URL,并设定URL状态(UNFETCHED),按照一定方法进行初始化分值;
2. 将URL进行合并,消除重复的URL入口;
3. 将URL及其状态、分值存入crawldb数据库,与原数据库中重复的则删除旧的,更换新的。
inject操作结果:crawldb数据库内容得到更新,包括URL及其状态。
看一下inject调用的函数:
public voidinject(Path crawlDb, Path urlDir) throwsIOException {
//产生一个文件名是随机的临时文件夹
Path tempDir = newPath(getConf().get("mapred.temp.dir", ".")
+ "/inject-temp-"
+ Integer.toString(new
Random().nextInt(Integer.MAX_VALUE)));
// map text input file to a<url,CrawlDatum> file
// 产生<url,CrawlDatum>key-value对的文件
JobConf sortJob = newNutchJob(getConf());
sortJob.setJobName("inject" + urlDir);
FileInputFormat.addInputPath(sortJob,urlDir);
sortJob.setMapperClass(InjectMapper.class);
FileOutputFormat.setOutputPath(sortJob,tempDir);
sortJob.setOutputFormat(SequenceFileOutputFormat.class);
sortJob.setOutputKeyClass(Text.class);
sortJob.setOutputValueClass(CrawlDatum.class);
sortJob.setLong("injector.current.time",
System.currentTimeMillis());
JobClient.runJob(sortJob);
这里用的是hadoop的东西,输入文件目录为:用户指定的url目录。输出目录为:产生的那个临时文件夹。这里的SequenceFileOutputFormat在<Hadoop,The definitive book>中的解释为:Imagine a logfile,where each log
https://c.tieba.baidu.com/p/3476808306
https://c.tieba.baidu.com/p/3476798710
https://c.tieba.baidu.com/p/3474281354
https://c.tieba.baidu.com/p/3474300101
https://c.tieba.baidu.com/p/3474294075
https://c.tieba.baidu.com/p/3474123295
https://c.tieba.baidu.com/p/3474314242
https://c.tieba.baidu.com/p/3474310411
https://c.tieba.baidu.com/p/3474304550
https://c.tieba.baidu.com/p/3475433945
https://c.tieba.baidu.com/p/3475430015
https://c.tieba.baidu.com/p/3475433348
https://c.tieba.baidu.com/p/3475431434
https://c.tieba.baidu.com/p/3474176863
https://c.tieba.baidu.com/p/3474159835
https://c.tieba.baidu.com/p/3474163941
https://c.tieba.baidu.com/p/3474156121
https://c.tieba.baidu.com/p/3474147660
https://c.tieba.baidu.com/p/3474151899
https://c.tieba.baidu.com/p/3474142287
https://c.tieba.baidu.com/p/3474136965
https://c.tieba.baidu.com/p/3474133165
https://c.tieba.baidu.com/p/3474128675
https://c.tieba.baidu.com/p/3474103896
https://c.tieba.baidu.com/p/3474099488
https://c.tieba.baidu.com/p/3474094120
https://c.tieba.baidu.com/p/3475431976
https://c.tieba.baidu.com/p/3474267991
https://c.tieba.baidu.com/p/3474259583
https://c.tieba.baidu.com/p/3474254990
https://c.tieba.baidu.com/p/3474228986
https://c.tieba.baidu.com/p/3474221626
https://c.tieba.baidu.com/p/3474215742
https://c.tieba.baidu.com/p/3474212122
https://c.tieba.baidu.com/p/3474188883
https://c.tieba.baidu.com/p/3474207722
https://c.tieba.baidu.com/p/3474184143
https://c.tieba.baidu.com/p/3474180522
https://c.tieba.baidu.com/p/3474171022
https://c.tieba.baidu.com/p/3474086627
阅读和此文章类似的: 程序员专区