网络爬虫的都是通过多线程,多任务逻辑实现的,在springboot框架中已封装线程池(ThreadPoolTaskExecutor),我们只需要使用就是了。
这一节我们主要实现多线程抓取网页连接信息,并将信息存储在队列里边。
引入新包
在pom中引入新包,具体如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.3</version> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <scope>provided</scope> </dependency>
|
为了简化编码,这里引入了lombok,在使用时候IDE需要安装lombok插件,否则会提示编译错误。
配置管理
springboot的配置文件都是在application.properties(.yml)统一管理的,在这里,我们也把爬虫相关的配置通过@ConfigurationProperties注解来实现。直接上代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
| package mobi.huanyuan.spider.config;
import lombok.Data; import org.springframework.boot.context.properties.ConfigurationProperties;
@Data @ConfigurationProperties(prefix = "huanyuan.spider") public class SpiderConfig {
public int maxDepth = 2;
public int minerHtmlThreadNum = 2;
private int corePoolSize = 4;
private int maxPoolSize = 100;
private int queueCapacity = 1000;
private int keepAliveSeconds = 300; }
|
然后,需要修改这些配置,只需要修改application.properties(.yml)里边即可:
线程池
线程池使用springboot已有的,配置也在上边配置管理里边有,这里只初始化配置即可:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
| package mobi.huanyuan.spider.config;
import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;
import java.util.concurrent.ThreadPoolExecutor;
@Configuration public class ThreadPoolConfig { @Autowired private SpiderConfig spiderConfig;
@Bean(name = "threadPoolTaskExecutor") public ThreadPoolTaskExecutor threadPoolTaskExecutor() { ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor(); executor.setMaxPoolSize(spiderConfig.getMaxPoolSize()); executor.setCorePoolSize(spiderConfig.getCorePoolSize()); executor.setQueueCapacity(spiderConfig.getQueueCapacity()); executor.setKeepAliveSeconds(spiderConfig.getKeepAliveSeconds()); executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy()); return executor; }
}
|
队列管理
这一节我们主要是抓取URL并保存进队列,所以涉及到的队列有待抓取队列和待分析队列(下一节分析时候用,这里只做存储),此外,为了防止重复抓取同一个URL,这里还需要加一个Set集合,将已访问过的地址做个记录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
| package mobi.huanyuan.spider;
import lombok.Getter; import mobi.huanyuan.spider.bean.SpiderHtml; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
import java.util.HashSet; import java.util.LinkedList; import java.util.Queue; import java.util.Set;
public class SpiderQueue { private static Logger logger = LoggerFactory.getLogger(SpiderQueue.class);
private static volatile Set<String> urlSet = new HashSet<>();
private static volatile Queue<SpiderHtml> unVisited = new LinkedList<>();
private static volatile Queue<SpiderHtml> waitingMine = new LinkedList<>();
public synchronized static void addUrlSet(String url) { urlSet.add(url); }
public static int getUrlSetSize() { return urlSet.size(); }
public synchronized static void addUnVisited(SpiderHtml spiderHtml) { if (null != spiderHtml && !urlSet.contains(spiderHtml.getUrl())) { logger.info("添加到待访问队列[{}] 当前第[{}]层 当前线程[{}]", spiderHtml.getUrl(), spiderHtml.getDepth(), Thread.currentThread().getName()); unVisited.add(spiderHtml); } }
public synchronized static SpiderHtml unVisitedPoll() { return unVisited.poll(); }
public synchronized static void addWaitingMine(SpiderHtml html) { waitingMine.add(html); }
public synchronized static SpiderHtml waitingMinePoll() { return waitingMine.poll(); }
public static int waitingMineSize() { return waitingMine.size(); } }
|
抓取任务
直接上代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
| package mobi.huanyuan.spider.runable;
import mobi.huanyuan.spider.SpiderQueue; import mobi.huanyuan.spider.bean.SpiderHtml; import mobi.huanyuan.spider.config.SpiderConfig; import org.apache.commons.lang3.StringUtils; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.slf4j.Logger; import org.slf4j.LoggerFactory;
public class SpiderHtmlRunnable implements Runnable { private static final Logger logger = LoggerFactory.getLogger(SpiderHtmlRunnable.class); private static boolean done = false; private SpiderConfig config;
public SpiderHtmlRunnable(SpiderConfig config) { this.config = config; }
@Override public void run() { while (!SpiderHtmlRunnable.done) { done = true; minerHtml(); done = false; } }
public synchronized void minerHtml() { SpiderHtml minerUrl = SpiderQueue.unVisitedPoll(); try { if (null == minerUrl || StringUtils.isBlank(minerUrl.getUrl()) || minerUrl.getDepth() > config.getMaxDepth()) { return; } if ("http".contains(minerUrl.getUrl())) { logger.info("当前爬取URL[{}]没有http", minerUrl.getUrl()); return; } logger.info("当前爬取页面[{}]爬取深度[{}] 当前线程 [{}]", minerUrl.getUrl(), minerUrl.getDepth(), Thread.currentThread().getName()); Connection conn = Jsoup.connect(minerUrl.getUrl()); conn.header("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13"); Document doc = conn.get(); String page = doc.html();
SpiderHtml spiderHtml = new SpiderHtml(); spiderHtml.setUrl(minerUrl.getUrl()); spiderHtml.setHtml(page); spiderHtml.setDepth(minerUrl.getDepth());
System.out.println(spiderHtml.getUrl()); SpiderQueue.addWaitingMine(spiderHtml); } catch (Exception e) { logger.info("爬取页面失败 URL [{}]", minerUrl.getUrl()); logger.info("Error info [{}]", e.getMessage()); }
} }
|
这里就是个Runnable任务,主要目标就是拉去URL数据,然后封装成SpiderHtml对象存放在待分析队列里边。
这里用到了jsoup–一个java对HTML分析操作的工具包,不清楚的可以去搜索看看,之后章节涉及到分析的部分也会用到。
其他
页面信息封装类SpiderHtml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| package mobi.huanyuan.spider.bean;
import lombok.Data;
import java.io.Serializable;
@Data public class SpiderHtml implements Serializable {
private String url;
private String html;
private int depth; }
|
爬虫主类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| package mobi.huanyuan.spider;
import mobi.huanyuan.spider.bean.SpiderHtml; import mobi.huanyuan.spider.config.SpiderConfig; import mobi.huanyuan.spider.runable.SpiderHtmlRunnable; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor; import org.springframework.stereotype.Component;
import java.util.concurrent.TimeUnit;
@Component public class Spider { private static Logger logger = LoggerFactory.getLogger(Spider.class);
@Autowired private ThreadPoolTaskExecutor threadPoolTaskExecutor; @Autowired private SpiderConfig spiderConfig;
public void start(SpiderHtml spiderHtml) {
SpiderQueue.addUnVisited(spiderHtml); SpiderQueue.addUrlSet(spiderHtml.getUrl());
for (int i = 0; i < spiderConfig.getMinerHtmlThreadNum(); i++) { SpiderHtmlRunnable minerHtml = new SpiderHtmlRunnable(spiderConfig); threadPoolTaskExecutor.execute(minerHtml); } try { TimeUnit.SECONDS.sleep(20); logger.info("待分析URL队列大小: {}", SpiderQueue.waitingMineSize()); threadPoolTaskExecutor.shutdown(); } catch (Exception e) { e.printStackTrace(); } } }
|
在”// TODO:”之后的代码逻辑这里是临时的,等后边章节完善之后,这里就慢慢去掉。
最后
要跑起这一节的代码,需要在springboot项目main方法中加入如下代码:
1 2 3 4 5 6
| ConfigurableApplicationContext context = SpringApplication.run(SpiderApplication.class, args); Spider spider = context.getBean(Spider.class); SpiderHtml startPage = new SpiderHtml(); startPage.setUrl("$URL"); startPage.setDepth(2); spider.start(startPage);
|
$URL就是需要抓取的网页地址。
springboot项目启动后,停止需要手动停止,目前没有处理抓取完自动停止运行的逻辑。
运行结果如下图:
最后,这个章节完成之后整个项目的结构如下图: