接上一节。
需求 存储数据第一个版本,将抓取到的网页数据用文件形式存储到本地。
配置 1、在SpiderConfig类中添加配置:
1 2 3 4 5 6 7 8 9 10 11 12 public int minerStoreThreadNum;public StoreType storeType = StoreType.FILE;public String storeLocalPath;
2、修改application.properties(.yml),增加新配置属性,如下图:
存储数据任务 存储数据任务主要是将抓取到的符合规则的数据存储起来。 存储类型:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 public enum StoreType { DB("DB" ), FILE("FILE" ); private String type; private StoreType (String type) { this .type = type; } public String getType () { return type; } }
这里暂时只实现存储到本地文件中。代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 package mobi.huanyuan.spider.runable;import mobi.huanyuan.spider.SpiderApplication;import mobi.huanyuan.spider.SpiderQueue;import mobi.huanyuan.spider.bean.SpiderHtml;import mobi.huanyuan.spider.config.SpiderConfig;import org.apache.commons.lang3.StringUtils;import org.apache.commons.lang3.time.DateFormatUtils;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.io.*;import java.nio.charset.StandardCharsets;public class SpiderStoreRunnable implements Runnable { private static final Logger logger = LoggerFactory.getLogger(SpiderStoreRunnable.class); private SpiderConfig config; public SpiderStoreRunnable (SpiderConfig config) { this .config = config; } @Override public void run () { while (!SpiderApplication.isStopping) { store(); } } private void store () { switch (config.getStoreType()) { case FILE: { fileStore(); break ; } case DB: default : logger.error("Don't support this store type[{}]." , config.getStoreType()); } } public synchronized void fileStore () { SpiderHtml html = SpiderQueue.storePoll(); if (null == html || StringUtils.isBlank(html.getHtml())) { return ; } String title = fileName(html.getUrl()); if (title == null || title.length() > 255 ) { return ; } storeHtmlToLocal(title, html.getHtml()); logger.info("保存数据文件完成,当前线程[{}]" , Thread.currentThread().getName()); } public String fileName (String title) { return title .replaceAll("\\\\" , "" ) .replaceAll("/" , "" ) .replaceAll(":" , "" ) .replaceAll("\\*" , "" ) .replaceAll("\\?" , "" ) .replaceAll("\"" , "" ) .replaceAll("<" , "" ) .replaceAll(">" , "" ) .replaceAll("\\|" , "" ); } private void storeHtmlToLocal (String title, String content) { Writer writer = null ; try { String path = config.getStoreLocalPath() + DateFormatUtils.format(System.currentTimeMillis(), "yyyyMMdd" ); makeDir(path); writer = new OutputStreamWriter (new FileOutputStream (new File (path + File.separator + title)), StandardCharsets.UTF_8); writer.write(content); writer.flush(); } catch (IOException e) { logger.error(e.getMessage(), e); } finally { if (writer != null ) { try { writer.close(); } catch (IOException e) { logger.error(e.getMessage(), e); } } } } public void makeDir (String path) { File file = new File (path); if (!file.exists()) { file.mkdirs(); logger.info("创建存储目录[{}]" , path); } } }
爬虫主类修改 修改Spider的start方法,增加存储数据的线程逻辑。
1 2 3 4 5 for (int i = 0 ; i < spiderConfig.getMinerStoreThreadNum(); i++){ SpiderStoreRunnable minerStoreThread = new SpiderStoreRunnable (spiderConfig); threadPoolTaskExecutor.execute(minerStoreThread); }
其他 到这里,一个爬虫的基本逻辑就算实现完成了,项目结构如下: