博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
jdk11 HttpClient 爬虫
阅读量:5780 次
发布时间:2019-06-18

本文共 6651 字,大约阅读时间需要 22 分钟。

hot3.png

目的: 获得目标背单词网站中的单词, 写了一个简单的小爬虫, 使用jdk11

eb85b63d30446ce1b3ce84577c7b17a8d38.jpg 

4a781b0bbebcce42efc165ef1a9970815d3.jpg

b922e5b400dea2201c8faadb1befd403be8.jpg

到此, 思路明确!

第一步,  把冰箱门...., 串词了,Sorry!!

第一步, 调用登陆接口, 拿到sessionid!

第二步, 带着sessionid到单词列表页, 拿到body, 转成Document, 开始"借鉴"单词!

是不是So easy!

package com.***;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import org.junit.Test;import java.io.IOException;import java.net.URI;import java.net.http.HttpClient;import java.net.http.HttpRequest;import java.net.http.HttpResponse;import java.util.HashMap;/** * @author jqw1122@foxmail.com * @description 爬啊爬 * @date 2/23/2019 17:14  */public class Crawler {    @Test    public void crawler() {        String loginUrl = "http://www.cikuang.me/login";        String fromBody = "username=jqw1122@foxamil.com&password=qweqwe123";        String wordSetUrl= "http://www.cikuang.me/member/learningset?id=4573";        HttpClient httpClient = HttpClient.newBuilder().build();        HttpRequest request = HttpRequest.newBuilder()                .uri(URI.create(loginUrl))                .header("Content-Type","application/x-www-form-urlencoded")                .POST(HttpRequest.BodyPublishers.ofString(fromBody))                .build();        httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString())                .thenApply(HttpResponse::headers)                .thenAccept(headers -> {            //cookie有好多 我只要sid啊魂淡!            var cookieMap = new HashMap
(); headers.map().get("set-cookie").forEach(c -> { String[] split = c.split(";"); for (String s : split) { String[] split1 = s.split("="); if (split1.length == 2) cookieMap.put(split1[0], split1[1]); } }); //拿着sid去单词页面 String cookie_sid = cookieMap.get("sid"); HttpRequest request2 = HttpRequest.newBuilder() .uri(URI.create(wordSetUrl)) .header("Content-Type","application/x-www-form-urlencoded") .header("Cookie", "sid=" + cookie_sid) .GET() .build(); httpClient.sendAsync(request2, HttpResponse.BodyHandlers.ofString()) .thenApply(HttpResponse::body) .thenAccept(htmlString ->{ //获取到body转成Document, 方便借鉴... Document htmlDocument = Jsoup.parse(htmlString); //获取单词table id Element wordListTable = htmlDocument.getElementById("wordListTable"); Elements trs = wordListTable.getElementsByTag("tr"); trs.forEach(t -> { Elements tds = t.children(); String en = tds.get(0).child(0).text(); String cn = tds.get(1).text(); System.out.println("单词---->>> " + en + ":" + cn); }); }).join(); } ).join(); }}

 爬虫2: 目的:获取KMF中托福-听力-所有练习题的题目的音频

5430b7e6bfb7cc2972907c9a7817f7de720.jpg cb3c1546a83fcbdf3963947e68669df2cf4.jpg

/** * @author jqw1122@foxmail.com * @description * @date 2/23/2019 17:14 */public class Crawler {    @Test    public void crawlerKMF() {        String mainUrl= "https://toefl.kmf.com";        String mainUrl1= "https://toefl.kmf.com/listen/ets/order/";        String localFilePath = "C:\\kmf_audio\\";        HttpClient httpClient = HttpClient.newBuilder().build();        List
detailUrlList = new ArrayList<>(); e:for (int i = 0; i <= 5; i++) { for (int j = 1; j <= 4; j++) { String url = mainUrl1 + i + "/0/" + j; HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(url)) .header("Content-Type","application/x-www-form-urlencoded") .GET() .build(); httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString()).thenApply(HttpResponse::body).thenAccept(bodyString -> { Document htmlDocument = Jsoup.parse(bodyString); Elements elements = htmlDocument.getElementsByAttributeValue("class", "check-links js-check-link"); elements.forEach(tagA -> { String href = tagA.attr("href"); detailUrlList.add(href); }); System.out.println("page detail number:" + elements.size()); }).join(); //test// if (1==1) break e; } } System.out.println("page/file number: "+detailUrlList.size()); var fileList = new ArrayList
>(); System.out.println(LocalTime.now().toString() + " start get audio file url in detail page"); detailUrlList.parallelStream().forEach(href -> { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(mainUrl + href)) .header("Content-Type","application/x-www-form-urlencoded") .GET() .build(); httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString()).thenApply(HttpResponse::body).thenAccept(bodyString -> { Document htmlDocument = Jsoup.parse(bodyString); Elements bts = htmlDocument.getElementsByAttributeValue("class", "i-title js-top-title"); String fileName = bts.get(0).text(); Elements audios = htmlDocument.getElementsByAttributeValue("class", "question-audio-cont js-question-audio g-player-control video-left-content js-player-record"); String fileUrl = audios.get(0).attr("data-url"); fileList.add(Map.of("fileName", fileName.toLowerCase().replace(" ", "_") + ".mp3", "fileUrl", fileUrl));// System.out.println(fileName+ "--"+fileUrl); }).join(); }); System.out.println(LocalTime.now().toString() + " finish get audio file url in detail page! start downloading files to local!"); fileList.parallelStream().forEach(t -> { try (InputStream ins = new URL(t.get("fileUrl")).openStream()) { Path target = Paths.get(localFilePath, t.get("fileName"));// Files.createDirectories(target.getParent()); Files.copy(ins, target, StandardCopyOption.REPLACE_EXISTING); } catch (IOException e) { System.out.println("download failed! fileName:" + t.get("fileName") + " fileUrl:" + t.get("fileUrl")); e.printStackTrace(); } }); System.out.println(LocalTime.now().toString() + " download completed"); }}

cdc4d5b2bcbc22c452196263b488c170aba.jpg下载成功了....

转载于:https://my.oschina.net/jiangqw/blog/3013965

你可能感兴趣的文章
ASP、Access、80040e14、保留关键字、INSERT INTO 语句的语法错误
查看>>
【转】二叉树的非递归遍历
查看>>
NYOJ283对称排序
查看>>
接连遇到大牛
查看>>
[Cocos2d-x For WP8]矩形碰撞检测
查看>>
自己写spring boot starter
查看>>
花钱删不完负面消息
查看>>
JBPM之JPdl小叙
查看>>
(step6.1.5)hdu 1233(还是畅通工程——最小生成树)
查看>>
Membership三步曲之进阶篇 - 深入剖析Provider Model
查看>>
huffman编码——原理与实现
查看>>
Linux移植随笔:终于解决Tslib的问题了【转】
查看>>
MyBitis(iBitis)系列随笔之四:多表(多对一查询操作)
查看>>
【leetcode】Longest Common Prefix
查看>>
前端优化及相关要点总结
查看>>
Vue 列表渲染
查看>>
struts2中form提交到action中的中文参数乱码问题解决办法(包括取中文路径)
查看>>
25 个精美的手机网站模板
查看>>
C#反射实例应用--------获取程序集信息和通过类名创建类实例
查看>>
VC中实现文字竖排的简单方法
查看>>