如何克隆javascript渲染的网站

结果,没搞定。
问题描述: 试图克隆javascript渲染的网站,但是用很多种方法都没有成功。 就算下载了html,用浏览器打开仍然是空白的页面,没有任何内容。

看起来很简单的一个问题……
郁闷

try0-用wget下载网站页面

失败。

用右键另存为html,还是没成功。

try1-用HTTrack

失败。
只能克隆静态网站。

try2-使用Web2Disk、Octoparse、WebCopy

用浏览器打开仍然是空白的页面,没有任何内容。

try3-使用python

解决思路:
网络爬虫通常会遇到两个主要问题:一是无法获取JavaScript渲染后的内容,二是无法执行JavaScript来获取动态生成的内容。为了解决这些问题,可以采取以下方法:

使用Selenium:Selenium是一种自动化测试工具,可以模拟用户操作浏览器,执行JavaScript代码,从而获取动态内容。通过Selenium可以实现对动态网页的内容抓取。

问题:
用浏览器打开仍然是空白的页面,而且报错了,以下是错误代码

1
2
[24600:25308:0828/200815.476:ERROR:ssl_client_socket_impl.cc(882)] handshake failed; returned -1, SSL error code 1, net_error -107
[24600:25308:0828/200816.267:ERROR:ssl_client_socket_impl.cc(882)] handshake failed; returned -1, SSL error code 1, net_error -107

把url修改为其他网站,可以正常抓取内容。程序也没有报错。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import os

chrome_driver_path = "C:/Users/whale/.cache/selenium/chromedriver/win64/128.0.6613.84/chromedriver.exe" # Replace with your actual path to chromedriver
service = Service(executable_path=chrome_driver_path)

chrome_options = Options()
#chrome_options.add_argument('--ignore-certificate-errors') # Ignore SSL errors
chrome_options.add_argument('--incognito') # Add incognito mode

driver = webdriver.Chrome(service=service, options=chrome_options)

url = 'http://xx:8008/wui/index.html#/?logintype=1&_key=ag0e2j'
driver.maximize_window()
driver.get(url)
data = driver.page_source

with open('./test.html', 'w', encoding='utf-8') as fp:
fp.write(data)

driver.quit()

try4-使用burp抓包

请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET /wui/index.html HTTP/1.1

Host: oa.szeastroc.com:8008

Cache-Control: max-age=0

Upgrade-Insecure-Requests: 1

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.6099.71 Safari/537.36

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding: gzip, deflate, br

Accept-Language: en-US,en;q=0.9

Cookie: __clusterSessionIDCookieName=10095022-e780-4382-b180-9d549d17f48f; __clusterSessionCookieName=8EF52A11A84B11C4FBE02783959AFD0C; JSESSIONID=10095022-e780-4382-b180-9d549d17f48f; ecology_JSessionid=10095022-e780-4382-b180-9d549d17f48f; __randcode__=f1b13cbf-cb4b-49cf-b00d-af2562d7b26a

If-None-Match: "Cadx5d5LI9+"

If-Modified-Since: Thu, 30 May 2024 08:15:48 GMT

Connection: close

提示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
HTTP/1.1 304 Not Modified

Server: WVS

X-Frame-Options: SAMEORIGIN

X-XSS-Protection: 1

X-UA-Compatible: IE=8

Cache-Control: no-cache,must-revalidate,proxy-revalidate,max-age=0

ETag: "Cadx5d5LI9+"

Connection: close

Date: Wed, 28 Aug 2024 12:51:08 GMT

HTTP/1.1 304 Not Modified 是一种响应状态码,表示所请求的资源自上次请求以来没有更改。这种响应通常用于缓存优化,以减少网络带宽使用和加快页面加载时间。

304 Not Modified 触发的原因:

  1. 浏览器缓存:当浏览器请求一个资源时(如 HTML 文件、CSS 样式表、JavaScript 文件或图像),如果该资源已经在浏览器缓存中且未过期,浏览器会在请求头中发送一个 If-Modified-SinceIf-None-Match 字段来检查资源是否已更新。

  2. 服务器响应:服务器会根据这些字段检查请求的资源是否有更改。

    • 如果资源没有更改,服务器返回 304 Not Modified,并且不会发送资源的主体(如文件内容)。
    • 如果资源已更改,服务器会返回 200 OK 和更新后的资源。

典型使用场景:

  • 提高性能:浏览器缓存机制通过避免重新下载未更改的资源,从而减少服务器负载和网络带宽使用。
  • 加速页面加载:避免重新下载未更改的资源可以加快页面加载速度,因为浏览器直接从缓存中获取资源,而不必等待网络请求。

304 Not Modified 是 HTTP 协议的一个正常响应码,用于优化资源加载和减少带宽使用。它表示客户端缓存中的资源是最新的,不需要重新下载。这通常是理想的行为,因为它能提升网页的性能。如果你需要确保总是获取最新的内容,可以调整缓存策略或在请求中禁用缓存。

重新请求根目录

于是重新请求根目录GET / HTTP/1.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
HTTP/1.1 200 OK

Server: WVS

X-Frame-Options: SAMEORIGIN

X-XSS-Protection: 1

X-UA-Compatible: IE=8

Cache-Control: private,max-age=86400000

ETag: "+bN7Mb724j6"

Last-Modified: Tue, 17 Jul 2018 09:20:10 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 3139

Connection: close

Date: Wed, 28 Aug 2024 12:53:19 GMT



<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<script type="text/javascript" src="/js/jquery/jquery_wev8.js"></script>

<script type="text/javascript" src="/js/jquery/plugins/client/jquery.client_wev8.js"></script>

<script type="text/javascript" src="/system/index_wev8.js"></script>

</head>



<script language="javascript1.1">































window.onload=function()

{

var redirectUrl = "/wui/index.html#/?logintype=1" ;

window.location.href=redirectUrl



}



function checkPopupBlocked(poppedWindow) {

setTimeout(function(){

var flag= false

if(jQuery.client.browser=="Chrome"){

flag = doCheckPopupBlocked(poppedWindow);

}else{

if(poppedWindow!=null){

flag = false;

}else{

flag = true;

}

}





if(flag){

var redirectUrl = "/wui/index.html#/?logintype=1" ;



var helpurl=getHelpUrl();

if(helpurl!=""){

var yn = confirm(msg);

if(!yn) location.href = redirectUrl;

if(yn) location.href = helpurl;

}else{

location.href = redirectUrl;

}



}else{



window.open('','_self');

window.close();

}

},500);

}



function doCheckPopupBlocked(poppedWindow) {



var result = false;

//alert(poppedWindow.closed)

try {

if (typeof poppedWindow == 'undefined') {

// Safari with popup blocker... leaves the popup window handle undefined

result = true;

//alert(1)

}

else if (poppedWindow && poppedWindow.closed) {

// This happens if the user opens and closes the client window...

// Confusing because the handle is still available, but it's in a "closed" state.

// We're not saying that the window is not being blocked, we're just saying

// that the window has been closed before the test could be run.

result = false;

//alert(2)

}

else if (poppedWindow && poppedWindow.outerWidth == 0) {

// This is usually Chrome's doing. The outerWidth (and most other size/location info)

// will be left at 0, EVEN THOUGH the contents of the popup will exist (including the

// test function we check for next). The outerWidth starts as 0, so a sufficient delay

// after attempting to pop is needed.

result = true;

//alert(3)

}

else if (poppedWindow && poppedWindow.test) {

// This is the actual test. The client window should be fine.

result = false;

//alert(4)

}

else {

// Else we'll assume the window is not OK

result = false;

//alert(5)

}



} catch (err) {

//if (console) {

// console.warn("Could not access popup window", err);

//}

}



return result

}



function getHelpUrl()

{



return "/help/sys/help.html"



}





</script>

</html>

请求/wui/index.html

响应为

用浏览器打开是空白的页面,没有任何内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
HTTP/1.1 200 OK

Server: WVS

X-Frame-Options: SAMEORIGIN

X-XSS-Protection: 1

X-UA-Compatible: IE=8

Cache-Control: no-cache,must-revalidate,proxy-revalidate,max-age=0

ETag: "Cadx5d5LI9+"

Last-Modified: Thu, 30 May 2024 08:15:48 GMT

Content-Type: text/html; charset=utf-8

Content-Length: 5406

Connection: close

Date: Wed, 28 Aug 2024 12:55:42 GMT



<!doctype html><html><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1,user-scalable=no"/><link rel="stylesheet" href="/cloudstore/resource/pc/com/v1/index.min.css?v=1717056632140"><link rel="stylesheet" href="/cloudstore/resource/pc/com/v1/ecCom.min.css?v=1717056632140"><script type="text/javascript" src="/cloudstore/resource/pc/jquery/jquery.min.js?v=20200605"></script><script type="text/javascript">var faviconUrl=window.ecologyContentPath+"/favicon.ico";document.writeln('<link rel="icon" href="'+faviconUrl+'" mce_href="'+faviconUrl+'" type="image/x-icon">'),document.writeln('<link rel="shortcut icon" href="'+faviconUrl+'" mce_href="'+faviconUrl+'" type="image/x-icon">')</script><script>localStorage.setItem("staticVersion",1717056636036)</script></head><body><div id="container"></div><script type="text/javascript" src="/cloudstore/resource/pc/polyfill/polyfill.min.js"></script><!-- Polyfills --><!--[if lt IE 10]>

<script type="text/javascript" src="/cloudstore/resource/pc/shim/shim.min.js"></script>

<![endif]--><script type="text/javascript">var agent=window.navigator.userAgent;if(-1<agent.indexOf("Windows")&&-1<agent.indexOf("Safari")&&agent.indexOf("Chrome")<0)window.location.href="/wui/common/page/sysRemind.jsp?labelid=7";else if((-1<agent.indexOf("Chrome")||-1<agent.indexOf("CriOS"))&&-1<agent.indexOf("Safari")){var regStr_chrome=/chrome\/[\d.]+/gi,chrome_info=agent.match(regStr_chrome),chrome_version=chrome_info[0].replace("Chrome/","").split(".")[0];chrome_version<49&&$.ajax({url:"/api/system/info/interceptionChrome",type:"GET",success:function(e){"true"===JSON.parse(e).isInterceptChorme&&(window.location.href="/wui/common/page/sysRemind.jsp?labelid=6")}})}else{var isIE=-1<agent.indexOf("MSIE")||-1<agent.indexOf("Trident"),IEVersion=0;agent.replace(/MSIE ([\d.]+)/g,function(e){var n=parseInt(e.replace(/^.*MSIE ([\d.]+).*$/,"$1"),10);return IEVersion<n&&(IEVersion=n),e}),isIE&&0<IEVersion&&IEVersion<10&&(window.location.href="/wui/common/page/sysRemind.jsp?labelid=4")}</script><script type="text/javascript" src="/cloudstore/resource/pc/react16/react.production.min.js?v=1607482078878"></script><script type="text/javascript" src="/cloudstore/resource/pc/react16/react-dom.production.min.js?v=1607482078878"></script><script type="text/javascript" src="/cloudstore/resource/pc/react16/prop-types.min.js"></script><script type="text/javascript" src="/cloudstore/resource/pc/react16/create-react-class.min.js"></script><script>React.PropTypes=PropTypes,React.createClass=createReactClass</script><!-- 全局依赖 --><script type="text/javascript" src="/cloudstore/resource/pc/promise/promise.min.js"></script><script type="text/javascript" src="/cloudstore/resource/pc/fetch/fetch.min.js"></script><!-- 组件库 --><script type="text/javascript" src="/spa/moduleConfig.js?v=1717056632140"></script><script type="text/javascript" src="/spa/coms/ssoConfig/config.js?v=1717056632140"></script><script type="text/javascript" src="/cloudstore/resource/pc/com/v1/index.min.js?v=1717056632140"></script><script type="text/javascript" src="/cloudstore/resource/pc/com/v1/ecCom.min.js?v=1717056632140"></script><!-- mobx --><script type="text/javascript" src="/cloudstore/resource/pc/mobx-3.1.16/mobx.umd.js"></script><script type="text/javascript" src="/cloudstore/resource/pc/mobx-react-4.2.1/index.js?v=1593485931122"></script><script type="text/javascript" src="/cloudstore/resource/pc/react-router/ReactRouter.min.js?v=1593485931122"></script><script type="text/javascript" src="/spa/coms/index.mobx.js?v=1717056632140"></script><!-- zDialog --><script type="text/javascript" src="/js/ecology8/lang/weaver_lang_7_wev8.js"></script><script type="text/javascript" src="/wui/theme/ecology8/jquery/js/zDialog_wev8.js"></script><!-- 门户公共 --><link rel="stylesheet" href="/spa/portal/public/index.css?v=1685480433504"><script type="text/javascript" src="/spa/portal/public/index.js?v=1685480433504"></script><!-- 登录接口 --><script type="text/javascript" src="/spa/hrm/staticLoginNew/loginNew.js?v=1685480858767"></script><!-- 门户页面 --><style id="portal-style"></style><link rel="stylesheet" href="/spa/portal/static/index.css?v=1685480721740"><script type="text/javascript" src="/spa/portal/static/index.js?v=1685480721740"></script><!-- 主题 --><style id="e9theme-style"></style><link rel="stylesheet" href="/spa/theme/static/index.css?v=1685480845785"><script type="text/javascript" src="/spa/theme/static/index.js?v=1685480845785"></script><!-- 收藏 --><link rel="stylesheet" href="/spa/favourite/static/index.css?v=1685481411304"/><!--多时区相关的js--><script type="text/javascript" src="/js/timeZone/timeZone.js"></script><!--rsa加密--><script type="text/javascript" src="/js/rsa/jsencrypt.js"></script><script type="text/javascript" src="/js/rsa/rsa.js"></script><script type="text/javascript" src="/formmode/js/CryptoJS3.1.2/aes_wev8.js"></script><script type="text/javascript" src="/formmode/js/CryptoJS3.1.2/mode_ecb_wev8.js"></script><script type="text/javascript" src="/js/weaver_encrypt/weaver_encrypt.js"></script><!-- <script type="text/javascript" src="/js/jquery/jquery_wev8.js"></script> --><script type="text/javascript" src="/spa/main/index-mobx.js?v=5ed2399c"></script></body></html>