说明原文连接【点我】
1、在原文基础上稍作修改,增加了mysql数据库的代码,但是在数据库中的显示结果不理想,主要是其格式不确定怎么设置。
2、本文亮点使用json分析了异步数据,用的方法比较麻烦,但是起作用,具体为通过firefox的网络+XHR功能,鼠标逐渐往下拉,来得到get请求,查看右侧的结果,以及找到步进规律和对应的json网址,如下图:
3、配置:需要安装mongodb,并配置,同时建议安装其图形化工具robomongo,会方便不少,, mongo代码:
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
import pymongo
url = 'http://www.guokr.com/scientific/'
def dealData(url):
client = pymongo.MongoClient('localhost', 27017)
guoke = client['guoke']
guokeData = guoke['guokeData']
web_data = requests.get(url)
datas = json.loads(web_data.text)
print(datas.keys())
for data in datas['result']:
guokeData.insert_one(data)
def start():
urls = ['http://www.guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset={}&_=1462252453410'.format(str(i)) for i in range(20, 100, 20)]
for url in urls:
dealData(url)
start()
4、输出结果到robomongo,如图:
5、使用mysql数据库,由于不确定输出的result应该用什么格式,我就用了varchar,但是这把一个列表都放到varchar中,导致长度非常长,看起来也很不爽,仅做记录。代码:
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
import pymysql.cursors
url = 'http://www.guokr.com/scientific/'
def dealData(url):
# client = pymongo.MongoClient('localhost', 27017)
# guoke = client['guoke']
# guokeData = guoke['guokeData']
web_data = requests.get(url)
datas = json.loads(web_data.text)
print(datas.keys())
for data in datas['result']:
#guokeData.insert_one(data)
print(data)
print (type(data))
connection = pymysql.connect(host='localhost',
user='root',
password='XXXX',
db='guoke',
charset='utf8'
)
try:
# 创建会话指针
with connection.cursor() as cursor:
# 创建sql语句
sql = 'insert into `guoke1` (`result`) values(%s)'
# 执行sql语句
cursor.execute(sql, (str(data)))
# 提交
connection.commit()
finally:
connection.close()
def start():
urls = ['http://www.guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset={}&_=1462252453410'.format(str(i)) for i in range(20, 100, 20)]
for url in urls:
dealData(url)
start()
6、mysql数据显示,如下图,比较杂乱:
7、总结:异步加载的网页爬取有所麻烦,后续会使用selenium工具来相对方便的操作, 同时,可以感觉到mongodb有其特色,