使用json和mongo进行果壳网异步数据的爬取【python】

说明原文连接【点我】

1、在原文基础上稍作修改,增加了mysql数据库的代码,但是在数据库中的显示结果不理想,主要是其格式不确定怎么设置。

2、本文亮点使用json分析了异步数据,用的方法比较麻烦,但是起作用,具体为通过firefox的网络+XHR功能,鼠标逐渐往下拉,来得到get请求,查看右侧的结果,以及找到步进规律和对应的json网址,如下图:

3、配置:需要安装mongodb,并配置,同时建议安装其图形化工具robomongo,会方便不少,, mongo代码:

#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
import pymongo

url = 'http://www.guokr.com/scientific/'

def dealData(url):
    client = pymongo.MongoClient('localhost', 27017)
    guoke = client['guoke']
    guokeData = guoke['guokeData']
    web_data = requests.get(url)
    datas = json.loads(web_data.text)
    print(datas.keys())
    for data in datas['result']:
        guokeData.insert_one(data)

def start():
    urls = ['http://www.guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset={}&_=1462252453410'.format(str(i)) for i in range(20, 100, 20)]
    for url in urls:
        dealData(url)

start()

4、输出结果到robomongo,如图:

5、使用mysql数据库,由于不确定输出的result应该用什么格式,我就用了varchar,但是这把一个列表都放到varchar中,导致长度非常长,看起来也很不爽,仅做记录。代码:

#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
import pymysql.cursors

url = 'http://www.guokr.com/scientific/'

def dealData(url):
    # client = pymongo.MongoClient('localhost', 27017)
    # guoke = client['guoke']
    # guokeData = guoke['guokeData']

    web_data = requests.get(url)
    datas = json.loads(web_data.text)
    print(datas.keys())
    for data in datas['result']:
        #guokeData.insert_one(data)
        print(data)
        print (type(data))
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='XXXX',
                                     db='guoke',
                                     charset='utf8'
                                     )

        try:
            # 创建会话指针
            with connection.cursor() as cursor:
                # 创建sql语句
                sql = 'insert into `guoke1` (`result`) values(%s)'
                # 执行sql语句
                cursor.execute(sql, (str(data)))

                # 提交
                connection.commit()
        finally:
            connection.close()

def start():
    urls = ['http://www.guokr.com/apis/minisite/article.json?retrieve_type=by_subject&limit=20&offset={}&_=1462252453410'.format(str(i)) for i in range(20, 100, 20)]
    for url in urls:

        dealData(url)

start()

6、mysql数据显示,如下图,比较杂乱:

7、总结:异步加载的网页爬取有所麻烦,后续会使用selenium工具来相对方便的操作, 同时,可以感觉到mongodb有其特色,