Python 多线程爬虫基础

Python多线程和java多线程的运用很相似. 都可以通过继承线程类, 然后再类里面重写run 方法来实现自己想要完成的任务.

Python 多线程的简单例子

让两个线程同时运行, 输出我是线程a/b

#! /usr/bin/python3
import threading

class A(threading.Thread):
    def __init__(self):
    ┆   threading.Thread.__init__(self)
    ┆   pass

    def run(self):
    ┆   ##while(1):
    ┆   ##    print("我是线程" + threading.Thread.getName())
    ┆   ##    threading.Thread
    ┆   for i in range(1,10):
    ┆   ┆   print("I'm thread A")
    ┆   ┆   pass

class B(threading.Thread):
    def __init__(self):
    ┆   threading.Thread.__init__(self)
    ┆   pass

    def run(self):
    ┆   ##while(1):
    ┆   ##    print("我是线程" + threading.Thread.getName())
    ┆   ##    threading.Thread
    ┆   for i in range(1,10):
    ┆   ┆   print("I'm thread B")

thread1 = A()
thread1.start()
thread2 = B()
thread2.start()

Python 队列使用的简单实例

#! /usr/bin/python3
import queue

a = queue.Queue()

a.put("小张")
a.put("小郑")
a.put("小李")
a.task_done

print(a.get())
a.task_done()  ## 每次GET之后,处理完要加taskdown
print(a.get())
a.task_done()  ## 每次GET之后,处理完要加taskdown
print(a.get())
a.task_done()  ## 每次GET之后,处理完要加taskdown

多线程爬虫思路

总体规划号程序执行的流程, 并规划好各线程的关系与作用. 总共需要三个线程!
线程1 专门获取对应的网址并处理为真实网址, 然后将网址写入队列urlqueue中, 该队列专门用来存放具体文章的网址.
线程2与线程1并行, 从线程1提供的文章网址中依次爬取对应文章信息,并处理. 处理后保存到本地.
线程3是控制线程, 用于检查程序是否完成, 否则即使线程1和线程2都干完活了, 程序也不会自动结束. 所以我们可以建立一个新的线程, 专门实现总体控制, 每次延时60秒, 延时后发现队列中没有了任务, 那么就终止程序.