This is an old revision of the document!
Table of Contents
crawlingPeopleRelatedWeiboOriginalPost(pid, person_name)
Description
Extract people related original weibo post from sina weibo and insert to database.
Parameters
Parameter | Necessity | Type | Description |
---|---|---|---|
pid | required | int | person id |
person_name | required | string | person name need to crawl |
Output
None
Implementation
- masterStart()
- wapLogIn()
- weiBoWapSearch(person_name, pid)
Related Work
None
Issues About The Crawler
- Sina Weibo API is not not so effective, it need to be authorized but the crawler would not pass sina's examine and verify.
- Using browser’s cookies to log in sina account.
- Using the url weibo.cn instead of www.weibo.com to crawl data, because the latter one’s tweet data is sealed in javascript and it’s difficult to extract.
- Using multiple proxies to prevent sina block our ip.
- For speeding up the crawler, using multiple processes and accounts.