crawlingRetweet(retweet_url, weibo_id)

Description

Extract retweet of original weibo post and insert to database.

Parameters

Parameter Necessity Type Description
retweet_url required string the retweet link url of original weibo
weibo_id required string original weibo id

Output

Parameters Type Description
status string show the crawler running status

Implementation

  1. masterStart(). Create multiple processes to begin crawling data.
  2. wapLogIn(). Log in sina Account.
  3. weiBoWapSearch(searchStr, Sid). Use searchStr(person name or company name) and search id(person id or company id) to search related weibo
    • extractTopic(person_name or company_name, person_id or company_id). Extract weibo text and insert to database.
    • getRetweet(retweet_url, weibo_id). Extract retweet of original weibo text and insert to database.

Related Work

None

Issues About The Crawler

  1. Sina Weibo API is not not so effective, it need to be authorized but the crawler would not pass sina's examine and verify.
  2. Using browser’s cookies to log in sina account.
  3. Using the url weibo.cn instead of www.weibo.com to crawl data, because the latter one’s tweet data is sealed in javascript and it’s difficult to extract.
  4. Using multiple proxies to prevent sina block our ip.
  5. For speeding up the crawler, using multiple processes and accounts.
 
projs/clans/docs/crawlingretweet.txt · Last modified: 2014/02/04 18:32 by yangjunfeng0317     Back to top