I’ve been using the Twitter API to search tweets, which has been good for ad hoc queries and low volume analysis, returning hundred to low thousands of tweets. I’d heard about the Twitter ‘firehose’ and the potential to access all tweets and thought I’d dig a bit deeper.
Twitter provide three forms of access (see BrightPlanet):
- Twitter’s Search API (pull): limited to 3200 tweets per request, only accesses the last 5000 tweets for a keyword, number of requests allowed in a time period is limited
- Twitter’s Streaming API (push): users register a set of criteria (keywords, usernames, locations, named places, etc.) and as tweets match the criteria, they are pushed directly to the user. Provides a sample of tweets – anywhere between 1% and 40% of available tweets
- Twitter’s Firehose: guaranteed to provide 100% of tweets that match your search criteria
Option 3 sounds wonderful, why would one ever want to bother with the limitations of 1 and 2? Because 1 and 2 are free and 3 incurs a substantial charge. Twitter work with four companies, who have access to the Firehose and provide access to data users:
- Data sift
- NTT Data
Following Apple’s purchase of Topsy in December 2013, Twitter acquired Gnip in April 2014, leaving Data sift and NTT (who are based in Japan and focus on Japanese tweet data) as independent providers of Firehose data (theguardian).
With access to the Firehose starting at around $500 to $3000 per month depending on the specificity of access, it’s out of reach for all but the most well-endowed academics. In early 2014 Twitter gave limited (and competitive) access to the Firehose (via Gnip) to academics (Wired, Poynter). The scheme closed on 15 March 2014 and is not currently accepting submissions. To get an idea of the ways in which Twitter data is used for research, visit the Twitter engineering site.