The constant stream of breaking news, real-time updates, and individual opinions about events, products, or people is a rich source of information that can turn into valuable insights and business intelligence —if extracted in the right way.
Twitter offers an API platform that allows the search and collection of tweets from the public timeline using keywords or the feed of specific accounts. Currently, there are three different tiers of search APIs: Standard (free), Premium, and Enterprise. These tiers differ with regard to data limits, search operators, and technical support from Twitter. In most cases, creating a developer account to use the Standard API is sufficient.
But how exactly does the Twitter search API work?
Let’s say that we would like to collect the latest tweets that talk about tasty food. The query may be written as:
- tasty food, returning tweets containing both tasty and food, but not necessarily in this order
- “tasty food”, returning tweets with the exact phrase tasty food
- tasty OR food, returning tweets containing either tasty or food
- #tastyfood, returning tweets containing the hashtag #tastyfood
Of course, there are more operators listed in the Twitter API documentation for narrowing our search, some of which are only available for the Premium and Enterprise accounts (premium operators). The Twitter API also has the option of returning the latest tweets, the most popular tweets, or a mix of the latest and most popular tweets; in this case, we have used the option for the latest tweets.
Once a query is submitted, the API sends a request to retrieve the tweets that match the query. Depending on the account tier, the API has limits on the number of requests, as well as different windows of time before exceeding those request limits. The most commonly used Standard API allows 180 requests per 15-minute window that will go back 7 days in the Twitter archive. Each request returns 100 tweets, so we can retrieve 18,000 tweets every 15 minutes until we reach 7 days back in the archive. If our task required access to tweets in the last 30 days—or even the full Twitter archive that reaches back to 2006—as well as a greater number of retrieved tweets per request, then we would need a paid Premium or Enterprise API account.
The data is returned in JSON format and includes the actual tweet along with metadata about each tweet and the user who created it. Consequently, a JSON parser is required to have all this information “translated” into a tabular format and/or imported into a database.
The time it takes to collect our data depends on the number of tweets matching our query, as well as the request and time-window specifications of our tier. To continue with our previous example: if we have a Standard API account and our query for #tastyfood matches 500 tweets within the 7 past days, it will take 5 requests and a few seconds to collect the tweets. But if there are 90,000 tweets within the past 7 days that match our query, it will take 900 requests and five 15-minute windows (a little over an hour) for our data collection.
Even though the free Standard search API only allows us to go back 7 days, we can still collect longitudinal datasets by moving forward instead of backward. For example, we can automate the task of sampling tweets every 1 hour each week. Keep in mind, if it is a highly active topic, we may lose some tweets between our sampling windows; and if it is a less active topic, we may have some repetition in our dataset that we would need to clear out.
One of the main barriers for researchers and companies who need to collect data from Twitter is the requirement of basic programming skills to use the API and its output. For this reason, Megaputer has created a user-friendly Twitter data collection feature in its software, PolyAnalyst™, which takes the programming part out of collecting the data by using a convenient graphical user interface (GUI). Users only need to input their account information, their query, and select further search options offered by the Twitter API (i.e., the language used in the tweets). PolyAnalyst™ then collects the data, automatically parses the returned JSON output, and imports the data into a table for further analysis. In addition, the built-in Scheduler feature allows users to schedule longitudinal data collection over a specific period of time, which will then update the data analysis workflow to include the most recently collected data.
Try it out yourself by requesting a free trial.