“A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the “Spritzer” version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the stream.”
While this Twitter Stream content is a sampled *spritzer* version, each month contains approximately 50 gigabytes of compressed Twitter content. Dates range from 2011 through April 2018 (as of 10/02/2018), although coverage is not 100% complete as some months are missing.
Monthly archives are compressed tarballs (.tar), containing hourly Tweet archives compressed as bzip2 files (.bz2). Uncompressed archives are in the standard Twitter JSON format, and contain all fields.
We developed a Python Jupyter Notebook to assist with parsing these files. It an be downloaded here: https://baylor.box.com/s/w3scjg51nrav429bso8r3i49xwaln0rl.
“”” Parses Twitter archives from Archive Team: The Twitter Stream Grab for a list of user-defined keywords
The Archive Team: The Twitter Stream Grab (https://archive.org/details/twitterstream) provides historic
downloads of Twitter archives by month. This script helps researchers to mine this content for a list of
words, phrases, or hashtags. This script requires the monthly archives to be downloaded and extracted from
the .tar archive before use.
Output is a .csv file containing one record per relationship. Relationships are classified as either
(1) reply, (2) mention, or (3) tweet. A reply is a direct response to another user’s post. A mention is
where another user is mentioned, but not a diret reply. A tweet relationship are tweets with neither no
replies or mentions.
See the modify section below to specify (1) keywords/hashtags, (2) top-level directory, and
(3) output file name.