I would like to setup two processes to consume data from Twitter's Streaming API:
1) Consume all matching tweets and associated username, user profile pic, longitude, latitude, and time created which match a set of keywords. This will use Twitter's Streaming API's "track" method. [url removed, login to view]
2) Consume all data from Twitter's "Spritzer" which is a ~1% random sample of all their tweets.
The processes must be deployed to Amazon EC2.
The data will be loaded into a database which I have a schema for. Currently I am using MySqL, but based on the data size and the speed which it is loaded, I may want to scale up to an S3 database. If that is the case I will need your help deploying this DB on S3 (or comparable resource). The SLA is that the Streaming API data is available in the DBs no more than 10 seconds later.
The processes should be reliable and provide monitoring on data consumption and cpu/memory used in a daily report. If any of the errors (listed here: [url removed, login to view]) are encountered those should also be logged and in some cases an email alert should be triggered.
Ideally the project will be in python code, but I can compromise here.
If the process is completed in python, meeting requirements, and on time, I can bonus $125.