Decide on the best database to use and design the most cost-effective and appropriate database schema based on application needs. We use AWS to host all of our servers and prefer to continue using them for this database but ultimately want to use the best database for the job.
This database will be used to store all compromised user information found on the internet with the key identifier being a users email address. Each hack data inserted into the database may contain different information associated with a user. Similar services include [url removed, login to view] and [url removed, login to view] The following link described how one service did it with Azure. [url removed, login to view]
- Ability to scale to billions of rows
- Quickly return results for an email address. (results generated in milliseconds)
- Quickly return all the results for a given email domain. (results generated in milliseconds)
- Store all associated information related to the email address.
- Store lots of unstructured data related to the email address.
- Allow for detailed reporting of certain data points. (What is the most common password, what is the most common email domain, etc)
<b>Normal searches on the database:</b>
1. Search based on a specific email.
2. Search based on a specific email domain.
1. Suggested Database
2. Suggested Database Schema
3. How do we deal with multiple reported hacks on an individual email address? Do we only allow one "database row" for each email address or a new row every time a new hack is discovered? 20% - 30%+ of most emails are duplicates
4. Help setting up the initial database on AWS.