Sometimes you just need some data to test and stress things. But randomly generated data is awful — it doesn’t have realistic distributions, and it isn’t easy to understand whether your results are meaningful and correct. Real or quasi-real data is best. Whether you’re looking for a couple of megabytes or many terabytes, the following sources of data might help you benchmark and test under more realistic conditions.
- The venerable sakila test database: small, fake database of movies.
- The employees test database: small, fake database of employees.
- The Wikipedia page-view statistics database: large, real website traffic data.
- The …