Best free data samples to teach databases, SQL, ETL, and OLAP

Preparing my Big Data class for technical audience, I stumbled upon the need to use sample SQL database schemas, available under unrestricted license, for teaching purposes. Here are some resources I recommend.

I wanted a set where I could demonstrate how SQL queries  become resource-intensive, thus creating a practical use case to introduce indexes, then build a data warehouse, to introduce the concepts of snowflake and star schema, OLTP, OLAP and ETL and discuss how all this impacts the database performance issues in general. Noting the audience for my classes are technical and participants had exposure to SQL, I still needed a reasonably simple use case which could be used during 2 hours.

I quickly found online found tons of online resources of fantastic data for teaching. However most of them proved not that great. I wanted practical commercial examples (e.g. corporate sales rather than US road accidents or health statistics) and focus on relational nature of the data rather than  machine learning feature extraction. I also preferred thousands of records rather than millions to make sure students can run stuff on laptops. It was tempting to create an artificial database from scratch full of random stuff, however knowing that humans are pretty bad at generating random data, I preferred downloading something real. So I digged further and asked around, and came up with some nice sets. Here is the list I ended up using. I am posting it here as it might become useful to someone else in the future.

  1. chinook database is cool and good enough for many simple purposes. It is especially nice because it comes with versions for several RDBMSs, so you can compare Oracle, PosgreSQL, MySQL, SQLite on exact same scenario. For instance, it is informative to study how SQLite .eqp compares to Oracle EXPLAIN PLAN on exact same query and same data!
  2. Oracle Tutorial has nice, albeit small Oracle-specific sample database. Check it out. By the way, Oracle Tutorial itself is really well organized and easier for a kickstart than the Oracle official docs.
  3. Oracle Database has nice sample schemas built-in… or has it? Unfortunately, the sample schemas instruction page is misleading, because those schemas have been pulled out from the Express Edition, except the HR schema. The described Database Configuration Assistant is also missing. Solution: install just the HR schema, by unlocking hte HR user, following the instruction on the stated page. Then, you can download the remaining schemas including OE, PM and SH separately from github here, but installing them is more tricky due to user roles and privileges.
  4. I found that instead of looking for a preconfigured schema, it is faster to build one based on an csv file, which are easier to find online (in Oracle, if you are not comfortable with commandline SQL*Plus, use SQL Developer’s import feature). You will then of course need to do some operations to split the data into several tables, but at least you have a real data instead of artificial one. A few good resources of nice comma-separated data come from several communities, not necessarily database related: Vincent Arel-Bundock’s Data sets for R with several small files, Spatialkey data sets with very practical examples from insurance or real estate, Watson sample sales data, perhaps not perfect because data has been preaggregated into quarters, but still sizeable and nice.
  5. Finally, our world in data csv sets are quite nice. It does not meet all my criteria (the data is not that commercial) but contains generic knowledge in a concise format, importantly – it is a type of data everyone is concerned of and students understand quickly.

There is plenty of other resources. But many of them inadequate to the purpose I described. I have spent several hours on dismissing dozens data sets I did not like for various practical reasons, so I hope the filtered list above will save the reader at least a few hours.

To end with, it is tempting to think I did a great job with this research and wow, what a benefit for humanity! Truth is, someone / somewhere made much bigger effort than this, to prepare those data sets listed here and make them available for free. I want to give big thanks to the authors and contributors of those online resources thanks to whom I my students could learn a lot. I aso contacted all of them in private (when I was able to identify the author) to give them personal feedback. This short article is also intended to be helpful to them. I am purposely not keeping the results of my research private but linking them from my site, so the authors will benefit somewhat from google ranking and publicity .

The reason for creating this writeup is that I I know from experience how rarely people give you credit for the work you put online, and how great it feels to receive one of those rare credits. If you use someone’s work, share it!

Please follow and like us: