My Sample Dataset: The Fionnula Alberton Mysteries

For the sake of any demos, blogs, or simple testing I’ve gone ahead and created my own data set. The point of this blog post is just for me to refer back to, in order to explain the data a bit without taking too much time in the blogs themselves.

Why not use data available online?
Long story, but let me try to make it an even longer story.

Typography, or the study of fonts, has long been a hobby of mine – like it is for many trendsetters and influencers around the world. It really is this century’s bungee jumping.

Ok fine, it’s at best this century’s stamp collecting (for younger readers, mail used to be hand written on physical paper, carried in paper containers called envelopes…). So, I understand when I’m at a customer site and they want to simply take a font from one machine and put it on another. And I understand the surprise when I ask about their font licensing situation.

Yeah, fonts are made by people for a price with specific licensing. You have to honor those agreements, even though you may not think about it.

So, when I went to a website to get detailed data on film profits to use as a default data set for demos and for use in this blog, at the last minute I thought to ask whether I was allowed to use this data. I was not. I was not licensed, and could not afford the license.

But, it’s more interesting and easier to understand film data than it is to understand, say, geothermic power plant efficiency from 1943 to 1972, or vegetable consumption rates in Luxembourg in the 1700s. And so, I decided to make my own data set.

Meet Fionnula Alberton
In this fictional and completely made up by me long running series of films, amateur detective Fionnula Alberton solves the murders afflicting her small Canadian town – murders which seem to be responsible for offing about a third of the population. Over the decades they have been running, they’ve had their highs and lows, including one disastrous reboot in the low column.

The brain child of author Ruby Price, who became disillusioned with the series almost right away, but continued writing them due to mysterious dump trucks full of money appearing at her door, the beloved series includes such hits as “Original Sin,” “More of the Same,” and “The 3rd Suspect is Always Guilty.”

Several actors have played multiple roles. The series took place over so many years that inflation plays a big role in the data. There’s basically a lot to work with.

How the data was made
I’m not ashamed to admit that I mainly used a spreadsheet program, and my best guesses for how film ticket prices and film-going have changed in the last 30 years. I also added a bit of randomization to make sure that the numbers weren’t too smooth.

The Structure
This is a diagram of my data, made in Lucidchart (which I will blog about eventually):

If you really want a closer look, here’s a visio diagram: Fionnula.vdx

The color coding is as follows:

  • Blue is for Fact tables
  • Green is for Primary Star Schema like lookup tables
  • Yellow is for Snowflake lookup tables
  • Red is for the relationship table which joins actors and characters together. This was necessary because some actors have played multiple characters, and some characters have been played by multiple actors so to create a 1-to-many relationship this had to be modeled through a relationship table.

As you can see, it’s designed to match MicroStrategy’s snowflake model, and will almost also work as a star schema with a minimum amount of modification (I’d need to put the descriptions into the Primary Star Schema tables). This is honestly how I prefer to design my schemas, for three reasons:

  1. Denormalizing the data infinitesimally increases the speed of reporting in MicroStrategy
  2. Basing the design around a core star schema prevents arguments with DBAs
  3. It also makes the model structurally able to be tested with any other tool, allowing for both one-to-one comparisons and easy conversion to another tool if your corporate needs change