Every professional is as good as his toolset and as a QlikView developer, there’s always room for one more trick under your sleeve. Today I will show you one of the most powerful –yet underused– chart for analyzing data: the histogram. Even though it is easy to create it I haven’t seen a lot of developers take advantage of it.
The important lesson here is that histograms are not exactly bar charts. The main difference is that bar charts are used to compare categorical variables whilst histograms represent distributions. Sounds interesting? No? Well, here’s an example.
A couple of days ago I was looking for a data set to try some functions and I got my hands on the ENEM results for 2011. The ENEM is a national exam taken by brazilian high school graduates that evaluates each institution (private and public) in subjects like mathematics, language and natural sciences.
With a traditional bar chart, you can address some questions like: Are there more public or private schools? Which state has the most schools? Which schools are the best ranked in mathematics?
But as your inner data analyst grows more interested, you will start asking more complex questions. When we create a bar chart for the top 10 schools in mathematics, we may realize that there’s a big discrepancy between the best and the worst elements:
Are there some extremely good schools raising the national average? Or alarming bad institutions brining it down? Are they all consistent? Can we separate them in groups (good, normal, bad)? Where’s the majority of the schools? Is there a significant difference between public and private institutions? Or between states?
Let’s see what QlikView can do for us. First, we’re going to change our classic conception of a bar chart by using the X-axis as the grade and the Y-axis as number of schools that got it. As you can see, the data adopts a shape that gives us a better perspective of the situation:
Far in the right, we’ve got great schools (there are not a lot of them, but their scores are pretty high). On the other end, those who might need a little help (grades below 420 points) and in the middle, the majority of the schools. We can appreciate that the curve is skewed to the left, with most of the schools scoring from 440 to 560. Remember, a higher bar represents a bigger number of schools. For example, the red bar (the highest of all the histogram) is conformed by 296 schools that got grades between 500 and 505 points.
Is there a difference between public and private high schools? Well, if we separate our histogram using a second dimension we’ll get something like this:
Some thoughts that will probably cross your mind are:
- It looks like there are more public than private institutions. Yes, 5 278 against 4 798 actually. But I wouldn’t recommend this type of chart for that analysis. Shapes and distributions can easily deceive our eyes. I you want to make that kind of comparison, a traditional bar chart will be better.
- The private schools are (on average) a bit better than their public counterparts.
- The shapes are slightly different. The red curve is wider, which means that we can find good, normal and bad private schools. On the other hand, the blue curve is more compact, meaning that public schools are consistently… not so good.
- Let’s check the position of both curves. It is interesting how the worst private schools (red left tail) are actually on pair with the average public schools. Also, the best public institutions get more or less the same grades as the average private ones.
If we add a Trellis dimension, we can see that some states like Sao Paulo and Minas Gerais are “two humped” (the gap between public and private schools is bigger). Others like Rio de Janeiro are flatter and some of them (like Ceará) are clearly more skewed.
1.- You can download the app and the XLS file here.
2.- The script is pretty simple, so I won’t go into detail. Just one thing, if you don’t have an internet connection you should comment the first lines. I connect to Wikipedia to get the state names.
3.- Create a Bar Chart with a calculated dimension:
In this case, the best way to group the schools in bins according their grade is the class function.
[I’d recommend using a variable instead of hardcoding a number.]
4.- Our expression will be:
5.- Don’t forget to sort the chart using the Numeric Value
6.- Your chart should look like this:
The class function returns this format “400 <= x < 405”. If you want to make it easier to read, we can omit the last part, taking only the first 3 characters, so edit the calculated dimension:
7.- In order to avoid the mess in the X-axis, I’d recommend turning off the Stagger Labels option in the Axes tab.
8.- Apply a little make-up and voilà!
As you can see, the histograms are not exactly bar charts and they certainly can be very useful to understand your data. I hope you liked today’s post. If you have any comments, leave them in the section below!