Every professional is as good as his toolset and as a QlikView developer, there’s always room for one more trick under your sleeve. Today I will show you one of the most powerful –yet underused– chart for analyzing data: the histogram. Even though it is easy to create it I haven’t seen a lot of developers take advantage of it.
The important lesson here is that histograms are not exactly bar charts. The main difference is that bar charts are used to compare categorical variables whilst histograms represent distributions. Sounds interesting? No? Well, here’s an example.
A couple of days ago I was looking for a data set to try some functions and I got my hands on the ENEM results for 2011. The ENEM is a national exam taken by brazilian high school graduates that evaluates each institution (private and public) in subjects like mathematics, language and natural sciences.
With a traditional bar chart, you can address some questions like: Are there more public or private schools? Which state has the most schools? Which schools are the best ranked in mathematics?
But as your inner data analyst grows more interested, you will start asking more complex questions. When we create a bar chart for the top 10 schools in mathematics, we may realize that there’s a big discrepancy between the best and the worst elements:
Are there some extremely good schools raising the national average? Or alarming bad institutions brining it down? Are they all consistent? Can we separate them in groups (good, normal, bad)? Where’s the majority of the schools? Is there a significant difference between public and private institutions? Or between states?
Let’s see what QlikView can do for us. First, we’re going to change our classic conception of a bar chart by using the X-axis as the grade and the Y-axis as number of schools that got it. As you can see, the data adopts a shape that gives us a better perspective of the situation:
Far in the right, we’ve got great schools (there are not a lot of them, but their scores are pretty high). On the other end, those who might need a little help (grades below 420 points) and in the middle, the majority of the schools. We can appreciate that the curve is skewed to the left, with most of the schools scoring from 440 to 560. Remember, a higher bar represents a bigger number of schools. For example, the red bar (the highest of all the histogram) is conformed by 296 schools that got grades between 500 and 505 points.
Is there a difference between public and private high schools? Well, if we separate our histogram using a second dimension we’ll get something like this:
Some thoughts that will probably cross your mind are: