Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Monday, May 21, 2007

Debunking GROUP BY myths

There is a popular myth about the SQL GROUP BY clause. The myth holds that 'standard SQL' requires columns referenced in the SELECT list of a query to also appear in the GROUP BY clause, unless these columns appear exclusively in an aggregated expression. MySQL is often accused of violating this standard.

In this article I will attempt to debunk this myth, and to provide a more balanced view regarding MySQL's treatment of GROUP BY at the same time.

To do that, I will first demonstrate that MySQL can be instructed to only accept GROUP BY clauses that include all non-aggregated expressions referred to in the SELECT list, thus making MySQL's behaviour conform more to that of other well-known rdbms-products.

Second, I will show that it is very important to clearly define which version of the SQL-standard is being referred to. The two most recent versions use a rather sophisticated way of defining the required relationships between expressions appearing in the GROUP BY clause and the SELECT list. Contrary to a popular belief, these standards do not literally require that all non-aggregated SELECT list columns appear in the GROUP BY clause.

Third, I will use a simple yet realistic example to illustrate in an informal manner what I believe is the intent expressed in the more recent versions of the SQL standard. Hopefully I will be able to convince you why it may even be better to not blindly include all non-aggregated columns from the SELECT list in the GROUP BY clause.

Before we dive into the details, I'll start with a brief introduction with regard to GROUP BY for those that are not too familiar at all with the construct. In the introduction, I will illustrate why most database products require all non-aggregated columns that are referenced in the SELECT list to appear in the GROUP BY clause, and why users run into trouble sometimes due to MySQL's treatment of GROUP BY clause.

The GROUP BY-clause


So, what is the GROUP BY-clause, and what does it do?

The GROUP BY-clause is an optional element of SQL SELECT expressions. Syntactically, the GROUP BY-clause consists of the keyword sequence GROUP BY, followed by a comma-separated list of (scalar) expressions. If the SELECT expression contains a GROUP BY-clause, it must appear after the WHERE clause. (If the WHERE clause is omitted, the GROUP BY clause will immediately follow after the FROM clause.)

When included, GROUP BY specifies that rows from the intermediate result set are to be divided in a number of groups, returning one single row for each such group. The list of expressions provided in the GROUP BY list defines how the grouping takes place. All rows that have the same combination of values for all expressions specified in the GROUP BY are in the same group.

Let's do a few simple example queries to illustrate the effect of the GROUP BY-clause. (For these examples I'll use the pet table from the menagerie database.) The following query retrieves all rows from the pet table in the menagerie database:

SELECT *
FROM menagerie.pet

The query returns a result that might look like this:

+----------+--------+---------+------+------------+------------+
| name | owner | species | sex | birth | death |
+----------+--------+---------+------+------------+------------+
| Fluffy | Harold | cat | f | 1993-02-04 | NULL |
| Claws | Gwen | cat | m | 1994-03-17 | NULL |
| Buffy | Harold | dog | f | 1989-05-13 | NULL |
| Fang | Benny | dog | m | 1990-08-27 | NULL |
| Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 |
| Chirpy | Gwen | bird | f | 1998-09-11 | NULL |
| Whistler | Gwen | bird | NULL | 1997-12-09 | NULL |
| Slim | Benny | snake | m | 1996-04-29 | NULL |
| Puffball | Diane | hamster | f | 1999-03-30 | NULL |
+----------+--------+---------+------+------------+------------+

(Because we did not specify an ORDER BY-clause, the rows are returned in some order determined by the database, so your results might not look exactly like this. However, for this example, the actual rows are important - not the order.)

Now, suppose we want to make groups for each species. The following addition of the GROUP BY-clause does just that:

SELECT species
FROM menagerie.pet
GROUP BY species -- make one group for each species

The query returns this result:

+---------+
| species |
+---------+
| bird |
| cat |
| dog |
| hamster |
| snake |
+---------+

At a glance, it seems as if the GROUP BY clause does nothing more than scan for unique occurrences in the species column and return those. However, it is better to think of each row in the GROUP BY result as a summary row that represents a group of rows that have the same value in the species column. So, in this case, the bird row represents the group of pets that are birds, ("Chirpy" and "Whistler"); the cat row represents the group of pets that are cats ("Fluffy" and "Claws"), and so on and so forth.

Calculating Aggregates for a group of rows


A GROUP BY query allows one to apply aggregate functions on the collection of rows associated with each group defined by the GROUP BY clause. An aggregate function can process expressions for each row in a group of rows to compute a single return value. A number of well-known standard aggregate functions are COUNT, MIN, MAX, and SUM.

(Aggregate functions can also be used without a GROUP BY clause, in which case the entire intermediate resultset is treated as one big group. Try imagining the effect of the GROUP BY operation with an empty GROUP BY list: the query will return just one row that summarizes all rows from the intermediate result set).

Expanding our previous example query for GROUP BY, the following example illustrates the effect of some of these aggregate functions:

SELECT species
, GROUP_CONCAT(name) -- make a list of pets per species
, COUNT(*) -- count pets per species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
FROM menagerie.pet
GROUP BY species

This example also includes usage of the MySQL specific GROUP_CONCAT aggregate function which will prove to be very useful to illustrate the effect of the GROUP BY-clause.

The result looks something like this:

+---------+--------------------+----------+------------+------------+
| species | GROUP_CONCAT(name) | COUNT(*) | MIN(birth) | MAX(birth) |
+---------+--------------------+----------+------------+------------+
| bird | Chirpy,Whistler | 2 | 1997-12-09 | 1998-09-11 |
| cat | Fluffy,Claws | 2 | 1993-02-04 | 1994-03-17 |
| dog | Buffy,Fang,Bowser | 3 | 1979-08-31 | 1990-08-27 |
| hamster | Puffball | 1 | 1999-03-30 | 1999-03-30 |
| snake | Slim | 1 | 1996-04-29 | 1996-04-29 |
+---------+--------------------+----------+------------+------------+

Again, we see one row for each group of rows that have an identical value in the species column, but this time, we also see the effect of processing the individual rows for each species using aggregate functions:

  • The GROUP_CONCAT function was applied to the name column. For each species in our pet table, there maybe multiple pets, and GROUP_CONCAT concatenates their names, separating the individual names by default with a comma. Thus in this example, the GROUP_CONCAT expression reveals the make-up of each group of pets of a single species.

  • The COUNT function is used with a wildcard *, instructing it to count the number of rows associated with each group. Verifying this with the previous column, we can immediately see that the number of rows in the group is consistent with the number of names concatenated by the GROUP_CONCAT expression

  • The MIN and MAX functions are applied to the birth column and for each species, respectively report the birth date of whichever pet is oldest (MIN(birth), the birth date that is smaller than any of the other birth dates) and youngest (MAX(birth), the birth date that is larger than any of the other birth dates).


So, aggregate functions are applied to expressions taken from a group of rows and have the effect of 'condensing' (aggregating) the group, yielding a single value. Most aggregate functions calculate or determine some kind of statistical metric which serves to characterize the group as a whole. The MySQL specific GROUP_CONCAT function is an exception: it simply enumerates all members in the group passed to the function, and as such it is not a statistical function. However, it still exposes the main property of aggregate functions, namely the ability to turn the expressions from a group of rows into a single value.

Running into trouble with GROUP BY


So far, we've seen a few examples with GROUP BY that make perfect sense. Yet is easy to run into trouble with GROUP BY. Take a look at the following query:

SELECT species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
, birth -- birthdate of ... uh oh...!
FROM menagerie.pet
GROUP BY species

This query is similar to the previous query, where we calculated a few aggregates for each group of pets belonging to the same species. However, this time, we also include the a plain reference to the birth column in the SELECT list.

An attempt to run this query on Oracle results in an error:

SQL> SELECT species
2 , MIN(birth)
3 , MAX(birth)
4 , birth
5 FROM sakila.pet
6 GROUP BY species;
, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression

Running this query on MySQL however does return a result, which may look something like this:

mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
+---------+------------+------------+------------+
| species | MIN(birth) | MAX(birth) | birth |
+---------+------------+------------+------------+
| bird | 1997-12-09 | 1998-09-11 | 1998-09-11 |
| cat | 1993-02-04 | 1994-03-17 | 1993-02-04 |
| dog | 1979-08-31 | 1990-08-27 | 1989-05-13 |
| hamster | 1999-03-30 | 1999-03-30 | 1999-03-30 |
| snake | 1996-04-29 | 1996-04-29 | 1996-04-29 |
+---------+------------+------------+------------+
5 rows in set (0.00 sec)

What is happening here? Why do we see such different behaviours? In fact, what is MySQL's behaviour in this case? Sometimes, the birth column reports a value that looks like the maximum value for birth within the species (first row), and sometimes we see the maximum value (second row). We even see one case where the returned value is in between the minimum and maximum values (row 3). How can we explain this seemingly random behaviour?

Understanding the Problem


It's not too hard to deduce what is happening here. All we need to do is go back to our explanation of the effect of the GROUP BY clause, and see how it applies to our last query.

It was already explained that the GROUP BY clause returns one row for each group of rows in the intermediate result, and that the groups are defined by the expression list defined in the GROUP BY clause. So, in this case, we are creating one result row for each group of rows that belong to the same species because the GROUP BY list only contains the species column. Yet there are several pets that may belong to a specific species, so the birth column may have (and often has) a different value for each row in a particular species group.

Having realized that, we can now ask ourselves: assuming there are multiple values in the birth column for a particular value of species, which one should be returned? What did we mean when we specified the birth column in the SELECT list?

There is no good answer to this question. It is certainly possible to select just one of the possible values for the birth column: in fact, this is exactly what MySQL does. However, it is impossible to define the significance of which ever of the possible values is chosen. That is, it makes no sense to want to mix the plain values from the rows that belong to the group with the group itself. Therefore it does not make sense to even include the birth column in the SELECT list of the query.

Another way of looking at it is to say that the 'grain' of species values is different (and therefore, incompatible) with the grain of the birth values. It does not mean we cannot access the values in the birth column; it merely means we must use an aggregate function to compute the 'right' value from a whole group of them.

Avoiding the Problem


What about the behaviour such as exercised by Oracle? Doesn't it make more sense to issue an error message rather than returning non-sense data? Put this way, most people will probably agree. Of course, the error message itself is somewhat puzzling:

, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression

This seems to suggest that the problem is that the birth column in the SELECT list is not included in the GROUP BY clause. In turn, this raises the question whether the problem would be solved if we would have included the birth column in the GROUP BY clause.

Well, including the birth column in the GROUP BY clause certainly gets rid of the error message. However, a lot of users with entry level skills in SQL fail to understand that this yields quite a different query. The original GROUP BY species yields one group for each species, whereas GROUP BY species, birth yields a group for each combination of values in species and birth - most probably not at all what is intended.

On the other hand, we cannot expect the database management system to know what we were thinking when we included the birth in the SELECT list in the first place. So, despite that the error message may seem a bit puzzling, it is still preferable over silently returning non-sense data.

But do we really have to put up with this? The answer is "No!".

Including ONLY_FULL_GROUP_BY in MySQL's sql_mode


Nowadays, MySQL is capable of detecting this problem too, and it is perfectly possible to make MySQL reject the previous query to avoid the problem of returning non-sense data. This is achieved by including ONLY_FULL_GROUP_BY in the sql_mode.

Like many server settings, we can specify the sql_mode using the --sql-mode command line argument to the MySQL server executable (mysqld), or we can include it in an option file. For example, including the following line in the option file will enable ONLY_FULL_GROUP_BY when the server starts up:

sql_mode=ONLY_FULL_GROUP_BY

Beginning with MySQL 4.1, it is also possible to set the sql_mode at runtime using the SET syntax. In this way, the sql_mode can be set globally or for the session level. The latter is the most useful, as it allows one to setup a sql_mode most suited for a particular application without affecting any other applications that run on the server. (Some applications don't expect anything other than the default setting and may run into trouble with a particular sql_mode.)

The following snippet illustrates how to include ONLY_FULL_GROUP_BY in the sql_mode at runtime:

mysql> SET sql_mode := CONCAT('ONLY_FULL_GROUP_BY,',@@sql_mode);
Query OK, 0 rows affected (0.00 sec)

The @@sql_mode server variable contains a possibly empty, comma-separated string of current sql_mode settings. The CONCAT expression prepends whatever the current setting is of the sql_mode with ONLY_FULL_GROUP_BY. Note the comma immediately following ONLY_FULL_GROUP_BY. If the value of @@sql_mode is the empty string, the value of the CONCAT expression will have a trailing comma, but this is allowed in the assignment (trimming off the comma in the process).

When we now attempt to execute our query again, it fails with an error message:

mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY

The error messages indicates that we did not include the birth column in the GROUP BY clause. Now, MySQL behaves similar to Oracle for this query.

(Alas, MySQL's ONLY_FULL_GROUP_BY is not as clever as it should be, and there are particular cases where ONLY_FULL_GROUP_BY is too restrictive in enforcing only full GROUP BY clauses. The details are described here. The good news is that the community can help to fix this bug! Go to MySQL Forge and check out Worklog task 2489.)

What does 'the' SQL standard say


In the previous sections, we've seen how Oracle and MySQL react very differently to the same SQL GROUP BY query. But what do the standards say? How is the GROUP BY clause supposed to behave?

In the introduction of this article, I claimed I would debunk a popular myth that holds that
...standard SQL requires columns referenced in the SELECT list of a query to also appear in the GROUP BY clause, unless these columns appear exclusively in an aggregated expression.


Now, I don't want to pretend I'm an expert as far as the SQL standard (ISO/IEC 9075) is concerned. In fact, I've noticed repeatedly that the sheer volume of the documentation as well as the persistent formal wording prevent me from obtaining a clear overview of it. But, let's give it a try anyway.

The 1992 version of the standard, 7.9 - 7 states that:
If T is a grouped table, then each <column reference> in each <value expression> that references a column of T shall reference a grouping column or be specified within a <set function specification>.
As said, I'm not an expert in this area, but the way I read it it boils down to:

  • Queries that include a GROUP BY clause can only include column references in SELECT-ed expressions if the column appears in the GROUP BY clause, or if that column appears as part of an aggregate.


Now, in the 7.12 - 15 of the 2003 version of the standard we find this:

If T is a grouped table, then let G be the set of grouping columns of T. In each <value expression> contained in <select list>, each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a <set function specification> whose aggregation query is QS.

The 1999 version of the standard contains a similar rule. The important thing to note here is that both versions that succeeded the 1992 version stopped requiring explicitly that all non-aggregated columns in the SELECT list must be present in the GROUP BY clause. Instead they require that any non-aggregated column appearing in the SELECT list is functionally dependent upon the GROUP BY clause.

Functional dependencies


What would the 1999 and 2003 version of the SQL standard mean by the term "functionally dependent"? The answer to that question is also defined by the standard. Unfortunately, it cannot be illustrated by a simple quote, as the formal definition of what exactly constitutes a functional dependency according to the standard is fairly extensive and complicated.

Luckily, the concept of functional dependencies can be easily illustrated in a less formal way. Suppose we have two expressions, A and B. Now, B is functionally dependent upon A if B has exactly one value for a particular value of A. Consider this snippet of code:

mysql> SELECT @A:=1 AS A
-> , @B:=@A + 1 AS B;
+---+------+
| A | B |
+---+------+
| 1 | 2 |
+---+------+

Here, the column B is functionally dependent upon column A. The value of B can be derived from the value of A in a very straightforward manner, namely by adding 1 to whatever the value of A happens to be. We know how to compute the value of B for any given value of A, and for any given value of A, the corresponding value of B will always be the same.

The functional dependency concept can also be be applied to multiple columns:

SELECT CONCAT(A,B) C
FROM someTable

Column C is defined as the result of CONCAT(A,B) expression. If we have the values of both A and B we can compute the result of CONCAT(A,B). Of course, for a given pair of expressions for A and B the result of CONCAT(A,B) will always be the same. Therefore, C is functionally dependent upon the column pair A and B.

Note that it is not enough just to only know the method to calculate B out of A. Consider this example:

mysql> SELECT @A:=1 AS A
-> , @B:=@A + RAND() AS B;

Here, we know the recipe to derive the value of A from the value of A: we have to take the value of A and add the value returned by a call to the RAND() function. However, the RAND() function will return a different value each time it is called. Therefore, the value of B will be different too every time, and hence we cannot say that B is functionally dependent upon A. There is thus not a single value for B for a given value of A, and thus B is not functionally dependent upon A.

Functional dependency and Normalization


The term functional depencency is also used with regard to normalization. Part of the normalization process involves discovering functional dependencies between different groups of columns.

Normalization requires that each table has at least one key. A key is a column or group of columns that may be used to identify a single record in the table. By definition, if we have a key, all non-key columns are functionally dependent upon the key. Another way to think about this is to imagine that we look up a row using a key. If the key entry exists, this will result in exactly one row. By definition, each column in that row has exactly one value, so the value of each non-key column can be determined by the key.

Functional dependencies between a group of columns that makes up a key and any other group of columns is allowed, but functional dependencies between two groups of non-key columns are eliminated by the normalization process. (This is done by splitting off the groups of columns that expose the functional dependency to a new table, making one of the groups of columns a key in the new table.)

Functional dependency and GROUP BY


We have just seen that the 1999 and 2003 versions of the SQL standard require that the columns appearing in the SELECT list are functionally dependent upon the groups defined by the GROUP BY clause. In other words, if we know that a column contains only one value for any given combination of values in the columns appearing in the GROUP BY clause, we may reference the column in the SELECT list even if it does not appear in an aggregate expression.

We've also seen that if we have a (primary or unique) key, all columns that are not included in the key are by definition functionally dependent upon the key. This means that if we include all key columns in the GROUP BY clause, we can reference any column we like in the SELECT list, even if they appear outside an aggregate expression.

The following example from the sakila sample database might help to illustrate this:

mysql> SELECT film_id -- primary key
-> , title -- non-key column
-> , COUNT(*) -- one row per group
-> FROM sakila.film
-> GROUP BY film_id; -- group by on primary key
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 1 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 1 |
+---------+-----------------------------+----------+
1000 rows in set (0.05 sec)

Here, we query the film table. The primary key of the film table consists of only the film_id column. The GROUP BY clause contains the film_id column. As a result, the query returns a collection of groups, each of which summarizes only one row. Of course, because there is only one row per group, there can be only one value in each of the other columns of the film table. Therefore, it is safe to include whatever column we like in the SELECT list. For this reason, it is perfectly ok to include the film_title column in the SELECT list.

Of course, the GROUP BY in the previous query does not make sense logically. Because the primary key of the film table consists of only the film_id column, we already know there can be only one row for any given value of film_id. However, it becomes interesting when we include another table in the query. Consider the next example:

mysql> SELECT f.film_id
-> , f.title
-> , COUNT(fa.actor_id)
-> FROM film f
-> LEFT JOIN film_actor fa
-> ON f.film_id = fa.film_id
-> GROUP BY f.film_id;
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 10 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 3 |
+---------+-----------------------------+----------+
1000 rows in set (0.02 sec)

Here, we have added a LEFT JOIN to calculate the number of actors per film. This time, the GROUP BY clause on the film_id does make sense: we now get one group of actors that play a role in each film. At the same time, we know that all columns in the film table are functionally dependent upon the film_id column. Each group returned by the GROUP BY clause corresponds to exactly one row from the film table, and this means that for each group, there is only one value in any column of the film table. So, it's perfectly safe to reference the columns from the film table in the SELECT list, even if we don't use them in an aggregate expression.

It is important to realize that we cannot reference just any column in the SELECT list: we can only reference those columns that are functionally dependent upon the film_id column of the film table. This means that it is wrong to reference any column of the film_actor table in the SELECT list directly: we may do so only in an aggregate expression.

The previous example demonstrates a pattern. The film table acts as the so-called master, and the film_actor table acts as the detail. The master-detail pattern is very common: Order and Order Items, Vendor and Products, Country and Cities are all examples of this pattern.

So why would I do that?


Ok, hopefully, I've been able to explain under which circumstances it is safe for GROUP BY queries to reference columns in the SELECT list directly. One might wonder though what the advantages and disadvantages are. I mean, just because you can doesn't mean you should, right?

Disadvantages?


Well, there certainly are reasons for always writing a full GROUP BY clause.

First of all, many rdbms products will only allow a full GROUP BY clause anyway, so in those cases there really is no choice. In MySQL however, we do have the choice as long as we are not using ONLY_FULL_GROUP_BY in the sql_mode.

Another reason for always writing the GROUP BY clause in full is that other developers might not understand under which circumstances it is ok to use a partial GROUP BY clause. In many cases, they've spent considerable time to learn to blindly repeat all SELECT columns in the GROUP BY clause, and they will usually point out that it is wrong to not adhere to that rule.

Of course, it is impossible to distinguish between a query that intentionally omits columns from the GROUP BY clause and one that accidentally forgot to include them. When it is the intention to always write a full GROUP BY clause, it is easy to verify whether the column references in the GROUP BY clause and the the SELECT list match.

I have heard people argue that including only the key in the GROUP BY clause will lead to problems when the definition of the key is changed. Personally, I think this is a bogus argument. When you are considering to change the definition of the key, you are most likely going to review all your queries anyway, because all your joins will need to reflect this change too. I just mean to say that changing a few GROUP BY's here and there is probably least of your problems when you are considering to change the definition of a key.

Another argument I have heard is that it is somehow 'more clear', 'cleaner' or 'prettier' to repeat all columns referenced in the SELECT list in the GROUP BY clause. Personally, I think this is a bogus argument too. At the very least, these are all a matter of opinion.

Advantages


Personally, I feel it is more clear and prettier to GROUP BY only on key columns where possible. I argued that it is a matter of opinion what is 'clear' or 'pretty', so I must discard this argument likewise.

I would argue that full GROUP BY clauses are harder to maintain. Many changes will require two edits of the code instead of one. Of course, this might or might not outweigh any of the advantages of a full GROUP BY clause.

A full GROUP BY clause might be slower than a partial one. The following query finds all film titles that gathered more than 300 $ worth of payments:

mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.18 sec)

Using only the film_id column in the GROUP BY clause, it takes 0.18 seconds to discover there are no film titles that satisfy this criterion. Now, let's compare this to the equivalent query using a full GROUP BY clause:

mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300;
Empty set (0.51 sec)

This query takes almost three times as long to complete! With EXPLAIN we can retrieve the execution plans for these queries. Without a full GROUP BY clause, we see a fairly normal execution plan:

mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: PRIMARY
key_len: 2
ref: NULL
rows: 953
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.00 sec)

With a full GROUP BY list, we notice a difference for the film table:

mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: idx_title
key_len: 767
ref: NULL
rows: 953
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.01 sec)

In case you did not yet notice, the film table now has Extra: Using index; Using temporary; Using filesort.

What I think that is happening is that MySQL takes the GROUP BY clause literally and performs the GROUP BY algorithm for each of the specified expressions. MySQL implements GROUP BY by sorting the rows according to the GROUP BY expressions. In this particular case, adding the title column to the GROUP BY clause does not allow the server to sort the rows in-memory, and forces the GROUP BY to be evaluated using a temporary table and a file sort. This requires extra IO operations which is causing a decrease in performance.

Of course, it would be nice if MySQL was smart enough to deduce that the result cannot possibly be different due to including the title column in the GROUP BY list. It could attempt to detect that a key of the film table is included in the GROUP BY list and that the title column can be completely ignored when evaluating the GROUP BY clause, because there will be exactly one value in the title column for each film_id column. But, then again, MySQL does not require us to write a full group by list. So if performance is paramount, be smart and do not write a full GROUP BY clause.

Aggregating on functionally dependent columns


We just argued that it is safe to include columns in the SELECT list as long as these columns are functionally dependent upon the GROUP BY list. The reasoning is that since the columns are functionally dependent upon the GROUP BY clause, there will be one value for each result group anyway. It was also shown that it may be a bad idea to include these columns in the GROUP BY clause, as it can hurt performance.

For some people, it may still seem unacceptable to include functionally dependent columns in the SELECT list without referencing these columns in the GROUP BY clause. For example, you might be using a rdbms that requires all columns that are not referenced in the GROUP BY to be aggregated. In those cases is might be better to apply an aggregate function to the functionally dependent column rather than including it in the GROUP BY clause. Consider the following query:

mysql> SELECT f.film_id
-> , MAX(f.title) AS title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.20 sec)

Again, we GROUP BY on the film_id column, which makes up the primary key of the film table. This time however, we apply the MAX aggregate function on the title column. We know that there is only one value in the title column for each value in the film_id column, so the aggregation will not influence the result. In fact, we could've used MIN equally well.

These aggregate functions will return the right value for exactly the same reason why it is safe to not include the functionally dependent column in the GROUP BY clause. Technically the aggregation is therefore unnecessary: it is just a trick to fool the rdbms.

Most likely, applying the aggregate function will be somewhat slower than not applying the aggregate. However, it will in most cases be faster than including the functionally dependent column in the GROUP BY clause.

Conclusion


Contrary to popular belief, the SQL standard does not require GROUP BY queries to reference all non-aggregated columns from the SELECT list in the GROUP BY clause. As of the 1999 version of the SQL standard, it is explicitly allowed for the SELECT list to reference non-aggregated expressions as long as they are functionally dependent upon the GROUP BY list.

Each expression that has exactly one value for each group defined by the GROUP BY clause is functionally dependent upon the GROUP BY clause. Functional dependencies can be witnessed in a common query pattern: whenever we have a join between a master and a detail table to calculate aggregates over the detail rows for each row from the master, we can GROUP BY over the primary or unique key from the master. All non-key columns of the master row will be functionally dependent upon the key, and can thus appear in the SELECT-list outside an aggregate expression.

In MySQL, one can write GROUP BY queries that reference non-aggregated columns in the SELECT list that are not included in the GROUP BY clause, even if these columns are not functionally dependent upon the GROUP BY clause. This behaviour conforms to none of the SQL standard's versions. It is possible to avoid this behaviour by including ONLY_FULL_GROUP_BY in the sql_mode server setting, but it might make more sense to take advantage of the ability to write only partial GROUP BY clauses.

In a nutshell:

  • It is completely safe to write partial GROUP BY clauses as long as all non-aggregated columns in the SELECT list are functionally dependent upon the GROUP BY clause.

  • A partial GROUP BY list can result in better performance, because it keeps the server from evaluating the entire GROUP BY list.

  • If one does not want to write partial GROUP BY clauses, consider using MIN or MAX to 'aggregate' the functionally dependent columns in the SELECT list rather than moving the functionally dependent columns to the GROUP BY clause.

10 comments:

rudy said...

i love the article!!! congratulations

one thing that is not clear -- in the conclusion, there are three paragraphs, and paragraph 2 seems to contradict paragraph 1

in particular, "This behaviour conforms to none of the SQL standard's versions."

however, in the first paragraph, you do seem to say that the SQL99 standard allows it "as long as they are functionally dependent upon the GROUP BY list."

could you please clear up this discrepancy for me?

thanks again for a wonderful article

i shall be referring people to it frequently

Roland Bouman said...

Rudy,

thanks for your comment! You are right, thanks for mentioning it. The sentence in the 2nd paragraph:

"In MySQL, one can write GROUP BY queries that reference non-aggregated columns in the SELECT list that are not included in the GROUP BY clause, even if these columns are not functionally dependent upon the GROUP BY clause."

I forgot the trailing: "...even if these columns are not functionally dependent upon the GROUP BY clause."

So, MySQL will allow a query like:

SELECT f.film_id
FROM sakila.film f
GROUP BY f.rating

Even though f.film_id is not functionally dependent upon f.rating.

This behaviour can be avoided by including ONLY_FULL_GROUP_BY in the server sql_mode.

Jeff Carouth said...

wonderful article. great topic. well written.

Rishi Agarwal said...

Thanks a lot. I wrote a View last week, which was similar to your 'master' and 'detail' example. I was under the impression that the View was perfect until I read your blog.

Thanks again for clearing all doubts.

serid said...

Absolute masterpiece !

This is the best article with regards to RDBMS I have ever read.

Well structured, well thought, well written, well done !

It has just landd in my favourites to share it with my friends.

Thanks a lot !

Roland Bouman said...

Jeff, Rishi, Serid: Thanks! I appreciate it a lot :)

amit said...

I have come across this article a little late. Nevertheless. i got my concepts cleared. Thanks a lot

Rob said...

Just found this article - very interesting read (now I'm off to check SQL Server, as that's the dB I use). Nice work.

Devang said...

Dear Sir;

We have two servers production and test.

On production we are using MySQL with below specifications

RHEL: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
MySQL : mysql Ver 14.12 Distrib 5.0.45, for redhat-linux-gnu (x86_64) using readline 5.0

while

On test server we are using MySQL with below specifications
RHEL: Red Hat Enterprise Linux Server release 5.6 (Tikanga)
MySQL: mysql Ver 14.12 Distrib 5.0.77, for redhat-linux-gnu (x86_64) using readline 5.1

On production server when i am issuing below sql query and it is rejecting with an error

Query: Select type, count(*) from location
Error: #1140 - Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no GROUP columns

On test server the same query is not giving us error.

In the same line when i referred MySQL documentation ,
I was told that we need to set SQL_MODE parameter on test server to generate
same error on both servers so I set it.

SQL_MODE=ONLY_FULL_GROUP_BY

But exactly after this an another query with Having clause start giving error on test server.

The error is "non-grouping field '' is used in HAVING clause"

This is also identified in the document.
http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html#sqlmode_only_full_group_by

We never set such parameters on Production server or test server we are using almost default
setup.

Can you please share with me why such difference is their.
What made such difference between two MySQL version.
How we can solve this issue?

thanks

Devang

Roland Bouman said...

Hi Devang!

"Query: Select type, count(*) from location"

ouch. This does look like a wrong query. there should be a GROUP BY type clause.

"On test server the same query is not giving us error."

No, but the result will be unreliable. Silently.

"I was told that we need to set SQL_MODE parameter on test server to generate
same error on both servers so I set it."

I agree that this should make the test server give you the same error.

"But exactly after this an another query with Having clause start giving error on test server."

"What made such difference between two MySQL version."

First let me get this straight. Are you saying that the query that errs with the "non-grouping field '' is used in HAVING clause" is performing ok on the production server? If it doesn't fail there too, there is probably some bug in the (older) version you're running on the production server. If you want to prevent these surprises you should always make sure you're running the same MySQL version on both test and production servers.

"How we can solve this issue?"

Personally I disrecommend using the ONLY_FULL_GROUP_BY sqlmode. My main reason is that it is buggy, see: http://bugs.mysql.com/bug.php?id=8510
To prevent bad queries from being used, like "Query: Select type, count(*) from location"
I suggest you have code reviews.

If that is for some reason unacceptable to you, you can take the query that is causing the "non-grouping field '' is used in HAVING clause" error, and see if it makes any difference if you use the aggregate function in the HAVING clause instead of an alias. It's kinda hard to explain without having your specific query, but going by the example in the docs you cite: if this is the original query (that errs)

SELECT name, COUNT(name) AS c
FROM orders
GROUP BY name
HAVING c = 1;

You could try rewriting that to

SELECT name, COUNT(name) AS c
FROM orders
GROUP BY name
HAVING COUNT(name) = 1;