Logical Operators and Aggregation
In the previous section we saw how a single rule could be added to a query using a WHERE clause. While this is useful, usually more than a single rule is required to produce the correct result. To support multiple rules we need to make use of NOT, AND, OR and parentheses.
The basic way of supporting multiple rules in a single query is by making use of AND. AND provides a way of connecting two rules together such that ALL the rules must be true before the row is shown. Lets make use again of the CAR table:
Consider the case where a police eye witness spots a car driving away from a crime. The witness reports that the car was BLUE and had the character '5' somewhere in the REGNO field. Taking these rules seperately...
SELECT regno from CAR WHERE colour = 'BLUE';
SELECT regno from CAR WHERE regno LIKE '%5%'
We are looking for a REGNO in common to both these results, which means the car we are looking for is 'A155 BDE'. Rather than doing this ourselves we want the computer to identify the right car in a single query. The two rules in question are linked together with an AND.
SELECT regno from CAR WHERE colour = 'BLUE' AND regno LIKE '%5%' ;
Remember that the layout of the SQL is independent of spaces and newlines, so this query is identical to:
SELECT regno FROM CAR WHERE colour = 'BLUE' AND regno LIKE '%5%' ;
You can link as many rules together as you like. So for instance if the witness said that the car was BLUE, had a 5 in the registration number, and that someone said the car was owned by Bob, we could write a query:
SELECT regno FROM CAR WHERE colour = 'BLUE' AND regno LIKE '%5%' AND owner LIKE 'Bob %' ;
AND allows us to link rules together such that all rules must be true to see that row. Think of AND as 'As well as'. Sometimes we want to say 'or that' or 'either' rather than 'As well as'. To do this we use OR. For instance, lets say that the police witness said that the car colour was either RED or BLUE, and they were not sure which. If you said:
WHERE colour = 'RED' AND colour = 'BLUE'
then no rows would be produced, as you are saying you want rows where the colour is both RED and BLUE at the same time (RED as well as BLUE). What we need is either RED OR BLUE.
SELECT REGNO,COLOUR from CAR WHERE colour = 'RED' OR colour = 'BLUE';
The NOT operator does the opposite of whatever comparison is being done. NOT is not frequently needed, as there is usually an opposite operator already. For instance, if you wanted the opposite of:
WHERE colour = 'RED'You could say
WHERE colour != 'RED'Using NOT you could also say
WHERE NOT colour = 'RED'While not particularly useful in these simple examples, NOT comes into its own once you start to use parentheses.
AND, OR, and NOT become more complex to understand when you mix them together in a single query. The problem is that the rules are combined together, not in the order you write them, but in the order of their precedence. This states that NOT is done first, then AND, and finally OR. This can make a BIG difference to your queries!
Consider the case of the police witness. Lets say that the car being looked for had a 5 in the registration number, and was either RED or BLUE.
SELECT REGNO,COLOUR from CAR WHERE colour = 'RED' -- 1 OR colour = 'BLUE' -- 2 AND regno LIKE '%5%' -- 3 ;
In this query, rule 3 and rule 2 and ANDed together first, as they have a higher precedent. Only then is rule 1 ORed in. Thus this query says "The car is BLUE with a 5 in the regno" OR "the car is RED". What was wanted was to have rules 1 and 2 done first, and then 3, so that the query says "The car is either RED or BLUE" AND "the car had a 5 in the regno". To do this we need to use parenthesis.
Parenthesis, or brackets, are used to instruct the database which rules should be done first. The database uses a simple ruleset to understand brackets. If you have any brackets, then the rule in the brackets is done first. If you have brackets within brackets, then the inner brackets are done first. In the example above, the right query can be generated as:
SELECT REGNO,COLOUR from CAR WHERE (colour = 'RED' OR colour = 'BLUE') AND regno LIKE '%5%' ;
The following queries are all identical in function to the above query...
SELECT REGNO,COLOUR from CAR WHERE (colour = 'RED' OR colour = 'BLUE') AND regno LIKE '%5%'; SELECT REGNO,COLOUR from CAR WHERE ( (colour = 'RED' OR colour = 'BLUE') AND regno LIKE '%5%');
Do not use brackets where they are not needed, as it makes the query harder for users to understand whats going on.
Lets say you want a list of all the colours of cars in the database. The COLOUR field of CAR gives you this, and thus:
SELECT colour FROM car;
This result was not the ideal one wanted. BLUE for some reason appears 3 times. It does this because BLUE appears 3 times in the original data. Sometimes this duplication is what is wanted, other times we want only to see the colours appearing once. To tell the database to show the rows only once, you can use the keyword DISTINCT. This appears immediately after the word SELECT. DISTINCT effectively means that all rows which appear must be unique, and any duplicate rows will be deleted.
SELECT DISTINCT colour FROM car;
When a query is executed the results are displayed in an almost random order. The order is dependent on how the database management system was written. This is fine usually, but sometimes giving the data out in a particular order would make the data must more useful. There is a special clause, ORDER BY, which can be added to the end of a query to give the data a particular order.
SELECT make FROM car;
To order alphabetically (which in SQL is known as ascending or ASC) you can use ORDER BY or ORDER BY ASC.
SELECT make FROM car ORDER BY make;
This is identical to
SELECT make FROM car ORDER BY make ASC;
To order things in the reverse ordering, you can use ORDER BY DESC.
SELECT make FROM car ORDER BY make DESC;
For complex orderings involving more than one column, you can specify multiple columns in the ORDER BY statement, simply by separating each column name with a comma. Thus a query to sort cars by colour and then make would look like:
SELECT make,colour FROM car ORDER BY colour,make;
List the registration numbers of cars which are either SKODA or SMART
SELECT regno,make FROM car WHERE make = 'SKODA' or make='SMART' ;
This can be rewritten using IN.
SELECT regno,make FROM car WHERE make IN ('SKODA','SMART') ;
A good way to think of IN is to consider it as "is one of the following".
Operators exist in SQL to give results based on the statistics of a group of values stored in the database. Such operators include "what is the maximum number" and "what is the average". These functions are called SET or AGGREGATE functions.
To calculate the average of a column you use the AVG function.
SELECT price FROM car;
SELECT avg(price) FROM car;
To calculate the SUM of all values in a column you use the SUM function.
SELECT sum(price) FROM car;
To calculate the maximum or biggest value present in a particular column you can use the MAX function.
SELECT max(price) FROM car;
To calculate the minimum or smallest value present in a particular column you can use the MIN function.
SELECT min(price) FROM car;
To work out how many rows are in a particular query result you can use the COUNT function.
Using "count(column)" counts how many rows exist in the answer where that column is NOT NULL. Using "count(*)" counts how many rows exist independent of NULL values.
SELECT count(price) FROM car;
In this case, the following SQL produces the same answer.
SELECT count(*) FROM car;
Sometimes you do not want to count how many rows are in a particular column, but how many different values are stored in a column. There is a special variation of count which allows you to do that, known as COUNT DISTINCT. Its syntax is a little unusual...
SELECT count(colour) from car;
SELECT count(DISTINCT colour) from car;
The aggregate functions are excellent when all you want is a single number answer. Frequently what is needed is statistical analysis in groups. For instance, what is the maximum cost of a car given its colour. Here we are wanting two columns, one the car colour, and the second column the highest cost. Intuitively one might think:
SELECT colour,max(price) FROM car ;
If you were to run this query it would produce a "group by" error.
Instead, what you have to do is consider all aggregate functions in your query, and over which columns they are going to be grouped. In this case we are grouping on colour, and want the maximum price within each "colour" group. To tell the computer this we use GROUP BY.
SELECT colour,price FROM car ;
SELECT colour,max(price) FROM car GROUP BY colour ;
If you are ever confused by what to put in the GROUP BY, then here is a simple rule which is 99% accurate... If you have a SELECT line with aggregate functions, then you need a GROUP BY listing all the column names from the SELECT line which are not used by the functions. In this example "price" and "colour" are columns from SELECT, but as "price" is used in MAX, only "colour" needs to go into the GROUP BY statement.
One annoying feature of SQL is that aggregate functions are executed at almost the last stage of the query process. This makes writing queries like "Which owners own more than 1 car" quite complex. Ideally we would like to write:
SELECT owner from car where count(owner) > 1;The problem is that this does not work! Aggregate functions cannot appear in a WHERE clause, so this query is illegal... To get around this you can have the HAVING clause. HAVING works in an identical way to WHERE, except that it runs very late in the process and allows aggregate functions. It is also VERY expensive for the database to use, so do not use it until it is absolutely essential.
Our query can now be rewritten thus:
SELECT owner,count(regno) FROM car GROUP BY owner HAVING count(regno)>1;
This query also shows how many cars the owner owns. You do not have to have the function in the HAVING on the SELECT line. The following also works:
SELECT owner FROM car GROUP BY owner HAVING count(regno)>1;
If you remember count(*) counts how many rows there are in the answer. With a GROUP BY, it counts how many rows are in each group. The difference between a count with * or with a column name is that using a column name makes the count ignore NULL entries in that column, whereas with * NULL entries are counted too. In our example, REGNO is never NULL, so the query is also identical to:
SELECT owner FROM car GROUP BY owner HAVING count(*)>1;