Tuesday, December 08, 2009

SQL GROUP BY

One aspect of the versatile SELECT statement that seems to confuse many people is the GROUP BY clause.  It is very important to group your rows in the proper place.  Always push GROUP BY aggregations as far into your nested SELECT statements as possible – if you have a situation in which you are grouping by long lists of columns that are not part of primary keys, you are probably have not structured your query correctly.

Here's a classic example that returns total sales per customer, in addition to returning the customer's name and address:


SELECT
  C.CustomerID, C.CustomerName,  
  C.CustomerType, C.Address1, C.City,
  C.State, SUM(S.Sales) as TotalSales
FROM
  Customers C
INNER JOIN Sales S
  ON C.CustomerID = S.CustomerID
GROUP BY
  C.CustomerID, C.CustomerName,
  C.CustomerType, C.Address1, C.City, C.State

I can’t say how many times I see SELECT's written this way and it is simply wrong. You should only be grouping on CustomerID, and not on all those other columns.  Push the grouping down a level, into a derived table:

SELECT
  C.CustomerID, C.CustomerName,
  C.CustomerType, C.Address1, C.City,
  C.State, S.TotalSales
FROM
  Customers C
INNER JOIN
  (SELECT 
     CustomerID, SUM(Sales) as TotalSales
   FROM
     Sales
   GROUP BY 
     CustomerID) S
ON
  C.CustomerID = S.CustomerID

Hopefully, you will agree that it is a much cleaner SELECT statement now, it is more efficient and it logically makes more sense when you look at it. 

One more common mistake is that people just mimic the expressions in their SELECT list in the GROUP BY clause, without thinking logically about what grouping is really necessary.  For example:

SELECT LastName + ', ' + FirstName, ... etc ...
FROM Names
GROUP BY LastName + ', ' + FirstName 

Again, that is logically wrong and also less efficient.  You should not be grouping by the expression itself; you should be grouping by what is needed to evaluate that expression. The correct grouping is:

GROUP BY LastName, FirstName

Too many people just keep stuffing column names and expressions into the GROUP BY clause until the errors go away without stepping back and thinking logically about how grouping works.  Take a minute to really consider what you need to return and how it should be grouped, and try using derived tables more frequently when writing aggregate queries to help keep them structured and efficient.

Check For Row Existance



It’s something you need to take care of in almost every ASP.NET Application. What if you would just like to know if that specific row exists or not, in SQL Server 2005 or SQL Server 2008. For example: Did this user vote on that topic already or not? There are different possibilities for that kind of situation. However, I wanted the fastet solution. If it’s about a query which looks like this:

SELECT somefield FROM sometable WHERE somefield = ’somevalue’

In this case ’somevalue’ is used as a literal key value, which also has an index running on it.

So if this query will always return either 0 or 1 rows, then – from an I/O point of view – using SELECT COUNT(*) or using EXISTS will be equally fast.

Why? Unfortunately SQL Server is not shortcutting an index seek if the value is being discovered in an intermediate index level. It also doesn’t shortcut it, if the value is out of range. So we always got a logical read for all index levels.

For the discussion COUNT(*) versus EXISTS, it does not matter whether the index on ’somefield’ is clustered or not. However, the definition of the clustered index (if present) does affect performance. That’s about testing, you need to check out if the clustered on is faster or not. It mainly depends on the key size of the clustered index, the row size and the size of ’somefield’. In theory, the fastest situation would be a clustered “and” nonclustered index on ’somefield’. This will make the nonclustered index on ’somefield’ the most shallow index possible, so the index seek on thisindex will use the least amount of I/O.

Finally handing over a resultset will be more costly than returning a return value.

Listbox Databind

Using datasets and SQL Server to bind data to controls is all very well and good, but what if you don’t have a database? You could add items from an array or list using a foreach loop, but luckily there is a better way.

.Net allows you to easily bind a list object to a list box control without the need to iterate through every item in the list. This method even works for custom types such as classes or structures.

In this example I have created a custom Person structure that holds first name and last name. I then set the data source of a list box to the instance of the Person object. When the program is run, the list box will contain “John” and “Jane”. For advanced data types you can set the property to be displayed using the DisplayMember property of the list box.


 
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

private void Form1_Load(object sender, EventArgs e)
{
Person[] People = new Person[] {
new Person("John","Smith"),
new Person("Jane" ,"Doe")};
lstPersons.DataSource = People;
lstPersons.DisplayMember = "FirstName";
}
}

public struct Person
{
private string firstName, lastName;

public Person(string firstName, string lastName)
{
this.firstName = firstName;
this.lastName = lastName;
}

public string FirstName
{
get
{
return firstName;
}
}

public string LastName
{
get
{
return lastName;
}
}
}


If you do not specify a DisplayMember it will use the value from the ToString() method (default will display “ListBoxBinding.Person”, but you can override the ToString() method to provide data in a format you require.