Codelines

Debugging the art of software

Database And In-Memory Paging With ADO.NET And LINQ

When performing queries over large datasets there is often the need to page the data, such that it can be displayed in smaller amounts, aka pages, incrementally. When thinking about database paging two immediate distinctions can be made; in-memory paging and database paging.

Depending on the size and usage of the data, each has its advantages and disadvantages.

Database and In-Memory paging considerations

Database paging is efficient as the result set is filtered by the database before the data is returned. This is a good thing when querying over very large datasets where the transferring of large amounts of data over the wire is not desirable. Depending on the number of pages that occur, many queries/and or connections (depending on whether a persistent connection is used throughout the life of paging) may need to be made in contrast to “in-memory” paging where all the data is retrieved from the database first, and pagination occurs on the in-memory model instead of the database. The latter obviously involves transferring more information over the wire but saves on trips made back to the database.

Depending on the amount of data transferred, the type of connection to the database and whether or not the data being paged is allowed to be “stale”; a side affect of working with in memory data, in memory paging may still be desirable.

In this post, I am going to take a look at both database and in memory paging with ADO.NET and LINQ.

Data paging with ADO.NET 

 

In-Memory paging with ADO.NET

In-memory paging with ADO.NET involves running a command object, filling a dataset with a data adapter and using a data reader to linearly read the datasets contents. Paging occurs by keeping an index of what record is currently being used and how many records should be returned in the next page.

See below for example:

public class AdoMemoryPager : IAdoPager
{
    public void Page(DbProviderFactory factory, DbConnection connection,
                          int pageSize, string selection)
    {
        // Find all the people that are interested in the
        // chosen technology.
        string commandText = "SELECT * FROM person as p"
        + " INNER JOIN PersonTechnologyAssignment AS pta"
        + " ON p.PersonID = pta.PersonID"
        + " INNER JOIN Technology as t"
        + " ON pta.TechnologyID = t.TechnologyID"
        + " WHERE t.TechnologyName = '" + selection + "'";
 
        DbCommand dbCommand = ADOUtility.CreateCommandObject(factory,
                         connection, commandText);
 
        // Have provider create adapter.
        DbDataAdapter adapter = factory.CreateDataAdapter();
        adapter.SelectCommand = dbCommand;
 
        int index = 0;
        DataSet dataset = new DataSet();
 
        // Have provider fill adapter.
        index += adapter.Fill(dataset, index, pageSize, commandText);
 
        IDataReader dbReader;
        while (index > 0)
        {
            // Get a reader over dataset.
            dbReader = dataset.CreateDataReader();
            while (dbReader.Read())
            {
                Console.WriteLine("{0} {1}", dbReader["FirstName"],
                                             dbReader["Surname"]);
            }
            // Empty the dataset.
            dataset.Clear();
            Console.ReadKey();
            int paged = adapter.Fill(dataset, index, pageSize,
                                      commandText);
            if (paged > 0)
                index += paged;
            else
                index = -1; // We are finished.
        }
    }
}

Database paging with ADO.NET

Coming soon…

Data paging with LINQ

The key thing to remember with LINQ is that it uses a deferred execution model. This means that a query is not run at the point of declaration, but at the point of usage, typically when traversing or enumerating over an enumerable query. You will see later why this is important with respect to data paging.

In-Memory paging with LINQ

In memory paging with linq is the same idea, except using a linq query over an in memory collection. The LINQ query is first defined and the result of query execution is stored in a list so that in memory paging can be performed rather than shifting the job of paging over to the database.

Notice the actual paging is performed by the Skip(…) and Take(…) extension methods when we query in the memory list.

class LinqMemoryPager : ILinqPager
{
    #region ILinqPager Members
 
    public void Page(TechnologyDataContext context, int index, int pageSize,
                          string selection)
    {
        IEnumerable people;
        people = (from p in context.PersonTechnologyAssignments
                     where p.Technology.TechnologyName.Equals(selection)
                     select p.Person);
 
        // Query is run here. We pull it into memory, and page off of
        // the memory model. This is stale data, but doesn't require
        // going back to database.
        IList pList = people.ToList();
 
        while (true)
        {
            IEnumerable pagedPeople;
            pagedPeople = (from Person p in pList
                                 select p).Skip(index).Take(pageSize);
 
            IList pList = pagedPeople.ToList();
 
            foreach (var person in pList )
            {
                Console.WriteLine("\t\t {0} {1}", person.FirstName,
                                                   person.Surname);
            }
 
            // ... Omitted for brevity as this section is much the same.
        }
    }
 
    #endregion
}

Database paging with LINQ

Again, this version is essentially the same, except the skip and take is performed on the enumerable instead of the in memory collection. If you have logging enabled and are connecting to SQL Server 2005 or later…

dataContext.Logging = Console.Out;

You will notice that because paging is being performed in the database, the LINQ query has generated optimised SQL for this. If you are using SQLServer 2005 or later, you should notice that LINQ has generated SQL using the newer SQL code that was added in 2005 for data paging. This demonstrates that one great advantage of data paging in LINQ over ADO.NET is that you as a programmer do not have to worry about optimising your SQL queries as you switch databases. LINQ will do it for you automatically.

class LinqDatabasePager : ILinqPager
{
    #region ILinqPager Members
 
    public void Page(TechnologyDataContext context, int index, int pageSize,
                     string selection)
    {
        while (true)
        {
            var people = Queryable.Skip(
                    (from p in context.PersonTechnologyAssignments
                    where p.Technology.TechnologyName.Equals(selection)
                    select p.Person), index
            ).Take(pageSize);
 
            // Query is run when entering.
            foreach (var person in people)
            {
                Console.WriteLine("\t\t {0} {1}", person.FirstName,
                                                  person.Surname);
            }
            // Query is re-run here.
            int length = people.Count();
 
           // ... Omitted for brevity as this section is much the same.
        }
    }
    #endregion
}

One more thing to notice with the database paging example is that every time you evaluate the query expression, the query is re-generated and executed on the database. Even calling people.Count() in this case causes a new query to be run on the database. This makes sense when you consider the default behaviour of the deferred execution model, as the query and or data source could have changed.

Being aware of this behaviour however is important. Even in this trivial example the database is hit twice for every page.

In-Memory paging with Compiled LINQ Queries

Similar to the in-memory paging we looked at earlier, except this time the paging is performed by a compiled linq query. Compiled LINQ queries are cached for the life of their scope (in this case because the LINQ function is static, it is cached for the life of the program). This means the LINQ query is only compiled once.

Again, if you have logging enabled you will notice that only the paramaters provided to the LINQ function change. Specifically, the index and page size in this example.

class CompiledLinqMemoryPager : ILinqPager
{
    #region ILinqPager Members
 
    public void Page(TechnologyDataContext context, int index, int pageSize,
                          string selection)
    {
        while (true)
        {
            IList people = CompiledPageQuery(context, selection,
                        index, pageSize).ToList();
 
            // Query is run when entering.
            foreach (var person in people)
            {
                Console.WriteLine("\t\t {0} {1}", person.FirstName,
                                                  person.Surname);
            }
            // ...Omitted for brevity as this section is much teh same.
        }
    }
 
    // Compiled linq query
    private static Func> CompiledPageQuery =
    CompiledQuery.Compile((TechnologyDataContext context, string selection,
                                   int toSkip, int toTake) =>
        (from p in context.PersonTechnologyAssignments
        where p.Technology.TechnologyName.Equals(selection)
        select p.Person).Skip(toSkip).Take(toTake).AsEnumerable());
 
    #endregion
}

Conclusions

Data paging can be performed by the database or in-memory depending on the need and preferred behaviour. Implementations range SQL paging, stored procedures in ADO.NET to LINQ queries and even compiled LINQ queries.

Database paging in LINQ especially has its advantages over ADO.NET in that LINQ takes care of generating the underlying SQL to be executed. This means when switching to another database vendor or upgrading to a newer version of the same database provider, LINQ will automatically generate the most optimised code it can without you having to re-write and optimise the SQL code directly (which you would have to do in ADO.NET). Looking at database paging in LINQ on SQL Server 2005 logging shows us that LINQ is making use of the SQL Server 2005’s data paging additions.

LINQ further makes data paging easier than ADO.NET by providing two simple extension methods on enumerable objects; the Skip(…) and Take(…) functions.

All in all, LINQ’s ease of use and automatic optimisation of underlying SQL code makes it a good choice. However, deferred execution means you will be hitting the database twice for every page in this example (because we obtain the count afterward). If you switch to in-memory paging you lose the advantage of fast optimised data paging when the database supports it.

Benefits of The ADO .Net Managed Provider Factory Model

The .NET managed provider factory model makes it easy to switch between database providers without having to change or re-compile any code. How does it achieve this vendor independent behaviour? Through a software pattern or model known as the database provider model.

Database Providers

Typically, when writing applications that make use of ADO.NET you make use of a database provider; a namespace that essentially defines a collection of types for connecting to, and operating on, a specific database implementation. These include creating the initial connection, running SQL commands, reading or operating on the results etc. Each database provider, defines these types in its namespace. One such example, is the System.Data.SqlClient namespace which defines all the types need to connect and operate on a SQL Server database. The System.Data.SqlClient namesapce defines types such as SqlConneciton, SqlCommand and SqlDataReader and SqlDataAdapter.

Benefits to using data provider specific implementations

Using data provider specific types creates two benefits for the developer; these types can come with optimsations for the underlying database whilst offering extra provider specific functionality specific to the database implementation which is not common across all databases.

Disadvantages to using data provider specific implementations

Whilst using provider specifc types yields the aforementioned benefits it also has the unwanted side effect of binding an application strongly to a specific provider. This makes portability difficult since switching database providers would require a lot of code re-write.

In steps .NET 2.0 Data Provider Factory

.NET 2.0 brings with it a new data provider factory which is essentially a common namespace for database types; connection, command, data reader, data adapter, exception etc. This namespace; System.Data.Common provides data provider independent implementations of these types.

Any data provider implementations that inherit from the abstract System.Data.Common types can now be interchanged without instantiating specific types. This latter part, is provided by the System.Data.Common.DbProviderFactories, which is responsible for returning the correct types according to the information the factory is instantiated with. So if you provide the factory with an Sql provider, it will take care of returning you the correct Sql* implementations of the types you request. Similarly, providing the factory with a MySQL provider will will result in MySQL* types being returned. Note that you are always working with the generic abstract type and not the provider specific type.

 

Using application settings, you can define the provider and connection information outside your code which means you can change your database provider without having to re-compile or touch a single line of code!

public class DataProviderSpike
{
    private static DbProviderFactory _DbProviderFactory;
 
    public static void Main(String[] args)
    {
        Console.WriteLine("===== Data Provider Factory Pattern ======\n");
 
        // Get connection string infromation from *.config
        string dbProvider = ConfigurationManager.AppSettings["provider"];
        string conn= ConfigurationManager.AppSettings["connectionString"];
 
        // Get a factory for the provider information in our
        // application settings.
        _DbProviderFactory = DbProviderFactories.GetFactory(dbProvider);
 
        // Create generic connection object using helper utility.
        DbConnection dbConn = ADOUtility.OpenConnection(_DbProviderFactory,
                                         conn);
 
        string commandText = "SELECT * FROM Person";
 
        // Create generic command object.
        DbCommand dbCommand = DbProviderFactory .CreateCommand();
        dbCommand.Connection = dbConnection;
        dbCommand.CommandText = commandText;
 
        // Create generic data reader.
        DbDataReader reader = dbCommand.ExecuteReader();
 
        // Helper utlity for printing out provider info.
        ADOUtility.PrintProviderInfo(dbConn, dbCommand, reader);
 
        Console.WriteLine("\n\nRunning command: {0}\nResults:", commandText);
 
        while (reader.Read())
        {
            Console.WriteLine("\t\t {0} {1}", reader["FirstName"],
                                    reader["Surname"]);
        }
        Console.WriteLine("\nPress any key to exit...");
        // Pause execution.
        Console.ReadLine();
    }
}

Summary

The portability and flexability of the data provider factory pattern comes at the cost of losing the extra “goodies” and optimised types that would have otherwise been provided by the data provider specific implementation. Depending on the project, it is a case of deciding which is more important. Database specific functionality with potential provider optimisations, or ease of migration to different database vendors.

Debugging Linq Queries And Object Initializer Gotchas

LINQ is a great time saving technology that distances the programmer from having to write complex SQL queries when working with databases. It offers a generic means for querying heterogenous IEnumerable data sources. It provides simple object relational mapping.

Object Initializers provide a short hand way of initializing an object in a single statement, rather than having to A) Build numerous constructors for different property combinations or B) Manually set properties one after that other post object creation.

One thing that quickly becomes evident when using object initializers with LINQ queries however (which is a very common thing to do by the way), is the problems associated with debugging LINQ queries that utilise object initializers.

Take for example a very simple query that returns all the people whose first name begins with ‘S’. In this example, we are just returning an anonymous type but you could just as easily return a typed object.

var people = from p in context.Persons
                 where p.FirstName.StartsWith("S")
                 select new
                 {
                     FirstName = (string)p.FirstName,
                     Surname = (string)p.Surname,
                     EmailAddress = (string)p.EmailAddress,
                     Age = (int)p.Age
                 };

One basic rule to remember when working with object initialisers is that they are an all or nothing approach towards object initialisation. This is to protect the object from being only partly initialised. If any one property fails initialisation, the object fails initialisation and cannot be constructed (resulting in an error).

In the case of our LINQ query above, should any one property fail we get an error pointing us to a problem beginning the line our query was declared on; var people = …. Depending on what data source you were working with, what you were trying to do when assigning the property that may have failed and what sort of qualifiers or constraints you had imposed on the LINQ query (such as Single() ) it can be a bit of a trial and error approach to discovering which property or constaint caused the error.

Consider another simple example where again, we are looking for contacts whose first name begins with ‘S’ but this time are quering a Linq to XML data source.

XDocument document = new XDocument("addressbook.xml");
var people = (from c in document.Descendants("Contact")
                where c.Element("FirstName").Value.StartsWith("S")
                select new
                {
                    AddressBookName = (string)c.Ancestors("Addressbook")
                                                .Single().Element("Name"),
                    AddressBookType = (string)c.Ancestors("Addressbook")
                                               .Single().Element("Type"),
                    FirstName = (string)c.Element("FirstName"),
                    Surname = (string)c.Element("Surname"),
                    EmailAddress = (string)c.Element("EmailAddress"),
                    Age = (int)c.Element("Age
                }).Single();

This is a completely fictious example to demonstrate a point. But looking at this example there are easily several problems that could occur. The query could return more than one result where we have declared it to be Single. This could be the entire query where we are only expecting one contact, or it could be the AddressBookName or AddressBookType. Any of the string castings could fail giving us no indication as to which property failed.

Using object initializers with LINQ provides quick, flexible and powerful object querying and manipulation. Whilst quick to write, it still pays to be aware of the potential maintenance issues further down the track when you encounter data your query doesn’t handle so well. Even trivial examples like the aforementioned can take multiple debug attempts in order to ascertain the real problem.

Do you have any good techniques for debugging linq queries and object initializers?