Fabian's Mix

Mixins, .NET, and more

re-linq: Support for SelectMany without result selector

without comments

While this has been missing from the start, different people reported it at about the same time a few weeks ago: support for the SelectMany overload without a result selector.

What does that mean?

Consider the following C# LINQ query:

from c in Cooks

from k in Kitchens

select c.Name + ” at “ + k.Name

This is equivalent to this query, which makes the query operators being used explicit:

Cooks.SelectMany (c => Kitchens, (c, k) => c.Name + ” at “ + k.Name)

SelectMany is a query operator that combines each item from an input sequence with each item from a second input sequence. The latter is calculated by means of a collection selector – the first argument to the SelectMany method in the example above. The collection selector returns the Kitchens sequence for each call; but it could potentially return a separate sequence for each c.

So, the purpose of SelectMany is to combine items from the sequences. But how does it combine them? This is where the second argument comes into play: the result selector. In the example, it builds a combined string from the cook’s and kitchen’s names.

Okay, now this is LINQ 101, where is the point?

There are cases where you don’t really want to combine the items of two sequences; you just want to “flatten” a nested sequence. Here’s an example:

var allAssistants = Cooks.SelectMany (c => c.Assistants);

This code simply takes all assistants of all cooks and returns them as a flattened sequence.

While the SelectMany overload is convenient, it will never be used when you write a LINQ query using the C# syntactic sugar form (from … select …). That’s why re-linq hasn’t included any default support for it – until now. Starting with build 1.13.88 (released today), re-linq can now parse this query out of the box.

About the technical details: when encountering a query using the simpler overload, such as the one above, re-linq just pretends there is a trivial result selector:

var allAssistants = Cooks.SelectMany (c => c.Assistants, (c, a) => a);

That way, the parsing code didn’t need to be changed at all. The QueryModel also doesn’t care about which overload you used – it will always represent SelectMany as an AdditionalFromClause. This means that LINQ providers using re-linq don’t have to change anything to make this work.

Which is certainly a good thing.

Written by Fabian

December 23rd, 2010 at 3:27 pm

Posted in re-linq

C# 5: await and ThreadPool.SwitchTo()

with one comment

Last week, I managed to read the 38-page whitepaper explaining the .NET Task-based Asynchronous Pattern in detail. That’s the pattern behind the new C# and VB.NET “async” support everybody is blogging about, for example, Eric Lippert, Alexandra Rusina, or Jon Skeet.

Those blog posts I’ve linked to give good examples on how you can write code orchestrating asynchronous tasks in a very simple, linear fashion. For example, from Eric Lippert’s blog:

async void ArchiveDocuments(List<Url> urls)
{
Task archive = null;
  for(int i = 0; i < urls.Count; ++i)
  {
    var document = await FetchAsync(urls[i]);
    if (archive != null)
      await archive;
    archive = ArchiveAsync(document);
  }
}

This code starts tasks that asynchronously (“without blocking the application”) fetch documents and archive them, with the coordination between the tasks written using an ordinary for loop and an if clause. This is cool because with previous asynchronous patterns in .NET, ie., the Begin…/End… Asynchronous Programming Model, or the Event-Based Asynchronous Pattern (heavily used by Silverlight applications), coordination of asynchronous tasks is quite cumbersome, and cannot be written in such a linear manner.

With the Task-based Asynchronous Pattern, the term asynchronous does not necessarily mean multithreaded – whether the FetchAsync and ArchiveAsync methods run their work on a background thread, in a separate process, in the cloud, or via some esoteric operating system mechanism does not matter for the orchestration code. The Task-based Asynchronous Pattern uses the .NET synchronization context and task scheduler mechanisms to ensure the coordination method is resumed on the correct thread when an awaited task has finished performing its work; it doesn’t care how (or on which thread) exactly the work was performed.

There is an interesting consequence of this, which I haven’t seen anyone blog about yet: you can also await pseudo-tasks that do not schedule work, but only switch to another context!

Here is an example from the aforementioned whitepaper:

public async void button1_Click(object sender, EventArgs e)
{
  string text = txtInput.Text;

  await ThreadPool.SwitchTo(); // jump to the ThreadPool

  string result = ComputeOutput(text);
  string finalResult = ProcessOutput(result);

  await txtOutput.Dispatcher.SwitchTo(); // jump to the TextBox’s thread
 
  txtOutput.Text = finalResult;
}

I like this a lot! Without await, you would use the ThreadPool.QueueUserWorkItem() method, passing in a continuation delegate that first performs the expensive computations, then calls Dispatcher.BeginInvoke(), passing in a continuation delegate that sets the text box text:

public async void button1_Click(object sender, EventArgs e)
{
  string text = txtInput.Text;

  ThreadPool.QueueUserWorkItem (() => {
      string result = ComputeOutput(text);
      string finalResult = ProcessOutput(result);

      txtOutput.Dispatcher.BeginInvoke ((Action) () => txtOutput.Text = finalResult);
  });
}

Of course, await more or less uses the same continuation concepts under the hoods. But code using await is so much easier to read than explicitly wiring up the delegates.

Asynchrony is one of these situations where syntactic sugar matters a lot, I think.

Written by Fabian

November 8th, 2010 at 9:29 am

Posted in .NET development

re-linq: Extensibility: Custom query operators

with 2 comments

This is a blog post about re-linq that I’ve been planning to write for a very long time. It is the documentation of how to extend the re-linq front-end so that it can detect custom query methods, and it’s going to be quite long for a blog post. Read it if you’re a user of re-linq and you are trying to get your LINQ provider to understand your own custom extension methods.

As I’ve mentioned before, re-linq differentiates between clauses, which form the main part of a query, and result operators, which describe operations conducted on the result of the query. Query methods such as Select and Where are parsed into clause objects, whereas methods such as Distinct, Take, or Count are parsed into result operator objects. Both clauses and result operators are represented in the QueryModel that represents the query analyzed and simplified by re-linq’s front-end.

The distinction between result operators and clauses is important because it defines what goes into a sub-query, and what is simply appended to the current query. For example, consider the following query:

Query<Order>()
  .Where (o => o.DeliveryDate <= DateTime.UtcNow)
  .OrderByDescending (o => o.DeliveryDate)
  .Take (5)
  .OrderBy (o => o.OrderNumber)

In this case, the Where and OrderByDescending method calls are parsed into clauses of a single QueryModel. The Take method call is appended to that QueryModel as a result operator. The OrderBy call following the Take call, however, is parsed into a different, outer QueryModel, which embeds the former QueryModel as a sub-query. Like this:

from o’ in (
  from o in Query<Order>()
  where [o].DeliveryDate <= DateTime.UtcNow
   orderby [o].DeliveryDate desc
   select [i]).Take (5)
orderby [o’].OrderNumber
select [o’]

(It is necessary to form a sub-query in this case because the OrderBy operation must take place after the Take operation. This makes a huge semantic difference!)

I’ve talked about the sub-query system in the past, and I believe it to have many positive properties with regards to scoping of identifiers, and similar. Also note how the final parsed query looks very similar to how the query would be written in C#’s query syntax. However, the point I was trying to make is that a query method parsed as a clause might cause the previous stuff to be wrapped into a sub-query, whereas a query method parsed as a result operator is always simply attached to the current QueryModel.

Therefore, if your goal is to extend a query by simply attaching some information to it, you will want to go the result operator route. Which is as follows:

  1. Create an extension method for queries that represents the information you want to attach to the query. This is the user’s entry point in attaching the information, and it has nothing to do with re-linq; it’s simply how LINQ works.
  2. Define a result operator class that represents the information in the parsed QueryModel.
  3. Write an expression node parser that translates the call to the extension method to your result operator and wire up that parser with your LINQ provider.
  4. Add code handling the result operator to your LINQ provider back-end. This, again, has not much to do with re-linq.

The rest of this post will walk you through a sample. Only steps 2 and 3 are really specific to re-linq, but I’ll include the other steps for completeness.

This is the query we want to be able to run in the end:

var query = (from o in QuerySource

             where o.ID != 0

             select o).AdditionalInfo (o => o.OrderDate);

 

var result = query.ToArray ();

1. Create an extension method

Here’s the AdditionalInfo method:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source, 
      Expression<Func<T, int>> parameter)

  {

    return source.Provider.CreateQuery<T> (
        Expression.Call (
            ((MethodInfo) MethodBase.GetCurrentMethod ())
                .MakeGenericMethod (typeof (T)),

            source.Expression,

            Expression.Quote (parameter)));

  }

}

(Bug fixed on 2010-12-02)

This is the “standard” way to write LINQ extension methods. Ouch.

Note: If you write a lot of these, you might want to use some kind of expression generation helper:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source,
      Expression<Func<T, int>> parameter)

  {

    return CreateQuery (source, s => s.AdditionalInfo (parameter));

  }

 

  private static IQueryable<T> CreateQuery<T, TR> (
      IQueryable<T> source, Expression<Func<IQueryable<T>, TR>> expression)

  {

    var newQueryExpression = ReplacingExpressionTreeVisitor.Replace (
        expression.Parameters[0],
        source.Expression,
        expression.Body);

    return source.Provider.CreateQuery<T> (newQueryExpression);

  }

}

This allows you to specify the new expression tree without having to build it by hand.

2. Define a result operator that represents the information in the parsed QueryModel

Here it is:

public class AdditionalInfoResultOperator
    : SequenceTypePreservingResultOperatorBase

{

  public AdditionalInfoResultOperator (Expression parameter)

  {

    Parameter = parameter;

  }

 

  public Expression Parameter { get; private set; }

 

  public override string ToString ()

  {

    return “AdditionalInfo (“
      + FormattingExpressionTreeVisitor.Format (Parameter)
      + “)”;

  }

 

  public override ResultOperatorBase Clone (CloneContext cloneContext)

  {

    return new AdditionalInfoResultOperator (Parameter);

  }

 

  public override void TransformExpressions (
      Func<Expression, Expression> transformation)

  {

    Parameter = transformation (Parameter);

  }

 

  public override StreamedSequence ExecuteInMemory<T> (StreamedSequence input)

  {

    return input; // sequence is not changed by this operator

  }

}

As mentioned before, the purpose of this class is mainly to represent the information expressed by the extension method in the resulting QueryModel. That’s why it has a property holding the extension method’s “parameter” value. The Clone, TransformExpressions, and ExecuteInMemory methods are used when the QueryModel holding the result operator is cloned, transformed, or executed in memory. If a LINQ provider does not make use of these features, the methods needn’t be implemented; but since it’s not much work, I did it here for completeness. The ToString method is overridden only for diagnostic reasons.

Note: If you find yourself in a case where you need a lot of parameterless result operators that represent some sort of ‘options’, you should probably define a single OptionsResultOperator that has an “optionKind” discriminator value (enum, string, or even MethodInfo).

3. Write an expression node parser that translates the method to the operator

Now, what’s this for? Can’t re-linq just automatically instantiate an AdditionalInfoResultOperator when it detects the extension method? By registration, or even by convention, for example?

In the most trivial cases, it probably could. There are a lot of more complex cases, though, especially if the extension method takes parameters. One of the tasks of the re-linq front-end is, after all, to remove transparent identifiers and to analyze where the data for a clause (or operator) stems from. This task can’t really be automated, and that’s why there’s one additional abstraction between query methods and result operators: the expression node parser. Here it is:

public class AdditionalInfoExpressionNode : ResultOperatorExpressionNodeBase

{

  public static MethodInfo[] SupportedMethods =
      new[] { typeof (MyQueryExtensions).GetMethod (“AdditionalInfo”) };

 

  private readonly LambdaExpression _parameterLambda;

 

  public AdditionalInfoExpressionNode (
      MethodCallExpressionParseInfo parseInfo, LambdaExpression parameter)

      : base (parseInfo, null, null)

  {

    _parameterLambda = parameter;

  }

 

  protected override ResultOperatorBase CreateResultOperator (
      ClauseGenerationContext clauseGenerationContext)

  {

    var resolvedParameter = Source.Resolve (
        _parameterLambda.Parameters[0],
        _parameterLambda.Body,
        clauseGenerationContext);

    return new AdditionalInfoResultOperator (resolvedParameter);

  }

 

 public override Expression Resolve (
      ParameterExpression inputParameter,
      Expression expressionToBeResolved,
      ClauseGenerationContext clauseGenerationContext)

  {

    return Source.Resolve (
        inputParameter,
        expressionToBeResolved,
        clauseGenerationContext);

  }

}

This is probably the most “complicated” class in the whole process, because it’s not obvious at first glance what that class actually does; but it is a case of necessary complexity. re-linq translates and simplifies queries as it parses expression trees. If you want to extend that process, you need to tell re-linq how to perform that translation and simplification. And this is exactly what the two methods, CreateResultOperator and Resolve do.

But first things first. As you can see, the node class is derived from a base class specifically designed for the parsing of result operator query methods: ResultOperatorExpressionNodeBase. The SupportedMethods field is a convention that makes it easier to register the parser with re-linq later on.

When re-linq traverses an expression tree and encounters a MethodCallExpression instance that matches a registered expression node parser, that parser is instantiated, and the method call’s actual argument expressions are passed to the parser’s constructor. Therefore, in our example, the constructor has two parameters. The first one, parseInfo, represents the extension method’s source parameter. It contains information about the expression being parsed, the previous expression node in the query, and so on. Our parser doesn’t use this apart from passing it to its base class. The second parameter, “parameter”, directly corresponds to the second argument of the extension method. Where the extension method takes an Expression<Func<T, int>>, the node takes a LambdaExpression (the non-generic version of the former). The parser stores the parameter for later use.

The CreateResultOperator method tells re-linq how to translate the MethodCallExpression being parsed into a result operator to be put into a QueryModel. It simply returns an instance of the AdditionalInfoResultOperator declared before, passing in the method call’s parameter. Before doing so, however, it has its Source resolve the parameter. What does that mean?

Often, when you create a result operator, you don’t really care about the LambdaExpression passed to the extension method in the query. An expression such as o => o.OrderDate doesn’t tell you which o’s OrderDate should be retrieved. After all, the query before the extension method might not even contain an item called “o”.

Therefore, re-linq allows the expression node parser to resolve the origin of its lambda’s input data while it is translating the expression into a result operator. To do so, the parser calls the Resolve method on the expression node directly preceding itself in the query. That way, o => o.OrderDate becomes [o].OrderDate, with [o] being a reference to the corresponding clause producing the items.

To enable this mechanism, each expression node parser must implement the Resolve method in such a way that the changes made by the result operator on the query result are reflected in the resolution result.

For example, if the result operator represents a step in the query that produces new items, its Resolve method returns an expression describing those new items. If it changes the items coming from its own Source, it returns an expression representing that change. If it simply passes through the items coming from its Source without adding new ones or changing existing ones, it simply passes on the Resolve request to the Source. Our sample result operator is purely informational and does not do anything with the items flowing through the query; so it takes that last option.

And that’s all there’s to say about expression node parsers for result operators. To summarize, each parser instance represents one MethodCallExpression instance. The parser has two tasks, so it has two methods. CreateResultOperator tells re-linq how to translate the extension method into a result operator, optionally simplifying the LambdaExpression parameter in the process. Resolve allows subsequent expression node parsers to simplify their own LambdaExpressions.

To have re-linq use the expression node parser, register it as follows:

nodeTypeRegistry.Register (

    AdditionalInfoExpressionNode.SupportedMethods,

    typeof (AdditionalInfoExpressionNode));

The nodeTypeRegistry can be obtained from a QueryParser via the ExpressionTreeParser and NodeTypeRegistry properties, or it can be injected into the constructor of DefaultQueryProvider when a new queryable instance is created.

Note: If you have several extension methods that are all mapped to the same result operator, you should only create a single expression node parser that has multiple SupportedMethods.

4. Add code handling the result operator to your LINQ provider back-end

Now, this is simple again. Using the registered expression node parser, re-linq now detects the extension methods and creates a result operator for it. You can inspect a QueryModel’s result operators via the ResultOperators collection. If you have a QueryModelVisitor, just override the VisitResultOperator method and check the result operator’s type. You can also choose to implement extended (acyclic) visitors by overriding the result operator’s Accept method.

In a nutshell, adding new extension methods to your own LINQ provider is quite possible once you understand how re-linq works. Apart from the inherent stuff, like defining an extension method, you need to define two classes: a result operator and an expression node parser. The result operator represents the information in the QueryModel, the parser translates from the expression tree to the result operator.

If you need samples, take a look at re-linq’s source code; for example the TakeResultOperator and TakeExpressionNode classes.

And if you have good ideas about how to simplify this specific extensibility point – let me know.

Update (2010-12-02): There was a bug in the example code given under step 1. I’ve fixed it.

Written by Fabian

October 28th, 2010 at 2:23 pm

Posted in re-linq

re-linq: Query operators defined by interfaces

with 4 comments

And another question that makes for a good blog post: Why don’t we support automatic detection of interface implementation in the re-linq front-end when we detect query operators?

Consider the following example:

// This works

var cookNamesArray = new[] { "john", "joe", "jim" };

var query1 =
from c in Cooks
where cookNamesArray.Contains (c.FirstName)
select c.FirstName;

As the comment implies, this works – the Contains call is detected as a query operator. It is so because the expression above refers to Enumerable.Contains<T>(), for which a node type parser is, by default, registered in the MethodCallExpressionNodeTypeRegistry. re-linq uses this registry to determine what methods are query operators, and how to parse them.

Now, consider this example:

// This does not work

var cookNamesMyList = new MyList<string> { “john”, “joe”, “jim” };

var query2 =

    from c in Cooks

    where cookNamesMyList.Contains (c.FirstName)

    select c.FirstName;

 

// This works again

IEnumerable<string> cookNamesMyListAsEnumerable = new MyList<string> { “john”, “joe”, “jim” };

var query3 =

    from c in Cooks 
    where cookNamesMyListAsEnumerable.Contains (c.FirstName)

    select c.FirstName;

The first Contains call in that second example is not recognized as a query operator because MyList<T>.Contains() has not been registered as a query operator. Since re-linq analyzes the expression tree generated by the compiler, the second call is recognized again because it is again the Enumerable.Contains<T>() method.

Now, a sensible suggestion would be the following: Since MyList<T> probably implements ICollection<T>, why not register a parser for ICollection<T>.Contains()?

And the answer is: because it wouldn’t help. (Unless you cast to ICollection<T>, that is.)

As long as the method in the Expression tree isn’t ICollection<T>.Contains() but MyList<T>.Contains(), re-linq wouldn’t be able to determine that the registered parser should be used. But why not? After all, we can do it, why can’t re-linq?

Automatic detection of interface implementations for query operators is a difficult topic, primarily because it’s hard to get right with good performance. Currently, the re-linq front-end is quite fast because it only does dictionary lookups for the methods coming in via MethodCallExpressions (plus GetGenericTypeDefinition()/GetGenericMethodDefinition()).

With interface method implementations, this is not that simple – we can’t perform dictionary lookups, which are O(1) operations (leaving the calls mentioned above aside); we have to get an interface map for the registered interfaces and perform a linear search through the TargetMethods array of that interface map. This makes the lookup an O(N*M) operation, where N is the number of registered interface types, and M is the number of methods in those interfaces. (And probably, GetInterfaceMap() adds another dimension of complexity.)

We could try to make it perform better by detecting interface methods by name first, and only then check the interface map, but it would still be much slower than an ordinary lookup – and you must not forget that re-linq has to perform that lookup for every MethodCallExpression in an expression tree in order to detect whether a method call refers to a query operator or not. So this has to be really fast.

Complex logic like this could only be implemented via sophisticated caching, which we’ve been able to avoid before. We’d have to devise a good caching scheme, expose control over the cache’s lifetime to the user of re-linq (ie., the LINQ provider), and so on.

So, this is not a small feature to implement, which is why we haven’t done it up to now. However, I’ve created a JIRA feature request for it, and we’ll consider implementing it in the future. (Oh, and we accept patches if they come with the necessary unit tests. If somebody would like to implement this feature, let’s discuss the design on the re-motion developers list.)

Until the feature is implemented, there are a few workarounds:

  • You can use Enumerable.Contains<T>(), as illustrated above. This works out of the box right now.
  • You can register a parser for common collection types (eg., List<T>) to at least support those. (List<T>.Contains() will become a default operator with RM-3340.)
  • You can register parsers for ICollection<T>.Contains() or IList<T>.Contains(), but have to cast the collection to the standard interface type: where ((ICollection<string>) cookNamesMyList).Contains (c.FirstName). This will also be implemented with RM-3340.
  • You can allow your users to register parsers for their own query operators (MyList<T>.Contains()) by exposing the MethodCallExpressionNodeTypeRegistry.
  • You can register a parser for the “Contains” method name. This is not a good option because it would also cause re-linq to detect string.Contains() as a query operator.

The standard ContainsExpressionNode parser should work for all of these cases.

Written by Fabian

September 23rd, 2010 at 9:35 am

Posted in re-linq

re-linq: Dealing with sub-queries in a from clause

without comments

On our re-motion users list, Fabio Maulo asks:

for this query
from Dog o in session.Query<Dog>() select o

I would catch the Cast method but it is parsed as SubQueryExpression.

Have you any advise ?
Thanks.

My short answer is: Look for the CastResultOperator in the ResultOperators collection of the SubQueryExpression’s QueryModel. Detect that the inner QueryModel is trivial by inspecting QueryModel.IsIdentityQuery(). Do this in your back-end’s respective VisitSubQueryExpression visitor method, or use the SubQueryFromClauseFlattener.

The long answer:

re-linq’s front-end works in such a way that it always wraps expressions that contain query operators (Select, Where, OrderBy, Count, Cast, and similar method calls) into QueryModel instances. If those expressions are part of a larger expression tree (eg., in the expression of a FromClause), re-linq will wrap the QueryModel into a SubQueryExpression and include that in the larger tree.

This sub-query system has several advantages, for example, it is highly consistent both in representation and analysis (because you can usually apply the same QueryModel analyzer to the outer and inner queries), it’s easy and fast to parse (no context is needed when re-linq’s structural parser encounters a Select MethodCallExpression, for example), and so on.

However, it means you have to explicitly deal with those SubQueryExpressions in your parser if you need to support such queries.

You have several options:

Option 1, if your target has good support for sub-queries (SQL has, for example), you can just translate those SubQueryExpressions into target sub-queries as they are. In the expression visitor parsing the expressions of from clauses, in the VisitSubQueryExpression method, apply your QueryModelVisitor to the expression’s inner QueryModel, then insert the resulting query as a sub-query into your result.

I’ll explain by using SQL. In the example above, you’d have the following:

SELECT [q0].*
FROM (SELECT [t0].* FROM [DogTable] [t0]) [q0]

You can choose to add an optimization step to the VisitSubQueryExpression method: if the QueryModel is trivial (QueryModel.IsIdentityQuery()) and all of its result operators can be handled inline, you can simply “flatten” the sub-query instead of parsing the inner QueryModel:

SELECT [t0].* FROM [DogTable] [t0]

As you can see, for this flattening, you have to take the FromExpression of the inner QueryModel and use it as the FromExpression of the outer query model. (If you also want to flatten non-trivial queries, you’ll need to perform an additional reference replacement step, but I won’t go into this now.)

This is how we do it in re-linq’s SQL backend (or how we will do it, since the flattener hasn’t been implemented yet). The advantage is that this concept supports all constructs that build SubQueryExpressions automatically, and that you can still build optimized statements.

Option 2, if your target has no good support of sub-queries, you can still detect the SubQueryExpression in the expression visitor’s VisitSubQueryExpression method. Check whether the inner QueryModel is trivial (IsIdentityQuery()) and whether all the result operators can be expressed in-line. If so, use the inner QueryModel’s FromExpression and modify it according to the result operators. You can do this, for example, using a dedicated SubQueryModelVisitor that returns a flattened result, which you can then insert into your outer query. If the QueryModel is non-trivial (eg., if somebody wrote “from o in session.Query<Dog>().Where (…).OrderBy (…) select o”), throw a NotSupportedException or something similar.

This is similar to option 1, but the optimization becomes the default (and only) route.

Option 3, if you feel this is too complicated to do in your back-end, you can perform a recursive sub-query flattening step on the outermost QueryModel before you feed it into your back-end’s analyzer or transformer. I’ve blogged about this before, and re-linq already contains a simple SubQueryFromClauseFlattener. If you want to use that, you’ll probably need to override CheckFlattenable() for the example above in order to make sure that the CastResultOperator is ignored.

If the existing SubQueryFromClauseFlattener is not enough for your purposes (or you don’t want to throw an exception if the sub-query cannot be flattened), you can implement your own transformation class in a similar way, of course.

The advantage of this option is that it keeps logic out of your back-end, which is likely complex enough anyway. The disadvantage is, however, that it requires a separate pass over the QueryModel. If performance is very important, it’s probably not the best option.

To summarize, I realize that at the first glance, re-linq’s sub-query model might sometimes look like overkill, especially when a SubQueryExpression is created just because of a single Cast, Count, or Single method call. On the other hand, it makes it possible to handle Cast, Count, or Single the same way, no matter if they are at the top of the expression tree or embedded in a Select expression, for example. And they can be handled the same way whether they follow directly on a query source (session.Query<Dog>().Count()) or not (session.Query<Dog>().Where(…).Select(…).Count()). To fully leverage this simplification, you need a target system that can also represent those sub-queries. If you got one, prefer option 1 from above.

Written by Fabian

September 23rd, 2010 at 7:44 am

Posted in re-linq

Handling the DefaultIfEmpty result operator in a re-linq-based provider

without comments

My recent musings about the DefaultIfEmpty LINQ query operator and how it is represented by re-linq had a cause: On the NHibernate mailing list, there is a discussion going on about (among other things) how to implement DefaultIfEmpty in NHibernate’s LINQ provider. Since that LINQ provider is based on re-linq, I thought I’d write a few things about this.

Let’s take a look at an example query similar to the one I used in my last post. I’ve simplified it a bit because I don’t want to talk about group joins in this post. Group joins are difficult and may be a topic for a future post on their own; here, I want to talk about DefaultIfEmpty. So, here’s the simplified query:

var query = from c in Cooks

            from k in Kitchens.DefaultIfEmpty()

            select new { c, k };

We have a query selecting the Cartesian product of the elements in the Cooks and Kitchens sequences, and we use DefaultIfEmpty so that we get tuples for each cook in the first sequence even if there are no kitchens in the second one.

This is the pretty-printed query model of that query:

from Cook c in TestQueryable<Cook>()
from Kitchen k in {TestQueryable<Kitchen>() => DefaultIfEmpty()}
select new <>f__AnonymousType2f`2(c = [c], k = [k])

As in the last post, we can see that the DefaultIfEmpty query operator causes a sub-query to be created, which selects its items from a query source (TestQueryable<Kitchen>()) before applying the operator.

The easiest way to handle the DefaultIfEmpty result operator in this query model is to keep the sub-query structure intact while analyzing the query. To illustrate what I mean, I’m going to show you step-by-step how I’d translate the second from clause to SQL if I were writing a SQL-generating LINQ provider (which, in fact, I am):

Okay, so here I am, I’ve implemented a QueryModelVisitor class which visits the query model’s clauses, one after the other, and generates SQL while doing so. I’ve already visited the main from clause, but no other clause yet, so my generated SQL looks as follows:

SELECT ???
FROM [CookTable] [t0]

Now, I’m in the VisitAdditionalFromClause method of my visitor, inspecting the additional from clause. An additional from clause usually means a Cartesian product, which is called “CROSS JOIN” in SQL:

SELECT ???
FROM [CookTable] [t0]
CROSS JOIN

I apply my ExpressionTreeVisitor to the clause’s FromExpression, and the visitor finds a SubQueryExpression. I keep the sub-query structure for now, and recursively generate SQL for the nested query by invoking a new instance of my QueryModelVisitor on the sub-query’s query model. I visit the sub-query’s main from clause and its select clause:

(SELECT [t1].*
FROM [KitchenTable] [t1])

Then, my visitor’s VisitResultOperator method is invoked, I detect the DefaultIfEmpty result operator. What now? I need a way to ensure that the query I just generated will always return at least one item; if there is none, I want NULLs – this is the semantics of DefaultIfEmpty.

In SQL, the way to achieve this is to perform a left outer join – but with what? I’m in my sub-query, I don’t have a left side I can join with? Shouldn’t I have descended into the sub-query so quickly? Maybe I should have flattened it somehow?

Yes, I could have done this – in this specific scenario. However, the important thing to notice about DefaultIfEmpty is that it can occur in the most absurd of places – for example on the source of the outermost main from clause. In such cases, there is no simple way to flatten the subquery. If I want a generic solution, I’ll have to find a way that always achieves the DefaultIfEmpty semantics in SQL. And, luckily, there is one:

(SELECT [q2].*
FROM (SELECT NULL AS [Dummy]) [Dummy]
LEFT OUTER JOIN (SELECT [t1].* FROM [KitchenTable] [t1]) [q2] ON 1 = 1
)

Can you see what I’ve done? I’ve put the sub-query emitted so far into yet another sub-query – and used this as the right side of a left outer join – with a dummy left side. Such are the tricks of those who translate LINQ queries to SQL…

Okay, I’ve now finished translating the sub-query, I can insert the sub-query into the outer query and finish translating the outer query:

SELECT [t0].*, [q3].*
FROM [CookTable] [t0]
CROSS JOIN (SELECT [q2].*
FROM (SELECT NULL AS [Dummy]) [Dummy]
LEFT OUTER JOIN (SELECT [t1].* FROM [KitchenTable] [t1]) [q2] ON 1 = 1
) [q3]

This does not look pretty, no it doesn’t. But you know what? It works. And, most importantly, this way of translating DefaultIfEmpty to SQL always works, no matter where a DefaultIfEmpty stands in the query. The reason for why it works is that I was able to keep the sub-query structure re-linq presented to me, and find an implementation for DefaultIfEmpty that works locally to that sub-query. This a convenient effect of the sub-query + result operators representation re-linq chooses for complex queries.  If you can find a way to keep the sub-queries and honor the result operators locally, you can literally translate every LINQ query to your target query system.

But what about optimization?

There is definitely potential for optimization here: Whenever I’m inserting sub-queries into their outer queries, I can apply a sub-query optimizer. The optimizer checks whether the sub-query contains a TOP, COUNT, GROUP BY, or similar construct, and, if it doesn’t, realizes that the sub-query’s content can be inlined. It first does so at the innermost sub-query, the one within the join:

(SELECT [t1].*
FROM (SELECT NULL AS [Dummy]) [Dummy]
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1
)

Then, it does the same at the next level:

SELECT [t0].*, [t1].*
FROM [CookTable] [t0]
CROSS JOIN (SELECT NULL AS [Dummy]) [Dummy]
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1

If I also implement the dummy select in such a way that it only appears if the LEFT OUTER JOIN would otherwise stand as the first table in the generated SQL, I’ve got a good final SQL query from my translation:

SELECT [t0].*, [t1].*
FROM [CookTable] [t0]
LEFT OUTER JOIN [KitchenTable] [t1] ON 1 = 1

Note that the CROSS JOIN has just become a LEFT OUTER JOIN. Since I’m doing all this while still in the VisitAdditionalFromClause method, this shouldn’t be too hard. There is a bit of work to do to get the SELECT clause to reference the right sub-query (q2, not q3), but that’s totally doable as well.

While working on the re-linq SQL backend (which supports DefaultIfEmpty in exactly the same way outlined above, but at the time of this writing still lacks the sub-query optimizer (which is planned)),  I’ve learned one thing: when dealing with complex LINQ queries, it’s really important to solve transformations as locally as possible. I always try to perform my transformations just in the scope of the current query model; consolidation of the sub-query with the outer query can be done when the visitors return to that outer query.

Any other way lies madness. In my opinion.

Written by Fabian

September 4th, 2010 at 9:40 pm

Posted in re-linq

How re-linq represents the DefaultIfEmpty query operator

without comments

In my last post, I explained what the DefaultIfEmpty query operator actually does, and why anybody should use it. In this post, I’ll analyze how re-linq’s front-end represents that query operator in its query model.

First, let’s look at one of the queries from the last post:

var query1 = from c in Cooks

             join a in Cooks on c equals a.Assisted into assistants

             from a in assistants.DefaultIfEmpty()

             select new { c, a };

Console.WriteLine (new QueryParser ().GetParsedQuery (query1.Expression));

This code causes the LINQ Expression tree built for the query to be analyzed by the re-linq front-end, and the ToString representation to be written to the console. After adding newlines and indentation, it looks like this:

from Cook c in TestQueryable<Cook>()

join Cook a in TestQueryable<Cook>()
  on [c] equals [a].Assisted
  into IEnumerable`1 assistants

from Cook a in {[assistants] => DefaultIfEmpty()}
select new <>f__AnonymousType30`2(c = [c], a = [a])

Looks astoundingly similar to the C# query, doesn’t it? But what’s that “{[assistants] => DefaultIfEmpty()}” part?

Before explaining this notation, I should probably say a few more words about the philosophy behind re-linq’s query model. It’s not a coincidence that the query model’s ToString output looks similar to the C# query above. The main goal behind the re-linq front-end was to simplify the LINQ query and represent it in a form that would be easy to understand for programmers building LINQ providers.

As mentioned above, the front-end constructs this from the Expression tree produced for the query, which looks like this (this is the ToString representation of the Expression tree):

TestQueryable<Cook>()
  .GroupJoin(
    TestQueryable<Cook>(),
    c => c,
    a => a.Assisted,
    (c, assistants) => new <>f__AnonymousType2f`2(
      c = c,
      assistants = assistants))
  .SelectMany(
    <>h__TransparentIdentifierf =>
      <>h__TransparentIdentifierf.assistants.DefaultIfEmpty(),
    (<>h__TransparentIdentifierf, a) =>
      new <>f__AnonymousType30`2(
        c = <>h__TransparentIdentifierf.c,
        a = a))

It then creates a number of clauses – the main from clause, additional from clauses, where clauses, a select clause, etc – and puts them together into a query model object. This is very similar to how a C# programmer would think of the query, and very different from what the Expression tree looks like.

The problem we had, however, is that a simple model can only represent a subset of what’s possible with LINQ. For example, in re-linq’s query model, look-alike to C#, you can only have one select clause in a query. In the expression tree, however, there can be more than one call to the Select method. We didn’t want to introduce the complexity of having multiple select clauses in the query model, but we couldn’t ignore that such things were possible (and even easy) to do with pure LINQ.

When we devised the query model, we therefore used a famous trick: we made the data structure recursive. The query model was kept very simple, but we added the feature of sub-query models. In other words, when the model could not represent a certain query, for example because it had two Select method calls, we put part of the query into a sub-query, which was again simple and well-defined, and nested that sub-query inside an outer query, again simple and well-defined.

In re-linq, sub-queries can stand, among other things, for source sequences, eg., in a from clause. The curly braces in the “{[assistants] => DefaultIfEmpty()}” part of the query model above represent such a sub-query. It’s abbreviated for “(from x in [assistants] select x) => DefaultIfEmpty()”. The square brackets around assistants indicate a reference to items coming from another query source.

The arrow represents a second trick we used: a result operator. Result operators are used by re-linq to embed information about query operators for which we don’t have clauses. For example, Count() is a result operator, as is Distinct(), or DefaultIfEmpty(). Result operators always stand at the end of a query model – they operate on the results of the query. If they can’t stand at the end because of the semantics of the query, a sub-query is built around them; then they can stand at the end again.

That explained, let’s take a look at the full query model again:

from Cook c in TestQueryable<Cook>()

join Cook a in TestQueryable<Cook>()
  on [c] equals [a].Assisted
  into IEnumerable`1 assistants

from Cook a in {[assistants] => DefaultIfEmpty()}
select new <>f__AnonymousType30`2(c = [c], a = [a])

Knowing what we do now, we can see that the second from clause selects its Cook items (named “a”) from a sub-query that gets its items directly from the elements of the assistants query source and has a DefaultIfEmpty result operator attached to its end.

Now, the one million dollars question: How do I handle such a sub-query with a DefaultIfEmpty result operator in my re-linq-based LINQ provider?

That will be answered in my next post.

Written by Fabian

September 4th, 2010 at 8:35 pm

Posted in re-linq

The DefaultIfEmpty LINQ query operator

without comments

The DefaultfEmpty LINQ query operator is a strange beast. According to its documentation, it:

Returns the elements of the specified sequence or the type parameter’s default value in a singleton collection if the sequence is empty.

What? I thought DefaultIfEmpty had to do something with joins?

Let’s dig into this; first, let’s analyze what the documentation says. When applied to a non-empty sequence, such as {1, 2, 3}, DefaultIfEmpty will return an equivalent sequence: {1, 2, 3}. When applied to an empty sequence, {}, it will return a singleton sequence containing the default value of the sequence’s element type: {0} (assuming the original sequence was an empty sequence of integers).

But what would anyone use that for?

When you think about this question, it turns out that DefaultIfEmpty is quite useful when sequences are combined; for instance, with additional from clauses. Consider the following example:

var query = from c in Cooks

            from a in c.Assistants

            select new { c, a };

The second from clause in this query (this is really a SelectMany query operator, when C#’s syntactic sugar is removed) causes a Cartesian product to be formed from the elements of the Cooks sequence and the elements of the sequences returned by the c.Assistants expression, which is evaluated once for each c.

Consider, for example, the following input data:

Cooks = { c1, c2 }
c1.Assistants = { a1, a2 }
c2.Assistants = { a3, a4 }

With this as input, the result of the query will be as follows:

{ c1, a1 }, { c1, a2 },{ c2, a3 }, { c2, a4 }

Consider, now, one of the Assistants sequences to be empty:

Cooks = { c1, c2 }
c1.Assistants = { a1, a2 }
c2.Assistants = { }

In this case, the result will be:

{ c1, a1 }, { c1, a2 }

As you can see, the result contains no entries for c2 – the Cartesian product discards items from the left sequence for which there is no item in the right sequence. Here is, where DefaultIfEmpty comes in handy. We can rewrite the query as follows:

var query = from c in Cooks

            from a in c.Assistants.DefaultIfEmpty()

            select new { c, a };

Now, the DefaultIfEmpty operator will guarantee that the right sequence always contains at least one element, and therefore, each of the elements of the Cooks collection is contained in the result set at least once:

{ c1, a1 }, { c1, a2 }, { c2, null }

If you know your SQL, this will indeed remind you strongly of the concept of left outer joins, and in fact, the MSDN documentation for DefaultIfEmpty does hint at this in its Remarks section:

This method can be used to produce a left outer join when it is combined with the GroupJoin) [sic] method.

It’s true, DefaultIfEmpty can be combined with a GroupJoin, to form something very similar to a SQL left outer join:

var query = from c in Cooks

            join a in Cooks on c equals a.Assisted into assistants

            from a in assistants.DefaultIfEmpty()

            select new { c, a };

But as shown above, DefaultIfEmpty can also be very useful when applied to sequences other than a GroupJoin’s result.

And, of course, it can also be used in contexts outside of joins or cartesian products:

var query = Cooks.DefaultIfEmpty ();

 

// This should really be done using the Any() operator

var query = from c in Cooks

            where c.Assistants.DefaultIfEmpty ().First() != null

            select c;

These situations don’t really match the concept of a left outer join (at first sight), but they’re still perfectly legal usages of DefaultIfEmpty. All things considered, the MSDN authors were probably right about their initial assessment. Generally speaking, that query operator simply:

Returns the elements of the specified sequence or the type parameter’s default value in a singleton collection if the sequence is empty.

Exactly.

Written by Fabian

September 4th, 2010 at 7:43 pm

Posted in re-linq

Making it hard to implement language-compatible LINQ providers

without comments

There are two things compiler writers can do in order to give us LINQ provider writers significant headaches:

  • Firstly, they can make certain kinds of expressions much more common than other languages do. For example, they can use overloads of query operators or produce expression combinations that are otherwise rarely used.
  • Secondly, they can embed completely custom operators in their expression trees, masked as MethodCallExpressions to some compiler service methods.

The first issue is annoying because missing features that aren’t a problem for other languages suddenly get really important. Implementing a LINQ provider is a LOT of work, so there’ll always have to be some feature prioritizing. One of my favorite quotes on this is by Frans Bouma, author of LLBLGen Pro’s LINQ provider:

Writing a Linq provider is a lot of work which requires a lot of code. If you’re dealing with a Linq provider which is just, say, 32KB in size, you can be sure it will not support the majority of situations you will run into. However, the O/R mapper developer likely simply said ‘We have Linq support’, and it’s even likely the provider can handle the more basic examples of a single entity type fetch with a Where, an Order By or even a Group By. But in real life, once you as a developer have tasted the joy of writing compact, powerful queries using Linq, you will write queries with much more complexity than these Linq 101 examples. Will the Linq ‘provider’ you chose be able to handle these as well? In other words: is it a full Linq provider or, as some would say, a ‘toy’ ?

Usually, there is a time limit when implementing a LINQ provider, so you have to decide what features to implement, what query methods to support – and what query methods not to support. You’d usually implement the most important ones first, and if you decide one query method is not that important because you don’t know another language requires it, that might be a problem.

However, it could be argued that these features also provide benefits for users of the other languages. For example, while the C# compiler won’t specify a result selector in the GroupBy calls it generates, C# users can of course manually call the respective method overloads. So, implementing support for that overload of GroupBy is not a language-specific feature, it’s just the priority that differs between the languages.

The second issue, on the other hand, the one of embedding custom operators in the expression tree, is the real problem of LINQ provider compatibility. You need to support these operators only for the sake of one programming language. You won’t even know these operators exist unless you perform extensive testing in that language. The programming language might just as well construct expression trees with self-defined nodes instead of using the standard ones – the effect would be similar: there’s no way the LINQ provider could handle them unless it was built with explicit support for that language.

That’s a problem. Especially since VisualBasic.NET, of course, does both of these things.

VB.NET does make use of expressions uncommon in, say, a C# program; for example, it uses the GroupBy overloads taking a result selector, it generates additional, unnecessary Select (x => x) calls, and so on. These things you thought no programmer would commonly use in a LINQ query – the VB.NET compiler does use them, all too often.

And VB.NET does define its own operators. Expression trees coming from VB.NET may contain calls to the Microsoft.VisualBasic.CompilerServices.LikeOperator’s methods because VB does have a LIKE operator – which can’t be represented using standard expression types. And even string comparisons, which are usually represented as BinaryExpressions, cause special operators to be used.

Of course, there are good arguments as of why VB needs those operators: There is no standard expression type for LIKE, and VB has some specific string comparison semantics. Still, I wonder if the LINQ teams at Microsoft shouldn’t have thought about these problems a little harder, a little longer before releasing VB.NET’s LINQ support as is.

At the very least, some documentation of these VB-specifics would have saved a few headaches. Definitely.

Written by Fabian

July 16th, 2010 at 12:51 pm

Posted in re-linq

Support for VB-specific string comparisons in re-linq

with one comment

In a previous post of mine, I’ve discussed the VB.NET compiler’s characteristics of emitting very special expressions when a string comparison is performed in a LINQ query. Instead of emitting a simple BinaryExpression, as C# would, VB.NET emits a MethodCallExpression to the Microsoft.VisualBasic.CompilerServices.Operators.CompareString method. The reason for this is that VB.NET has some special features regarding string comparison – and the standard BinaryExpressions cannot represent those features.

Many LINQ providers are written in C# and therefore never learn of that specialty, until one day a VB.NET user submits a bug report about string comparisons leading to an exception in the LINQ provider. Therefore I thought it sensible to provide support for this VB.NET quirk directly in re-linq’s front-end.

It has taken some time – there was a lot of other work to do –, but starting with re-linq 1.13.65, there is now built-in support for VB.NET string comparisons.

(Unfortunately, the release notes of that version describe the feature in a different form than was implemented. The release notes for version 1.13.68 will contain an updated version of the feature description that is consistent with my explanations below.)

There were a few different options about how to actually handle this issue because of two conflicting goals:

  • make it as simple for the standard LINQ provider – one that simply needs to know that two strings were compared, without any VB-specific quirks – as possible, and
  • provide all necessary information to those LINQ providers that want to simulate VB’s string comparison semantics.

The first goal’s trivial implementation would be to have re-linq detect the MethodCallExpressions and replace them with standard BinaryExpressions [1]. That way, the LINQ provider would never even notice that VB.NET produces different expression trees; VB.NET’s comparisons would be handled just like C#’s ones.

On the other hand, this couldn’t possibly fulfill the second goal. Once the MethodCallExpression has been replaced, vital information (the textCompare parameter to the CompareString method, for example) is lost. To fulfill the second goal, the most sensible thing would therefore be to leave the MethodCallExpression in the expression tree – or to replace it with a specific custom expression, let’s call it VBStringComparisonExpression, which would hold all the required metadata about the comparison. Then, the LINQ provider could recognize this expression and deal with the expression the VB way. On the downside, this would require every LINQ provider to deal with those VBStringComparisonExpressions. And back to square one.

So, the question was, how to reach both goals at once?

One possibility would have been to make it configurable. I didn’t like this option for a number of reasons. First, as a library user, I like APIs that just work, without me having to configure anything, reading through documentation, and so on. Second, re-linq is not currently designed to be configurable at this part in the front-end, and while it would have been possible to add this feature, it seemed strange to add a configuration infrastructure just for this one option of turning automatic VB support on and off. Third, I knew there had to be a better way.

And find one I did: The .NET BCL 4.0 made some changes to the LINQ Expression classes, which were required because the Dynamic Language Runtime uses the same classes to represent their abstract syntax trees. There were a few features I liked so much that I re-implemented them in re-linq (which also has to run on .NET 3.5), in the ExtensionExpression base class. That class contains some boilerplate code for the Visitor pattern, and it also has a feature for reducing expressions. If an extension expression returns true from its CanReduce property, this means that it’s Reduce method can be used to transform the expression into a simplified version with the same (more or less) semantics. And this is the feature I used to implement the VBStringComparisonExpression with.

Here are the facts:

  • When parsing an expression tree into a QueryModel, re-linq’s frontend will now automatically detect calls to VB.NET’s CompareString operator.
  • It will simplify that comparison by using a BinaryExpression [1]. The BinaryExpression will be wrapped into a VBStringComparisonExpression, which also contains the textCompare flag defining how VB.NET would compare the string.
  • VBStringComparisonExpression is implemented so that its VisitChildren method will automatically visit the simplified BinaryExpression. This means that ordinary expression visitors (derived from the ExpressionTreeVisitor base class) can simply choose to ignore the VBStringComparisonExpressions. The simplified BinaryExpressions will automatically be visited.
  • VBStringComparisonExpression.Reduce is implemented so that it returns the simplified BinaryExpression. Visitors that are defined to throw an exception on any expression whose type they don’t know should try to reduce such expressions before giving up. That way, they can deal with complex, unknown expressions as long as they reduce to something simpler. ThrowingExpressionTreeVisitor, the base class that implements this behavior, does this automatically. This means that throwing expression visitors (derived from the ThrowingExpressionTreeVisitor base class) can also ignore the VBStringComparisonExpressions. The BinaryExpression will automatically be used instead.
  • If a visitor wants to explicitly deal with VBStringComparisonExpressions, it can implement the IVBSpecificExpressionVisitor. (Of course, such a visitor has to visit the expression before any ThrowingExpressionTreeVisitor reduces the nodes. Ordinary ExpressionTreeVisitors won’t reduce, so they are safe to be used at any time.)

And here is a picture:

VBStringComparisonExpression

The concept works quite well, I’ve tested it with re-linq’s own SQL backend – which does not handle VB comparisons in a special way. So far, the backend works nicely with VB expressions without explicitly dealing with them in any way.

It seems as if it was indeed possible to have the cake, and eat it too.


[1] This is actually a simplification. VB.NET’s CompareString is also produced for the <, <=, >, and >= operators. BinaryExpression can’t represent those operators for strings, so the “standard” expressions for them involve a MethodCallExpression to String.CompareTo.

Written by Fabian

July 15th, 2010 at 6:03 pm

Posted in re-linq