Fabian's Mix

Mixins, .NET, and more

Archive for the ‘re-linq’ Category

re-linq: Subquery boundaries after GroupBy, Union, and similar operators?

without comments

Working on https://www.re-motion.org/jira/browse/RM-4093 (UnionResultOperator should act as a query source) made me think about the nature of result operators (i.e., query operators operating on the result of a from/where/join/orderby/select block) that mangle the incoming data so much that it can’t be associated with the original data any more.

For example, consider three queries:

  1. (from o in Orders
  2.  select o)
  3.     .Take (4)
  4.     .Any (o2 => o2.OrderDate > DateTime.Now)

  • The result of the from/select block is a sequence of Order objects that are directly traceable to the MainFromClause: [o] (re-linq’s notation for: reference to the query source called "o").
  • The result of the Take operator is still a sequence of the same Order objects: [o].
  • Therefore, the "o2" in the Any operator still refers to the items produced by the MainFromClause: [o].OrderDate > DateTime.Now.
  1. (from o in Orders
  2.  select o)
  3.     .GroupBy (o => o.OrderDate)
  4.     .Any (g => g.Key > DateTime.Now)

  • As in the first query, the result of the from/select block is a sequence of items coming from the MainFromClause: [o].
  • The result of the GroupBy operator is a sequence of IGrouping<DateTime, Order> objects which are no longer traceable to the MainFromClause, but only to the GroupResultOperator (let’s call it “g”): [g].
  • Therefore, the "g" in the Any operator refers to items produced by the GroupResultOperator: [g].Key > DateTime.Now.
  1. (from o in Orders
  2.  select o)
  3.     .Union (OtherOrders)
  4.     .Any (o2 => o2.OrderDate > DateTime.Now)

  • As in the first query, the result of the from/select block is a sequence of items coming from the MainFromClause: [o].
  • The result of the Union operator is a sequence of Order objects which are, however, no longer traceable to the MainFromClause, since they could potentially come from the second query source (OtherOrders): [u].
  • Therefore, the "u" in the Any operator refers to items produced by the UnionResultOperator: [g].Key > DateTime.Now.

Okay, so far so good. However, the question is how useful it is to have a result operator refer to another result operator coming before it. Back-ends that produce SQL will always have to add a sub-query boundary after such a result operator, because the produced SQL should look something like this:

  1. SELECT CASE
  2.   WHEN EXISTS(
  3.     SELECT *
  4.     FROM (SELECT * FROM [OrderTable] AS [t0]
  5.           UNION
  6.           SELECT * FROM [OtherOrderTable]) AS [q0]
  7.     WHERE ([q0].[OrderDate] > @1)
  8.     )
  9.   THEN 1
  10.   ELSE 0
  11.   END AS [value]

As you can see, the part before the Any (which is equivalent to the EXISTS clause) needs to be moved into a sub-query – there is a subquery boundary after the UnionResultOperator. Therefore, we could change re-linq’s front-end to (always) automatically insert such boundaries before operators such as Union and GroupBy:

  1. (from u in
  2.      (from o in Orders
  3.       select o)
  4.      .Union (OtherOrders)
  5.  select u)
  6.     .Any (u => u.OrderDate > DateTime.Now)

The question I’m asking myself is whether there is a good reason for not making this change. Is there any advantage to keeping GroupBy and Union in the same query as the following operators?

I’d appreciate any input on this on our mailing list: https://groups.google.com/d/topic/re-motion-users/MgZcKlHAn1g/discussion.

Written by Fabian

October 12th, 2012 at 9:57 am

Posted in re-linq

re-linq: Now on NuGet (with symbols)

without comments

In July 2011, Henrik Feldt created a NuGet package for re-linq 1.13.111. Only one year later, we took notice of that, when Chris Eldredge and Gordon Watts gently asked us to update the outdated package.

Starting with re-linq 1.13.161, we’ve therefore decided to regularly publish NuGet packages in the course of our weekly build. The release policy is currently the same as for our CodePlex releases: whenever we have changes such as bugfixes or new features, we’ll release a new build of re-linq.

Note that the versioning of re-linq is still bound to that of re-motion as, for now, the two projects are built together. This means that, at the moment, re-linq does not follow semantic versioning – new releases may contain fixes, new features, or breaking changes, no matter how the version number changed. Also, strictly speaking, all 1.13 builds are internal builds without any support guarantees. On the other hand, these are also the builds we’re using at rubicon, and we have confidence in their quality, so there isn’t much reason not to use them.

Also note that the Remotion.Linq NuGet package only contains the re-linq front-end. If there is demand for a SQL Backend package, please tell us so.

The packages we release to NuGet are symbol packages, which means that you can debug them and step into the code if you want to. You need to configure Visual Studio to disable Just My Code, enable source server support, and add SymbolSource.org as a symbol file location. And then…

Debugging re-linq result

There is one caveat with debugging: The NuGet package contains the Release build of re-linq, so debugging might sometimes be bumpy, since the source code does not reflect the optimizations made by the C# compiler. But even so, when I tried it, it worked quite nicely.

Written by Fabian

August 28th, 2012 at 12:08 pm

Posted in re-linq

re-linq: How to recognize if a method is a query operator?

without comments

On the users’ mailing list, Alex Norcliffe, Lead Architect of Umbraco 5, describes a problem they are currently facing with re-linq. As an illustration, consider the following two queries, especially the sub-queries within the where clauses:

var query1 = from c in QuerySource

             where c.Assistants.Any ()

             select c;

 

var query2 = from c in QuerySource

             where c.GetAssistants().Any()

             select c;

When re-linq analyzes those sub-queries within the where clauses, the first query will produce a SubQueryExpression with a QueryModel whose MainFromClause has the following expression: [c].Assistants. In other words, the items produced by the sub-query are those identified by the Assistants property.

The second query, however, will produce an exception:

Remotion.Linq.Parsing.ParserException : Cannot parse expression ‘c’ as it has an unsupported type. Only query sources (that is, expressions that implement IEnumerable) and query operators can be parsed.

—-> Remotion.Linq.Utilities.ArgumentTypeException : Expected a type implementing IEnumerable<T>, but found ‘Remotion.Linq.UnitTests.Linq.Core.TestDomain.Cook’.

Why’s that?

re-linq assumes that all methods occurring in a query operator call chain should be treated like query operators (Where, Select, etc.). This means that for the sub-query within query2, re-linq regards c as the start of the query operator chain. And, since c’s type does not implement IEnumerable<T>, it throws the exception shown above. Even if c’s type implemented IEnumerable<T>, an exception would be thrown that
GetAssistants() “is currently not supported”, unless one registers a custom node parser for that method.

Of course, what Alex actually wanted was re-linq treating both query1 and query2 in an equivalent way. I.e., a SubQueryExpression with a QueryModel whose MainFromClause has the following expression: [c].GetAssistants().

There is an easy workaround for now (see the mailing list), but I’m wondering how we could change re-linq to produce this result out of the box. I can think of two possibilities, both of which have certain drawbacks:

1 – Have re-linq treat MethodCallExpressions the same way as MemberExpressions. I.e, if the method has a registered node parser, treat it as a query operator. Otherwise, treat it (and all expression parts of the call chain before it) as the start of the query.

This would work quite well in the scenario shown above, and it would be nicely symmetric to how MemberExpressions work in re-linq.

However, it would become a very breaking change regarding diagnostics. Consider this example, in which CustomOperator is actually a custom query operator:

source.Where(…).CustomOperator().Select(…)

Currently, re-linq will throw an exception that it can’t parse CustomOperator() if one forgets to register the respective parser, and the LINQ provider backend won’t even get a QueryModel to process.

If we change this behavior, the frontend will no longer throw an exception, and the backend will suddenly get a MainFromClause with a FromExpression of "[source].Where(…).CustomOperator()". I think it would be difficult to understand for LINQ provider implementers why exactly this occurs. I can even imagine people believing this must be "right" (as no exception occurred) and start manually parsing the Where(…) and CustomOperator() calls, effectively reimplementing logic from re-linq…

2 – Have re-linq only treat MethodCallExpressions called on enumerables as query operators. Otherwise, treat them (and all expression parts of the call chain before the method call) as the start of the query.

This would also work in the given scenario, and it has the advantage of still providing good diagnostics when methods taking IEnumerable<T> have no associated expression node parser. However, it’s still a heuristic way of parsing, and it is asymmetric (both with MemberExpressions and in itself). Consider the following three examples:

instanceImplementingEnumerable.StartQueryHere().Where(…)
instanceNotImplementingEnumerable.StartQueryHere().Where(…)
instanceImplementingEnumerable.StartQueryHere.Where(…)

re-linq would parse the StartQueryHere method in the first example as a query operator (and throw an exception if there isn’t an expression node parser registered for it). The StartQueryHere method and property in the second and third example, on the other hand, would parse just fine. I believe this is difficult to understand just as well.

What do other people think of these two options? If you want to see this scenario to be supported out of the box, please give me some feedback about it on the developer mailing list: http://groups.google.com/group/re-motion-dev/t/f9f6198bbbecd796.

Written by Fabian

December 20th, 2011 at 9:29 am

Posted in re-linq

re-linq: Customizability explained

without comments

In a previous post, I mentioned a set of features for better customizability of how the re-linq front-end parses expression trees. This time, I want to explain why you’d use those features, and how to decide which one to use.

To start, here’s an updated version of the diagram showing re-linq’s pipeline:

re-linq pipeline 2

(Read my previous post for an explanation of this pipeline.)

There are four important points of extensibility in that pipeline:

  • The query parser,
  • the expression tree processors,
  • the expression transformers, and
  • the expression nodes.

Let’s take a look at each of them.

Replacing the Query Parser

The query parser’s responsibility is to build a QueryModel from a LINQ expression tree. This responsibility is defined by the IQueryParser interface, and implemented by the QueryParser class. The latter performs quite a good job, so why would you want to replace it?

In reality, you wouldn’t usually want to replace re-linq’s query parser, but maybe you need to extend it. By implementing IQueryParser and decorating the existing QueryParser class, you can:

  • Do anything you want with the expression tree before re-linq gets to see it;
  • adapt the QueryModel after re-linq is done constructing it;
  • react to the fact that re-linq is asked to construct a QueryModel; e.g., to implement logging, or to perform some checks; or
  • avoid the re-linq pipeline at all; e.g, if you want to perform caching at the QueryModel level.

The point about caching deserves some explanation. Let’s say you have a LINQ provider that translates queries to SQL. If your users issue the same queries again and again and your LINQ provider proves to be a bottleneck, you might need some possibility to cache the parsed SQL queries based on the incoming expression trees. re-linq currently does not have any support for caching query translation results, and while it’s possible to build a caching subclass of QueryProviderBase that intercepts Execute, this doesn’t really fit re-linq’s architecture with the IQueryExecutor class.

It is, however, easily possible to implement a two-part cache using the existing architecture. First, implement a caching IQueryParser implementation that calculates a cache key from the incoming expression tree, and caches QueryModel references based on that key. Then, implement a caching IQueryExecutor that keeps track of the generated SQL by QueryModel reference.

That said, replacing the query parser is probably not the most interesting extensibility point in re-linq, so let’s go on to the next one.

Adding (or replacing) Expression Tree Processors

Expression tree processors are implementations of the IExpressionTreeProcessor interface that re-linq applies to the expression tree before that tree is actually analyzed. You can add your own processors to replace and transform the expression tree if your LINQ provider needs to do this. re-linq already defines two processors that are included in the pipeline by default: the partial evaluator and the transforming processor.

The partial evaluator is responsible for analyzing the parsed expression tree for any sub-expressions that can be evaluated in memory and replacing those sub-expressions with the result of the evaluation.

Here’s an example: Consider you write a query such as the following:

from o in Orders
where o.Date == DateTime.Today
select o

In this query, the partial evaluator will detect that the sub-expression “DateTime.Today” can be evaluated locally and will replace it with a constant expression that holds the respective date. (Note: If you implement query caching as explained above, keep this in mind!)

By default, the partial evaluator is the first processor to be executed. You can replace it if you need a custom evaluator.

The second default processor, the transforming processor, is responsible to execute the expression transformers explained below.

To add or replace expression tree processors, create a customized instance of the QueryParser and ExpressionTreeParser classes. See this description for the details: https://www.re-motion.org/jira/browse/RM-3721.

You’d typically add your own processor if you have an expression visitor that needs to be applied to the expression tree prior to query analysis. Note, however, that the number of expression visitors involved in query parsing can negatively affect the performance of your LINQ provider. For simple processing, expression transformers may be the better alternative, so let’s look at those next.

Adding (or replacing) Expression Transformers

Expression transformers are a light-weight, efficient way of transforming sub-expressions in an expression tree. They work similar to the Visit… methods of the ExpressionTreeVisitor class, but unlike expression visitors, transformers are only meant for local transformations (i.e., transformations of expression patterns that can be detected by looking at a single expression and its (more or less) immediate children). Transformers are written for a dedicated expression type (e.g., MethodCallExpression), and they should not build up any state that spans multiple expression nodes in a tree.

Here’s an example from re-linq’s source code:

/// <summary>

/// Replaces calls to <see cref=”Nullable{T}.Value”/> and
///
<see cref=”Nullable{T}.HasValue”/> with casts and null checks. This
///
allows LINQ providers

/// to treat nullables like reference types.

/// </summary>

public class NullableValueTransformer
 
: IExpressionTransformer<MemberExpression>

{

  public ExpressionType[] SupportedExpressionTypes

  {

    get { return new[] { ExpressionType.MemberAccess }; }

  }

 

  public Expression Transform (MemberExpression expression)

  {

    ArgumentUtility.CheckNotNull (“expression”, expression);

 

    if (expression.Member.Name == “Value”
       
&& IsDeclaredByNullableType(expression.Member))

      return Expression.Convert (expression.Expression, expression.Type);

    else if (expression.Member.Name == “HasValue”
        && IsDeclaredByNullableType (expression.Member))

      return Expression.NotEqual (
          expression.Expression
          Expression.Constant (null, expression.Member.DeclaringType));

    else

      return expression;

  }

 

  private bool IsDeclaredByNullableType (MemberInfo memberInfo)

  {

    return memberInfo.DeclaringType.IsGenericType
       
&& memberInfo.DeclaringType.GetGenericTypeDefinition()
            == typeof (Nullable<>);

  }

}

The NullableValueTransformer implements the IExpressionTransformer<T> interface for MemberExpression because it will handle that expression type, similar to an expression visitor implementing VisitMemberExpression. The SupportedExpressionTypes property is used to determine what expressions exactly should be handled by this transformer. (If the type parameter and the expression types don’t match, an exception is thrown at run-time.)

When the expression tree is parsed, the transforming processor (see above) will visit each sub-expression of the expression tree and pick the corresponding transformers based on the node type values. When it picks the NullableValueTransformer, it calls the Transform method, and the transformer may then decide whether to return the (untransformed) input expression, or to return a different expression. In the example, the transformer replaces calls to the Nullable<T>.Value and HasValue properties with cast expressions and null checks.

The transformers are called “inside out”, i.e., child nodes are transformed before their parent and ancestor nodes. When more than one transformer qualifies for the same expression, the transformers are called in a chain in the order of registration. When a transformer changes the expression, the chain is aborted and transformers are again chosen for the new expression (which may have a different type than the original one).

re-linq comes with a number of predefined transformers, including the nullable value transformer, a few VB syntax transformers, a transformer that detects invocations of LambdaExpressions, and a set of transformers that add metadata to constructor invocations for tuple types.

I’d think that most pre-processing requirements that a LINQ provider may have can be efficiently implemented as transformers. To add custom transformations, you again need to create a customized instance of the QueryParser and ExpressionTreeParser classes, see https://www.re-motion.org/jira/browse/RM-3721.

Custom Expression Nodes

Last, but not least, there are the expression node parsers used by re-linq for translating query operators into query model clauses and result operators. I’ve written about these before, see https://www.re-motion.org/blogs/mix/archive/2010/10/28/re-linq-extensibility-custom-query-operators.aspx. Add your own node parsers to support non-default query operators. Again, https://www.re-motion.org/jira/browse/RM-3721 shows how they are integrated into the pipeline.

Well, that’s it. I’ve to say I’m quite glad with these customization features. If you have any comments, questions, or suggestions regarding them, feel free to post at our mailing list: http://groups.google.com/group/re-motion-users.

Written by Fabian

April 29th, 2011 at 3:06 pm

Posted in re-linq

re-linq: Weekly builds now on CodePlex

without comments

Starting with build 1.13.92, we’ve changed our build system to automatically upload the re-linq weekly builds to CodePlex if any re-linq features were implemented in that week.

Previously, we published our weekly builds on our own server (https://re-motion.org/builds/) and only created CodePlex releases for “major” changes. For example, it was planned to publish the next CodePlex release only when re-linq’s SQL backend was finished.

That had the disadvantage, however, of giving the impression that re-linq wasn’t being worked on. For example, version 1.13.41 had been the “recommended release” for a very long time. It was released on January 13, 2010, and some people assumed that re-linq hadn’t been changing at all in the meantime. This was not correct – a lot of changes were made to re-linq between 1.13.41 and 1.13.92.

This concept was described on re-linq’s CodePlex homepage, but apparently, this wasn’t enough. So, now, every weekly build will be published on CodePlex. re-linq is developed in a test-driven way, and we usually consider our weekly builds to be “stable” (ie, ready for production). Here’s what the homepage has to say about stability:

Note that due to the goodness of TDD, weekly builds are generally considered stable and we do often use those in production. However, if you need a bug fix you will have to upgrade to a newer version. Hotfixes are only produced for release versions (even/odd scheme: release versions have even minor version numbers, such as the upcoming 1.14.0, and hotfixes will be numbered 1.14.1, 1.14.2 etc.).

So, be aware that there can be “breaks” between minor releases for non-even builds (like 1.13), and always check out the release notes for the versions between your current one and the new one if you want to upgrade. The easiest way to do this is via our JIRA.

Apart from that, just keep monitoring the releases on CodePlex (RSS link) in order to get notified of new versions.

Written by Fabian

January 31st, 2011 at 10:55 am

Posted in re-linq

re-linq: A lot of new customizability

without comments

This week, we’ve implemented a set of features for easier customizability of the way re-linq’s front-end handles expression trees. With those features, you – as a re-linq-based LINQ provider author – now can:

  • extend, customize, or even replace the whole process of how to get from a LINQ expression tree to a re-linq QueryModel;
  • add custom expression tree transformation steps (or replace the existing ones); and
  • add light-weight expression transformation logic.

I’ll explain those in detail. Consider the following picture, which shows the important classes involved in the execution of a LINQ query with a re-linq based LINQ provider:

re-linq pipeline

As you can see, a user starts the process by either calling GetEnumerator on an instance of a class derived from QueryableBase<T> or by using one of the methods for immediate execution (such as Count, Single, First, etc.) defined by the .NET framework’s Queryable class (passing in an instance derived from QueryableBase<T>). Both cause QueryProviderBase.Execute to be called. Up to here, everything is according to the standard LINQ provider implementation patterns.

Now comes the part specific to re-linq. The QueryProviderBase class will now ask a QueryParser to parse the query into a QueryModel, and then pass that QueryModel to an implementation of IQueryExecutor. The executor must be defined by the actual LINQ provider – this is where the actual data is queried from the underlying data source.

In order to produce the QueryModel, the QueryParser asks an ExpressionTreeParser to parse the given query expression tree into a chain of ExpressionNodeTypes. For each top-level query operator, one (or more) node instances are created, and all of those nodes are then asked to apply themselves onto (and thus building) the QueryModel to be returned.

The ExpressionTreeParser performs several steps before it creates the expression nodes: first, it partially evaluates the expression tree – ie., it simplifies those parts of the tree that can be executed in memory –, then, it applies some transformation steps, and finally, it creates the nodes for the top-level query operators using a user-extensible registry. (For simplicity, I’ve left out the analysis of method call arguments – which involves the detection of sub-queries – from the picture.)

Now that you understand the re-linq (front-end) pipeline, let me try again to explain what we’ve added. You can now:

  • extend, customize, or even replace the whole process of how to get from a LINQ expression tree to a re-linq QueryModel by providing a custom query parser;
  • add custom expression tree transformation steps (or replace the existing ones) by adding custom expression tree processing steps; and
  • add light-weight expression transformation logic by adding custom expression transformers.

I’ll probably describe the details of each of those possibilities in separate blog posts. For now, it will suffice to say that the new customizability features will be released with re-linq build 1.13.92 on CodePlex tomorrow (2011-01-28). By tomorrow evening (CET), the release notes will also be completed (available via JIRA).

Written by Fabian

January 27th, 2011 at 6:44 pm

Posted in re-linq

re-linq: Support for SelectMany without result selector

without comments

While this has been missing from the start, different people reported it at about the same time a few weeks ago: support for the SelectMany overload without a result selector.

What does that mean?

Consider the following C# LINQ query:

from c in Cooks

from k in Kitchens

select c.Name + ” at “ + k.Name

This is equivalent to this query, which makes the query operators being used explicit:

Cooks.SelectMany (c => Kitchens, (c, k) => c.Name + ” at “ + k.Name)

SelectMany is a query operator that combines each item from an input sequence with each item from a second input sequence. The latter is calculated by means of a collection selector – the first argument to the SelectMany method in the example above. The collection selector returns the Kitchens sequence for each call; but it could potentially return a separate sequence for each c.

So, the purpose of SelectMany is to combine items from the sequences. But how does it combine them? This is where the second argument comes into play: the result selector. In the example, it builds a combined string from the cook’s and kitchen’s names.

Okay, now this is LINQ 101, where is the point?

There are cases where you don’t really want to combine the items of two sequences; you just want to “flatten” a nested sequence. Here’s an example:

var allAssistants = Cooks.SelectMany (c => c.Assistants);

This code simply takes all assistants of all cooks and returns them as a flattened sequence.

While the SelectMany overload is convenient, it will never be used when you write a LINQ query using the C# syntactic sugar form (from … select …). That’s why re-linq hasn’t included any default support for it – until now. Starting with build 1.13.88 (released today), re-linq can now parse this query out of the box.

About the technical details: when encountering a query using the simpler overload, such as the one above, re-linq just pretends there is a trivial result selector:

var allAssistants = Cooks.SelectMany (c => c.Assistants, (c, a) => a);

That way, the parsing code didn’t need to be changed at all. The QueryModel also doesn’t care about which overload you used – it will always represent SelectMany as an AdditionalFromClause. This means that LINQ providers using re-linq don’t have to change anything to make this work.

Which is certainly a good thing.

Written by Fabian

December 23rd, 2010 at 3:27 pm

Posted in re-linq

re-linq: Extensibility: Custom query operators

with 2 comments

This is a blog post about re-linq that I’ve been planning to write for a very long time. It is the documentation of how to extend the re-linq front-end so that it can detect custom query methods, and it’s going to be quite long for a blog post. Read it if you’re a user of re-linq and you are trying to get your LINQ provider to understand your own custom extension methods.

As I’ve mentioned before, re-linq differentiates between clauses, which form the main part of a query, and result operators, which describe operations conducted on the result of the query. Query methods such as Select and Where are parsed into clause objects, whereas methods such as Distinct, Take, or Count are parsed into result operator objects. Both clauses and result operators are represented in the QueryModel that represents the query analyzed and simplified by re-linq’s front-end.

The distinction between result operators and clauses is important because it defines what goes into a sub-query, and what is simply appended to the current query. For example, consider the following query:

Query<Order>()
  .Where (o => o.DeliveryDate <= DateTime.UtcNow)
  .OrderByDescending (o => o.DeliveryDate)
  .Take (5)
  .OrderBy (o => o.OrderNumber)

In this case, the Where and OrderByDescending method calls are parsed into clauses of a single QueryModel. The Take method call is appended to that QueryModel as a result operator. The OrderBy call following the Take call, however, is parsed into a different, outer QueryModel, which embeds the former QueryModel as a sub-query. Like this:

from o’ in (
  from o in Query<Order>()
  where [o].DeliveryDate <= DateTime.UtcNow
   orderby [o].DeliveryDate desc
   select [i]).Take (5)
orderby [o’].OrderNumber
select [o’]

(It is necessary to form a sub-query in this case because the OrderBy operation must take place after the Take operation. This makes a huge semantic difference!)

I’ve talked about the sub-query system in the past, and I believe it to have many positive properties with regards to scoping of identifiers, and similar. Also note how the final parsed query looks very similar to how the query would be written in C#’s query syntax. However, the point I was trying to make is that a query method parsed as a clause might cause the previous stuff to be wrapped into a sub-query, whereas a query method parsed as a result operator is always simply attached to the current QueryModel.

Therefore, if your goal is to extend a query by simply attaching some information to it, you will want to go the result operator route. Which is as follows:

  1. Create an extension method for queries that represents the information you want to attach to the query. This is the user’s entry point in attaching the information, and it has nothing to do with re-linq; it’s simply how LINQ works.
  2. Define a result operator class that represents the information in the parsed QueryModel.
  3. Write an expression node parser that translates the call to the extension method to your result operator and wire up that parser with your LINQ provider.
  4. Add code handling the result operator to your LINQ provider back-end. This, again, has not much to do with re-linq.

The rest of this post will walk you through a sample. Only steps 2 and 3 are really specific to re-linq, but I’ll include the other steps for completeness.

This is the query we want to be able to run in the end:

var query = (from o in QuerySource

             where o.ID != 0

             select o).AdditionalInfo (o => o.OrderDate);

 

var result = query.ToArray ();

1. Create an extension method

Here’s the AdditionalInfo method:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source, 
      Expression<Func<T, int>> parameter)

  {

    return source.Provider.CreateQuery<T> (
        Expression.Call (
            ((MethodInfo) MethodBase.GetCurrentMethod ())
                .MakeGenericMethod (typeof (T)),

            source.Expression,

            Expression.Quote (parameter)));

  }

}

(Bug fixed on 2010-12-02)

This is the “standard” way to write LINQ extension methods. Ouch.

Note: If you write a lot of these, you might want to use some kind of expression generation helper:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source,
      Expression<Func<T, int>> parameter)

  {

    return CreateQuery (source, s => s.AdditionalInfo (parameter));

  }

 

  private static IQueryable<T> CreateQuery<T, TR> (
      IQueryable<T> source, Expression<Func<IQueryable<T>, TR>> expression)

  {

    var newQueryExpression = ReplacingExpressionTreeVisitor.Replace (
        expression.Parameters[0],
        source.Expression,
        expression.Body);

    return source.Provider.CreateQuery<T> (newQueryExpression);

  }

}

This allows you to specify the new expression tree without having to build it by hand.

2. Define a result operator that represents the information in the parsed QueryModel

Here it is:

public class AdditionalInfoResultOperator
    : SequenceTypePreservingResultOperatorBase

{

  public AdditionalInfoResultOperator (Expression parameter)

  {

    Parameter = parameter;

  }

 

  public Expression Parameter { get; private set; }

 

  public override string ToString ()

  {

    return “AdditionalInfo (“
      + FormattingExpressionTreeVisitor.Format (Parameter)
      + “)”;

  }

 

  public override ResultOperatorBase Clone (CloneContext cloneContext)

  {

    return new AdditionalInfoResultOperator (Parameter);

  }

 

  public override void TransformExpressions (
      Func<Expression, Expression> transformation)

  {

    Parameter = transformation (Parameter);

  }

 

  public override StreamedSequence ExecuteInMemory<T> (StreamedSequence input)

  {

    return input; // sequence is not changed by this operator

  }

}

As mentioned before, the purpose of this class is mainly to represent the information expressed by the extension method in the resulting QueryModel. That’s why it has a property holding the extension method’s “parameter” value. The Clone, TransformExpressions, and ExecuteInMemory methods are used when the QueryModel holding the result operator is cloned, transformed, or executed in memory. If a LINQ provider does not make use of these features, the methods needn’t be implemented; but since it’s not much work, I did it here for completeness. The ToString method is overridden only for diagnostic reasons.

Note: If you find yourself in a case where you need a lot of parameterless result operators that represent some sort of ‘options’, you should probably define a single OptionsResultOperator that has an “optionKind” discriminator value (enum, string, or even MethodInfo).

3. Write an expression node parser that translates the method to the operator

Now, what’s this for? Can’t re-linq just automatically instantiate an AdditionalInfoResultOperator when it detects the extension method? By registration, or even by convention, for example?

In the most trivial cases, it probably could. There are a lot of more complex cases, though, especially if the extension method takes parameters. One of the tasks of the re-linq front-end is, after all, to remove transparent identifiers and to analyze where the data for a clause (or operator) stems from. This task can’t really be automated, and that’s why there’s one additional abstraction between query methods and result operators: the expression node parser. Here it is:

public class AdditionalInfoExpressionNode : ResultOperatorExpressionNodeBase

{

  public static MethodInfo[] SupportedMethods =
      new[] { typeof (MyQueryExtensions).GetMethod (“AdditionalInfo”) };

 

  private readonly LambdaExpression _parameterLambda;

 

  public AdditionalInfoExpressionNode (
      MethodCallExpressionParseInfo parseInfo, LambdaExpression parameter)

      : base (parseInfo, null, null)

  {

    _parameterLambda = parameter;

  }

 

  protected override ResultOperatorBase CreateResultOperator (
      ClauseGenerationContext clauseGenerationContext)

  {

    var resolvedParameter = Source.Resolve (
        _parameterLambda.Parameters[0],
        _parameterLambda.Body,
        clauseGenerationContext);

    return new AdditionalInfoResultOperator (resolvedParameter);

  }

 

 public override Expression Resolve (
      ParameterExpression inputParameter,
      Expression expressionToBeResolved,
      ClauseGenerationContext clauseGenerationContext)

  {

    return Source.Resolve (
        inputParameter,
        expressionToBeResolved,
        clauseGenerationContext);

  }

}

This is probably the most “complicated” class in the whole process, because it’s not obvious at first glance what that class actually does; but it is a case of necessary complexity. re-linq translates and simplifies queries as it parses expression trees. If you want to extend that process, you need to tell re-linq how to perform that translation and simplification. And this is exactly what the two methods, CreateResultOperator and Resolve do.

But first things first. As you can see, the node class is derived from a base class specifically designed for the parsing of result operator query methods: ResultOperatorExpressionNodeBase. The SupportedMethods field is a convention that makes it easier to register the parser with re-linq later on.

When re-linq traverses an expression tree and encounters a MethodCallExpression instance that matches a registered expression node parser, that parser is instantiated, and the method call’s actual argument expressions are passed to the parser’s constructor. Therefore, in our example, the constructor has two parameters. The first one, parseInfo, represents the extension method’s source parameter. It contains information about the expression being parsed, the previous expression node in the query, and so on. Our parser doesn’t use this apart from passing it to its base class. The second parameter, “parameter”, directly corresponds to the second argument of the extension method. Where the extension method takes an Expression<Func<T, int>>, the node takes a LambdaExpression (the non-generic version of the former). The parser stores the parameter for later use.

The CreateResultOperator method tells re-linq how to translate the MethodCallExpression being parsed into a result operator to be put into a QueryModel. It simply returns an instance of the AdditionalInfoResultOperator declared before, passing in the method call’s parameter. Before doing so, however, it has its Source resolve the parameter. What does that mean?

Often, when you create a result operator, you don’t really care about the LambdaExpression passed to the extension method in the query. An expression such as o => o.OrderDate doesn’t tell you which o’s OrderDate should be retrieved. After all, the query before the extension method might not even contain an item called “o”.

Therefore, re-linq allows the expression node parser to resolve the origin of its lambda’s input data while it is translating the expression into a result operator. To do so, the parser calls the Resolve method on the expression node directly preceding itself in the query. That way, o => o.OrderDate becomes [o].OrderDate, with [o] being a reference to the corresponding clause producing the items.

To enable this mechanism, each expression node parser must implement the Resolve method in such a way that the changes made by the result operator on the query result are reflected in the resolution result.

For example, if the result operator represents a step in the query that produces new items, its Resolve method returns an expression describing those new items. If it changes the items coming from its own Source, it returns an expression representing that change. If it simply passes through the items coming from its Source without adding new ones or changing existing ones, it simply passes on the Resolve request to the Source. Our sample result operator is purely informational and does not do anything with the items flowing through the query; so it takes that last option.

And that’s all there’s to say about expression node parsers for result operators. To summarize, each parser instance represents one MethodCallExpression instance. The parser has two tasks, so it has two methods. CreateResultOperator tells re-linq how to translate the extension method into a result operator, optionally simplifying the LambdaExpression parameter in the process. Resolve allows subsequent expression node parsers to simplify their own LambdaExpressions.

To have re-linq use the expression node parser, register it as follows:

nodeTypeRegistry.Register (

    AdditionalInfoExpressionNode.SupportedMethods,

    typeof (AdditionalInfoExpressionNode));

The nodeTypeRegistry can be obtained from a QueryParser via the ExpressionTreeParser and NodeTypeRegistry properties, or it can be injected into the constructor of DefaultQueryProvider when a new queryable instance is created.

Note: If you have several extension methods that are all mapped to the same result operator, you should only create a single expression node parser that has multiple SupportedMethods.

4. Add code handling the result operator to your LINQ provider back-end

Now, this is simple again. Using the registered expression node parser, re-linq now detects the extension methods and creates a result operator for it. You can inspect a QueryModel’s result operators via the ResultOperators collection. If you have a QueryModelVisitor, just override the VisitResultOperator method and check the result operator’s type. You can also choose to implement extended (acyclic) visitors by overriding the result operator’s Accept method.

In a nutshell, adding new extension methods to your own LINQ provider is quite possible once you understand how re-linq works. Apart from the inherent stuff, like defining an extension method, you need to define two classes: a result operator and an expression node parser. The result operator represents the information in the QueryModel, the parser translates from the expression tree to the result operator.

If you need samples, take a look at re-linq’s source code; for example the TakeResultOperator and TakeExpressionNode classes.

And if you have good ideas about how to simplify this specific extensibility point – let me know.

Update (2010-12-02): There was a bug in the example code given under step 1. I’ve fixed it.

Written by Fabian

October 28th, 2010 at 2:23 pm

Posted in re-linq

re-linq: Query operators defined by interfaces

with 4 comments

And another question that makes for a good blog post: Why don’t we support automatic detection of interface implementation in the re-linq front-end when we detect query operators?

Consider the following example:

// This works

var cookNamesArray = new[] { "john", "joe", "jim" };

var query1 = from c in Cooks where cookNamesArray.Contains (c.FirstName) select c.FirstName;

As the comment implies, this works – the Contains call is detected as a query operator. It is so because the expression above refers to Enumerable.Contains<T>(), for which a node type parser is, by default, registered in the MethodCallExpressionNodeTypeRegistry. re-linq uses this registry to determine what methods are query operators, and how to parse them.

Now, consider this example:

// This does not work

var cookNamesMyList = new MyList<string> { “john”, “joe”, “jim” };

var query2 =

    from c in Cooks

    where cookNamesMyList.Contains (c.FirstName)

    select c.FirstName;

 

// This works again

IEnumerable<string> cookNamesMyListAsEnumerable = new MyList<string> { “john”, “joe”, “jim” };

var query3 =

    from c in Cooks 
    where cookNamesMyListAsEnumerable.Contains (c.FirstName)

    select c.FirstName;

The first Contains call in that second example is not recognized as a query operator because MyList<T>.Contains() has not been registered as a query operator. Since re-linq analyzes the expression tree generated by the compiler, the second call is recognized again because it is again the Enumerable.Contains<T>() method.

Now, a sensible suggestion would be the following: Since MyList<T> probably implements ICollection<T>, why not register a parser for ICollection<T>.Contains()?

And the answer is: because it wouldn’t help. (Unless you cast to ICollection<T>, that is.)

As long as the method in the Expression tree isn’t ICollection<T>.Contains() but MyList<T>.Contains(), re-linq wouldn’t be able to determine that the registered parser should be used. But why not? After all, we can do it, why can’t re-linq?

Automatic detection of interface implementations for query operators is a difficult topic, primarily because it’s hard to get right with good performance. Currently, the re-linq front-end is quite fast because it only does dictionary lookups for the methods coming in via MethodCallExpressions (plus GetGenericTypeDefinition()/GetGenericMethodDefinition()).

With interface method implementations, this is not that simple – we can’t perform dictionary lookups, which are O(1) operations (leaving the calls mentioned above aside); we have to get an interface map for the registered interfaces and perform a linear search through the TargetMethods array of that interface map. This makes the lookup an O(N*M) operation, where N is the number of registered interface types, and M is the number of methods in those interfaces. (And probably, GetInterfaceMap() adds another dimension of complexity.)

We could try to make it perform better by detecting interface methods by name first, and only then check the interface map, but it would still be much slower than an ordinary lookup – and you must not forget that re-linq has to perform that lookup for every MethodCallExpression in an expression tree in order to detect whether a method call refers to a query operator or not. So this has to be really fast.

Complex logic like this could only be implemented via sophisticated caching, which we’ve been able to avoid before. We’d have to devise a good caching scheme, expose control over the cache’s lifetime to the user of re-linq (ie., the LINQ provider), and so on.

So, this is not a small feature to implement, which is why we haven’t done it up to now. However, I’ve created a JIRA feature request for it, and we’ll consider implementing it in the future. (Oh, and we accept patches if they come with the necessary unit tests. If somebody would like to implement this feature, let’s discuss the design on the re-motion developers list.)

Until the feature is implemented, there are a few workarounds:

  • You can use Enumerable.Contains<T>(), as illustrated above. This works out of the box right now.
  • You can register a parser for common collection types (eg., List<T>) to at least support those. (List<T>.Contains() will become a default operator with RM-3340.)
  • You can register parsers for ICollection<T>.Contains() or IList<T>.Contains(), but have to cast the collection to the standard interface type: where ((ICollection<string>) cookNamesMyList).Contains (c.FirstName). This will also be implemented with RM-3340.
  • You can allow your users to register parsers for their own query operators (MyList<T>.Contains()) by exposing the MethodCallExpressionNodeTypeRegistry.
  • You can register a parser for the “Contains” method name. This is not a good option because it would also cause re-linq to detect string.Contains() as a query operator.

The standard ContainsExpressionNode parser should work for all of these cases.

Written by Fabian

September 23rd, 2010 at 9:35 am

Posted in re-linq

re-linq: Dealing with sub-queries in a from clause

without comments

On our re-motion users list, Fabio Maulo asks:

for this query
from Dog o in session.Query<Dog>() select o

I would catch the Cast method but it is parsed as SubQueryExpression.

Have you any advise ?
Thanks.

My short answer is: Look for the CastResultOperator in the ResultOperators collection of the SubQueryExpression’s QueryModel. Detect that the inner QueryModel is trivial by inspecting QueryModel.IsIdentityQuery(). Do this in your back-end’s respective VisitSubQueryExpression visitor method, or use the SubQueryFromClauseFlattener.

The long answer:

re-linq’s front-end works in such a way that it always wraps expressions that contain query operators (Select, Where, OrderBy, Count, Cast, and similar method calls) into QueryModel instances. If those expressions are part of a larger expression tree (eg., in the expression of a FromClause), re-linq will wrap the QueryModel into a SubQueryExpression and include that in the larger tree.

This sub-query system has several advantages, for example, it is highly consistent both in representation and analysis (because you can usually apply the same QueryModel analyzer to the outer and inner queries), it’s easy and fast to parse (no context is needed when re-linq’s structural parser encounters a Select MethodCallExpression, for example), and so on.

However, it means you have to explicitly deal with those SubQueryExpressions in your parser if you need to support such queries.

You have several options:

Option 1, if your target has good support for sub-queries (SQL has, for example), you can just translate those SubQueryExpressions into target sub-queries as they are. In the expression visitor parsing the expressions of from clauses, in the VisitSubQueryExpression method, apply your QueryModelVisitor to the expression’s inner QueryModel, then insert the resulting query as a sub-query into your result.

I’ll explain by using SQL. In the example above, you’d have the following:

SELECT [q0].*
FROM (SELECT [t0].* FROM [DogTable] [t0]) [q0]

You can choose to add an optimization step to the VisitSubQueryExpression method: if the QueryModel is trivial (QueryModel.IsIdentityQuery()) and all of its result operators can be handled inline, you can simply “flatten” the sub-query instead of parsing the inner QueryModel:

SELECT [t0].* FROM [DogTable] [t0]

As you can see, for this flattening, you have to take the FromExpression of the inner QueryModel and use it as the FromExpression of the outer query model. (If you also want to flatten non-trivial queries, you’ll need to perform an additional reference replacement step, but I won’t go into this now.)

This is how we do it in re-linq’s SQL backend (or how we will do it, since the flattener hasn’t been implemented yet). The advantage is that this concept supports all constructs that build SubQueryExpressions automatically, and that you can still build optimized statements.

Option 2, if your target has no good support of sub-queries, you can still detect the SubQueryExpression in the expression visitor’s VisitSubQueryExpression method. Check whether the inner QueryModel is trivial (IsIdentityQuery()) and whether all the result operators can be expressed in-line. If so, use the inner QueryModel’s FromExpression and modify it according to the result operators. You can do this, for example, using a dedicated SubQueryModelVisitor that returns a flattened result, which you can then insert into your outer query. If the QueryModel is non-trivial (eg., if somebody wrote “from o in session.Query<Dog>().Where (…).OrderBy (…) select o”), throw a NotSupportedException or something similar.

This is similar to option 1, but the optimization becomes the default (and only) route.

Option 3, if you feel this is too complicated to do in your back-end, you can perform a recursive sub-query flattening step on the outermost QueryModel before you feed it into your back-end’s analyzer or transformer. I’ve blogged about this before, and re-linq already contains a simple SubQueryFromClauseFlattener. If you want to use that, you’ll probably need to override CheckFlattenable() for the example above in order to make sure that the CastResultOperator is ignored.

If the existing SubQueryFromClauseFlattener is not enough for your purposes (or you don’t want to throw an exception if the sub-query cannot be flattened), you can implement your own transformation class in a similar way, of course.

The advantage of this option is that it keeps logic out of your back-end, which is likely complex enough anyway. The disadvantage is, however, that it requires a separate pass over the QueryModel. If performance is very important, it’s probably not the best option.

To summarize, I realize that at the first glance, re-linq’s sub-query model might sometimes look like overkill, especially when a SubQueryExpression is created just because of a single Cast, Count, or Single method call. On the other hand, it makes it possible to handle Cast, Count, or Single the same way, no matter if they are at the top of the expression tree or embedded in a Select expression, for example. And they can be handled the same way whether they follow directly on a query source (session.Query<Dog>().Count()) or not (session.Query<Dog>().Where(…).Select(…).Count()). To fully leverage this simplification, you need a target system that can also represent those sub-queries. If you got one, prefer option 1 from above.

Written by Fabian

September 23rd, 2010 at 7:44 am

Posted in re-linq