Fabian's Mix

Mixins, .NET, and more

re-linq: Extensibility: Custom query operators

with 2 comments

This is a blog post about re-linq that I’ve been planning to write for a very long time. It is the documentation of how to extend the re-linq front-end so that it can detect custom query methods, and it’s going to be quite long for a blog post. Read it if you’re a user of re-linq and you are trying to get your LINQ provider to understand your own custom extension methods.

As I’ve mentioned before, re-linq differentiates between clauses, which form the main part of a query, and result operators, which describe operations conducted on the result of the query. Query methods such as Select and Where are parsed into clause objects, whereas methods such as Distinct, Take, or Count are parsed into result operator objects. Both clauses and result operators are represented in the QueryModel that represents the query analyzed and simplified by re-linq’s front-end.

The distinction between result operators and clauses is important because it defines what goes into a sub-query, and what is simply appended to the current query. For example, consider the following query:

Query<Order>()
  .Where (o => o.DeliveryDate <= DateTime.UtcNow)
  .OrderByDescending (o => o.DeliveryDate)
  .Take (5)
  .OrderBy (o => o.OrderNumber)

In this case, the Where and OrderByDescending method calls are parsed into clauses of a single QueryModel. The Take method call is appended to that QueryModel as a result operator. The OrderBy call following the Take call, however, is parsed into a different, outer QueryModel, which embeds the former QueryModel as a sub-query. Like this:

from o’ in (
  from o in Query<Order>()
  where [o].DeliveryDate <= DateTime.UtcNow
   orderby [o].DeliveryDate desc
   select [i]).Take (5)
orderby [o’].OrderNumber
select [o’]

(It is necessary to form a sub-query in this case because the OrderBy operation must take place after the Take operation. This makes a huge semantic difference!)

I’ve talked about the sub-query system in the past, and I believe it to have many positive properties with regards to scoping of identifiers, and similar. Also note how the final parsed query looks very similar to how the query would be written in C#’s query syntax. However, the point I was trying to make is that a query method parsed as a clause might cause the previous stuff to be wrapped into a sub-query, whereas a query method parsed as a result operator is always simply attached to the current QueryModel.

Therefore, if your goal is to extend a query by simply attaching some information to it, you will want to go the result operator route. Which is as follows:

  1. Create an extension method for queries that represents the information you want to attach to the query. This is the user’s entry point in attaching the information, and it has nothing to do with re-linq; it’s simply how LINQ works.
  2. Define a result operator class that represents the information in the parsed QueryModel.
  3. Write an expression node parser that translates the call to the extension method to your result operator and wire up that parser with your LINQ provider.
  4. Add code handling the result operator to your LINQ provider back-end. This, again, has not much to do with re-linq.

The rest of this post will walk you through a sample. Only steps 2 and 3 are really specific to re-linq, but I’ll include the other steps for completeness.

This is the query we want to be able to run in the end:

var query = (from o in QuerySource

             where o.ID != 0

             select o).AdditionalInfo (o => o.OrderDate);

 

var result = query.ToArray ();

1. Create an extension method

Here’s the AdditionalInfo method:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source, 
      Expression<Func<T, int>> parameter)

  {

    return source.Provider.CreateQuery<T> (
        Expression.Call (
            ((MethodInfo) MethodBase.GetCurrentMethod ())
                .MakeGenericMethod (typeof (T)),

            source.Expression,

            Expression.Quote (parameter)));

  }

}

(Bug fixed on 2010-12-02)

This is the “standard” way to write LINQ extension methods. Ouch.

Note: If you write a lot of these, you might want to use some kind of expression generation helper:

public static class MyQueryExtensions

{

  public static IQueryable<T> AdditionalInfo<T> (
      this IQueryable<T> source,
      Expression<Func<T, int>> parameter)

  {

    return CreateQuery (source, s => s.AdditionalInfo (parameter));

  }

 

  private static IQueryable<T> CreateQuery<T, TR> (
      IQueryable<T> source, Expression<Func<IQueryable<T>, TR>> expression)

  {

    var newQueryExpression = ReplacingExpressionTreeVisitor.Replace (
        expression.Parameters[0],
        source.Expression,
        expression.Body);

    return source.Provider.CreateQuery<T> (newQueryExpression);

  }

}

This allows you to specify the new expression tree without having to build it by hand.

2. Define a result operator that represents the information in the parsed QueryModel

Here it is:

public class AdditionalInfoResultOperator
    : SequenceTypePreservingResultOperatorBase

{

  public AdditionalInfoResultOperator (Expression parameter)

  {

    Parameter = parameter;

  }

 

  public Expression Parameter { get; private set; }

 

  public override string ToString ()

  {

    return “AdditionalInfo (“
      + FormattingExpressionTreeVisitor.Format (Parameter)
      + “)”;

  }

 

  public override ResultOperatorBase Clone (CloneContext cloneContext)

  {

    return new AdditionalInfoResultOperator (Parameter);

  }

 

  public override void TransformExpressions (
      Func<Expression, Expression> transformation)

  {

    Parameter = transformation (Parameter);

  }

 

  public override StreamedSequence ExecuteInMemory<T> (StreamedSequence input)

  {

    return input; // sequence is not changed by this operator

  }

}

As mentioned before, the purpose of this class is mainly to represent the information expressed by the extension method in the resulting QueryModel. That’s why it has a property holding the extension method’s “parameter” value. The Clone, TransformExpressions, and ExecuteInMemory methods are used when the QueryModel holding the result operator is cloned, transformed, or executed in memory. If a LINQ provider does not make use of these features, the methods needn’t be implemented; but since it’s not much work, I did it here for completeness. The ToString method is overridden only for diagnostic reasons.

Note: If you find yourself in a case where you need a lot of parameterless result operators that represent some sort of ‘options’, you should probably define a single OptionsResultOperator that has an “optionKind” discriminator value (enum, string, or even MethodInfo).

3. Write an expression node parser that translates the method to the operator

Now, what’s this for? Can’t re-linq just automatically instantiate an AdditionalInfoResultOperator when it detects the extension method? By registration, or even by convention, for example?

In the most trivial cases, it probably could. There are a lot of more complex cases, though, especially if the extension method takes parameters. One of the tasks of the re-linq front-end is, after all, to remove transparent identifiers and to analyze where the data for a clause (or operator) stems from. This task can’t really be automated, and that’s why there’s one additional abstraction between query methods and result operators: the expression node parser. Here it is:

public class AdditionalInfoExpressionNode : ResultOperatorExpressionNodeBase

{

  public static MethodInfo[] SupportedMethods =
      new[] { typeof (MyQueryExtensions).GetMethod (“AdditionalInfo”) };

 

  private readonly LambdaExpression _parameterLambda;

 

  public AdditionalInfoExpressionNode (
      MethodCallExpressionParseInfo parseInfo, LambdaExpression parameter)

      : base (parseInfo, null, null)

  {

    _parameterLambda = parameter;

  }

 

  protected override ResultOperatorBase CreateResultOperator (
      ClauseGenerationContext clauseGenerationContext)

  {

    var resolvedParameter = Source.Resolve (
        _parameterLambda.Parameters[0],
        _parameterLambda.Body,
        clauseGenerationContext);

    return new AdditionalInfoResultOperator (resolvedParameter);

  }

 

 public override Expression Resolve (
      ParameterExpression inputParameter,
      Expression expressionToBeResolved,
      ClauseGenerationContext clauseGenerationContext)

  {

    return Source.Resolve (
        inputParameter,
        expressionToBeResolved,
        clauseGenerationContext);

  }

}

This is probably the most “complicated” class in the whole process, because it’s not obvious at first glance what that class actually does; but it is a case of necessary complexity. re-linq translates and simplifies queries as it parses expression trees. If you want to extend that process, you need to tell re-linq how to perform that translation and simplification. And this is exactly what the two methods, CreateResultOperator and Resolve do.

But first things first. As you can see, the node class is derived from a base class specifically designed for the parsing of result operator query methods: ResultOperatorExpressionNodeBase. The SupportedMethods field is a convention that makes it easier to register the parser with re-linq later on.

When re-linq traverses an expression tree and encounters a MethodCallExpression instance that matches a registered expression node parser, that parser is instantiated, and the method call’s actual argument expressions are passed to the parser’s constructor. Therefore, in our example, the constructor has two parameters. The first one, parseInfo, represents the extension method’s source parameter. It contains information about the expression being parsed, the previous expression node in the query, and so on. Our parser doesn’t use this apart from passing it to its base class. The second parameter, “parameter”, directly corresponds to the second argument of the extension method. Where the extension method takes an Expression<Func<T, int>>, the node takes a LambdaExpression (the non-generic version of the former). The parser stores the parameter for later use.

The CreateResultOperator method tells re-linq how to translate the MethodCallExpression being parsed into a result operator to be put into a QueryModel. It simply returns an instance of the AdditionalInfoResultOperator declared before, passing in the method call’s parameter. Before doing so, however, it has its Source resolve the parameter. What does that mean?

Often, when you create a result operator, you don’t really care about the LambdaExpression passed to the extension method in the query. An expression such as o => o.OrderDate doesn’t tell you which o’s OrderDate should be retrieved. After all, the query before the extension method might not even contain an item called “o”.

Therefore, re-linq allows the expression node parser to resolve the origin of its lambda’s input data while it is translating the expression into a result operator. To do so, the parser calls the Resolve method on the expression node directly preceding itself in the query. That way, o => o.OrderDate becomes [o].OrderDate, with [o] being a reference to the corresponding clause producing the items.

To enable this mechanism, each expression node parser must implement the Resolve method in such a way that the changes made by the result operator on the query result are reflected in the resolution result.

For example, if the result operator represents a step in the query that produces new items, its Resolve method returns an expression describing those new items. If it changes the items coming from its own Source, it returns an expression representing that change. If it simply passes through the items coming from its Source without adding new ones or changing existing ones, it simply passes on the Resolve request to the Source. Our sample result operator is purely informational and does not do anything with the items flowing through the query; so it takes that last option.

And that’s all there’s to say about expression node parsers for result operators. To summarize, each parser instance represents one MethodCallExpression instance. The parser has two tasks, so it has two methods. CreateResultOperator tells re-linq how to translate the extension method into a result operator, optionally simplifying the LambdaExpression parameter in the process. Resolve allows subsequent expression node parsers to simplify their own LambdaExpressions.

To have re-linq use the expression node parser, register it as follows:

nodeTypeRegistry.Register (

    AdditionalInfoExpressionNode.SupportedMethods,

    typeof (AdditionalInfoExpressionNode));

The nodeTypeRegistry can be obtained from a QueryParser via the ExpressionTreeParser and NodeTypeRegistry properties, or it can be injected into the constructor of DefaultQueryProvider when a new queryable instance is created.

Note: If you have several extension methods that are all mapped to the same result operator, you should only create a single expression node parser that has multiple SupportedMethods.

4. Add code handling the result operator to your LINQ provider back-end

Now, this is simple again. Using the registered expression node parser, re-linq now detects the extension methods and creates a result operator for it. You can inspect a QueryModel’s result operators via the ResultOperators collection. If you have a QueryModelVisitor, just override the VisitResultOperator method and check the result operator’s type. You can also choose to implement extended (acyclic) visitors by overriding the result operator’s Accept method.

In a nutshell, adding new extension methods to your own LINQ provider is quite possible once you understand how re-linq works. Apart from the inherent stuff, like defining an extension method, you need to define two classes: a result operator and an expression node parser. The result operator represents the information in the QueryModel, the parser translates from the expression tree to the result operator.

If you need samples, take a look at re-linq’s source code; for example the TakeResultOperator and TakeExpressionNode classes.

And if you have good ideas about how to simplify this specific extensibility point – let me know.

Update (2010-12-02): There was a bug in the example code given under step 1. I’ve fixed it.

Written by Fabian

October 28th, 2010 at 2:23 pm

Posted in re-linq

2 Responses to 're-linq: Extensibility: Custom query operators'

Subscribe to comments with RSS

  1. I’m trying to follow these instructions – this is my first use of re-linq. And I’m baffled by the nodeTypeRegistry – or finding it.

    I’m doing something very simple right now – just adding a custom query operator that has no parameters. So, I have the following:

    var h = new LINQToTTree.QueriableTTree<BasicNtupleModel>(fi, "btag").ProjectToHistogram();

    And the queriableTTree:

    public QueriableTTree (FileInfo rootFile, string treeName)
    : base (new TTreeQueryExecutor(rootFile, treeName))
    {
    }

    And then the TTreeQueryExectuor just implements IQueryExectuor. So, it looks to me like I need to get a hold of nodeTypeRegistry in the QueriableTTree object. But for the life of me I can’t figure out when to do that with out doing something like the following in the ctor above:

    (this.Provider as DefaultQueryProvider).ExpressionTreeParser.NodeTypeRegistry

    Could you give me a pointer or two? Many thanks!

    Gordon Watts

    29 Nov 10 at 08:45

  2. Gordon: The node type registry is part of the QueryProvider. Therefore, the code you’ve provided works: you can access the registry via "this.Provider". The cast is (unfortunately) necessary because the Provider property (defined by the LINQ interfaces) is of type IQueryProvider, not DefaultQueryProvider. This is currently the only way to obtain the registry from a fully constructed Queryable instance. (I might change this in the future.)

    In order to initialize the registry when initializing your QueriableTTree class, you can also use a different base constructor:

    public QueriableTTree (FileInfo rootFile, string treeName)
    : base (new DefaultQueryProvider (typeof (QueriableTTree<>), new TTreeQueryExecutor (rootFile, treeName), CreateNodeTypeRegistry())
    {
    }

    private static ExpressionNodeTypeRegistry CreateNodeTypeRegistry()
    {
    var registry = ExpressionNodeTypeRegistry.CreateDefault();
    // register your custom handlers here
    return registry;
    }

    In this code, I’ve used the constructor that allows me to create my own provider instance, and I’ve passed in a node type registry to that provider. I’ve created the registry via a static method – this is the place to put custom handler registrations.

    Hope that helps. If you have any further questions, please contact me via our Google Group: "groups.google.com/group/re-motion-users".

    fabian

    29 Nov 10 at 16:56

Leave a Reply