Skip to content

Content Processing Part 1: FAST Search Server 2010 for SharePoint, custom meta tags and date formats

09/16/2012

Overview

This is the first in a short series of blogs in which I will write about custom content processing with FAST Search Server 2010 for SharePoint (FS4SP) and SharePoint 2013 Preview.  I think it will be interesting to lay out a couple of use cases for content processing, provide something of a how-to for the scenario(s) with FS4SP and then try to refactor the solutions for SharePoint 2013 Preview to see how the solutions will differ, what has become easier, more challenging, etc.

Very briefly, content processing refers to the act of modifying content as it travels through the document processing pipeline with your own custom code (or that of a 3rd party solution).  Use cases for custom processing can be anything from simple re-formatting (think dates, names, multi-valued properties, etc.) or as powerful as enhancing search index items with data from other systems.

In FS4SP, the place to do this is in the pipeline extensibility stage of the document processing pipeline.  The diagram below shows the flow in and out of the FS4SP index and a detailed section describing the item processing pipeline including the extensibility stage.  When the item being crawled hits the extensibility stage, your custom code can be called and the item can be impacted by your code.  As you can see, the custom processing happens towards the end of the pipeline and before the item makes it into the index.

The following link is Microsoft’s Technet overview on custom item processing:  http://msdn.microsoft.com/en-us/library/ff795801.aspx

So onto the first use case.

Use Case

I worked with a customer recently who was crawling HTML files that have custom meta tags in the header.  They wanted to know how to get the metadata stored in those meta tags into the index so they could use it as a refiner or to display it in the search results.  Additionally, some of these metadata fields contain date information and they wanted to know how to wire that info to existing date properties that they are using such as “Last Modified Date”.  This is a great use case because we can solve two separate, but related problems in one shot.

Custom Meta Tag

We’ll handle a single custom meta property for this example called MyCustomDate.  This is what the file looked like and if you notice, the date format is as follows:  “Tue, 31 Jul 2012 07:42:28 -0400”

So the first thing we need to figure out is how to get the property into the index – turns out that’s the easy part.  Being the expert  researcher that I am, I figured this out by tweeting Mikael Svenson and he told me the answer which is that they end up in the crawled properties under the “Web” category.  After all that research I had to take a break and get a snack 😉  More on Mikael later in this post.

So to test this out, first we need to get the document into the index.  Instead of kicking off a crawl which can be a bit time consuming, I performed a docpush for my sample file. Docpush.exe is an out of the box FS4SP utility that allows you to add or remove one document at a time into the index.  Performing a docpush sends the item through all of the stages just as if it had been found during a full crawl.  It’s perfect for testing and debugging.  More about docupsh here:  http://technet.microsoft.com/en-us/library/ee943508.aspx

Once the item has been processed and is in the index, the custom meta tags are automatically added as crawled properties in the “Web” category – just as Mikael told me.  To find them navigate to Central Admin, Manage Service Apps, FAST Query SSA, FAST Search Administration.

From there, choose Crawled property categories and then “Web”.  In there all of the meta tags can be found.

Date Formats

So that is some progress, but right off this looks like trouble.  I see that my custom date crawled property is of type “Text” instead of “DateTime”.  Just to check whether this was a problem I went ahead and mapped the mycustomdate(Text) crawled property to one managed property of type “DateTime” and one of type “Text”

After creating and mapping these managed properties I performed my docpush again and executed a search for the document.  The results of my search told the story (see below) – the text property was populated but the date property was not.  This indicated that FS4SP didn’t want to map my text crawled property to the DateTime managed property.

Quick side-note.  The results in the screenshot above are from an out of the box search center, however I replaced the normal XSLT of the core results web part with my own that looks like this:

This will display a raw XML representation of all of the properties that are passed into the core results web part (in the fetched properties section of the web part).  So in my case I added mycustomdate and mytesttext to the fetched properties.

So the task at hand now is to convert the text string from MyCustomDate to something that FS4SP will recognize as a DateTime – and this is where the custom processing comes in.

Custom Pipeline Stage

Now typically when I create a custom processing stage I use Mikael Svenson’s example in this blog post as a starting point.  For those that don’t know Mikael, he has a tremendous amount of experience and knowledge around FS4SP, search and SharePoint in general and he is extremely generous with his knowledge.  I would also suggest that anyone working with FS4SP read his book Working with FAST Search Server 2010 for SharePoint.  He’s one of my SharePoint heroes 🙂

So, based on Mikael’s post, I created my custom processing console application and wired it up as he describes.

In the program, first I find the property and then get the date string which, if you remember from our document, is “Tue, 31 Jul 2012 07:42:28 -0400”

Then I reformat it with two lines of simple code.  In the example below “res” is a variable that contains the date.

DateTime MyDateTime = DateTime.Parse(res.First());

string myFormatedDate = MyDateTime.ToString(“s”) + “Z”;

Which yields this:  “2012-07-31T07:42:28Z”.  This is the suggested date format for FS4SP.  So it still remains a string at this point but we have it in the correct format.

Finally, I write that new value to a crawled property with the same name but give it the variant type “64”, which is DateTime.

Here is a reference for the different variant types.  I usually just peak at an existing property in Central Administration myself, but it is a good reference.  http://technet.microsoft.com/en-us/library/ff191231.aspx

Now, once I deployed the program, reset my docprocs (“psctrl reset” in the FS4SP Admin Console) and perform another the docpush again I saw the additional crawled property of type “DateTime”

It’s probably a good time to high-five someone at this point – we’re almost there.

Lastly, I mapped the new crawled property to a managed property called MyCustomDate in Central Administration and as you can see below, the custom date is now populated.

Now, as you can see, that the format is not particularly friendly to humans.  But don’t worry.  When I added the same crawled property to the “Write” (which is last modified) field, the date format came out perfectly like the screenshot below.  This is because there are XSLT templates used in the out of the box core search results web part handle formatting dates in the search results.

Conclusions

  • Meta tags (even custom tags) are automatically surfaced as crawled properties and are found in the “Web” category.
  • If you want to map a custom date field, make sure it is a crawled property of variant type “64” (DateTime).
  • Creating a custom pipeline stage to format a date is a fairly straight forward – especially if you start with some example code (thanks Mikael)

I like to say this a lot in my live presentations.  Don’t be afraid of writing a little code – even if you’re not a “developer”.  For encapsulated tasks like this (rather than large complex programs with lots of moving parts) you don’t have to be a rock star coder.  You can easily do this in an afternoon even if it’s your first time (assuming you have some familiarity with Visual Studio).  Just make sure you have one of your proper developer friends have a good look at it and it’s been tested thoroughly before it moves to production 😉  I’m being funny here, but do make sure you test thoroughly and have experienced developers and testers take a look at the code before deploying to production!  But that said, don’t be afraid to roll up your sleeves and tinker around.  I’m not a “developer” but I don’t let this stuff scare me 😉

Source Code

The environment I used to perform the tasks above was one of CloudShare’s pre-configured FS4SP servers.  I’ve made it available as a “permalink” here:  https://use.cloudshare.com/Pro/ShareEnv/0KZTTZGWVLUW

If you would like the source code click the “permalink”, create a CloudShare profile and once you get into the environment you’ll find the code on the desktop.

Stay Tuned

Stay tuned for the next blog in the series where we’ll explore a different use case.  Then in a subsequent post or two I’ll attempt to accomplish the same modifications in SharePoint 2013 so we can compare and contrast the tools and methods.

Happy processing!

From → FAST, SharePoint 2010

Leave a Comment

Leave a comment