Indexing PDF with sitecore 7.5 and a custom crawler using ITextSharp

As you probably know sitecore index PDF using Adobe iFilter…

Adobe iFilter technology is not really friendly and it make use of COM objects, it means that you are going to have a lot of security issues and dependencies on the COM objects.

On the top of it, it seams that there are proven solution based on the iFilter up to the version 9, but unfortunately now you can download only the version 11  you can read more about this issue here 

So I can reccomend two solutions:

  1. Buy the license of a third party tool (like FoxIt)
  2. Write your own media crawler following this post

When it come to parsing PDF there are several options, I have chosen ITextSharp that seems widely used and supported.

This is the code that you need for you custom media crawler:

using System;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.ContentSearch.Diagnostics;
using Sitecore.ContentSearch.Extracters.IFilterTextExtraction;
using Sitecore.Data.Items;
using Sitecore.Diagnostics;

namespace xxx.Crawler.Pdf
{
    public class MediaContentExtractor : IComputedIndexField
    {
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        public object ComputeFieldValue(IIndexable indexable)
        {
            Item item = (SitecoreIndexableItem) indexable;
            Assert.ArgumentNotNull(item, "item");

            object result = null;
            if (item != null && item.Paths.IsMediaItem)
            {
                MediaItem _media = item;
                string ext = _media.Extension.ToLower();
                if (ext == "pdf" || _media.MimeType == "application/pdf")
                {
                    result = ParsePDF(_media);
                }
                else
                {
                    result = ParseItemsWithIfilters(_media);
                }
            }

            return result;
        }


        private string ParsePDF(MediaItem mediaItem)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            var builder = new StringBuilder();
            if (mediaItem != null)
            {
                try
                {
                    var reader = new PdfReader(mediaItem.GetMediaStream());
                    if (reader.Info.ContainsKey("Title"))
                    {
                        builder.Append(reader.Info["Title"]);
                    }
                    if (reader.Info.ContainsKey("Subject"))
                    {
                        builder.Append(reader.Info["Subject"]);
                    }

                    if (reader.Info.ContainsKey("Keywords"))
                    {
                        builder.Append(reader.Info["Keywords"]);
                    }

                    for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
                    {
                        builder.Append(PdfTextExtractor.GetTextFromPage(reader, pagenumber, strategy));
                    }
                }
                catch (Exception ex)
                {
                    CrawlingLog.Log.Error(ex.ToString(), ex);
                    return string.Empty;
                }
            }
            return builder.ToString();
        }


        private string ParseItemsWithIfilters(MediaItem mediaItem)
        {
            string content = string.Empty;
            try
            {
                Stream streamReader = mediaItem.GetMediaStream();
                TextReader reader = new FilterReader(((FileStream) streamReader).Name);
                using (reader)
                {
                    content = reader.ReadToEnd();
                }
            }
            catch (Exception ex)
            {
                CrawlingLog.Log.Error(ex.ToString(), ex);
            }

            if (!string.IsNullOrWhiteSpace(content))
            {
                content = content.Replace("\r\n", string.Empty).ToLower();
            }

            return content;
        }
    }
}

obviously you need also to amend the file: Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration

<!--<field fieldName="_content"                 type="Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch">
<mediaIndexing ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/mediaIndexing">
            </field>-->
<field fieldName="_content" storageType="no" indexType="tokenized">xxx.Crawler.Pdf.MediaContentExtractor, xxx.Crawler.Pdf</field>
Advertisements

WebApi Attribute routing is not working with Sitecore 7.5

If you are into WebApi, you were very excited to get attribute routing working on Sitecore 7.2…

you can read about it in several blogs and it works pretty well:

http://kamsar.net/index.php/2014/05/using-web-api-2-attribute-routing-with-sitecore/

http://patrickdelancy.com/2013/08/sitecore-webapi-living-harmony/#.VHXGi4usV1B

Unfortunately in Sitecore 7.5 rev. 141003 attribute routing it is not working anymore…. you can read more about it here

and this is the error that you would get:

Message: An error has occurred.
ExceptionMessage: Value cannot be null. Parameter name: key", ExceptionType: "System.ArgumentNullException", StackTrace: " at System.Collections.Generic.Dictionary2.FindEntry(TKey key) at System.Collections.Generic.Dictionary2.TryGetValue(TKey key, TValue& value) at Sitecore.Services.Infrastructure.Web.Http.Dispatcher.NamespaceHttpControllerSelector.SelectController(HttpRequestMessage request) at System.Web.Http.Dispatcher.HttpControllerDispatcher.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken) at System.Web.Http.Dispatcher.HttpControllerDispatcher.d__0.MoveNext() 

so my advice if, you want to use WebApi with Sitecore 7.5 is to have something like it:

url: api/Sdbother/getid

public class SdbotherController : ApiController
{
   [System.Web.Http.HttpGet]
   public string GetId()
   {
     return "test";
    }
}

make the global.asax inherits from the following class:

public class GlobalExtended : Sitecore.Web.Application
{
   protected void Application_Start(object sender, EventArgs e)
   {
    GlobalConfiguration.Configure(ConfigureRoutes);
    }

  public static void ConfigureRoutes(HttpConfiguration config)
  {
   config.Routes.MapHttpRoute("DefaultApiRoute",
    "api/{controller}/{action}/{id}",
     new { id = RouteParameter.Optional });

    GlobalConfiguration.Configuration.MapHttpAttributeRoutes();
    GlobalConfiguration.Configuration.Formatters.Clear();
    GlobalConfiguration.Configuration.Formatters.Add(new JsonMediaTypeFormatter());
   }
}

Getting Started with Sitecore 7.5 Analytics and MongoDb

As probably most of you have already heard, the more exciting new functionality of Sitecore 7.5 is around the Analytics and MongoDb
In this guide, I will help you in your first configuration of the analytics and MongoDb.

Assuming that you have completed the setup of SC 7.5 and Sitecore is running fine on your machine now it is time of download Mongo from here http://www.mongodb.org/downloads once that you have installed Mongo, you can decide if you want to run as a service or a standalone application.

If you have decided for the standalone application, now it is time to start it you can run the following from command line

cd C:\Program Files\MongoDB 2.6 Standard\bin
mongod.exe -dbpath C:\MongoData

Now you should see in the command line, the MongoDb status and activity tracking.

Now you can browse your sitecore instance ensuring that the Visitor identification tag is rendered correctly

sc:visitoridentification
@Html.Sitecore().VisitorIdentification()

and that the following code is rendered on the pages

<link href="/layouts/System/VisitorIdentification.aspx" rel="stylesheet" type="text/css" />

Assuming that you are running everything on the default MongoDB Port 27017 this should be in your connection string

<add name="analytics" connectionString="mongodb://localhost/analytics"/>
<add name="tracking.live" connectionString="mongodb://localhost/tracking_live"/>
<add name="tracking.history" connectionString="mongodb://localhost/tracking_history"/>
<add name="reporting" connectionString="user id=user;password=password;Data Source(local);Database=sc75wizardSitecore_Analytics"/>

Now that you have collected some visitors data you can either run the latest visits report (in engagement Analytics) or use a MongoDbViewer (like Mongo Vue) to see your data in the Analytics – Contacts Table

MongoContacts

Associate a visitor to a Contact in Sitecore 7.5 Analytics

This is a very important concept of Sitecore Analytics, basically everytime that you are identifying an user (eg. successful login, or getting the email address for any reason) you need to convert the existing “anonymous” visit to a registered contact, in this way the Analytics will associate the anonymous session analytics data to an identified “contact” data

var tracker = Sitecore.Analytics.Tracker.Current;
var identifier = tracker.Contact.Identifiers;
if (identifier.IdentificationLevel != Sitecore.Analytics.Model.ContactIdentificationLevel.Known)
{
   tracker.Session.Identify("stelio@thisIsMyEmail.com");
}