Indexing PDF with sitecore 7.5 and a custom crawler using ITextSharp

As you probably know sitecore index PDF using Adobe iFilter…

Adobe iFilter technology is not really friendly and it make use of COM objects, it means that you are going to have a lot of security issues and dependencies on the COM objects.

On the top of it, it seams that there are proven solution based on the iFilter up to the version 9, but unfortunately now you can download only the version 11  you can read more about this issue here 

So I can reccomend two solutions:

  1. Buy the license of a third party tool (like FoxIt)
  2. Write your own media crawler following this post

When it come to parsing PDF there are several options, I have chosen ITextSharp that seems widely used and supported.

This is the code that you need for you custom media crawler:

using System;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.ContentSearch.Diagnostics;
using Sitecore.ContentSearch.Extracters.IFilterTextExtraction;
using Sitecore.Data.Items;
using Sitecore.Diagnostics;

namespace xxx.Crawler.Pdf
{
    public class MediaContentExtractor : IComputedIndexField
    {
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        public object ComputeFieldValue(IIndexable indexable)
        {
            Item item = (SitecoreIndexableItem) indexable;
            Assert.ArgumentNotNull(item, "item");

            object result = null;
            if (item != null && item.Paths.IsMediaItem)
            {
                MediaItem _media = item;
                string ext = _media.Extension.ToLower();
                if (ext == "pdf" || _media.MimeType == "application/pdf")
                {
                    result = ParsePDF(_media);
                }
                else
                {
                    result = ParseItemsWithIfilters(_media);
                }
            }

            return result;
        }


        private string ParsePDF(MediaItem mediaItem)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            var builder = new StringBuilder();
            if (mediaItem != null)
            {
                try
                {
                    var reader = new PdfReader(mediaItem.GetMediaStream());
                    if (reader.Info.ContainsKey("Title"))
                    {
                        builder.Append(reader.Info["Title"]);
                    }
                    if (reader.Info.ContainsKey("Subject"))
                    {
                        builder.Append(reader.Info["Subject"]);
                    }

                    if (reader.Info.ContainsKey("Keywords"))
                    {
                        builder.Append(reader.Info["Keywords"]);
                    }

                    for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
                    {
                        builder.Append(PdfTextExtractor.GetTextFromPage(reader, pagenumber, strategy));
                    }
                }
                catch (Exception ex)
                {
                    CrawlingLog.Log.Error(ex.ToString(), ex);
                    return string.Empty;
                }
            }
            return builder.ToString();
        }


        private string ParseItemsWithIfilters(MediaItem mediaItem)
        {
            string content = string.Empty;
            try
            {
                Stream streamReader = mediaItem.GetMediaStream();
                TextReader reader = new FilterReader(((FileStream) streamReader).Name);
                using (reader)
                {
                    content = reader.ReadToEnd();
                }
            }
            catch (Exception ex)
            {
                CrawlingLog.Log.Error(ex.ToString(), ex);
            }

            if (!string.IsNullOrWhiteSpace(content))
            {
                content = content.Replace("\r\n", string.Empty).ToLower();
            }

            return content;
        }
    }
}

obviously you need also to amend the file: Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration

<!--<field fieldName="_content"                 type="Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch">
<mediaIndexing ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/mediaIndexing">
            </field>-->
<field fieldName="_content" storageType="no" indexType="tokenized">xxx.Crawler.Pdf.MediaContentExtractor, xxx.Crawler.Pdf</field>

SwitchOnRebuildLuceneIndex in sitecore 7.2

SwitchOnRebuildLuceneIndex is an exiting new functionality introduced in sitecore 7.2

The idea is to offer a READ ONLY lucene index available meanwhile you are rebuilding your index.

As you probably already know, you should plan carefully your index strategy to avoid downtime of your site when the index rebuild, but in the lifetime of an enterprise project you will be forced to kick an index rebuild to troubleshoot some issue.

SwitchOnRebuildLuceneIndex  allow you to avoid the downtime of your site, giving a read-only index. It means that if somebody is adding items to the index, these items won’t be returned until the index rebuild is not completed.

At low level when you switch on SwitchOnRebuildLuceneIndex it simply create a copy of the index folder meanwhile is rebuilding the index in a different folder and switch the two folders once that the index rebuild has completed.

The config to use to configure it is the following one:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
        <configuration type="Sitecore.ContentSearch.LuceneProvider.LuceneSearchConfiguration, Sitecore.ContentSearch.LuceneProvider">
          <indexes hint="list:AddIndex">
            <index id="sitecore_web_index" type="Sitecore.ContentSearch.LuceneProvider.SwitchOnRebuildLuceneIndex, Sitecore.ContentSearch.LuceneProvider">
              <param desc="name">$(id)</param>
              <param desc="folder">$(id)</param>

you can find more information about it in the following post http://blog.eldblom.dk/2014/07/31/avoiding-downtime-while-rebuilding-your-lucene-search-indexes-in-sitecore-asp-net-cms/

simplest way to search a lucene Index in Sitecore

this is the simplest code to search a lucene index in Sitecore…

 using (IndexSearchContext indexSearchContext = SearchManager.GetIndex("CustomIndex" ).CreateSearchContext())
            {
                var query = new BooleanQuery();
                query.Add(
                    new TermQuery( new Term( FieldNameLegacyId, legacyId)),
                    BooleanClause.Occur .MUST);
                SearchHits hits = indexSearchContext.Search(query, int.MaxValue);
                List< Item> results =
                    hits.FetchResults(0, int.MaxValue)
                        .Select(r => r.GetObject< Item>())
                        .Where(c => c != null && c.Parent != null)
                        .ToList();

                if (results.Any())
                {
                    return results.First().ID.ToGuid().ToString( "N");
                }
            }

Lucene Indexes in Sitecore 6.6 getting corrupted (ArgumentOutOfRangeException)

never seen this exception?

System.ArgumentOutOfRangeException 
Message: Non-negative number required. 
Parameter name: capacity Source: mscorlib 
  at System.Collections.Hashtable..ctor(Int32 capacity, Single loadFactor)
  at System.Collections.Hashtable.Clone() 
  at SupportClass.WeakHashTable.Clean() 
  at SupportClass.WeakHashTable.CleanIfNeeded() 
  at SupportClass.WeakHashTable.Add(Object key, Object value) 
  at Lucene.Net.Util.CloseableThreadLocal.Set(Object object) 
  at Lucene.Net.Index.TermInfosReader.GetThreadResources() 
  at Lucene.Net.Index.TermInfosReader.Get(Term term, Boolean useCache) 
  at Lucene.Net.Index.SegmentReader.DocFreq(Term t) 
  at Lucene.Net.Index.DirectoryReader.DocFreq(Term t) 
  at Lucene.Net.Search.Similarity.IdfExplain(Term term, Searcher searcher)
  at Lucene.Net.Search.TermQuery.CreateWeight(Searcher searcher) 
  at Lucene.Net.Search.BooleanQuery.BooleanWeight..ctor(BooleanQuery enclosingInstance, Searcher searcher) 
  at Lucene.Net.Search.BooleanQuery.CreateWeight(Searcher searcher) 
  at Lucene.Net.Search.Query.Weight(Searcher searcher) 
  at Lucene.Net.Search.Hits..ctor(Searcher s, Query q, Filter f, Sort o) 
  at Lucene.Net.Search.Searcher.Search(Query query, Sort sort) 
  at scSearchContrib.Searcher.QueryRunner.RunQuery(Query query, Boolean showAllVersions, String sortField, Boolean reverse, Int32 start, Int32 end) 
  at scSearchContrib.Searcher.QueryRunner.GetItems(IEnumerable`1 parameters, Boolean showAllVersions, String sortField, Boolean reverse, Int32 start, Int32 end) 

it means that your index is getting corrupted and need an app pool recycle…
to avoid it you can try with this trick:

&lt;setting name=&quot;MaxWorkerThreads&quot; value=&quot;100&quot;/&gt;