Lucene Index got corrupted on Sitecore 7.5

I have recently encountered a very bizarre issue with my application pool got restarted every 15 minutes on my Content Delivery instances…. I have started to see this error, just after one deployment….

I was running the site on Sitecore 7.5 and I do use Lucene indexes…

As often happen looking into the Application log through the event viewer I found the following error:

Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: Lucene.Net.Index.MergePolicy+MergeException
Stack:
   at Lucene.Net.Index.ConcurrentMergeScheduler.HandleMergeException(System.Exception)
   at Lucene.Net.Index.ConcurrentMergeScheduler+MergeThread.Run()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.ThreadHelper.ThreadStart()
My first suspicious was some content author publishing when I was deploying could create some lock/race condition within the index folder?!
I tried to look into the index unsuccessfully with the index viewer and the only solution that I found, also due to the time constraint and do to the fact that site was going down every 15 minutes, it was to delete the physical folder containing the indexes, rebuild my index on the Content Authoring instance and copying the file of the index physically in the index folder…
Once I have done this operation everything run smoothly….
my take away if you are using Lucene, be always ready to re-delete and uplaod your index folder and mainly be sure that your site does not rely completely on the indexes….
Advertisements

Indexing PDF with sitecore 7.5 and a custom crawler using ITextSharp

As you probably know sitecore index PDF using Adobe iFilter…

Adobe iFilter technology is not really friendly and it make use of COM objects, it means that you are going to have a lot of security issues and dependencies on the COM objects.

On the top of it, it seams that there are proven solution based on the iFilter up to the version 9, but unfortunately now you can download only the version 11  you can read more about this issue here 

So I can reccomend two solutions:

  1. Buy the license of a third party tool (like FoxIt)
  2. Write your own media crawler following this post

When it come to parsing PDF there are several options, I have chosen ITextSharp that seems widely used and supported.

This is the code that you need for you custom media crawler:

using System;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.ContentSearch.Diagnostics;
using Sitecore.ContentSearch.Extracters.IFilterTextExtraction;
using Sitecore.Data.Items;
using Sitecore.Diagnostics;

namespace xxx.Crawler.Pdf
{
    public class MediaContentExtractor : IComputedIndexField
    {
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        public object ComputeFieldValue(IIndexable indexable)
        {
            Item item = (SitecoreIndexableItem) indexable;
            Assert.ArgumentNotNull(item, "item");

            object result = null;
            if (item != null && item.Paths.IsMediaItem)
            {
                MediaItem _media = item;
                string ext = _media.Extension.ToLower();
                if (ext == "pdf" || _media.MimeType == "application/pdf")
                {
                    result = ParsePDF(_media);
                }
                else
                {
                    result = ParseItemsWithIfilters(_media);
                }
            }

            return result;
        }


        private string ParsePDF(MediaItem mediaItem)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            var builder = new StringBuilder();
            if (mediaItem != null)
            {
                try
                {
                    var reader = new PdfReader(mediaItem.GetMediaStream());
                    if (reader.Info.ContainsKey("Title"))
                    {
                        builder.Append(reader.Info["Title"]);
                    }
                    if (reader.Info.ContainsKey("Subject"))
                    {
                        builder.Append(reader.Info["Subject"]);
                    }

                    if (reader.Info.ContainsKey("Keywords"))
                    {
                        builder.Append(reader.Info["Keywords"]);
                    }

                    for (int pagenumber = 1; pagenumber <= reader.NumberOfPages; pagenumber++)
                    {
                        builder.Append(PdfTextExtractor.GetTextFromPage(reader, pagenumber, strategy));
                    }
                }
                catch (Exception ex)
                {
                    CrawlingLog.Log.Error(ex.ToString(), ex);
                    return string.Empty;
                }
            }
            return builder.ToString();
        }


        private string ParseItemsWithIfilters(MediaItem mediaItem)
        {
            string content = string.Empty;
            try
            {
                Stream streamReader = mediaItem.GetMediaStream();
                TextReader reader = new FilterReader(((FileStream) streamReader).Name);
                using (reader)
                {
                    content = reader.ReadToEnd();
                }
            }
            catch (Exception ex)
            {
                CrawlingLog.Log.Error(ex.ToString(), ex);
            }

            if (!string.IsNullOrWhiteSpace(content))
            {
                content = content.Replace("\r\n", string.Empty).ToLower();
            }

            return content;
        }
    }
}

obviously you need also to amend the file: Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration

<!--<field fieldName="_content"                 type="Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch">
<mediaIndexing ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/mediaIndexing">
            </field>-->
<field fieldName="_content" storageType="no" indexType="tokenized">xxx.Crawler.Pdf.MediaContentExtractor, xxx.Crawler.Pdf</field>

SwitchOnRebuildLuceneIndex in sitecore 7.2

SwitchOnRebuildLuceneIndex is an exiting new functionality introduced in sitecore 7.2

The idea is to offer a READ ONLY lucene index available meanwhile you are rebuilding your index.

As you probably already know, you should plan carefully your index strategy to avoid downtime of your site when the index rebuild, but in the lifetime of an enterprise project you will be forced to kick an index rebuild to troubleshoot some issue.

SwitchOnRebuildLuceneIndex  allow you to avoid the downtime of your site, giving a read-only index. It means that if somebody is adding items to the index, these items won’t be returned until the index rebuild is not completed.

At low level when you switch on SwitchOnRebuildLuceneIndex it simply create a copy of the index folder meanwhile is rebuilding the index in a different folder and switch the two folders once that the index rebuild has completed.

The config to use to configure it is the following one:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
        <configuration type="Sitecore.ContentSearch.LuceneProvider.LuceneSearchConfiguration, Sitecore.ContentSearch.LuceneProvider">
          <indexes hint="list:AddIndex">
            <index id="sitecore_web_index" type="Sitecore.ContentSearch.LuceneProvider.SwitchOnRebuildLuceneIndex, Sitecore.ContentSearch.LuceneProvider">
              <param desc="name">$(id)</param>
              <param desc="folder">$(id)</param>

you can find more information about it in the following post http://blog.eldblom.dk/2014/07/31/avoiding-downtime-while-rebuilding-your-lucene-search-indexes-in-sitecore-asp-net-cms/

simplest way to search a lucene Index in Sitecore

this is the simplest code to search a lucene index in Sitecore…

 using (IndexSearchContext indexSearchContext = SearchManager.GetIndex("CustomIndex" ).CreateSearchContext())
            {
                var query = new BooleanQuery();
                query.Add(
                    new TermQuery( new Term( FieldNameLegacyId, legacyId)),
                    BooleanClause.Occur .MUST);
                SearchHits hits = indexSearchContext.Search(query, int.MaxValue);
                List< Item> results =
                    hits.FetchResults(0, int.MaxValue)
                        .Select(r => r.GetObject< Item>())
                        .Where(c => c != null && c.Parent != null)
                        .ToList();

                if (results.Any())
                {
                    return results.First().ID.ToGuid().ToString( "N");
                }
            }

Lucene Indexes in Sitecore 6.6 getting corrupted (ArgumentOutOfRangeException)

never seen this exception?

System.ArgumentOutOfRangeException 
Message: Non-negative number required. 
Parameter name: capacity Source: mscorlib 
  at System.Collections.Hashtable..ctor(Int32 capacity, Single loadFactor)
  at System.Collections.Hashtable.Clone() 
  at SupportClass.WeakHashTable.Clean() 
  at SupportClass.WeakHashTable.CleanIfNeeded() 
  at SupportClass.WeakHashTable.Add(Object key, Object value) 
  at Lucene.Net.Util.CloseableThreadLocal.Set(Object object) 
  at Lucene.Net.Index.TermInfosReader.GetThreadResources() 
  at Lucene.Net.Index.TermInfosReader.Get(Term term, Boolean useCache) 
  at Lucene.Net.Index.SegmentReader.DocFreq(Term t) 
  at Lucene.Net.Index.DirectoryReader.DocFreq(Term t) 
  at Lucene.Net.Search.Similarity.IdfExplain(Term term, Searcher searcher)
  at Lucene.Net.Search.TermQuery.CreateWeight(Searcher searcher) 
  at Lucene.Net.Search.BooleanQuery.BooleanWeight..ctor(BooleanQuery enclosingInstance, Searcher searcher) 
  at Lucene.Net.Search.BooleanQuery.CreateWeight(Searcher searcher) 
  at Lucene.Net.Search.Query.Weight(Searcher searcher) 
  at Lucene.Net.Search.Hits..ctor(Searcher s, Query q, Filter f, Sort o) 
  at Lucene.Net.Search.Searcher.Search(Query query, Sort sort) 
  at scSearchContrib.Searcher.QueryRunner.RunQuery(Query query, Boolean showAllVersions, String sortField, Boolean reverse, Int32 start, Int32 end) 
  at scSearchContrib.Searcher.QueryRunner.GetItems(IEnumerable`1 parameters, Boolean showAllVersions, String sortField, Boolean reverse, Int32 start, Int32 end) 

it means that your index is getting corrupted and need an app pool recycle…
to avoid it you can try with this trick:

&lt;setting name=&quot;MaxWorkerThreads&quot; value=&quot;100&quot;/&gt;