Lucene UnifiedHighlighter示例

Lucene UnifiedHighlighter性能最高的荧光笔,特别是对于大型文档。在本教程中,学习突出显示索引文档/文件中的搜索词。

目录


使用UnifiedHighlighter 

项目结构突出显示的片段写入Lucene索引源代码

项目结构

我正在创建Maven项目来执行此示例。并添加了这些Lucene依赖项

<properties>
	<lucene.version>6.6.0</lucene.version>
</properties>

<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-core</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-analyzers-common</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-queryparser</artifactId>
	<version>${lucene.version}</version>
</dependency>

<!-- To include highlight support-->
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-highlighter</artifactId>
	<version>${lucene.version}</version>
</dependency>

项目结构现在看起来像这样:

Lucene UnifiedHighlighter-项目结构
Lucene UnifiedHighlighter –项目结构

请注意,我们将在项目中使用这两个文件夹:

  • inputFiles –将包含我们要索引的所有文本文件。
  • indexedFiles–将包含Lucene索引文件。我们将在其中搜索索引。

使用UnifiedHighlighter突出显示片段

UnifiedHighlighter用于在Lucene搜索结果中突出显示搜索到的短语或查询的Java示例。

在这里,我正在搜索在folder创建的lucene索引indexedFiles。在下一节中,我们将学习如何编写这些索引。

package com.how2codex.demo.lucene.highlight;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneUnifiedHighlighterExample
{
	//This contains the lucene indexed documents
	private static final String INDEX_DIR = "indexedFiles";

	public static void main(String[] args) throws Exception 
	{
		//Get directory reference
		Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
		
		//Index reader - an interface for accessing a point-in-time view of a lucene index
		IndexReader reader = DirectoryReader.open(dir);
		
		//Create lucene searcher. It search over a single IndexReader.
		IndexSearcher searcher = new IndexSearcher(reader);
		
		//analyzer with the default stop words
		Analyzer analyzer = new StandardAnalyzer();
		
		//Query parser to be used for creating TermQuery
		QueryParser qp = new QueryParser("contents", analyzer);
		
		//Create the query
		Query query = qp.parse("Questions");
		
		//Search the lucene documents
		TopDocs hits = searcher.search(query, 10, Sort.INDEXORDER);
		
		System.out.println("Search terms found in :: " + hits.totalHits + " files");
		
		UnifiedHighlighter highlighter = new UnifiedHighlighter(searcher, analyzer);
        String[] fragments = highlighter.highlight("contents", query, hits);

        for(String f : fragments)
        {
        	System.out.println(f);
        }
		
		//To get which fragment belong to which doc/file

		/*for (int i = 0; i < hits.scoreDocs.length; i++) 
        {
			int docid = hits.scoreDocs[i].doc;
            Document doc = searcher.doc(docid);
            
            String filePath = doc.get("path");
            System.out.println(filePath);
            System.out.println(fragments[i]);
        }*/

        dir.close();
	}
}

输出:

在搜索:: 3文件中发现方面
 的问题女孩私人富含做了或两者兼而有之。 
问题也解释了她儿子的喜好首选陌生人。 
问题或被忽视的发现总结了运动员。

将文件写入Lucene索引

我要遍历文件inputFiles夹中的所有文件,然后为它们建立索引。我正在创建3个字段:

  1. path:文件路径[Field.Store.YES]
  2. 已修改:文件上次修改的时间戳
  3. 内容:文件内容[Field.Store.YES]

LuceneWriteIndexFromFileExample.java

package com.how2codex.demo.lucene.file;

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneWriteIndexFromFileExample 
{
	public static void main(String[] args)
	{
		//Input folder
		String docsPath = "inputFiles";
		
		//Output folder
		String indexPath = "indexedFiles";

		//Input Path Variable
		final Path docDir = Paths.get(docsPath);

		try 
		{
			//org.apache.lucene.store.Directory instance
			Directory dir = FSDirectory.open( Paths.get(indexPath) );
			
			//analyzer with the default stop words
			Analyzer analyzer = new StandardAnalyzer();
			
			//IndexWriter Configuration
			IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
			iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
			
			//IndexWriter writes new index files to the directory
			IndexWriter writer = new IndexWriter(dir, iwc);
			
			//Its recursive method to iterate all files and directories
			indexDocs(writer, docDir);

			writer.close();
		} 
		catch (IOException e) 
		{
			e.printStackTrace();
		}
	}
	
	static void indexDocs(final IndexWriter writer, Path path) throws IOException 
	{
		//Directory?
		if (Files.isDirectory(path)) 
		{
			//Iterate directory
			Files.walkFileTree(path, new SimpleFileVisitor<Path>() 
			{
				@Override
				public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException 
				{
					try 
					{
						//Index this file
						indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
					} 
					catch (IOException ioe) 
					{
						ioe.printStackTrace();
					}
					return FileVisitResult.CONTINUE;
				}
			});
		} 
		else 
		{
			//Index this file
			indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
		}
	}

	static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException 
	{
		try (InputStream stream = Files.newInputStream(file)) 
		{
			//Create lucene Document
			Document doc = new Document();
			
			doc.add(new StringField("path", file.toString(), Field.Store.YES));
			doc.add(new LongPoint("modified", lastModified));
			doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
			
			//Updates a document by first deleting the document(s) 
			//containing <code>term</code> and then adding the new
			//document.  The delete and then add are atomic as seen
			//by a reader on the same index
			writer.updateDocument(new Term("path", file.toString()), doc);
		}
	}
}

源代码

使用下面的给定链接下载源代码。

学习愉快!

saigon has written 1440 articles

Leave a Reply