Lucene 之 索引和搜索文本文件

在这个Lucene 6示例中,我们将学习从文件创建索引,然后在已索引文档中搜索令牌。要了解有关安装Lucene的信息,请参考Lucene索引和搜索示例

目录

项目结构
索引文本文件内容
搜索索引文件
演示
源代码

项目结构

我正在创建Maven项目来执行此示例。并添加了这些Lucene依赖项

<properties>
    <lucene.version>6.6.0</lucene.version>
</properties>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>${lucene.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>${lucene.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>${lucene.version}</version>
</dependency>

项目结构现在看起来像这样:

Lucene索引文件-项目结构
Lucene索引文件–项目结构

请注意,我们将在项目中使用这两个文件夹:

  • inputFiles –将包含我们要索引的所有文本文件。
  • indexedFiles–将包含Lucene索引文件。我们将在其中搜索索引。

索引文本文件内容

我要遍历文件inputFiles夹中的所有文件,然后为它们建立索引。我正在创建3个字段:

  1. path:文件路径[Field.Store.YES]
  2. 已修改:文件上次修改的时间戳
  3. 内容:文件内容[Field.Store.YES]
如果文档已建立索引但未存储,则可以搜索该文档,但不会与搜索结果一起返回。一个YES值使lucene将原始字段值存储在索引中。

LuceneWriteIndexFromFileExample.java

package com.how2codex.demo.lucene.file;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class LuceneWriteIndexFromFileExample
{
    public static void main(String[] args)
    {
        //Input folder
        String docsPath = "inputFiles";
        
        //Output folder
        String indexPath = "indexedFiles";
        //Input Path Variable
        final Path docDir = Paths.get(docsPath);
        try
        {
            //org.apache.lucene.store.Directory instance
            Directory dir = FSDirectory.open( Paths.get(indexPath) );
            
            //analyzer with the default stop words
            Analyzer analyzer = new StandardAnalyzer();
            
            //IndexWriter Configuration
            IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
            
            //IndexWriter writes new index files to the directory
            IndexWriter writer = new IndexWriter(dir, iwc);
            
            //Its recursive method to iterate all files and directories
            indexDocs(writer, docDir);
            writer.close();
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
    
    static void indexDocs(final IndexWriter writer, Path path) throws IOException
    {
        //Directory?
        if (Files.isDirectory(path))
        {
            //Iterate directory
            Files.walkFileTree(path, new SimpleFileVisitor<Path>()
            {
                @Override
                public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException
                {
                    try
                    {
                        //Index this file
                        indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
                    }
                    catch (IOException ioe)
                    {
                        ioe.printStackTrace();
                    }
                    return FileVisitResult.CONTINUE;
                }
            });
        }
        else
        {
            //Index this file
            indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
        }
    }
    static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException
    {
        try (InputStream stream = Files.newInputStream(file))
        {
            //Create lucene Document
            Document doc = new Document();
            
            doc.add(new StringField("path", file.toString(), Field.Store.YES));
            doc.add(new LongPoint("modified", lastModified));
            doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
            
            //Updates a document by first deleting the document(s)
            //containing <code>term</code> and then adding the new
            //document.  The delete and then add are atomic as seen
            //by a reader on the same index
            writer.updateDocument(new Term("path", file.toString()), doc);
        }
    }
}

搜索索引文件

在本节中,我们将搜索在上一步中创建的索引,即,我们将搜索包含搜索查询词的文档。

package com.how2codex.demo.lucene.file;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class LuceneReadIndexFromFileExample
{
    //directory contains the lucene indexes
    private static final String INDEX_DIR = "indexedFiles";
    public static void main(String[] args) throws Exception
    {
        //Create lucene searcher. It search over a single IndexReader.
        IndexSearcher searcher = createSearcher();
        
        //Search indexed contents using search term
        TopDocs foundDocs = searchInContent("frequently", searcher);
        
        //Total found documents
        System.out.println("Total Results :: " + foundDocs.totalHits);
        
        //Let's print out the path of files which have searched term
        for (ScoreDoc sd : foundDocs.scoreDocs)
        {
            Document d = searcher.doc(sd.doc);
            System.out.println("Path : "+ d.get("path") + ", Score : " + sd.score);
        }
    }
    
    private static TopDocs searchInContent(String textToFind, IndexSearcher searcher) throws Exception
    {
        //Create search query
        QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
        Query query = qp.parse(textToFind);
        
        //search the index
        TopDocs hits = searcher.search(query, 10);
        return hits;
    }
    private static IndexSearcher createSearcher() throws IOException
    {
        Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
        
        //It is an interface for accessing a point-in-time view of a lucene index
        IndexReader reader = DirectoryReader.open(dir);
        
        //Index searcher
        IndexSearcher searcher = new IndexSearcher(reader);
        return searcher;
    }
}

Demo

  1. 让我们在inputFiles具有以下内容的文件夹中创建3个文件。data1.txt
    社会对山寨私人感到兴奋。从伤口开始。有钱的女孩会做或两者兼而有之。在宣布为一起欢欣鼓舞。他给自己留下了兴高采烈的印象。她也反对太太想删除捐赠的天赋。

    data2.txt

    问题也解释了她儿子的喜好首选陌生人。设置羞怯的办公室让他的女性远离他。改善有消息,但害羞的自己儿子却如何欢呼。快速判断其他请假先请她负责。的确或话语总是沉默寡言。瞬间就可以将遭受假装的被忽视的首选男人送达。也许肥沃的布兰登确实想到了亲切的小屋。

    data3.txt

    还是发现发现的令人愉快的结论吧,运动员。约翰每周上班。儿子优雅使用婚礼分开。要求太事项形成了县检察院反对人才。他有时会立即入住或要依靠。几乎没有经常被裁断的事情都决定性地做到了简单。他一年少做,但不确定。
  2. LuceneWriteIndexFromFileExample.java使用它的main()方法执行。验证是否在indexedFiles文件夹中创建了Lucene索引。
  3. 假设我要搜索包含单词“ agreeable”的文档。在行号中更改搜索词。29节课LuceneReadIndexFromFileExample.java。使用其main()方法执行该类。验证输出:
    总结果:: 2
    路径:inputFiles \ data3.txt,得分:0.47632512
    路径:inputFiles \ data2.txt,得分:0.38863274
  4. 搜索更多术语并自行验证。

源代码

使用下面的给定链接下载源代码。

学习愉快!

saigon has written 1440 articles

Leave a Reply