这个本质上还是Lucene的analyzer chain,Solr的只是方便了使用:通过配置xml文件就可以把tokenizer和filter链接起来。我们有时候需要在自己代码里使用这个chain。本文记录怎么做。
先看整体代码 (groovy):
class MyAnalyzer { def analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String s) { def loader = new ClasspathResourceLoader() // create tokenizer def factory = new MMSegTokenizerFactory(["mode": "complex", "dicPath": "dict"]) factory.inform(loader) Tokenizer tokenizer = factory.create() // create tokenfilters factory = new SynonymFilterFactory(["synonyms": "dict/synonyms.txt", "expand": "true", "ignoreCase": "true"]) factory.inform(loader) TokenFilter filter = factory.create(tokenizer) factory = new StopFilterFactory(["ignoreCase": "true", "words": "dict/stop_words_cn.txt"]) factory.inform(loader) filter = factory.create(filter) return new TokenStreamComponents(tokenizer, filter) } } def tokenize(String text) { def tokens = [] def ts = analyzer.tokenStream("text", text) def termAttr = ts.addAttribute(CharTermAttribute.class) ts.reset() while (ts.incrementToken()) { tokens.add(termAttr.toString()) } ts.end() ts.close() return tokens } public static void main(String[] args) { MyAnalyzer analyzer = new MyAnalyzer() println(analyzer.tokenize("我是一个粉刷匠")) }}
有几个关键点:
自定义的analyzer要集成Analyzer,并实现createComponents方法
需要resourceLoader来加载数据文件
上面链接了三个,可以更多:TokenizerFactory, SynonymFilterFactory, StopFilterFactory...
取token的方法比较特殊:需要用一个Attribute类: CharTermAttribute;需要操作tokenStream
必须要按这样的流程(lucene规定。。。):reset, incrementToken, end, close