http://juanggrande.wordpress.com/2010/12/20/solr-index-size-analysis/
In this post I’m going to talk about a set of benchmarks that I’ve done with Solr. The goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system.
The first step was to fetch the set of documents that I was going to use in the tests. I wanted the documents to be composed of real text, so I started to look for sources in Internet. The first one that I really liked was Twitter. They provide a REST API that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. Each tweet is expressed as a JSON Object, and carries meta-data about the message and the author. While this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so I started to look for a source of bigger documents, finally choosing Wikipedia. I downloaded the documents through HTTP using the “Random Article” feature in their site, obtaining about 160,000 articles in a couple of days. At the time of writting, the site download.wikipedia.org, which provides an easy way of downloading a bunch of articles, was out of service.
The next step was to design the schema. Because one of the objectives is to see how each change in the schema affects the size of the index, I used many different combination of parameters, as to measure the influence of each one of them. On each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of Solr. For both datasets, the “omitNorms”, “termVectors” and “stopWords” parameters are referred to the “text” field. In all cases, the value of the parameters “termOffsets” and “termPositions” is the same as “termVectors”.
In the first figure you can see the size of the index for each schema for the Twitter data-set, and which proportion of the index corresponds to each parameter. Remember that this data-set has lots of documents (about 1.7 million) but each one is small (240 bytes on average). There are many remarkable things here. The first one is that the space occupied by the term vectors (~280 MiB when not using stop words) is almost equal to the space occupied by the inverted index itself (~240 MiB). In second place, the space saved by omitting norms is almost negligible (~2 MiB). Third, the space saved by using stop word is doubled when storing term vectors, going from about 4% of the index to about 10%. Finally, the space occupied by the stored fields (~340 MiB) is considerably bigger than the space occupied by the inverted index itself.
In the second figure you can see the same information for the Wikipedia data-set. The size occupied by the norms is still negligible (< 1MiB), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. This time, the size occupied by the term vectors (~1067 MiB) is almost three times the space occupied by the inverted index itself (~380 MiB). Finally, the size of the stored documents (~6330 MiB) is more than four times the size of the index with term vectors stored.
At this point, we can state some conclusions concerning the size of the index:
1. When the number of fields is small, the size of the norms is negligible, independently of the size and number of documents.
2. When the documents are large, the stop words help reducing the size of the index significantly. Maybe here is important to note two things. In first place, the documents fetched from Wikipedia are writen using traditional language, and are all writen in English, while the documents fetched from Twitter are writen using modern language, and in many different languages. In second place, I didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good.
3. If you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. If you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size.
I hope you find this post useful. Currently I’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!
分享到:
相关推荐
Solr 索引 测试报告 性能
solr创建索引并查询,希望能够帮助有需要的人。。。
solr初学者很受用的!讲解了solr怎么创建索引的及其原理,以及查询
主要讲解了 solr客户端如何调用带账号密码的solr服务器调用,实现添加索引和查询索引,以及分组查询
在tomcat中配置solr,以及solr 全文搜索建立索引的相关方法总结
Solr接受xml格式数据更新、提交、修改索引。
solr配置中文解析器和将数据导入solr索引库时所需的jar包
solr增量导入更新索引包
solr索引服务基础知识[收集].pdf
solr在做检索的时候时常需要得知他的性能参数,此处使用8G内存,双核处理器测试的结果
包含solr介绍、全局索引介绍、ik分词器安装包、solr安装包、及各个部分的安装教程。
lucene&solr原理分析,lucene搜索引擎和solr搜索服务器原理分析。
springboot、Dubbo、MySQL,源码web系统,框架,代码均经过严格测试,可直接运行,有需要可自取
Solr数据库插入(全量和增量)索引,全量一般用于第一次创建索引情况,批量一般更新数据部分创建索引。
索引是设计表的一部分,创建的索引对sql的语句木有任何影响,对sql语句的执行效率有影响
这是我用window xp的自己按装,总结了,现在共享,希望给新手有帮助,“搜索引擎solr环境配置、分词及索引操作”
solr定时索引(增量索引、完整索引)需要用到的jar包和配置 支持7.3版本
Apache Solr 搜索架构分析外部设计篇PDF Solr 源码 架构 分析 搜索
关于Solr技术分析及运用的开发文档,从下载到运行的全过程。
solr4.1安装配置 IK分词器 solrJ简单操作 solr索引富文本