solr导入mysql数据内存不足

在solr使用dataImport时,在测试机上由于内存太小,创建索引时不成功.

###开启batchSize

在data-config.xml文件的dataSource中加入batchSize=”-1”的配置.参考http://wiki.apache.org/solr/DataImportHandlerFaq

I’m using DataImportHandler with a MySQL database. My table is huge and DataImportHandler is going out of memory. Why does DataImportHandler bring everything to memory?

DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.

Should look like:

1
2
<dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:8889/mysqldatabase" batchSize="-1" user="root" password="root"/>

###分批处理

在data-config.xml的query节点使用limit来分批处理数据,比如

1
2
query="select *  from tb_content limit ${dataimporter.request.begin},50000 "

使用了一个begin的参数来每次分批处理50000条记录.

然后访问的连接参数,第一次clean=true清理旧的索引,后面的clean=false不清理索引.访问连接如下

1
2
3
4
5
6
http://localhost:8080/solr/core1/dataimport?wt=json&commit=true&clean=true&command=full-import&begin=0
http://localhost:8080/solr/core1/dataimport?wt=json&commit=true&clean=false&command=full-import&begin=50000
http://localhost:8080/solr/core1/dataimport?wt=json&commit=true&clean=false&command=full-import&begin=100000
....


分批建立索引php脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?php
/**
* 用于分批创建索引
*/

set_time_limit(0);

$http_url="http://localhost:8080/solr/core1/dataimport";

$arr = array();
$arr['wt'] ='json';
$arr['commit'] ='true';
$arr['clean'] ='true';
$arr['command'] ='full-import';
$arr['begin'] =0;

$url = $http_url.'?'.http_build_query($arr);

file_get_contents($url);
echo $url;

$arr['clean'] = 'false';

while( 1 ) {
$url = $http_url.'?'.'command=status&wt=json';

$content = file_get_contents($url);
$content = json_decode($content);
if( $content->status == 'busy' ){//执行
var_dump( http_build_query( $arr ) );
var_dump( $content->statusMessages );
}
if( $content->status == 'idle' ){//完成
$arr['begin'] = intval($arr['begin']) + 50000;

if( $arr['begin'] > 500000 ){
break;
}
$url = $http_url.'?'.http_build_query( $arr );
file_get_contents( $url );
echo $url ;
sleep( 5 );
}
sleep( 1 );
}


作者

张巍

发布于

2015-01-22

更新于

2015-01-22

许可协议

评论