compression flag is not wrking

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

compression flag is not wrking

Anita
Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search log data.
There will be few hundred tera bytes of log data.


TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log  ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message fields, elasticsearch writes few more fields, but how come elasticsearch data size 3 times of original data size?  Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

     // source cntains log message. It has 12 fields. Log message is about 1K.

}



TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)  

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du -sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" : {"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.

Reply | Threaded
Open this post in threaded view
|

Re: compression flag is not wrking

Michael Sick
Do you have any fields included in the _source? Compression does not happen on the index overall, just those fields included in the _source field.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html 
The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. 

...

Includes / Excludes

Allow to specify paths in the source that would be included / excluded when its stored, supporting * as wildcard annotation.



On Fri, Apr 20, 2012 at 6:49 PM, Anita <[hidden email]> wrote:
Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search
log data.
There will be few hundred tera bytes of log data.


TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log  ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G
of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message
fields, elasticsearch writes few more fields, but how come elasticsearch
data size 3 times of original data size?  Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

    // source cntains log message. It has 12 fields. Log message is about
1K.

}



TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du
-sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/compression-flag-is-not-wrking-tp3927247p3927247.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: compression flag is not wrking

Igor Motov-3
_source is a field, so it should configured in the mapping like this:

curl  -XPUT http://localhost:9200/order/trade/_mapping -d '{
  "trade" : {
    "_source" : { "compress" : true }
  }
}'


On Wednesday, April 25, 2012 8:24:31 AM UTC-4, Michael Sick wrote:
Do you have any fields included in the _source? Compression does not happen on the index overall, just those fields included in the _source field.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html 
The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. 

...

Includes / Excludes

Allow to specify paths in the source that would be included / excluded when its stored, supporting * as wildcard annotation.



On Fri, Apr 20, 2012 at 6:49 PM, Anita <[hidden email]> wrote:
Compression flag is not working.

ElasticSearch version : 0.19.1
Gateway : Hadoop
Hadoop version: 0.20.2.cdh3u3

We are planning to use ElasticSearch and Hadoop gateway to store and search
log data.
There will be few hundred tera bytes of log data.


TEST1:
Index: order

INDEX STATUS
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 0
index.version.created: 190199
index._source.compress: true
}
Compression flag is set by running curl request
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
I ran the optimizer
curl -XPOST 'http://localhost:9200/order/_optimize?max_num_segments=5'

Order
size: 1.2gb (1.2gb)
docs: 349345 (349345)

Each log message is about 1K.

I am writing raw data to a file to check how compression works
Raw data : 381978518 Apr 20 12:40 test_es.log  ( Non compressed 380M)

Data written on the hdfs : 1304717764 bytes (1.2G)

Raw data size is about 380M, how come elasticsearch is written about 1.2 G
of data even with compress flag is enabled.
ElasticSearch wrote about 3 times of raw data.
I do not see data getting compressed at all. In addition to log message
fields, elasticsearch writes few more fields, but how come elasticsearch
data size 3 times of original data size?  Am I missing any thing here?

Each document contains

_index: order
_type: trade
_id: 5c10PkonSeOQrBZkH8JVpg
_version: 1
_score: 1
_source: {

    // source cntains log message. It has 12 fields. Log message is about
1K.

}



TEST2:
I tested with/without compression.

Cluster : 3 node cluster
Replication : 2
Number of shards : 10
Gateway : hadoop (5 node cluster, replication 3)

Compression Test

INDEX META DATA
{
state: open
settings: {
index.number_of_shards: 10
index.number_of_replicas: 2
index.version.created: 190199
index._source.compress: true
}

index._source.compress: true

order
size: 2.2gb (6.7gb)
docs: 627336 (627336)

Same log data writing to a file:

685678405 Apr 20 10:44 test_es.log ( size 685M)

Raw data is only about 685M

Without Compression
{
order
size: 2.2gb (6.8gb)
docs: 631361 (631361)

690077730 Apr 20 12:45 test_es.log (SIZE 690M)

I do not see any difference in data size w/o compression, when I run "du
-sh" on the data directory.

is it right way to set compression flag?
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress" : true} }}'
curl -XPUT localhost:9200/order/_settings -d '{"index": {"_source" :
{"compress_threshold" : 10000} }}

I really appreciate any help on this.

Please give me some guidelines how to scale large data.



--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/compression-flag-is-not-wrking-tp3927247p3927247.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.