[Hadoop] Slow performance of Elasticsearch-Hadoop + Spark SQL

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Hadoop] Slow performance of Elasticsearch-Hadoop + Spark SQL

Dmitriy Fingerman
Hi,

I see a big difference in performance of the same query expressed via Spark SQL and CURL.
In CURL the query runs less then a second, and in Spark SQL it runs 15 seconds.
The index/type which I am querying contains 1M documents.
Can you please explain why there is so big difference in performance?
Are there any ways to tune performance of Elasticsearch + Spark SQL?

Environment: (everything is running on the same box):
    Elasticsearch 1.4.4
    elasticsearch-hadoop 2.1.0.BUILD-SNAPSHOT
    Spark 1.3.0.

CURL:

curl -XPOST "http://localhost:9200/summary/intervals/_search" -d'
{
    "query" : {
        "filtered" : {
            "query" : { "match_all" : {}},
             "filter" : {
                "bool" : {
                    "must" : [
                        {
                            "term" : { "User" : "Robert Greene" }
                        },
                        {
                            "term" : { "DataStore" : "PROD_HK_HR" }
                        },
                        {
                            "term" : { "EventAffectedCount" : 56 }
                        }
                    ]
                }
            }
        }
    }
}'

Spark:

    val sparkConf = new SparkConf().setAppName("Test1")

    // increasing scroll size to 5000 from the default 50 improved performance by 2.5 times
    sparkConf.set("es.scroll.size", "5000")

    val sc =  new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)

    val intv = sqlContext.esDF("summary/intervals")
    intv.registerTempTable("INTERVALS")
   
    val intv2 = sqlContext.sql("select EventCount, Hour      " +
                                      "from intervals               " +
                                      "where User = 'Robert Greene' " +
                                      "and DataStore = 'PROD_HK_HR' " +
                                      "and EventAffectedCount = 56  ")
    intv2.show(1000)

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8fc0d384-23bd-4807-8eae-a2ef2011f6ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [Hadoop] Slow performance of Elasticsearch-Hadoop + Spark SQL

Costin Leau
The best way is to use a profiler to understand where time is spent.
Spark while it is significantly faster than Hadoop, cannot compete with CULR.
The latter is a simple REST connection - the former triggers a JVM, Scala, akka, Spark,
which triggers es-hadoop which does the parallel call against all the nodes, retries the data in JSON format,
converts it into Scala/Java and applies on schema on top for Spark SQL to run with.

If you turn on logging, you'll see in fact there are multiple REST/CURL calls done by es-hadoop.
With a JVM/Scala warmed up, you should see less than 15s however it depends on how much hardware you have available.
Note that the curl comparison is not really fair - adding a SQL layer on top of that is bound to cost you something.


On 6/1/15 8:47 PM, Dmitriy Fingerman wrote:

> Hi,
>
> I see a big difference in performance of the same query expressed via Spark SQL and CURL.
> In CURL the query runs less then a second, and in Spark SQL it runs 15 seconds.
> The index/type which I am querying contains 1M documents.
> Can you please explain why there is so big difference in performance?
> Are there any ways to tune performance of Elasticsearch + Spark SQL?
>
> Environment: (everything is running on the same box):
>      Elasticsearch 1.4.4
>      elasticsearch-hadoop 2.1.0.BUILD-SNAPSHOT
>      Spark 1.3.0.
>
> CURL:
>
> curl -XPOST "http://localhost:9200/summary/intervals/_search" -d'
> {
>      "query" : {
>          "filtered" : {
>              "query" : { "match_all" : {}},
>               "filter" : {
>                  "bool" : {
>                      "must" : [
>                          {
>                              "term" : { "User" : "Robert Greene" }
>                          },
>                          {
>                              "term" : { "DataStore" : "PROD_HK_HR" }
>                          },
>                          {
>                              "term" : { "EventAffectedCount" : 56 }
>                          }
>                      ]
>                  }
>              }
>          }
>      }
> }'
>
> Spark:
>
>      val sparkConf = new SparkConf().setAppName("Test1")
>
>      // increasing scroll size to 5000 from the default 50 improved performance by 2.5 times
>      sparkConf.set("es.scroll.size", "5000")
>
>      val sc =  new SparkContext(sparkConf)
>      val sqlContext = new SQLContext(sc)
>
>      val intv = sqlContext.esDF("summary/intervals")
>      intv.registerTempTable("INTERVALS")
>
>      val intv2 = sqlContext.sql("select EventCount, Hour      " +
>                                        "from intervals               " +
>                                        "where User = 'Robert Greene' " +
>                                        "and DataStore = 'PROD_HK_HR' " +
>                                        "and EventAffectedCount = 56  ")
>      intv2.show(1000)
>
> --
> Please update your bookmarks! We have moved to https://discuss.elastic.co/
> ---
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> [hidden email] <mailto:[hidden email]>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/8fc0d384-23bd-4807-8eae-a2ef2011f6ed%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/8fc0d384-23bd-4807-8eae-a2ef2011f6ed%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Costin

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/
---
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/556CD145.3010203%40gmail.com.
For more options, visit https://groups.google.com/d/optout.