Statistics

Hippo maintains data distribution statistics internally and will choose an optimal solution when querying data in the light of the related statistics. With below command, users can check or update statistics. Kindly note, statistics update is resource consuming, and Hippo will update it accordingly when data update reaches a certain amount. Generally there is no need to trigger this operation manually.

curl -u shiva:shiva -XPOST 'localhost:8902/hippo/v1/{table}/_analyze_db?pretty' -H 'Content-Type: application/json' -d'{
  "is_update" : false,
  "columns" : ["word_count", "book_id"],
  "wait_for_completion" : true,
  "timeout" : "10m"
}';

Result:

{
  "job_id" : "0725c69b15734f3fa82abcf7fdf5fd9b",
  "job_status" : "SHIVA_JOB_SUCCESS",
  "task_results" : [
    {
      "id" : "d53e6662c3474a7193fe099a82012175",
      "status" : "TASK_SUCCESS",
      "server" : "172.29.203.189:27841",
      "execute_time" : 0.0,
      "column statistics" : [
        {
          "column_id" : 0,
          "num_distinct_keys" : 300,
          "num_keys" : 300,
          "null_frequency" : 0.0,
          "correlation" : 0.0
        },
        {
          "column_id" : 1,
          "num_distinct_keys" : 100,
          "num_keys" : 300,
          "null_frequency" : 0.0,
          "correlation" : 0.0
        }
      ]
    }
  ]
}

Parameter description:

Parameters	Description	Required
database_name	Database where the destination table is located	No, defaults to "default" database
table_name	Destination table name	Yes
is_update	Whether to trigger statistics update operation	No, defaults to false
columns	Columns to check statistics	No, defaults to all, only takes effect when is_update is set to true
wait_for_completion	Whether to wait until the job is done	No, defaults to true
timeout	Operation timeout	No, defaults to 5 mins

Table 32 Statistics (Restful API)