HASH Join and memory usage


  1. Is it that a HASH join has to be always instructed via a hint? i.e. the optimizer will never choose a HASH join on its own? If yes, is that because the optimizer is a rule-based one? If yes, does the upcoming cost based optimizer choose a HASH join over NL join if it finds the cost to be lower?

  2. A HASH join will need memory for holding the in-memory hash table. That memory will be used on which node: the node running the Data service, Index service or the Query service?


Yes. HASH JOIN available in EE version. In rule based optimizer HASH JOIN is based on Hints. In Cost Based Optimizer, if stats are available it choose HASH JOIN or NL based on lower cost. If no cost available it falls back rule based approach.

HASH table is based on query predicate, built on query node where query is running. At present in-memory hash table only (no spill to disk)

cc @bingjie.miao

Thank you.

But then there is no provision to assign memory quota to query service. So I researched and found that Query service will compete for memory. Does this mean it can potentially hog all the memory on the node (if many queries are running)?

Is there a way to identify the actual memory used by the query when we use HASH JOIN so that it can be ensured that the nodes are sized correctly?

In 7.0 there will be per request memory quota meaning that at startup you can enforce what’s the maximum amount of memory taken by values for the processing of a query (there’s several reasons why we prefer per request quota to per service quota - but that’s too long to discuss here) .
This includes memory used for values exchanged in between operators, and memory for aggregates, sorts and hash joins.
When memory quota is on, at the end of a request you get, in the metrics, the maximum amount of value memory used for executing the request, so you could use that for sizing the request quota at development time.

Thank you Marco.

Will both (per request memory quota & metrics) be available only in 7.0? If yes, in 6.x, is there absolutely no way (metadata, statistics, etc.) to find how much memory was used ?

Looking at system tools like top, vmstat, etc. could be one way but a lot of our queries are very shortlived, few hundred millsecs to a couple of seconds. In that case, it will be almost impossible to capture the same and hence the query.