When creating a view with a relationship to an entity reference field (to join the referred-to entity in the query), duplicates will be produced if the entity reference field is multi-valued (i.e. cardinality > 1). Users assume that checking the box at Advanced -> Query settings -> Distinct will solve this problem, but it doesn't.
This is because from a database perspective, the records actually aren't duplicates. Each different entity reference in the multi-valued field will produce a different row as the target IDs are different, even though this isn't necessarily reflected in the view itself if not all fields are included. Said another way, the view results may look the same on the front-end, even though they're not on the back-end.
It should be noted that this affects all entity types, whether it be taxonomy terms, nodes, paragraphs, custom entities, etc. There are several duplicates of this issue in various other queues that should be marked as such, with all efforts directed here.
Proposed solution
If the Distinct box is checked in the view configuration, after the query has executed, remove results whose entity IDs are already in the list.
Remaining tasks
Produce a patch to solve the basic problem.- Add support for pagers (as per #12).
- Add support for aggregation (as per the D7 version of Views Distinct).
- Add tests.
Original report
If you have a content type with a multivalue taxonomy term field and add a relationship based on that field in a view (in my case I want to use the term name as an argument) the resulting views result has duplicates if one or more of the nodes has multiple terms in that field.
The DISTINCT keyword does not help, since the tid is added as a field in the SELECT part of the query.
I have the following query:
SELECT DISTINCT node_field_data.sticky AS node_field_data_sticky, node_field_data.created AS node_field_data_created, node_field_data.nid AS nid, taxonomy_term_field_data_node_field_data.tid AS taxonomy_term_field_data_node_field_data_tid
FROM
{node_field_data} node_field_data
LEFT JOIN {taxonomy_index} taxonomy_index ON node_field_data.nid = taxonomy_index.nid
LEFT JOIN {taxonomy_term_field_data} taxonomy_term_field_data_node_field_data ON taxonomy_index.tid = taxonomy_term_field_data_node_field_data.tid
WHERE (node_field_data.promote = '1') AND (node_field_data.status = '1')
ORDER BY node_field_data_sticky DESC, node_field_data_created DESC
.. which produces duplicates, even though the DISTINCT keyword is present, because the taxonomy_term_field_data_node_field_data.tid makes the duplicate nodes in the results 'distinct'.
Ideally the query should just exclude the field from the SELECT part like this:
SELECT DISTINCT node_field_data.sticky AS node_field_data_sticky, node_field_data.created AS node_field_data_created, node_field_data.nid AS nid
FROM
{node_field_data} node_field_data
LEFT JOIN {taxonomy_index} taxonomy_index ON node_field_data.nid = taxonomy_index.nid
LEFT JOIN {taxonomy_term_field_data} taxonomy_term_field_data_node_field_data ON taxonomy_index.tid = taxonomy_term_field_data_node_field_data.tid
WHERE (node_field_data.promote = '1') AND (node_field_data.status = '1')
ORDER BY node_field_data_sticky DESC, node_field_data_created DESC
which works an nodes with multiple terms in the field are now only appearing once.
I have couldn't find a way to actually omit the field in the SELECT part of the query, and I'm also not sure, if it could create other issues if removed. But the tid in the select the DISTINCT keyword is useless..
Steps to reproduce.
1. add multivalue taxonomy term field to node type, and add a node with multiple terms selected
2. add a view which uses that field to add a relationship to the term (needed to use the term name as argument)
3. make the view results DISTINCT
4. optionally add the argument (contextual filter) that uses the relationship to filter the views result on the term name.
5. if no argument is given, the node with multiple terms in the field will appear multiple times