on 08-09-2016 4:42 AM
Doing the following query:
select 'wat ' as x from dummy
union
select 'wat' as x from dummy
In other words, I'm doing a union on two rows, where one value has a trailing whitespace.
This is what I get back:
x | count(*) |
---|---|
wat | 1 |
Since the strings are not the same, I expect to get 2 different results. Is there something I'm missing?
Just out of fun, could you try and check the result for
select to_varchar( 'wat ') as x from dummy
union
select to_varchar('wat') as x from dummy
?
I suspect an implicit type conversion to CHAR is happening here.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Nicolas, I believe you can avoid your issue by using "union all":
The simple "union" operator will check the results for a real difference and whitespace isn't sufficient to constitute a difference. While "union all" will not perform a comparison and therefore (as Lars suggested) refrain from an implicit conversion.
I was just using union as a way to build a simple test case. This is a problem that affects other operations.
Join
with
test1 as (
select 'wat ' as trail from dummy
),
test2 as (
select 'wat' as normal from dummy
)
select
trail
, normal
, length(trail) as trail_length
, length(normal) as normal_length
from test1
join test2 on test1.trail = test2.normal
which results in this:
trailing | normal | length(trailing) | length(normal) |
---|---|---|---|
wat | wat | 4 | 3 |
Group By
select x, count(*) from (
select 'wat ' as x from dummy
union all
select 'wat' as x from dummy
)
group by x
...results in one "wat, 2" row.
Distinct
select distinct x from (
select 'wat ' as x from dummy
union all
select 'wat' as x from dummy
)
... results in one "wat" row.
The big issue with this bug is that it hides underlying data quality problems and results in incorrect reporting. I wonder if this is actually a feature, intended to clean people's bad data. If that's the case, then there should be a way to turn it off through a config or something.
Ok, this is not a bug, but ANSI SQL standard behavior.
For comparisons (except for LIKE) of two strings, the shorter string gets right-padded with space before comparison.
Operations like UNION and JOIN perform string-wise comparisons and therefore implement this semantic.
HASH_MD5, LENGTH and e.g. DISTINCT on the other hand focus don't follow this approach.
If you look around for 'distinct char trailing' you'll find that this specific bit of ANSI conform behavior has spread quite a bit of confusion around nearly all DBMS out there.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
When the given select statement is executed, system performs the union on two sets and picks the distinct set of values. It seems the 'wat' and 'wat '(with trailing white-space) are treated as same value by the row engine and an output of 1 records. "trailing white-sapces are not considered"
whereas the values with leading white-space are treated as different values and its returning two records for the below query
select ' wat ' as x from dummy
union
select ' wat' as x from dummy
--
KRPK
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
87 | |
10 | |
10 | |
10 | |
7 | |
6 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.