High Performance PostgreSQL for Rails: Confusing `SUBSTR` arguments in `SCRUB_EMAIL` (page 52)

I’m having trouble understanding the intent of the start and count arguments to SUBSTR in the SCRUB_EMAIL function.

I understand that we want to return the first five or more characters from the MD5 value. The PostgreSQL documentation examples indicate that string indexes start at 1, not 0. Why does the SCRUB_EMAIL code call SUBSTR with a starting index of 0 and a count of 6 or greater? I see that the code produces the desired outcome, but wouldn’t it be equivalent and clearer to use a starting index of 1 and a count of 5 or greater?

1 Like

Hi @douglasshuang. Thanks for the detailed write-up again here. I don’t remember for sure whether I knew functions like substring() had 1-based indexes and not 0-based.

I studied that documentation page and didn’t explicitly see it mentioned, however it can be deduced from the examples, for example:

substring('Thomas' from 2 for 3)hom

1,2,3,4,5,6
----------
T,h,o,m,a,s

We can see it starts at position 2 “h” and goes for a count of 3 (inclusive of “m”) to get “hom”.

With that in mind, I agree the code/sql/scrub_email_function_full.sql function would be more sensible to start from 1 and go for a count of “5”.

I guess before with a count of 0 it doesn’t raise an error, but isn’t a valid index position, so that’s why it was set to 6 to get “5” characters plus 1 invalid position.

Start at 1, go for 5 is more clear!

Thanks again for finding this. I’m submitting this fix and hope it goes into a new version.

1 Like

Hi @andatki, that sounds good. Thank you!