One of my children was exploring sounds at school and asked me to give them a hand.
One of the tasks that was set for them was writing down a list of word pairs that are spelt differently by pronounced identically.
While, I managed to come up with a few – it was a bit more challenge than I originally though.
Anyway, being the techie that I am (or at least think I am 🙂 ) – I decided to see if technology could help me. Specifically, could SQL Server help me out.
I’d worked with SOUNDEX in Oracle many, many years ago, but I’d never really had reason to explore it in SQL Server – well, now was my chance.
So, not only does SQL Server have a SOUNDEX function, it also has a DIFFERENCE function that indicates how different SQL Server thinks two words may be from one another.
As you might expect with functions based around string – collation is a factor.
SOUNDEX returns a four-character code that is based on how the string sounds when spoken in English. You can read more about how those codes are put together in the docs.
DIFFERENCE provides an integer values between 0 and 4 with 0 meaning no similarity and 4 representing a very close match.
I experimented with a few different combinations, ranging from the identical to the clearly different – I even used some numbers.
Here’s my test data set
USE tempdb; GO SELECT Word , [SOUNDEX(word)] = SOUNDEX(Word) , Compare , [SOUNDEX(Compare)] = SOUNDEX(Compare) , [Difference] = DIFFERENCE(Word , Compare) FROM ( VALUES ('Talk' , 'Torque') , ('witch' , 'which') , ('green' , 'greene') , ('bee' , 'be') , ('piece' , 'peace') , ('right' , 'write') , ('absence' , 'absents') , ('awe' , 'oar') , ('ball' , 'bawl') , ('band' , 'banned') , ('beach' , 'beech') , ('seed' , 'cede') , ('heard' , 'herd') , ('licker' , 'liquor') , ('hertz' , 'hurts') , ('centre' , 'center') , ('carat' , 'carrot') , ('heroin' , 'heroine') , ('metal' , 'mettle') , ('lightening' , 'lightning') , ('Martin' , 'Luke') , ('you''ll' , 'yule') , ('123' , '123') , ('9183646262' , '9183646262') , ('a' , 'A') , ('boy' , 'girl') ) AS Sounds(Word , Compare) ORDER BY [Difference] DESC;
And here are the results
We can see that it found most homophones to be a 4 – indicating a very strong similarity.
Interestingly enough it compared numbers ok, it got the difference between the strings ‘martin’ and ‘luke’ right – not similar at all. However, it thought the strings ‘boy’ and ‘girl’ had a least some similarity, which seemed odd. it didn’t get ‘right’ and ‘write’ correct, likely as they begin with different letters.
On the whole very encouraging – but not perfect.
An interesting venture into the worlds of sounds in SQL Server and what my kids are up to at school.
Have a great day.