A small glimpse at Natural Language processing & Phonetics

Kelvin Arellano
4 min readFeb 11, 2021

Have you ever wondered how it is that Alexa and Siri have actually come to understand us? I remember in older movies how automated answering machines were the bane of everyone’s existence, but now people are becoming more and more accustomed to interacting with machines verbally, and I wanted to know how that process came to be and how it works.

Computers and people can interact through Natural Language Processing (NLP). It is what makes human language intelligible to computers; it is a blend of linguistics and computer science to study and create systems for analyzing and understanding language.

A basic understanding of how it works is by transforming text into something a machine can understand (text vectorization) and then the machine learning algorithms are feed training data and expected outcomes (tags) and then testing it on unseen data. To make the process easier something called Lemmatization is often used this is the process of taking different forms of the same word i.e am, is, and been, and turning them into the same root meaning of be.

A branch of Natural Language Processing that I’m going to be focusing on today is phonetic analysis. It is a branch of Natural Language Processing, that deals with how sounds are processed when we talk and how words are related to sounds. Even more specifically how one sound differs from another and how the melding of sounds convey meaning.

A “small” rabbit hole that I fell into for a bit is Articulatory phonetics. This deals with understanding how human vocal organs produce the sounds we get meaning from and reproducing them artificially, not in a recording or electronic way, but through actually recreating the muscle and tongue movements. This can help close the uncanny valley and make it so that human and computer interaction crosses that threshold into indistinguishability.

Acoustic phonetics is the branch that deal with the waves that sound produced by humans and processing and distinguishing them.

Some of the unique difficulties with this branch of phonetics is the fact that no two people will use exactly the same sets of words, pronounced exactly the same way, to produce exactly the same meaning. Even the same individual will vary in pronunciation and meaning. That is on an individual level: zooming out to get a bigger picture of language; there are more than two thousand different languages on earth. The estimates usually fall somewhere between three and seven thousand, a small comfort when presented with this fact is that there are 8 languages (Mandarin, English, Hindi, Spanish, Russian, Arabic, Bengali, and Portuguese) that are spoken by half of the world population.

Different languages use different sound systems, they use different sounds and combine them in different ways. As an example the “th” sound think does not exist in French, and the “ch” and “sh” sounds are distinct in English but not Spanish.

The analysis of spoken words can usually be broken down into six different levels even for simple utterances e.g “John inputs the data”.

  1. Pragmatically: the reason it was said. Is it declarative, stating a fact. Or is it informative, providing context or a response.
  2. Semantically: the meaning of the words. John refers to a specific individual and inputs is the action to be taken.
  3. Syntactically: the order of the words to denote meaning. In english we can break down a sentence like this into verbs and nouns for further classification and to better understand sentence structure as a whole.
  4. Morphologically: the internal structure of words. Input in this case has a prefix in a root word put and a third person marker with the s.
  5. Phonologically: the distinct sounds that words make. English is notoriously only partially phonetic in its pronunciation of words. In this case the h in John is silent, and the two a’s in data are pronounced differently.
  6. Phonetically: how the words sound together. This takes into account pauses and the stressing of syllables.

This analysis is what formed the basis for our understanding of language, and has served as the bed rock of linguistics. However over the last 30 or so years the basis for language processing for things like automated voice recognition and chatbots have used things like audio wave recognition and assigning the most likely sequences of words to those audio waves.

This then assigns a transcription of the speech and then makes predictions based on that.

Ultimately this is a very deep branch of Natural Language Processing and its worth taking a deeper look to understand the intricacies.

--

--