Speech recognition

voice recognitionautomatic speech recognitionvoice commandspeech-to-textvoice commandsspeechVoice dialingspoken commandsvoice recognition softwarerecognition
Speech recognition is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.wikipedia
794 Related Articles

Voice user interface

voice controlvoice command deviceVoice
Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed direct voice input).
A voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and questions, and typically text to speech to play a reply.

Speaker recognition

voice recognitionspeaker verificationvoice-activated
The term voice recognition or speaker identification refers to identifying the speaker, rather than what they are saying.
The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification (also called speaker authentication) contrasts with identification, and speaker recognition differs from speaker diarisation (recognizing when the same speaker is speaking).

Deep learning

deep neural networksdeep neural networkdeep-learning
Most recently, the field has benefited from advances in deep learning and big data. Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997.
Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.

SoundHound

These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, GoVivace Inc., SoundHound, iFLYTEK many of which have publicized the core technology in their speech recognition systems as being based on deep learning.
SoundHound Inc., founded in 2005, is an audio and speech recognition company.

Nuance Communications

NuanceScanSoftCaere Corporation
These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, GoVivace Inc., SoundHound, iFLYTEK many of which have publicized the core technology in their speech recognition systems as being based on deep learning.
This permitted the use of the system, so-called speaker-independent natural-language speech recognition (abbreviated as SI-NLSR or just NLSR), for call automation.

Frederick Jelinek

Fred JelinekJelinek
* By the mid-1980s IBM's Fred Jelinek's team created a voice activated typewriter called Tangora, which could handle a 20,000-word vocabulary Jelinek's statistical approach put less emphasis on emulating the way the human brain processes and understands speech in favor of using statistical modeling techniques like HMMs.
Frederick Jelinek (18 November 1932 – 14 September 2010) was a Czech-American researcher in information theory, automatic speech recognition, and natural language processing.

Windows Speech Recognition

Speech Recognitionspeech recognition group at MicrosoftWindows Speech Recognition Macros
Huang went on to found the speech recognition group at Microsoft in 1993.
Windows Speech Recognition (WSR) is a speech recognition component developed by Microsoft for the Windows Vista operating system that enables the use of voice commands to control the desktop user interface; dictate text in electronic documents and email; navigate websites; perform keyboard shortcuts; and to operate the mouse cursor.

CMU Sphinx

SphinxSphinx-IIPocket Sphinx
Raj Reddy's former student, Xuedong Huang, developed the Sphinx-II system at CMU.
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University.

Dynamic time warping

dynamic time warping (DTW)time warping
Around this time Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary.
A well known application has been automatic speech recognition, to cope with different speaking speeds.

Natural-language understanding

natural language understandinglanguage understandingUnderstanding
They thought speech understanding would be key to making progress in speech recognition;, this later proved to untrue.
NLU is the post-processing of text, after the use of NLP algorithms (identifying parts-of-speech, etc.), that utilizes context from recognition devices (Automatic Speech Recognition [ASR], vision recognition, last conversation, misrecognized words from ASR, personalized profiles, microphone proximity etc.), in all of its forms, to discern meaning of fragmented and run-on sentences to execute an intent from typically voice commands.

IFlytek

These speech industry players include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, GoVivace Inc., SoundHound, iFLYTEK many of which have publicized the core technology in their speech recognition systems as being based on deep learning.
It creates voice recognition software and 10+ voice-based internet/mobile products covering education, communication, music, intelligent toys industries.

Lernout & Hauspie

L&HLernout & Hauspie Speech Product
Lernout & Hauspie, a Belgium-based speech recognition company, acquired several other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000.
Lernout & Hauspie Speech Products, or L&H, was a leading Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001 because of a fraud engineered by management.

Kai-Fu Lee

Lee Kaifu
Raj Reddy's student Kai-Fu Lee joined Apple where, in 1992, he helped develop a speech interface prototype for the Apple computer known as Casper.
Lee developed the world's first speaker-independent, continuous speech recognition system as his Ph.D. thesis at Carnegie Mellon.

GOOG-411

The first product was GOOG-411, a telephone based directory service.
GOOG-411 (or Google Voice Local Search) was a telephone service launched by Google in 2007, that provided a speech-recognition-based business directory search, and placed a call to the resulting number in the United States or Canada.

Recurrent neural network

recurrent neural networksrecurrentElman networks
Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997.
This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

DARPA Global autonomous language exploitation program

DARPA GALE programGALEGlobal Autonomous Language Exploitation
In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and Global Autonomous Language Exploitation (GALE).
The program encompassed three main challenges: automatic speech recognition, machine translation, and information retrieval.

Lawrence Rabiner

L. RabinerLarry RabinerLawrence R. Rabiner
1990 - Dragon Dictate, a consumer product released in 1990 AT&T deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without the use of a human operator. The technology was developed by Lawrence Rabiner and others at Bell Labs.
Lawrence R. Rabiner (born 28 September 1943) is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition.

Acoustic model

acoustic modeling
The use of deep feedforward (non-recurrent) networks for acoustic modeling was introduced during later part of 2009 by Geoffrey Hinton and his students at University of Toronto and by Li Deng and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and University of Toronto which was subsequently expanded to include IBM and Google (hence "The shared views of four research groups" subtitle in their 2012 review paper). Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech.

Computational linguistics

computational linguistmathematical linguisticscomputer linguistics
Speech recognition is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.
These systems, such as Siri of the iOS operating system, operate on a similar pattern-recognizing technique as that of text-based systems, but with the former, the user input is conducted through speech recognition.

Electrical engineering

electrical engineerelectricalelectrical engineers
It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.
In such products, DSP may be responsible for noise reduction, speech recognition or synthesis, encoding or decoding digital media, wirelessly transmitting or receiving data, triangulating position using GPS, and other kinds of image processing, video processing, audio processing, and speech processing.

Language model

language modelingstatistical language modelslanguage
Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval and other applications.

Keyword spotting

In the United States, the National Security Agency has made use of a type of speech recognition for keyword spotting since at least 2006.
Since speech recognition technology forms the core of keyword spotting, the solution can also be used to build content based indexes of audio archives for intelligence and business applications.

Hidden Markov model

hidden Markov modelsHMMhidden Markov models (HMMs)
A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for speech recognition.
Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

Babel program

Babel
Some government research programs focused on intelligence applications of speech recognition, e.g. DARPA's EARS's program and IARPA's Babel program.
The IARPA Babel program developed speech recognition technology for noisy telephone conversations.

James K. Baker

Dr. James BakerJames Baker
A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for speech recognition.
James Baker is an expert in speech recognition technology and a Distinguished Career Professor at Carnegie Mellon University.