Speech recognition

voice recognitionautomatic speech recognitionvoice commandspeech-to-textvoice commandsspeechVoice dialingspeech to textvoice recognition softwarerecognition
Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.wikipedia
832 Related Articles

Voice user interface

voice controlvoice command devicevoice assistant
Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search key words (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed direct voice input).
A voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply.

Speaker recognition

voice recognitionspeaker verificationspeaker identification
The term voice recognition or speaker identification refers to identifying the speaker, rather than what they are saying.
The term voice recognition can refer to speaker recognition or speech recognition.

Deep learning

deep neural networkdeep neural networksdeep-learning
Most recently, the field has benefited from advances in deep learning and big data. Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997.
Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.

Dynamic time warping

dynamic time warping (DTW)time warping
Around this time Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary.
A well known application has been automatic speech recognition, to cope with different speaking speeds.

Frederick Jelinek

Fred JelinekJelinek
* By the mid-1980s IBM's Fred Jelinek's team created a voice activated typewriter called Tangora, which could handle a 20,000-word vocabulary Jelinek's statistical approach put less emphasis on emulating the way the human brain processes and understands speech in favor of using statistical modeling techniques like HMMs.
Frederick Jelinek (18 November 1932 – 14 September 2010) was a Czech-American researcher in information theory, automatic speech recognition, and natural language processing.

Windows Speech Recognition

Speech Recognitionspeech recognition group at MicrosoftWindows Speech Recognition Macros
Huang went on to found the speech recognition group at Microsoft in 1993.
Windows Speech Recognition (WSR) is a speech recognition component developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface; dictate text in electronic documents and email; navigate websites; perform keyboard shortcuts; and to operate the mouse cursor.

CMU Sphinx

SphinxSphinx-IIPocket Sphinx
Raj Reddy's former student, Xuedong Huang, developed the Sphinx-II system at CMU.
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University.

Natural-language understanding

natural language understandinglanguage understandingUnderstanding
They thought speech understanding would be key to making progress in speech recognition;, this later proved to untrue.
NLU is the post-processing of text, after the use of NLP algorithms (identifying parts-of-speech, etc.), that utilizes context from recognition devices (automatic speech recognition [ASR], vision recognition, last conversation, misrecognized words from ASR, personalized profiles, microphone proximity etc.), in all of its forms, to discern meaning of fragmented and run-on sentences to execute an intent from typically voice commands.

Kai-Fu Lee

Lee KaifuKaifu Lee
Raj Reddy's student Kai-Fu Lee joined Apple where, in 1992, he helped develop a speech interface prototype for the Apple computer known as Casper.
Lee developed the world's first speaker-independent, continuous speech recognition system as his Ph.D. thesis at Carnegie Mellon.

Lernout & Hauspie

L&HLernout & Hauspie Speech Product
Lernout & Hauspie, a Belgium-based speech recognition company, acquired several other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000.
Lernout & Hauspie Speech Products, or L&H, was a leading Belgium-based speech recognition technology company, founded by Jo Lernout and Pol Hauspie, that went bankrupt in 2001 because of a fraud engineered by management.

BBN Technologies

Bolt, Beranek and NewmanBBNBolt Beranek and Newman
BBN, IBM, Carnegie Mellon and Stanford Research Institute all participated in the program.
In recent years, BBN has led a wide range of research and development projects, including the standardization effort for the security extension to the Border Gateway Protocol (BGPsec), mobile ad hoc networks, advanced speech recognition, the military's Boomerang mobile shooter detection system, cognitive radio spectrum use via the DARPA XG program.

GOOG-411

Google Voice Local Search
The first product was GOOG-411, a telephone based directory service.
GOOG-411 (or Google Voice Local Search) was a telephone service launched by Google in 2007, that provided a speech-recognition-based business directory search, and placed a call to the resulting number in the United States or Canada.

Recurrent neural network

recurrent neural networksrecurrentSimple recurrent network
Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997.
This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

Nuance Communications

NuanceScanSoftCaere Corporation
The speech technology from L&H was bought by ScanSoft which became Nuance in 2005.
This permitted the use of the system, so-called speaker-independent natural-language speech recognition (abbreviated as SI-NLSR or just NLSR), for call automation.

DARPA Global autonomous language exploitation program

Global Autonomous Language ExploitationDARPA GALE programGALE
In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and Global Autonomous Language Exploitation (GALE).
The program encompassed three main challenges: automatic speech recognition, machine translation, and information retrieval.

Lawrence Rabiner

Lawrence R. RabinerL. RabinerLarry Rabiner
Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993).
Lawrence R. Rabiner (born 28 September 1943) is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition.

Long short-term memory

LSTMLong Short-term Memory (LSTM)long short term memory
Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & Jürgen Schmidhuber in 1997.
For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition and anomaly detection in network traffic or IDS's (intrusion detection systems).

Acoustic model

acoustic modeling
The use of deep feedforward (non-recurrent) networks for acoustic modeling was introduced during later part of 2009 by Geoffrey Hinton and his students at University of Toronto and by Li Deng and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and University of Toronto which was subsequently expanded to include IBM and Google (hence "The shared views of four research groups" subtitle in their 2012 review paper). Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech.

Computational linguistics

mathematical linguisticscomputational linguistSymbolic Systems
Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.
These systems, such as Siri of the iOS operating system, operate on a similar pattern-recognizing technique as that of text-based systems, but with the former, the user input is conducted through speech recognition.

RIPAC (microprocessor)

RIPAC
* 1987 – The back-off model allowed language models to use multiple length n-grams, and CSELT used HMM to recognize languages (both in software and in hardware specialized processors, e.g. RIPAC).
RIPAC was aimed to provide efficient real-time speech recognition services to the italian telephone system provided by SIP.

Hidden Markov model

hidden Markov modelsHMMPoisson hidden Markov model
A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for speech recognition.
Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

Electrical engineering

electrical engineerelectricalElectrical and Electronics Engineering
It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.
In such products, DSP may be responsible for noise reduction, speech recognition or synthesis, encoding or decoding digital media, wirelessly transmitting or receiving data, triangulating position using GPS, and other kinds of image processing, video processing, audio processing, and speech processing.

James K. Baker

Dr. James BakerJames Baker
A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for speech recognition.
James Baker is an expert in speech recognition technology and a Distinguished Career Professor at Carnegie Mellon University.

Babel program

Babel
Some government research programs focused on intelligence applications of speech recognition, e.g. DARPA's EARS's program and IARPA's Babel program.
The IARPA Babel program developed speech recognition technology for noisy telephone conversations.

Language model

language modelingstatistical language modelsNeural network language models
Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval and other applications.