Sam Liang longs for his mother and wishes he could recapture the things she told him when he was in high school.
“I really miss her,” he said of her death in 2001. “Those were precious lifetime moments.”
Liang, who is the chief executive and co-founder of Otter.ai, a Silicon Valley startup, has set out to do something about that in the future. His company offers a service that automatically transcribes speech with high enough accuracy that it is gaining popularity with journalists, students, podcasters and corporate workers.
Improvements in software technology have made automatic speech transcription possible. By capturing vast quantities of human speech, neural network programs can be trained to recognize spoken language with accuracy rates that in the best circumstances approach 95%. Coupled with the plunging cost of storing data, it is now possible to use human language in ways that were unthinkable just a few years ago.
Liang, a Stanford-educated electrical engineer who was a member of the original team that designed Google Maps, said that data compression had made it possible to capture the speech conversation of a person’s entire life in just two terabytes of information—compact enough to fit on storage devices that cost less than $50.
The rapid improvement in speech recognition technology, which over the past decade has given rise to virtual speech assistants such as Apple’s Siri, Amazon’s Alexa, Google Voice, Microsoft Cortana and others, is spilling into new areas that are beginning to have a significant impact on the workplace.
These consumer speech portals have already raised extensive new privacy concerns. “Computers have a much greater ability to organize, access and evaluate human communications than do people,” said Marc Rotenberg, president and executive director of the Electronic Privacy Information Center in Washington. In 2015, the group filed a complaint with the Federal Trade Commission against Samsung, arguing that the capture and storage of conversations by their smart TVs was a new threat to privacy. Speech transcription potentially pushes traditional privacy concerns into new arenas both at home and work, he said.
The rapid advances being made in the automated transcription market in the past year show striking near-term potential in a growing array of new applications. This fall, for example, at the University of California, Los Angeles, students on campus who require assistance in note taking, such as those who are hearing-impaired, are being equipped with the Otter.ai service. The system is designed to replace the current note-taking process where other students take notes during classes and then share them.
In May, when the former first lady, Michelle Obama, visited campus as part of a student signing day celebration, deaf students were given access to a instantaneous transcription of her speech generated by the transcription service.
Zoom, maker of a web-based video conferencing system, offers a transcription option powered by the Otter.ai service that makes it possible to instantaneously capture a transcript of a business meeting that can be stored and searched online. One of the features that is offered by Otter.ai and other companies is the ability to easily separate and then label different speakers in a single transcription.
Companies such as Rev, which began in 2010 using temporary workers to offer transcription for $1 a minute, offers an additional automated speech transcription service for 10 cents a minute. As a result, transcription is pushing into a variety of new areas, including captioning for YouTube channels, corporate training videos and market research firms who need transcripts from focus groups.
The Rev system allows the customer to choose whether they want more accuracy or a quicker turnaround at lower cost, said Jason Chicola, the company’s founder and chief executive. Increasingly, his customers will correct machine-generated texts rather than transcribing from scratch. He said that while Rev had 40,000 human transcribers, he did not believe that automated transcription would decimate his workforce. “Humans and machines will work together for the foreseeable future,” he said.
In the medical field, automated transcription is being used to change the way doctors take notes. In recent years, electronic health record systems became part of a routine office visit, and doctors were criticized for looking at their screens and typing rather than maintaining eye contact with patients. Now, several health startups are offering transcription services that capture text and potentially video in the examining room and use a remote human transcriber, or scribe, to edit the automated text and produce a “structured” set of notes from the patient visit.
One of the companies, Robin Healthcare, based in Berkeley, California, records office visits with an automated speech transcription system that is then annotated by a staff of human “scribes” who work in the United States, according to Noah Auerhahn, the company’s chief executive. Most of the scribes are pre-med students who listen to the doctor’s conversation, then produce a finished record within two hours of the patient’s visit. The Robin Healthcare system is being used at the University of California, San Francisco, and at Duke University.
A competitor, DeepScribe, also based in Berkeley, takes a more automated approach to generating electronic health records. The firm uses several speech engines from large technology companies like Google and IBM to record the conversation and creates a summary of the examination that is checked by a human. By relying more on speech automation, DeepScribe is able to offer a less expensive service, said Akilesh Bapu, the company’s chief executive.
In the past, human speech transcription has largely been limited to the legal and medical fields. This year, the cost of automated transcription has collapsed as rival startup firms have competed for a rapidly growing market. Companies such as Otter.ai and Descript, a rival San Francisco-based startup started by Groupon founder Andrew Mason, are giving away basic transcription services and focusing on charging for subscriptions that offer enhanced features.
Speech scientists emphasize that while the automated transcription systems are significantly improved, they are still far from perfect. While 95% accuracy may be obtained by automated transcription, it is possible only under the best circumstances. An accent, a poorly positioned microphone or background noise can cause accuracy to fall.
The hope for the future is the emergence of another speech technology known as natural language processing, which tries to capture the meaning of words and sentences that will increase computer accuracy to human levels. But for now, natural language processing still remains one of the most challenging frontiers in the field of artificial intelligence.
© 2019 The New York Times