By John Gruber
Obsidian: the private and flexible writing app that adapts to the way you think. Sign up by Jan 1st for a special offer.
Let me just reiterate up front that my suspicions surrounding Google’s Duplex recordings are not suspicions regarding the idea of Duplex itself. If I had to bet on who will be the first to create an AI voice system that passes for human, even within the limited constraints of a single well-defined task like booking reservations, it would be Google. If Vegas had a betting line on this, Amazon would probably have decent odds too, but surely Google would be the favorite.
We can all hear for ourselves how well Google Assistant works today. I’m not alleging that these recordings are complete fabrications, or betting against Google being further ahead in this effort than anyone else.
But everything about the way Google announced this — the curious details of the calls released so far, the fact that no one in the media has been allowed to see an actual call happen live — makes me suspect that for one or more reasons, the current state of Duplex is less than what Sundar Pichai implied on stage. His words before the first recording was played: “What you’re going to hear is the Google Assistant actually calling a real salon to schedule an appointment for you. Let’s listen.” And after the second recording: “Again, that was a real call.”
You can parse those words precisely and argue that Pichai never said they were unscripted or un-coached, or that the recordings are unedited. But that’s like saying Bill Clinton was technically truthful with his “I did not have sexual relations with that woman” statement. The implication of Clinton’s statement was that he wasn’t involved sexually with his intern, and that wasn’t true. The implication of Pichai’s statement was that right now, today, Google has a version of Duplex in its lab that can call a real restaurant or hair salon and book a reservation and sound truly human while doing so. Not soon, today. Look at the news coverage from the announcement — Mashable, The Guardian, The Verge, The Evening Standard — all of those reports on Duplex’s announcement are written in the present tense, as though it’s something Google has working, as heard, with no or very minimal editing, today.
If a few months or more from now Google can demonstrate a real Duplex call, live, that wouldn’t disprove my suspicion that they can’t do it right now in May 2018 — even though Sundar Pichai clearly implied last week that they can. If I’m wrong — if stories come out in the next week or two from journalists granted behind-the-scenes access to listen to Duplex make live calls (and watch them be parsed correctly, creating calendar events and notifications of the reservation dates and times), and those calls sound every bit as realistically human as the recordings Google has released so far — my suspicion will be proven false. And I’d be delighted by that. Part of the reason I’m so focused on Duplex is that if it really works like it does in these recordings, it’s one of the most amazing advances in technology in years.
But Google hasn’t done that, and the more I think about it, and the longer Google stonewalls on press inquiries about Duplex, the more suspicious I get that they can’t. Even if Duplex still has a low success rate, it would be amazing if, say, half its calls worked as well and sounded as good as these recordings. That would be perfectly understandable for a technology still in development.
But Pichai also said “This will be rolling out in the coming weeks as an experiment.” On the one hand, that makes me feel like maybe I am off my rocker for being so skeptical. Why in the world would Pichai say that if they weren’t at a stage in internal testing where Duplex works as the recordings suggest? But on the other hand, if they are that close, why haven’t they invited anyone from the media to see Duplex in action?
They did invite Richard Nieva from CNet to a behind-the-scenes preview before I/O, but all he got to hear were recordings, too:
In a building called the Partnerplex on Google’s sprawling campus in Mountain View, California, I’ve been invited to hear a 51-second phone recording of someone making a dinner reservation. […]
As I listen to what sounds like a man and a woman talking, Google’s top executives for Assistant, the search giant’s digital helper, watch closely to gauge my reaction. They’re showing off the Assistant’s new tricks a few days before Google I/O, the company’s annual developer conference that starts Tuesday.
Turns out this particular trick is pretty wild.
That’s because Person 2, the one who sounds like a man, isn’t a person at all. It’s the Google Assistant.
Why not let Nieva hear it live? Why not let Nieva answer the phone and book the reservation himself, as though he works at the restaurant? If it’s “weeks” away from rolling out in a limited beta to the public, that should be possible.
The job of journalists is to verify these things, not just to take a company’s word for it. Here’s Om Malik, linking to Dan Primack’s Axios story on Google’s stonewalling:
“Google may well have created a lifelike voice assistant…Or it was partially staged. Or something else entirely. We just don’t know, because Google won’t answer the questions.” @danprimack doing what journalists are supposed to do. Verify and dig deeper!
Finally journalism starts asking obvious questions of tech.
Tech journalism has never asked basic questions like “how did you do this?”
Apple once used my software to demo their tech, which wasn’t ready.
Reporters refused to ask about this.
“How did you do this?” is a necessary question. But even broader, when you’re only shown a recording, the question is “How do we know this is real?”
Maybe Duplex, today, works just as well and sounds just as human as these recordings suggest. But maybe it doesn’t work as well as they claimed, or doesn’t sound so human,1 or takes pauses that were edited out of the clips they’ve released. We don’t know, because Google hasn’t allowed anyone to verify anything about it. It’s like a card trick where the magician, rather than an audience member, picks the card and shuffles the deck.
It’s the difference between, say, watching video of a purported self-driving car versus watching — or even better, riding as a passenger in — an actual self-driving car.
The headlines last week should have been along the lines of “Google Claims Assistant Can Make Human-Sounding Phone Calls”, not “Google Assistant Can Make Human-Sounding Phone Calls”. There’s a difference.
A recording is not a demo. You can demo hardware and software that isn’t shipping yet — most companies do, because that’s when the products are still under wraps and can make for a surprise. But there’s an obligation to be clear about the current state of the product, and to demo what you currently have working “for real”. Showing it privately to select members of the media is another acceptable strategy. Just to cite one famous example from Apple: in January 2007 the original iPhone was six months away from shipping and still needed a lot of work. But what Steve Jobs showed on stage was real — early stage software running on prototype hardware. Everything demoed was live, not a recording. And then to further prove that, after the keynote, select members of the media, including Jason Snell, Andy Ihnatko, and David Pogue, got up to 45 minutes of actual hands on time with a prototype, even though the software was at such an early stage that some of the default apps only showed screenshots of what they were supposed to look like.
That’s how you prove to the world that a demo was what you said it was. It is damn curious that Google won’t do that with Duplex.
Google now claims their plan all along has been to have Duplex identify itself to humans. I don’t understand how that squares with the efforts they clearly went through to make Duplex sound convincingly human. It seems clear that they only started thinking about disclosing Duplex as a bot to humans in response to the ethical outcry after the keynote. Ethics aside though, what makes the promise of Duplex so tantalizing as a technology is its seeming humanness. ↩︎