Many of you who are into gaming or serious video editing know NVIDIA as creators of the leading graphics processing technology on the market. But NVIDIA is also a leader in the areas of artificial intelligence and deep learning; specifically in how these technologies can improve how we experience graphics, text and video synthesis, and conversational AI.
Some of their work was showcased in a series of videos they’ve put together called I AM AI which are a compelling look at what is (and what will be) available to us to improve how we experience the world – and each other. And recently I had the opportunity to have a LinkedIn Live conversation with Bryan Catanzaro, Vice President, Applied Deep Learning Research at NVIDIA, to hear more about their work with AI to reimagine how we experience sights and sounds.
Below is an edited transcript of a portion of our conversation. Click on the embedded SoundCloud player to hear the full conversation.
Make sure to watch the embedded clips as they help to frame our conversation.
Brent Leary: That voice in that video sounded like a real human being to me. You’re used to hearing like Alexa and Siri, and before that it was like, you know, we even want don’t want to talk about the voices for before that, but that one really sounded like a human being with, with human inflection and some depth. Is that the thing that we’re looking at when you talk about reinventing graphics and reinventing voice technology and using newer technology, including AI and deep learning to not only change the look of graphics but change the feel and sound of a machine to make it sound more like one of us.
Bryan Catanzaro: I should make sure that you understand that although that voice was synthesized, it was also closely directed. So I wouldn’t say that that was a push button, speech synthesis system. Like you might use when you talk with a virtual assistant. Instead, it was a controllable voice that our algorithms allow the producers of the video to create. And one of the ways that they do that is by modeling the inflection and the rhythm and the energy that they want a particular part of the video of the narration to have. And so, so I would say it’s not just a story about AI getting better, but it’s also a story about how humans work more closely with AI to build things, and having the ability to make synthetic voices that are controllable in this way.
I think this opens up new opportunities for speech synthesis in entertainment and the arts, I think. I think that’s exciting, but it’s something that you and your audience should understand was actually very closely directed by a person. Now, of course, we’re hard at work on algorithms that are able to predict all of that humanity there, the rhythm, the inflection, the pitch. And I think that we are going to see some pretty amazing advances in that over the next few years, where we can have a fully push button, speech synthesis system that has the right inflection to go along with the meaning of the text, because when you speak a lot of the meaning is conveyed through the inflection of your voice, not just the meaning of the words that you choose.
And, if we have models that are able to understand the meaning of texts, like some of these amazing language models that I was referring to earlier, we should be able to use those to direct speech synthesis in a way that has meaning. And that’s something that I’m very excited about. it’s interesting.
I feel that we have kind of a cultural bias, maybe it’s specific to the United States. I’m not sure, but we have this cultural bias that computers can’t speak in a human-like way. And maybe it comes somewhat from Star Trek: The Next Generation where Data was like an incredible computing machine, and he could solve any problem and invent new theories of physics, but he could never speak in quite the same way that a human could, or maybe it traces back to, you know.
Brent Leary: Spock, maybe.
Bryan Catanzaro: It was off-putting like his, his voice, like was creepy, you know. And so we have 50 years, several generations of culture telling us that a computer can’t speak in a human-like way. And I actually just think that’s not the case. I think we can make a computer speak in a more human-like way, and, and we will. And I also think that the benefits of that technology are going to be pretty great for all of us.
Brent Leary: The other thing that stood out in that, in that clip was the Amelia Earhart, with her picture seeming to come to life. Can you talk about, I’m guessing that’s part of reinventing graphics using AI.
Bryan Catanzaro: Yeah, that’s right. NVIDIA Research has been really involved in a lot of technologies to basically synthesize videos and synthesize images using artificial intelligence. And that’s one example, you saw one where the neural network was colorizing an image, sort of giving us new ways of looking at the past. And when you think about that, you know, what’s involved in colorizing an image. The AI needs to understand the contents of the image in order to assign possible colors to them, like, for example, grass is usually green, but if you don’t know where the grass is, then you shouldn’t color anything green and traditional approaches to colorizing images were, I would say a little bit risk averse. But as the AI gets better at understanding the contents of an image and what objects are there and how the objects relate to each other, then it can do a lot better of assigning possible colors to the image that kind of brings it to life.
That’s one example, this image colorization problem. But I think in that video, we saw several other examples where we were able to take images and then animate them in various ways.
Visual Conditional Synthesis
One of the technologies we’ve been really interested in is, is called conditional video synthesis, where you are able to create a video based on sort of a sketch and, you know, for, for something like this, what you would do is oppose recognition that analyzes the structure of objects. For example, a face, and here’s the eyes and here’s the nose, and then assigns kind of positions to the object and sizes.
And that becomes kind of cartoon-like, a child might draw with a stick figure. And then what you do is send that into another routine that animates that stick figure and makes the person move their head or smile or, or talk with texts that we want to animate a person’s speaking to a certain text while we can make a model that predicts how their stick-figure model is going to evolve as, as the person that’s speaking. And then once we have that kind of animated stick figure drawing, that shows how the person should move, then we put it through a neural network that synthesizes a video from that and, and goes sort of from the initial image that has like the, the appearance of the person and the, and the background and so forth, and then animates it via this sort of stick figure animation to make the video.
And we call that conditional video generation, because there are many different videos that you could produce from the same stick figure. And so what we want to do is choose one that seems plausible conditioned on, on sort of some sort of other information, like maybe the text that the person is speaking, or maybe some sort of animation that we want to create. And conditional video generation is a very powerful idea and it’s something that I think over time will evolve into a new way of generating graphics, a new way of rendering and creating graphics.
Brent Leary: There is even a piece of that video where the person basically said, draw this and it actually started getting drawn.
Bryan Catanzaro: Right. The power of deep learning is that it’s a very flexible way of mapping from one space to another. And so in that video, we’re seeing a lot of examples of that. And this is another example, but from the point of view of the AI technology they’re all similar, because what we’re doing is trying to learn a mapping that goes from X to Y. And in this case, we’re trying to learn a mapping that goes from a text description of the scene to a stick figure a cartoon of that scene that. Let’s say I said a lake surrounded by trees in the mountains. I want the model to understand that mountains go in the background and they have the certain shape.
And then, the trees go in the foreground and then right in the middle, usually there’s going to be a big lake. It’s possible to train a model based on say a thousand or a million images of natural landscapes and you have annotations that show, what are the contents of these images? Then you can train the model to go the other way and say, given the text, can you create a sort of stick figure cartoon of what the scene should look like? Where do the mountains go? Where do the trees go? Where does the water go? And then once you have that stick figure, then you can send it into a model that elaborates that into an image. And, and so that’s what you saw in that video.
Digital Avatars and Zoom Calls
Watch this short video of how this technology will be used to make Zoom calls a much better experience in the near future. This scenario has a guy being interviewed for a job via a Zoom call.
Brent Leary: What was cool about that is, at the end, he said that image of him was generated from one photo of him; and it was his voice. You could, on the screen you could see the movement of the mouth. The audio quality is great, and he’s sitting in a coffee shop, which there could be a lots of sound going on in coffee shop, but we didn’t hear any of that sound.
Bryan Catanzaro: Yeah, well, we were really proud of that demo. I should, I should also note that that demo won best in show at the SIGGRAPH conference this year, which is the biggest graphics conference in the world. That model was a generalized video synthesis model. We were talking earlier about how you can take a kind of a stick figure representation of a person then animate it. Well, one of the limitations of models in the past is that you had to train an entirely new model for every situation. So let’s say if I’m at home, I have one model. If I’m in the coffee shop with a different background, I need another model. Or if you are wanting to do this yourself, you would need one model for yourself in this place, another model for yourself, another place, every time you create one of these models, you have to capture a dataset in that location with maybe that set of clothes or those glasses on or whatever, and then spend a week on a supercomputer training a model, and that’s really expensive, right? So most of us could never do that. That would really limit the way that this technology could be used.
I think the technical innovation behind that particular animation was that they came up with a generalized model that could work with basically anyone. You just have to provide one picture of yourself, which that’s cheap enough. Anybody can do that, right? And if you go to a new location or you’re wearing different clothes or glasses, or whatever, that day, you can just take a picture. And then the model, because it’s general, is able to resynthesize your appearance with just using that one photo as a reference.
I think that’s pretty exciting. Now later on in that video, actually, they switched to a speech synthesis model as well. So what we heard in that clip was actually the main character speaking with his own voice, but later on things in the coffee shop gets so noisy that he ends up switching over to text. And so he’s just typing and the audio is being produced by one of our speech synthesis models.
I think giving people the opportunity to communicate in new ways only helps bring people closer together.
Brent Leary: Conversational AI, how is that going to change how we communicate and collaborate in the years to come?
Bryan Catanzaro: The primary way humans communicate is through conversation just like you and I are having right now, but it’s very difficult for humans to have a meaningful conversation with the computer, for a number of reasons. One is that it doesn’t feel natural, right? Like if it sounds like you’re speaking to a robot, that’s a barrier that inhibits communication. It doesn’t look like a person, It doesn’t react like a person and obviously computers these days, you know, most of the systems that, that you and I have interacted with, don’t understand what humans can understand. And so conversational AI in some ways is the ultimate AI challenge. In fact, you may be familiar with the Turing test, Alan Turing, who is considered by many to be the father of artificial intelligence – he set conversational AI as the end goal of artificial intelligence.
Because if you have a machine that’s able to intelligently converse with a human, then you basically solved any kind of intelligence question that you can imagine, because any information that humans have, any wisdom, any idea that humans have created over the past many thousand years has all, they’ve all been expressed through language. And so that means language is a general enough way. It’s obviously the only way for humans really, to communicate complicated ideas. And if we’re able to make computers that are able to understand and communicate intelligently, and with low friction, so it actually feels like you’re interacting with the person, then a lot of problems I think we’ll be able to solve.
I think conversational AI is going to continue to be a focus of research from the entire industry for a long time. I think it is as deep a subject as all of human understanding and knowledge. If you and I were having a podcast on, let’s say Russian literature, there would be a lot of specialist ideas that someone with a PhD in Russian literature would be able to talk about better than I would, for example, right? So even amongst humans, our capabilities in various subjects are going to differ. And that’s why I think conversational AI is going to be a challenge that continues to engage us for the foreseeable future, because it really is a challenge to understand everything that humans understand. And we aren’t close to doing that.
This article, “Bryan Catanzaro of NVIDIA – Conversational AI in Some Ways is the Ultimate AI Challenge” was first published on Small Business Trends