Notebook for ideas and thoughts: Deep learning ideas

Here are some of my ideas of amazing things that could be developed with deep learning, and that I will probably never have time to work on myself. So I'll just keep waiting for the deep learning experts to also come up with those ideas themselves and solve them.

1. Phonetics-based speech synthesis

Currently existing text-to-speech applications are based on a text in a given (or auto-detected) language. The given language is used to select the appropriate training model, which is then run to generate the sound. That model is therefore language-specific.
I am not aware if the modern systems still go through the intermediate step of looking up the input words in a dictionary of phonemes to then feed those to the speech synthesis module, or if the text is directly given as input to a neural network.
Let's assume it is the former, and that the representation of those phonemes are not langage-specific but international (1), e.g. the IPA.
The idea is to pack those tools in a program that will expect as input a text as a sequence of phonetic symbols, and output the synthesized speech.

Such a program would allow automatically generating speech for user-defined languages, either lesser-known languages or entirely new ones. This would be useful for language-learning apps to automatically generate audio for each word without requiring the course creator to record audio files themselves, and would also reduce storage space. This would also have an application generating audio samples for new constructed languages, and could for example be used in a video game where the characters speak an imaginary language.

2. Photorealistic image from drawing

Nowadays, deep photo style transfer is a very popular problem and its solutions are pretty good. We can choose a photo and a drawing or painting style and we obtain the restyled picture.

We less frequently mention the reverse problem. Based on a painting or a line drawing, we would like to generate a photorealistic image.
In 2017, the app Pix2Pix became viral and allowed you to generate a photo of a face with photorealistic textures from a simple black lines drawing.
In 2019, NVidia release GauGAN, generating photorealistic landscapes from areas of flat colors representing diverse types of terrain.

Now, we would like a generalization of NVidia's work: the algorithm should learn to recognize landscape elements and objects from a painting or a drawing, without any color code previously agreed on.

3. Description-based image synthesis

The solutions to the problem of describing images are getting pretty good.

What is quite unheard of, however, is the reverse problem: synthesizing a photorealistic image based on a text description. I would very much like to see what GANs would be able to come up with to solve this.

4. 3D scene from 2D image

Rendering a 3D scene onto a 2D image is an extremely common problem, arising everytime a frame has to be rendered in a video game or in a 3D animation movie.

The reverse problem is more challenging, and some work exists for it too. The existing solutions involve depth estimation, possibly supplemented by some background filling, for example to create a Ken-Burns effect out of a still picture. If we allow to use more than one picture, the NeRF algorithm published in March 2020 showed some stunning results.(2)
But all this research mostly focuses on generating just one side of the scene. In December 2019, NVidia published a paper looking at a different aspect of the problem, identifying the main object in the picture and generating a 3D model of it, textured on all sides.

I would like an AI that mixes these two approaches, and that takes its freedom further, filling any missing parts of the scene, maybe even deep-dreaming a 360 degree panorama, so that the 3D scene can be viewed from all sides.

5. Description-based 3D scene synthesis

Finally, if the problems of the last two paragraphs are solved, then put them together and you are able to synthesize an entire 3D scene, solely based on a text description. Alternatively, it would probably be a better idea to by-pass the 2D step, to allow the information contained in the description to influence the content of the 3D scene directly.
Congratulations! you have now become a world creator, akin to Atrus creating new ages from the tip of his quill in the Myst game series.

6. Summary

To sum up, AIs must be able to perform any of the conversions indicated by arrows in the below graph:

7. Another dimension

You can also even expand all the previous problems to another dimension: time. The problems that were dealing with images will then deal with videos, and 3D scenes will then inlcude 3D animations. This is also the topic of a lot of ongoing research.

1. which are definitely wrong assumptions, because the speech synthesis module needs to be informed of what language it is reading, at least to use a correct intonation, and every every little detail that will make it sound less like a robot.
2. But then we fall into the category of photogrammetry, which is a whole different ball game.

Notebook for ideas and thoughts

samedi 25 avril 2020

Deep learning ideas