Generating random Turkish first names with a super basic RNN

Inspired by the mini “Dinosaur name generator” project I did a while ago, I wanted to re-use the core code (with some modifications and refactorings of my own) and try to generate Turkish first names. The project was about building a simple recurrent neural network using numpy and it was a part of the Sequence Models course of Andrew Ng’s awesome Deep Learning Specialization series on Coursera.

Main Flow

The project builds a simple character level language model to generate new first names. Roughly, the steps are to:

Initialize the weights
Forward propagate and calculate the loss
Backward propagate and calculate the gradients, using the loss
Clip the gradients with a max value, so that they don’t explode
Update the weights using the gradients

The input we provide is the list of characters of each name in the dataset; and the output at every time step is the next character to be used. At each time step, our recurrent neural network is picking a character, given the previous character, according to a probability distribution.

Preprocessing the data

I wanted to try this model with two approaches:

Get the most common 10 thousand names within the whole dataset (so that the dataset consists of unique names)
Get 3 million names, regardless of their frequency (so that the frequent names appear more in the dataset)

For both approaches, I needed to clean the data first because some of the names contained characters like dot, paranthese, hyphen, number and what not.

Also there were some non-Turkish names in the dataset, so I removed names that contain “x”, “w”, “q” as these letters are not in the Turkish alphabet.

Lastly, there were a lot of first names consisting of two or three names. Since I didn’t want my model to generate names consisting of two or more names, I split these into single ones. While procesing the names one by one, I added a split name into my list of names, if it hasn’t been added in the earlier iterations.

There were a lot of names with this style within my 10K and 3m dataset. So after reducing them to single names, I only ended up with ~700 and ~5000 distinct names respectively.

Getting predictions

I wanted to see how familiar the names will look, when trained with the most common Turkish names (10K dataset) and when trained with most of the names coming from the longer tail (3m dataset).

So I took 10 different output samples from each model.

10K dataset	3m dataset
Sin, Badir, Sedef, Akmul, Salan, Betad, Behan, Siga, Kanefi, Sac	Bevit, Ezer, Kumul, Zarit, Semiye, Gursap, Hukdeye, Sevar, Tahiye, Ozar
Vuran, Bahir, Urdi, Akir, Sam, Curec, Celkemi, Sula, Kari, Saba	Goley, Niyur, Menan, Apmat, Mindorlah, Azda, Sulki, Guka, Kime, Ergi
Radugi, Dukdur, Maha, Zusuf, Mayih, Nefah, Aysu, Selsem, Gabsun, Kahat	Zuraye, Rettime, Sengelli, Gulcime, Sortedi, Sugon, Zaki, Memra, Kunni, Hucan
Bercan, Habbat, Nultan, Ayiz, Yurev, Havgi, Bakes, Yaydil, Melif, Kalis	Gungu, Aybeden, Yavrun, Samine, Ruydar, Arnur, Hazti, Vuteton, Cahan, Talize
Baytul, Sicen, Ekep, Ale, Murgal, Kerif, Furem, Zihan, Kubza, Luse	Tekayu, Cabilatviye, Feyhan, Badabed, Sasir, Gugiki, Guhteviz, Sehhun, Sakser, Edana
Becer, Hase, Gahat, Betri, Sidar, Erzan, Gulsul, Benan, Kevay, Efere	Uzatil, Iren, Buhneyan, Sefdan, Vezhap, Fimnan, Bikan, Hirdar, Havdin, Hadrure
Suray, Bali, Guhan, Sele, Bezih, Zaza, Zacet, Saya, Dasen, Sevdam	Kurt, Iyarlan, Kagiy, Kiysan, Samay, Bervan, Safi, Fur, Zetir, Biheti
Beyfa, Fahat, Dula, Zatkun, Semelten, Gondur, Gulerun, Buysut, Gukpek, Alpfehatengel	Gocdan, Buhac, Sabip, Belet, Iben, Buhise, Aran, Semum, Sencin, Ceni
Aseye, Emra, Fera, Oyas, Zumul, Azel, Fular, Guli, Saner, Fasi	Urman, Erikmar, Buhan, Rahif, Hadul, Rosen, Daban, Hana, Rani, Zizasel
Cevli, Seylut, Sayih, Gece, Kaki, Dilsin, Semi, Aser, Sabil, Boylu	Heri, Soket, Serme, Selzikcan, Nurlay, Harfir, Seyyulga, Zeyar, Dilal, Gunsan

My thoughts

For me, the results when the model is trained with the 10K dataset sound more familiar. Probably that’s because I only heard a certain amount of Turkish names in my lifetime and when looking at the whole dataset, there are many many names which feel weird / funny to me :) Naturally, these types of weird sounding Turkish names are more present in the 3m dataset, so what my model generates may sound more like them.

Also what made me happy is that all the generated names are at least pronouncible in Turkish. Normally in Turkish, no more than 2 consonants appear next to each other. A vowel is usually followed by a consonant and a consonant is usually followed by another consonant or a vowel. The model seems to have followed this rule with a high percentage of the names it generated so I’m at least interpreting this as a good result.

And the most fun part of this doing this project was to loudly laugh at the funny names in the dataset or the names my model predicted. So at the very least, having a good laugh was worth the time I spent on this :]

If you’re curious about the implementation, head over to my repo on Github and check out the details✌️

Published Jun 6, 2018

deep-learning

atoms with consciousness matter with curiosityAslı Sabancı on Twitter