A Journey Into Deep Learning - II. NLP

1 minute read

NLP for Text Generation with RNNs

What is the NLP?

Natural Language Processing (NLP) extract useful information from the given text or sentence using machine learning and deep learning techniques. The techniques requires text being converted to a set of real number (a vector).

The process of converting text into numbers are called Vectorization.

Great summary in Prabhu’s blog.


How do we do to generate the useful text?

The course demonstrates how to use a recurrent neural network (RNN) to generate text. Process could be done character by character. Details in Andrej Karpathy blog

Step 0 - Importing main libraries

Libraries are always required:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

Step 1 - import text & understand the characters

Free text is available here

Importing text:

path_to_file = 'FILENAME.txt'
text = open(path_to_file, 'r').read()
print(text[:1000])

Understanding the characters:

vocab = sorted(set(text))
print(vocab)
len(vocab)

set() where parameters could be list, tuple or dictionary

Step 2 - text processing

As a neural network can’t work with string data, the process of converting text into number is required. So, this section need to create dictionaries that can go from numeric index to character and character to numeric index.

char_to_ind = {char:ind for ind, char in enumerate(vocab)}
char_to_ind['H']
#33

ind_to_char = np.array(vocab)
ind_to_char[33]
#'H'

Note: enumerate is a built-in function of Python, which allows us to loop over something and have an automatic counter.

Create Mapping so we can go back and forth from char to num:

# encode text into text.array
encoded_text = np.array[char_to_ind[c] for c in text]
# Get the text
sample = text[:20]
# Parallel num for the text
encoded_text[:20]

Step 3 - create batches