Abstract. Word embedding is of great importance for any NLP task. Word embeddings is used to map a word using a dictionary to a vector. Skip gram model is a type of model to learn word embeddings. This model tries to predict the surrounding words within a certain distance based on the current one. It aims to predict the context from the given word. Words occurring in similar contexts tend to have similar meaning. Therefore it can capture the semantic relationship between the words. This paper explains about the word embedding using skip gram model. It explains about its architecture and implementation.

Keywords: Word embedding; Skip gram model.


Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

The web has a voluminous vocabulary of words. Each word gives a subjective and objective meaning for a sentence. Every word can be sensed differently based on the situation or context. With the rapid inclusion of Natural Language Processing (NLP) tasks 2 there is a need to consider all words, relationships between words, synonyms, and antonyms based on context. Instead of machine learning methodologies deep learning methodologies are being considered in all research works. Deep learning considers the neural network structure which consists of neurons as our basic element to work on. At a stretch we can work on a huge amount of data by using these neurons. So for embeddings of words in large amount this skip gram model gives good implementation.

Word Embeddings

All words are represented as vectors of numbers i.e., the text are converted into some numbers. This is done as it’s incapable to process any plain text or strings or raw form words. Word embedding tries to map a word using a dictionary to a vector. The meaning of a word can be approximated by the set of contexts in which it occurs. Words with similar vectors are semantically similar in meaning. The vector representation is termed as hot encoded vector in which 1 represents the position of word being existed and 0 represents everywhere else. These vectors help us to encode the semantic relationship among the other words.

Word embeddings can perform few tasks like finding the degree of similarity between two words, finding odd one out, probability to find a text under the document etc. Few applications of word embedding are like machine translation, sentiment analysis, named entity recognition, chat bots and so on.

Skip gram model

Skip gram model is built for word embeddings. The skip gram model tries to predict the surrounding words within a certain distance based on the current one. The idea behind to develop skip gram model is to take a word and predict all the related contextual words. Simply the aim of skip gram model is to predict the context when a word is given.

In skip gram model a simple neural network with a single hidden layer is used. Main intuition behind this model is that given a word w at the kth position within a sentence it tries to predict the most probable surrounding context. The word is represented as its index ‘i’ within the vocabulary V and fed into a projection layer that turns this index into a continuous vector given by the corresponding ith row in the layer weight matrix.

Skip gram model belong to prediction based vector. Skip gram is more efficient with small training data. Infrequent words are well presented using this model. Words occurring in similar contexts tend to have similar meanings. Therefore it can capture the semantic relationship between the words. So this model is like a simple logistic regression (softmax) model.


For this model all words in vocabulary should be distinct or unique. 6711 These distinct words are fed into the input layer of the model. The number of nodes in the hidden layer represents the dimensionality of the system. Hidden layer is represented by a weight matrix with rows (for every word in our vocabulary) and columns (one for every hidden neuron i.e. dimension or neuron). It is like the rows of the weight matrix are our actual word vectors. The evaluation in hidden layer is just similar like a lookup table. The output layer is a softmax regression classifier. Each output neuron i.e. one per word in vocabulary will produce an output between 0 and 1 and the sum of all these output values will sum up to 1. So in skip gram model target word is fed at the input, the hidden layer remains the same and the output layer is replicated multiple times to accommodate the chosen number of context words.

Fig. 1: Skip gram model architecture


All unique words in vocabulary are given to input layer. We select a central word to perform the mapping. For the selected central word search is performed to find the nearest words in sequence, semantically or logically related words. The input to the network is encoded using “1-out of-V” representation meaning that only one input line is set to one and rest of inputs are set to zero.

3.1 Implementation

Simple steps involved for implementation of skip gram model:
(i) Build a corpus ; vocabulary: means a dataset corpus can be used. A vocabulary which is like a dictionary with all distinct words from corpus should be arranged in alphabetical order. This vocabulary is helpful like a look-up table for mapping words to meaning.
(ii) Build a skip-gram generator of format (target, context): here target is the word for which we need to find the neighboring words which will fetch us the context words.
(iii) Build the skip gram model architecture: so that at input layer this skip gram generator format can be passed to get the related context words at the output layer.
(iv) Train the model: train this model to get the functionality run even when new words are added.

Input matrix representation is as: 7

W11 W12 W13
W21 W22 W23
W31 W32 W33
W41 W42 W43
W51 W52 W53

W11 – weight of neuron from a node w1 to h1
W12 – weight of neuron from a node w1 to h2

Function of input to hidden layer connection is basically to copy the input word vector to hidden layer. We define a window called “skip-window” which is the number of word movement back and forth from the selected words. The input words are converted to a numerical representation.

The output matrix is represented as:

W’11 W’12 W’13 W’14 W’15
W’21 W’22 W’23 W’24 W’25
W’31 W’32 W’33 W’34 w’35

W’11 – weight of neuron from a node h1 to O11

Evaluation/ Example

Consider the sentences, “the dog saw a cat”, “the dog chased the cat”, “the cat climbed a tree”. 8The corpus vocabulary has eight words when ordered alphabetically. Our eight words are:


The skip gram generator format for the “the dog saw a cat” sentence will be (the, dog), (the, saw), (the, a), (the, cat), (dog, saw), (dog, a), (dog, cat), (saw, a), (saw, cat), (a, cat). Similarly for other sentences also this format is generated.

From the build corpus let the input word be “cat” and target word be “climbed”. So input vector is 0 1 0 0 0 0 0 0 and output vector is 0 0 0 1 0 0 0 0 10.

Output of the kth neuron is computed as”
Yk= Pr (wordk|wordcontext)=
exp?(activation(k) )÷?_(n=1)^v??exp?(activation(n))?
Where activation (n) represents the activation value of the nth output layer neuron.


Thus we get the target words which are the related words in a context for the given input selected word. So this skip gram model helps to embed the words of a similar context. Skip gram model can capture two semantics for a single word. It can be used for sentiment analysis from multidomain. This model works well with a small amount of the training data, even with rare words or phrases. These word embedding is of much use nowadays for NLP tasks to be carried out. Word embedding is used to figure out better word representations than the existing ones.