Defining Deep Learning, Part 4: Understanding Text (Deeply!)
We help our customers discover and reach the right audiences for their ABM campaigns, and do it by deploying AI models to understand their accounts, the personas and buying groups that populate these targeted organizations, the kinds of products and services they work with and their roles and capacities within their hierarchies and departments.
That way, we’re able to take more than just their job titles into account. That three-dimensionality is crucial; as we’ve mentioned before, job titles are tricky and can be misleading as hell if an ABM marketer isn’t careful.
Much of the useful information on the web is raw, unstructured data in the form of text, links and image media. IDC estimates unstructured data accounts for over 95% of all digital content and is predicted to grow exponentially; even within the “controlled” confines of an an enterprise, 90% of data is unstructured.
Until the last 5 years or so, it was infeasible uncover topics and emotions across the web without powerful computing resources. Engineers didn’t have efficient methods to make sense of words and documents at a large scale. Now, with deep learning, we can convert unstructured text to computable formats, effectively incorporating semantic knowledge for training machine learning models. Harnessing the vast data troves of the digital world can help us understand people more directly, going beyond the limitations of collecting data points through measurements and survey results.
For each individual in our data set, we compile a profile of attributes and text across multiple data sources. These attributes can be treated as a row consisting of numbers and categories, known as a vector. Each entry or column is a separate dimension. Different vectors of the same dimensions can be compared using linear algebra (the mathematics of vector spaces and linear functions). If we can represent every person’s text as a fixed-length vector and append it to the attribute vector (concatenation), then each person can be represented in a “universal” format that can be processed by computer algorithms.
While the vector space neatly summarizes any person’s identity, we use algorithms that learn to assess relevance in terms of customer profile and propensity to buy. These algorithms are essentially statistical models that specialize in pattern recognition. We can use cluster analysis to explore related concepts (interests, products and other keywords) or quickly identify groups of similar accounts in a client dataset.
These findings are used to define detailed personas, which we use to train classifiers that detect new potential customers within a large population. By narrowing this population to our client’s highest-quality leads, we maximize the rate of content engagement we achieve.
The following plots show data points representing customer accounts. On the left, similar account vectors have been clustered into groups. Suppose we identify the group with the red circle as a target persona. We can produce a lead scoring model by using these vectors to train a binary classifier. To prepare the input labels, we would mark all the red cluster vectors as positive (with a 1) and all other vectors as negative (with a 0). This classifier learns a decision boundary in the vector space so it can distinguish whether any new vector belongs to the persona. The distance of any vector away boundary line determines the value of the lead score.
How do we construct this vector space? First, we assemble a large collection of relevant documents as a text corpus, which we use to build a vocabulary of relevant words and phrases (using frequency and co-occurrence statistics). We then use a special algorithm to assign vector representations or embeddings for every vocabulary item. We feed these word embeddings to a neural network to learn the composition of word sequences such as job titles. For each individual, we can concatenate these embeddings of keyword lists, descriptions and job title to yield an overall vector.
What are embeddings?
An embedding is a modestly-sized vector (perhaps several to a few hundred dimensions) that encodes information in a continuous space. This means every dimension ranges through decimal values across any increment. Hence, we consider embeddings to be dense representations instead of sparse. Common examples of sparse representation include lists of distinct items or categorical data. These vectors may hold integers (which are discontinuous) or decimals but a majority of the dimensions are zeros.
For example, you can express your location as a sparse vector categorizing every city. If you are present in New York City, that dimension is 1 while the remainder are 0 (you are absent in London, Paris, Shanghai, etc.). This is called a one hot encoding since only one “wire” is live at a time. This is a standard encoding to represent a categorical item, though quite inefficient as a dimension is needed for each tracked location.
Couldn’t this sparse vector be compressed by indicating everything is 0 except the index for New York? Yes, this can save space for computer storage, but the mathematical operation performed in memory requires all relevant vector dimensions to be present and aligned. We could leave out irrelevant dimensions but of course we don’t know which are relevant until we actually feed the data into a statistical model. Therefore, input data intended for machine learning needs a consistent number of relevant vector dimensions that are aligned in the same order.
A better way to express location is with GPS coordinates, specifically a 2-dimensional vector for latitude and longitude. This is a dense representation where each dimension corresponds with an underlying physical feature (a polar reference on the Earth’s surface). Thus, each dimension is a continuous quantity, and mathematical operations between coordinates yield meaningful results. For instance, distance and relative direction can be calculated by subtracting one coordinate from another. This isn’t possible with our sparse city location vector.
By narrowing down unstructured content into embeddings, we can efficiently represent data points in an efficient manner for comparison. Next, we’ll discuss how text data is processed to produce vectors.
Representing documents: Tackling text the old way
Although digital images may appear complex, they’re essentially pixel values that are easily converted into vectors for machine learning. Text is actually much more complicated, since they consist of a sequence of characters which are purely categorical (encoded using ASCII or Unicode). An incremental change like flipping one bit changes a character, mangling a word or changing its meaning. For representing a document, we could consider the words it contains (search for contiguous alphanumeric characters) and store a tally of each word in a count vector.
The dimensions of this vector correspond to a predetermined vocabulary of a sufficient breadth, often tens of thousands at minimum to capture an assortment of topics. An example of a simple heuristic to build a vocabulary would be to tally the 10,000 most frequent words in a corpus.
Consider a toy example of three sentences shown below:
Sentence 0: “I bought bananas at the grocery shop”
Sentence 1: “We purchased apples from the corner store”
Sentence 2: “The cat suddenly jumped over that dog”
Sentence 3: “We bought apples from the grocery store”
The corpus vocabulary as a Python style list would be:
[‘apples’, ‘at’, ‘bananas’, ‘bought’, ‘cat’, ‘corner’, ‘dog’, ‘from’, ‘grocery’, ‘i’, ‘jumped’, ‘over’, ‘purchased’, ‘shop’, ‘store’, ‘suddenly’, ‘that’, ‘the’, ‘we’]
Note that the vocabulary order is an arbitrary decision and doesn’t affect later results. Often it is alphabetically or sorted by frequency, as long as this table is kept for reference. Each word is associated with its vocabulary index. Here, “apple” is 0, “at” is 1 and we increment until the last item “we” at 18. By tallying the words, we come up with the count vectors shown below. Note that these values can be any count and not just 1.
For the sake of brevity, we leave out three common techniques for improving topic discrimination. We could keep a list of frequent filler words called stop words to ignore from vectorization and reduce the vocabulary. Another technique is scaling down the count values of frequent words with ratios using inverse document frequency (idf or tf-idf).
Count Vector 0: [0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0]
Count Vector 1: [1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1]
Count Vector 2: [0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0]
Count Vector 3: [1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1]
We can judge the similarity of documents by examining the overlapping mix of words. An operation called a dot product between vectors quantities this similarity by giving a value between 0 and 1 (for dot products of other domains negative vector values can result in dot products down to -1 but these negatives don’t occur for count vectors). The dot product is simply the sum of the product of each aligned dimension. So a dot product between vectors A = [a0 a1 a2] and B = [b0 b1 b2] would be equal to a0*b0 + a1*b1 + a2*b2.
Before these operations, we scale these vectors so their magnitudes (root squared sum of the vector elements – like the hypotenuse length of a triangle but with many dimensions) are equal to 1. Since we care more about comparing the relative distribution of words for gauging similarity we don’t want the total number of words to affect our calculation (document length isn’t relevant). The magnitudes for our 3 sentences are √7, √7 and √6 so we divide each element for each vector by these respective magnitudes. The popular Scikit-learn library for Python has a CountVectorizer and Normalize function to perform the steps we elaborated on. Using a dot product with vectors normalized to 1 is computing the cosine similarity, which is an angle measurement between two vectors, or far apart the vector directions are pointing.
Comparing a document with itself is perfect similarity, so the diagonal entries in the table below are all ones. Unfortunately, though sentences 0 and 1 are similar, their score of 0.14 is low, which is comparable to its score with sentence 2 whose meaning is quite different. This is because these 3 sentences only share the word “the”. Because there is no notion of similarity between words (they all exist in separate dimensions without a continuum of meaning), comparing the sparse vectors misses out the rich, semantic relations of words (such as synonyms: recall “bought” and “purchased” are separate dimensions in our example). The most similar sentences, 1 and 3, do have the highest similarity with 0.71.
Table of Similarity Scores
|Sent. 0||Sent. 1||Sent. 2||Sent. 3|
In this example, these count vectors don’t detect similarity when there is a difference in word choice despite staying near the same meaning. This example demonstrates the curse of dimensionality, where adding dimensions to data makes it more sparse. A sparse document vector with a broad vocabulary would be mostly filled with zeros. We cannot learn statistical relations if we don’t have some training examples that share similar non-zero values along their dimensions. So with more dimensions we might require much more training data.
Can we find a suitable dense representation to encode the meaning of words? There is a measurable, physical relation for New York with respect to location, so we can simply ponder the distance between New York and San Francisco or calculate it using retrieved coordinates. But what is the distance between “pencil”, “pen” and “paintbrush” or very different words such as “punitive” and “pumpkin”? If we can uncover these semantic dimensions then we might be able to do computations on words just like we did with GPS.
Word2vec is an efficient algorithm to do exactly that. It uses a simple neural network for unsupervised learning of word embeddings on a large, unlabelled corpus. Supervised learning, in contrast, requires true labels for each training sample. Word2vec trains word embeddings to predict its neighboring word embeddings. Therefore, words that share similar contexts become close in the embedding space.
In the following section, we’ll review neural network architecture and training, but you’re welcome to skip it if you’re already up to speed.
A review of neural networks
A neural network has an input layer and output layer on each side of multiple “hidden” layers. The training process tunes the parameters of the network’s neurons so that its output layer can give good predictions on input data. Each input sample is vector whose values are fed to the input layer.
Diagram of a simple feed-forward neural network
Each layer consists of multiple neurons with each neuron receiving values from the output signals of the layer before it. Each neuron contains a parameter for each input value that acts as a weighting factor that judges that input’s contribution to the neuron’s output signal. These weights are multiplied with their respective input values and every product is added together. This operation is a dot product between the weight vector and the input vector. A bias value is added on, a parameter particular to each neuron. The result is a linear transformation of i.
This dot product value is passed through an activation function – a nonlinear, continuously increasing function whose output value is limited to 0 to 1 – to yield the output value for this neuron. This nonlinear behavior allows stacked layers to have powerful feature learning abilities.
One wide layer can theoretically model any type of mathematical function on the input, but using multiple shorter layers lets the network quickly learn a hierarchy of simpler functions to model a complex one.
In essence, a neural network is a hierarchy of simple decision-makers. The lower layers learn to process raw data and the upper layers learn to extract useful features from the lower ones. The final output layer represents a prediction computed from the knowledge embedded within the parameters across the entire network.
Suppose there is an image classification network for discriminating between photos of cats, dogs and photos containing neither. Its output layer would contain three neurons predicting 1 out of 3 possible categories. The training data would be labeled with three possible vectors [1, 0, 0], [0, 1, 0] and [0, 0, 1] representing one of three categories. A special function on this layer called a softmax ensures that its output values always sum to 1. This neural network can be trained to model a probability distribution across categories.
For every training sample, a prediction error is a necessary signal to adjust the network’s parameters (towards that sample’s corresponding true label). We use a loss function to compute the deviation between the network output against that sample’s true label. The computed loss value is then back-propagated to all the neurons in the previous layers so that an optimization algorithm can tune the weight parameters (in a process called gradient descent, since we are searching the parameter space to push down the loss value) in the neurons of each layer. After going through each training example (a forward pass and a backward pass) and many iterations (a.k.a. epochs) over the data set, the optimized parameters should help the network make good predictions for new input data.
Word2vec uses a simple two-layer network to train word embeddings. It slides through a corpus in a skip-gram fashion. Word2vec selects a word as a source word and trains its embedding to predict the embeddings of words in its context window. This window includes all the words in the immediate left side and right side of source word. Then, the window slides to the next word, repeating the process across every word in the corpus.
The embedding selection is performed with an embedding layer, which is a lookup table for input vectors rather than a layer of neurons. Each word is associated with a vocabulary index, which is a unique integer used to select the word’s vector from the embedding layer. The vector from the source word is taken from the source embedding layer while the vectors for the context words are taken from the context embedding layer. At the end of training, embeddings from either layer can be used for application but typically the source layer is chosen.
Word2vec takes each pair of vectors and adjusts their values according to the loss between their dot product and the value 1 (perfectly aligned vectors). All vectors are initialized with different random values before training. Training every word to predict its neighborhoods would pull all word vectors to the same coordinates. So for each training pair, we add a few random words to act as contrasting negative samples2. In effect, Word2vec is pulling together related words and pushing apart unrelated ones in the embedding space.
When trained on a broad and information-rich data source such as Wikipedia, the resulting vector space exhibits startling properties. For example, subtracting the vector for “man” from the vector for “king” and adding the vector for “woman” results to a vector closest to “queen”.
Many other analogies are possible, and some examples are shown below. This demonstrates that semantic knowledge is encoded within the vector space.
Word2vec is efficient and scalable. You can train on all Wikipedia articles in a few hours on a typical modern computer. Word2vec can train with batches of vector pairs and the computation time increases linearly as corpus size grows. Since the embedding layer selects only the relevant word pairs for training, larger vocabularies do not increase the complexity of the network or its training process.
If you don’t have a large corpus but want to try machine learning with word vectors, you can download pre-trained embeddings from the web. Word2vec embeddings by Google are available to download as well as other embedding algorithms such as GloVe from Stanford4 and FastText embeddings in 294 languages from Facebook5.
Now that we have vectors for words, how do we handle paragraphs or documents? One simple but effective method is computing the average of word vectors in a document. Another way is using a new document vector with the same dimensions (initialized with a random distribution) and training this vector to predict all the words in the document. This paragraph vector6 or Doc2vec method is exactly like Word2vec but we using the entire document as the context and “freeze” the context embeddings so training only optimizes the document vector. The Gensim library for Python is a useful open-source implementation to run Doc2vec and an assortment of other topic modelling algorithms.
Let’s try out the pretrained vectors for FastText and see if we can compare the three sentences from the previous section. We can find the average word vector for each sentence and then find the cosine similarity between each pair of vectors. We can see sentence 0 and 1 has a similarity of 0.9 (much better than the sparse vector similarity of 0.14) while each is further away from sentence 2 (similarities 0.59 and 0.61). Suppose we tweak sentence 1 slightly to create sentence 3. This new sentence has a similarity of 0.94. And just for fun, if we add a completely different word “rainbow”, the low score of 0.32. As we can see, these values correspond with our intuition more than the sparse vector comparisons.
S0: “I bought bananas at the grocery shop”
S1: “We purchased apples from the corner store”
S2: “The cat suddenly jumped over that dog”
S3: “We bought apples from the grocery store”
The sparse and dense representations discussed so far are bag-of-word (BoW) models since they consider the mix of words present but do not account for their ordering. A more effective model needs to capture the nuance of sentence structure for more sophisticated tasks, such as judging the emotion of a tweet or understanding the intent of a question.
Going deeper with recurrent networks
Recurrent neural network (RNN) is a network containing neural layers that have a temporal feedback loop. A neuron in this layer receives the current inputs as well as its own outputs from the previous time-step. An RNN can operate across a sequence of inputs since the recurrent layer “remembers” previous inputs while processing the current input. When an RNN network is computed, the software will “unroll” the network (like shown below) by rapidly cloning the RNN across all the time-steps and computing the signals for the forward pass. During the backwards pass, the loss is back-propagated through time (BPTT) like a feed-forward network, except the parameters’ adjustments across the clones are shared.
We use a sophisticated RNN called Long Short-Term Memory (LSTM)7, whose neurons include gated activations that act like inner switches for advanced memory capabilities. Another distinction for LSTMs is passing along a “hidden state” of activations to the next time-step, separate from its outputs.
With a dataset of product reviews, we could feed an RNN word-by-word and predict the rating with a softmax layer after receiving the final word. Another RNN application is a language model, where each input predicts the next input. Well-trained RNNs have generated text to mimic the style of Shakespeare’s plays, Obama’s speeches, lines of computer code or even composing music.
Using dual RNNs to encode one language and decode to another language, we can train powerful and elegant sequence-to-sequence translation models (“seq2seq” learning)8. Much of Google Translate’s intelligence now relies on this technology. These RNNs work with one-hot input vectors or untrained/pre-trained embeddings, representing text either at a character level or word level.
Solving titles: Our Seq2BoW model
We can combine RNNs with Word2vec to map job titles to interests using a model we’ll call “Seq2BoW” (sequence to bag of words). The RNN learns to compose the embedding for any title from its word sequence, giving us two advantages:
- First, we no longer require a vocabulary list to exhaustively capture so many combinations of similar titles.
- Second, the titles and interests exist in the same embedding space so we can query across the two vocabularies to see how they relate in meaning. Not only can we assess how similar titles are, we can find out why by contrasting their inferred interests.
The training set consists of job descriptions containing titles and extracted keywords. We use a new embedding layer for the title words (tokens) and feed these through an LSTM to compute a fixed title embedding. This new title embedding is trained to predict the interests occurring in the same description.
To do this, we use a linear layer to project the new title embedding onto the vector space for interests (pre-trained on our own corpus using Word2vec). A linear layer is a neural layer without a nonlinear activation, so it is merely a linear transformation from the LSTM output to the existing interest vector space. We train the word predictions like Word2vec with positive and negative embeddings according to the keywords that appear in the job description for the associated title.
For speedup, we combined the dot products between the RNN projection and positive word and negative words together into the same training sample using a softmax layer for predicting only the positive word instead of training separate embedding pairs like in Word2vec.
We use Keras on the Tensorflow backend. Running on an NVIDIA GPU gave us the computation power to blaze through 10 million job descriptions in 15 minutes (32 wide RNN and 24 wide pre-trained interest word vectors). We can demonstrate the representational abilities of these vectors with a few examples below (all lists are computer “generated”).
Using our interest vocabulary trained with Word2vec, we can enter any interest keyword, look up its vector and find all the words belonging to the closest vectors:
Marketing Content: Content generation, Corporate blogging, Content syndication, Bylined articles, Social media strategy, Online content creation, Content curation, Content production
Juggling: Roller skating, Ventriloquism, Circus arts, Unicycle, Street dance, Swing dance, Comedic timing, Acrobatics
Brain Surgery: Medical research, Neurocritical care, Skull base surgery, Endocrine surgery, Brain tumors, Medical education, Pediatric cardiology, Hepatobiliary surgery
With the Seq2BoW title model, we can find related interests, given any title:
Marketing Analytics: Marketing mix modeling, Adobe insight, Lifetime value, Attribution modeling, Customer analysis, Spss clementine, Data segmentation, Spss modeler
Data Engineer: Spark, Apache Pig, Hive, Pandas, Map Reduce, Apache Spark, Octave, Vertica
Winemaker: Viticulture, Winemaking, Wineries, Red wine, Wine tasting, Food pairing, Champagne, Beer
We can create a separate title vocabulary by computing and storing the vectors for the most frequent titles. Then, we can query among these vectors to find related titles:
CEO: Chairman, General Partner, Chief Executive, Coo, President, Founder/Ceo, President/Ceo, Board Member
Dishwasher: Crew Member, Crew, Kitchen Staff, Busser, Barback, Shift Leader, Carhop, Sandwich Artist
Code Monkey: Senior Software Development Engineer, Lead Software Developer, Senior Software Engineer II, Software Designer, Software Engineer III, Lead Software Engineer, Technical Principal, Lead Software Development Engineer
We can also find titles near any interest:
Cold Calling: Account management, Sales presentations, Direct sales, Sales process, Sales operations, Outside sales, Sales, Sales management
Baking: Chef Instructor, Culinary Arts Instructor, Culinary Instructor, Baker, Head Baker, Pastry Chef, Pastry, Assistant Pastry Chef
Neural Networks: Senior Data Scientist, Principal Data Scientist, Machine Learning, Data Scientist, Algorithm Engineer, Quantitative Researcher, Research Programmer, Lead Scientist
We can extend beyond relating interests and titles and add various inputs or outputs to the Seq2BoW model. For example, we could consider company information, educational background, geographic location or other individual social and consumer insights and harness the flexibility of deep learning to understand how these relate.
Understanding text powers ABM at scale
The ABM intelligence used by MarianaIQ relies on powerful and concise representations of identity. We use Deep Learning to compute semantic embeddings for keywords and titles. To train useful machine learning models, we feed in unique labeled vectors of individuals and accounts, each containing attributes concatenated with our embeddings – a heterogeneous but harmonious combination of structured and unstructured data.
By learning from data collected across the web, we avoid the narrower and biased perspective of traditional consultants. We can quickly and accurately provide quantitative assessments such as deciding who will be more responsive to particular topics or identifying who looks more like a potential buyer. These analyses are only feasible through today’s machine learning, but they’re the secret sauce that allows ABM to function at scale for our customers.
In our final Defining Deep Learning post, we’ll cover how AI helps ABM by building account models. Look for it soon!
1Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013).
2A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. Andriy Mnih, Yee Whye Teh (2012).
3Distributed Representations of Words and Phrases and their Compositionality. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013).
4GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, Christopher D. Manning (2014).
5Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov (2016).
6Long Short-Term Memory. Sepp Hochreiter, Jürgen Schmidhuber (1997).
7Sequence to Sequence Learning with Neural Networks. Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014).