Infrastructure for Deep Learning: How to train an AI when your data is crap
At MarianaIQ, we face a fairly unique set of hurdles as a machine learning, B2B marketing-focused startup. Separately, any startup, machine learning developer or B2B company has its own difficulties to overcome. When the three intersect? It’s a lesson in multiplication of challenges, frankly.
For example, a startup needs to be fast and stay agile, and be able to pass that benefit along to its customers. But for a machine learning provider, it’s not always simple to move “fast” with algorithms that can take hours or, in some cases, a day or more to run.
So we’ve had to develop our own solutions in situations like this, especially when there’s another potential complication: We’re faced with building good algorithms with very limited data. Which is a nice way of saying it’s incomplete or even, to drop some data science jargon, “crappy.”
Teaching a system when the data isn’t deep
First, let’s step back and frame the problem. A customer may come to us with a set of positive data – data we want the machine learning system to train itself with and recognized as positives. In this case, it’s the customer’s existing customer list and sales data.
We want to find more contacts who may prove to be good customers, so we take their list and segment it into different personas to understand the types of buyers they already have. We then want the system to build a classifier based on that segmented data, which it can run against other potential customs to generate a P(customer) score for each targeted account, so we can target only those individual contacts who are likely to convert.
In machine learning, the core of a good model is good data. Algorithmic options and quality of results get better and better the more data you have to train on. But as a B2B company, we’re often limited in the quantity of data we’re given, because our customers are B2B firms that haven’t been in the habit of capturing great amounts of detailed customer data.
The data we’re often dealing with is often manually-entered, dirty, incomplete or rife with errors, making the job even harder. So we do our best to clean customer and sales data we’ve been given and produce good, labeled data upon which we can build a useful model, but inevitably not everything can be cleaned enough to be usefu, so we lose some data in the process.
And after doing that, we’re still left with a multiclass classification problem, meaning there are more than two (binary) classes to be predicted. Even with customers who provide a lot of data, the smallest classes may have only a couple dozen positive data examples for the system to utilize. For other customers with little data to pass along to us, the least-populous classes frequently have less than a dozen or so positive examples. At times, we’ve needed build models using as few as three positive examples
Another challenge to address before dealing with the positive data deficit? We have a segmented list of positives, but no negatives. Or if we do have negatives, they’re only barely negatives: a list of people who came this close to converting into customers, for example.
But negative examples are necessary for a machine learning system to create classifiers: it’s the old analogy of children learning “what tastes good to eat” by being exposed to ice cream (positive!) versus Brussels sprouts (negative!). (Note – as an adult I love roasted brussels sprouts, but as a kid they were usually steamed to mush, so not so good.) So we either need a way for the system to create a classifier (and train itself to produce predictions) without negatives, or we need to generate a good negative set for it to work with.
We first tried avoiding negatives entirely, by using clustering or distance-based scoring approaches. While these kind of worked, they weren’t flexible enough to work well with combinations of broad and narrow personas within the same data set, and they required difficult manual tuning.
So we decided to try picking a random set of potential customers within the database and calling them negatives. But that wasn’t quite right, either. While most of those labels will be accurate, there’ll inevitably be false negatives, and their effect can’t be corrected with typical outlier handling when there are few positive examples.
So we made a couple tweaks. First, we changed the concept of these customers from being negatives to being likely negatives. Next, we looked at these likely negatives as documents, and filtered out data that shared keywords with the positive data. Our reasoning? While strong keyword overlap isn’t enough of a signal to label these data as positive, it is enough of a signal to say they are not longer likely negatives.
We fed this set of likely negatives into our classifier training and converted everything to vectors. Then, based on the idea that our positive sets should occupy blobby regions of the vector space, we downsampled the likely negatives to ensure that, for each labeled class, within its own dense region it’s more populous than the likely negatives. This leaves us with a good negative set, uniform and roughly equally as dense as the least-dense labelled class.
Now we have data the system can learn on, but the sparsity of positive data still presents issues. In a high-dimensional space, it’s just nearly impossible to learn a generalizable classifier with less than a dozen or so positive examples of a class, as was sometimes the case.
So we first turned to manual augmenting or, as we called it, bootstrapping. While this provides good quality data, it’s expensive and tedious to produce, and had to be done separately for every customer, so it doesn’t scale well.
But coincidentally, we were simultaneously considering ways to define personas without customer data, and hit upon the idea of adapting the keyword-filtering approach we mentioned above. By adjusting the criteria and thresholds for keyword detection, perhaps we could use a keyword based approach to augment the positives.
While a lack of data can be an issue here, we used the auto-augmented keyword terms as a starting point, manually reviewing them and making adjustments before finalizing the augmenting terms. It still needed to be done for each persona for each customer, but since it only needs to be done for keywords on the persona lavel, its a lot less work than manually tagging mountains of individual data points, soit’s scalable. Also, since this makes the positive augmenting just a search function, we can leverage the strong existing search tech within Elasticsearch.