June 20, 2022
Machine learning is everywhere. Our smartphones use unsupervised machine learning to group our photos into categories on our behalf. And natural language processing makes it easier and less time-consuming to compose emails.
And nowhere is machine learning talked about more than the digital advertising industry. Every adtech platform and point solution boasts its use of machine learning. But are some machine learning models better than others? What makes machine learning good? Or bad? And where does machine learning fit in the changing regulatory environment?
To find out more, I sat down with Dr. Sechan Oh, Moloco’s Director of Machine Learning, to learn about deep neural networks, and why Moloco’s approach to machine learning is particularly suited to helping marketers drive strong campaign results.
Sechan joined Moloco in 2016, after spending seven years at IBM
as a research scientist in machine learning and business optimization. He holds a PhD in Management Science and Engineering from Stanford University, where he specialized in operations research, which is essentially the combination of statistical modeling and optimization.
Ryan: Every DSP – and ad tech point solution for that matter – claims to be driven by machine learning. Is all machine learning more or less the same, or are some models better than others?
Sechan: This is a question I get asked a lot. It’s true that everyone says they use machine learning, and even deep learning, but, in reality, using a deep learning model for a large-scale DSP system is extremely challenging from an engineering perspective.
In general, DSPs have about 100 milliseconds to see an ad request, predict a user’s likely response, calculate the right price to pay, and then place a bid. That’s not a lot of time. Now consider that Moloco can fire up to 10 inference models, each one an independent deep neural network, at a single bid request, which means we need tremendous computing capacity to process over 6 million bid requests per second.
The truth is, few companies outside of the traditional walled gardens actually deploy deep learning for extremely high volume and low latency services. Moloco is one of them, but keep in mind we view ourselves as a machine learning company, and our DSP is just one use case of our deep neural network (DNN). We built the necessary computing capacity to support true deep learning from the start.
Ryan: So Moloco’s machine learning engine was purpose-built for deep learning?
Sechan: That’s correct.
Ryan: Can you explain what a deep neural network is in very simple terms?
Sechan: A deep neural network is just one class of machine learning models, but it is arguably the most complex one available.
The name and structure of neural networks are inspired by how biological neurons interact in human brains. A neural network consists of layers of nodes, or artificial neurons. A neuron sends a signal to other neurons when it’s activated. This signal is amplified by a unique numeric value between two connected neurons, which will be learned during model training. Deep neural network simply refers to the neural networks that have multiple layers of artificial neurons. Although it sounds very magical, a deep neural network can be seen as a very complex numeric function that projects input values to output values.
There is a very high degree of freedom in designing how the artificial neurons are connected to each other, and the large number of connections between neurons imply a large number of parameters to learn. This flexibility is the main reason why deep neural networks are so powerful in learning complex relations, but it also makes the model training process a challenging problem.
Ryan: So, if a neural network is a brain, a deep neural network is a bigger brain. What are the unique benefits of that?
Sechan: Deep neural networks have the biggest capacity, which means they’re the most effective at expressing the relationship between inputs and outputs.
So, what does that mean, exactly? Basically, machine learning models are trained to connect or explain an output based on the inputs it receives. In the advertising world, the inputs are the things we see at a bid request: ad format, connection type, device type, publisher’s app, time of day, location, and so on. The outputs are the actual actions the user took based on seeing an ad. Did the user ignore the ad? Install the app? Make a purchase? The more we can understand the relationship between inputs and outputs, the better we get at targeting the right people with ads.
Capacity is so important here because we are dealing with an extremely large number of input feature values. Imagine the number of cities in the world, the number of mobile devices in use, and the number of publishing apps that serve advertising. We need a model that can capture the interaction of all such feature values in order to accurately predict whether a user is likely to take a desired action as a result of seeing an ad.
The other key benefit of deep neural networks is that they excel at generalization, which, in data science, is a good thing. It means that input/output relationships from one action, such as account registration, are often useful for another target action, such as an in-app purchase. Moloco’s models are very good at understanding the relationships between multiple publishers’ apps. In many cases, the things we observe in one publishing app can apply to a similar publishing app. This allows us to deliver strong campaign performance very quickly.
Ryan: Marketers today are worried about limit ad tracking (LAT) and how it will affect the amount of data that’s available to fuel their campaigns. How can a deep neural network mitigate that challenge?
Sechan: Deep learning, like all machine learning, works best when the training dataset is very large. With third-party data going away, first-party features will have to play a bigger role in advertising, and that, in turn, will dictate how we leverage deep learning technology.
To understand why, let's use the analogy of machine-based language translation, like Google Translate. When we use machines to translate text from one language to another, say French to Korean, the translation actually happens in two steps.
First, we translate the French sentence into a numerical representation of the sentence itself, which can be seen as a machine language. This representation is unique, regardless of whether we translate that sentence to Korean or Japanese. Next, we convert that numerical representation to a Korean sentence. The great advantage of this approach is that it can translate languages whose translated pair has never been seen before.
We apply this same concept when using a brand’s first-party data in the deep learning dataset. We can understand the meaning of a brand’s unique first-party data with very limited campaign observations, and thus, can scale up advertising performance quickly.
Ryan: That’s a lot to unpack. Can you give me an example?
Sechan: Sure. Let's say a brand wants to leverage its first-party data in a bunch of advertising campaigns. As we discussed, we’d typically need a lot of data – essentially, the entire campaign log – to understand the relationship between the inputs and outputs. In other words, what are input features, and how do they help us to understand how the results unfolded? Or why did certain users take a desired action while others did not? Most DSPs require a huge volume of responses in order to build a strong machine learning model.
So, what happens when we don’t have an abundance of campaign data, like when a brand tries to use its first-party data for the first time? We can use deep learning technology to translate the publisher’s raw first-party features into artificial-like languages within our system, similar to the way a French word is translated into a numerical representation. We then estimate various conversion probabilities using the numerical representations. This allows us to generalize or scale a brand’s first-party data into machine learning very, very well, even though we have a relatively small number of observations.
Ryan: Moloco’s customers tell us that they realize ROAS faster than other DSPs they’ve used. Why is that?
Sechan: The industry typically uses past results – past responses to ad impressions – as machine learning training datasets. Conversions are pretty sparse, especially when compared to the number of ad impressions generated. So, when we’re looking for features of a conversion, we’re looking at a sparse dataset.
It can take months to realize ROAS if you wait until you receive enough positive samples – in this case, conversions – to begin training your model. Yet, this is the approach that most DSPs take.
We decided to take a completely different approach. Instead of waiting for live campaign data to come in, we begin translating the first-party dataset into machine learning features (like I just described) before the campaign even launches. This gives us a significant head start.
Another big difference in our approach is we update our model with real-time results every hour. We don’t wait for a day, a week, or a month to pass. This allows us to deliver sustained ROAS, and avoid the phenomenon of diminishing returns.
Ryan: Earlier you said that Moloco uses 10 different inference models. What are they, and what do they do?
Sechan: Inference models take in a bunch of disparate data points and calculate an output, such as a numerical score, that represents a prediction. That prediction can be the likelihood of a user to install an app, or the price we need to bid in order to win the impression.
Moloco’s models are grouped into three categories. The first focuses on our infrastructure of cost optimization. We said earlier that it’s really difficult to engineer deep learning on a DSP. The first group of inference models saves a significant portion of infrastructure cost at almost no impact on revenue or performance.
The second group is core to advertising since it predicts the advertising output result. Will this user install an app, make a purchase, start playing a game again after being dormant for six months, etc.? This group of models predict the various kinds of user responses when we see the bid request. Such predictions give us an estimated return of serving an advertiser's ad impression.
The third group ensures we bid competitively to acquire impressions for our clients, and, at the same time, it prevents us from overpaying for inventory or spending our clients’ budgets unwisely. This is a bit tricky, as the right price point will change during the course of a day or week. So we deploy these bidding models to determine the optimal bid price as well as the optimal budget spend strategy to maximize the advertiser’s return.
Ryan: And all of this happens in less than 100 milliseconds?
Sechen: Yes! And it actually happens for each of the 350 billion bid requests Moloco sees each day. So you can see why this is a difficult engineering feat.
The annual list is a comprehensive view of business trajectory for independent companies across every sector.
A $100,000 (USD) donation, employee donations, and immigration/legal resources as we stand in solidarity with Ukraine.
Learn about our Singapore office, the team, and growth plans.
Talk to an expert at Moloco today. We're here to help.