Data Science & Analytics

Demystifying Driver Analysis

March 26, 2018

5 Minutes Read

What is this… ‘Driver Analysis'?

Business decision makers often rely on driver analysis to assess what is REALLY important to their customers. For example, customers often claim that price is the most important thing to them when asked. More often than not, however, driver analysis reveals service or other experiences are actually what drives overall satisfaction. This type of insight is what makes driver analysis able to sort the ‘signal from the noise’. It means decision makers can more effectively allocate resources in a way that will yield the best return on limited resources to invest.

Sounds great, what’s the catch?

Because of the reliance on these techniques, and the dollars at stake, marketers need to have confidence in robust, accurate and actionable results from their driver modelling. The forces changing the Market Research industry don’t necessarily help achieve this. Respondents want shorter surveys, and clients want faster insights. We collect more feedback as text than ever and we survey people across many channels and stages of their customer journey. This leads to all sorts of challenges creating the right sort of data to use driver analysis.

We therefore need new driver analysis methods that solve these problems and don’t require big long surveys with endless rated statements on massive sample sizes. At Evolve Research, we not only apply these techniques, but have developed our own capabilities that enable driver analysis to keep pace with changing client needs and research methods. In this article, We’ll discuss how we’re doing this, but first, a brief primer on driver analysis for those who don’t have a technical understanding.

OK, so how does it actually work?

Driver analysis is a two-stage process, whereby we ‘model’ a relationship between variables, then derive information about the relationship. In effect, we attempt to recreate an accurate model of a dependent variable (e.g. customer satisfaction) from the scores on other variables alone.. Like a ship in a bottle, we know a real ship is made of wood, linen and steel. But it isn’t just the materials, it is also applying the right proportions in the model. A good statistical model is one that is an accurate approximation of the real thing.

In stage two, when we have modelled a suitably strong relationship, we can inspect the individual variables to determine their importance in the relationship comparatively to other variables.

To be specific, we want to know the relative weight of different variables, relative to each other, in driving the dependent variables. For example, we may conduct a survey and ask customers to rate their satisfaction with price, range, service and after sales support. We then use driver analysis to determine the relative ‘impact’ of each variable on overall satisfaction. This is a fairly standard application and one that researchers can perform using any number of software packages, typically applying regression-based techniques.

However, driver analysis has broader uses; it need not be satisfaction as the dependent variable, but practically anything, it could be the risk of defection or it could be the probability of the sale of a product. Economists use much the same techniques for budget allocation, hiring decisions, and managing stock portfolios.

The accuracy of the model depends on several things including the sample size, the amount of variation in the sample, the type of data being collected and how well you’ve designed your questionnaire.

These models can also predict other useful things such as customer segments or buying behaviour. It all depends on the type of data and the nature of what you want to predict, which determines the form of the model and what it does.

As mentioned earlier, obtaining a strong model is the first consideration. A strong model is one that is good at predicting the right answer. There are various statistical procedures to determine this but the basic one is called ‘r square’ for numeric based analysis, and accuracy for binary (yes/no) type analysis. Both of these are expressions of the proportion of variance in the dependent variable that your model explains. A high score means that you can accurately predict the dependent variable just knowing the scores on the predictor (aka independent variables). A low score means just the opposite and typically arises because the questionnaire is not measuring all causal factors. Other technical factors can reduce the r square of the model, for example, if model assumptions are not met, poor data quality or low sample sizes in the survey.

Once we know the strength of the model, we can move to look at the individual predictor variable impacts. These inform us about the relative weight of each on the dependent variable being predicted. For example, fuel economy might have twice as much weight on shoppers’ desire to buy a car as comfort. A one-point improvement in fuel economy is ‘worth’ two points of comfort. Invest in a better engine rather than more comfortable seats.

There are several approaches to determining variable importance. Some techniques treat each variable as a formula or recipe to predict satisfaction. One cup of price, two cups of service, and a dash of car park, makes a customer satisfied. Other techniques replace actual respondent data with random noise. The impact on model accuracy is measured to determine the importance of the variable.

Ok, so what can go wrong with that?

Well, a lot actually. Worse than not getting an answer, poorly performed, driver modelling can yield the wrong answer. Working within the ‘rules’ of driver modelling is critical to getting a valid and true model.

The following is a list of challenges associated with driver analysis. Most of these have existed since people starting using driver modelling techniques. Market research professionals may encounter some of these problems in their projects. Accurate, robust results depend on understanding these issues with your project analyst.

Non-normal data
Missing data
Multicollinearity
Class imbalance
Non-normal data

Many modelling techniques come fixed with a set of assumptions about the data. Older driver analysis techniques rely on data being ‘normally distributed’. Normally distributed is better known as the ‘bell curve’. Responses to satisfaction questions often don’t look like a bell curve. Some respondents cluster on the negative part of the scale, violating this assumption.

Missing data

Missing data happens because respondents skipped a question due to survey logic or were asked a question and answered ‘don’t know’. As per our example above, some customers may walk to the store and not use require service support. Evolve uses sophisticated techniques fill in missing data and allow fair results for car park. We run the analysis several times and average the results. Each time, with a different missing data treatment. Older techniques may exclude any missing data. This would result in a model that only takes into account car users, and not all users.

Multicollinearity

A technical term which in practice means each of the variables pretty much say the same thing. This can occur when asking too many predator attributes that are similar in their interpretation or which respondents are generally ambivalent about. Whereas traditional techniques can produce highly varied results in this instance, our techniques remain stable

Class imbalance

When measuring satisfaction on a survey, it is generally the case that most customers are satisfied and very few are dissatisfied. This doesn’t sound like a problem at all – unless you’re an analyst! In these instances, we can quickly design a simple, accurate and totally useless model. Simply predict every respondent is satisfied, and we’ve achieved 90% accuracy! This is not useful for us, so we need special techniques to force a fairer analysis – that is one that is sensitive enough to correctly allocate dissatisfied customers to the dissatisfied group.

Weighting
Large data sets
Text data
Estimating variance
Weighting

Often in market research we need to weight data. In another article we will go into detail about this technique. In short, we often need to value some respondents more than others in our analysis. Our driver analysis technique can handle weighting requirements. Our approach sub-samples the data at random according to the weight. This is like missing data handling and helps produce an average result.

Large data sets

Analysing hundreds of thousands of responses can take a long time. Some techniques may only use a small set of larger data. We do this as well, but re-sample several times. This helps to ensure we use all the data to get the most accurate results possible.

Text data

Typically, text data is not a compatible format for driver analysis. We use cutting edge techniques to extract thematic information from text data, to be able to run driver analysis. This allows respondents to set the agenda of what is important, rather than questionnaire writers.

Estimating variance

Usually driver analysis involves creating a single model. This can make it hard to determine variance in the model. A model built with thousands of respondents can on paper look as strong as a model built with very few. Our techniques report on variation. This gives you a better understanding of the statistical power and inform judgement about the validity of the model.

Wrapping it all up

Considering all the above, driver analysis is a relatively simple concept that is quite easy to stuff up. For that reason, it has traditionally been the domain of trained analysts. At Evolve we are developing processes to make these driver analysis techniques available on the desktop and in a way where the hard work and decisions are made behind the scenes. We are developing methodologies that are robust, accurate and soon to be available as part of our developing HumanListening™ platform along with our other analytical tools.