BI for beginers

Hi all, this week (I enjoy maintaining the delusion that that this is a weekly blog; but it still beats the schedule of my favorite web comic) I would like to write about some basic BI stuff. This week’s topic is Key Performance Indicators (KPI); which ones to use, how to calculate them, and what they mean (paying attention to some of the traps).

Key Performance Indicators are statistics about your application; generally they are about user behavior. Like most things KPI can be formed into arbitrary groups  by people other than me and meaningful groups by me. The deeply meaningful and aesthetically pleasing groups that I assign KPIs are “cohort” and “user” metrics. “Cohort Metrics” (CM) group users by their install day whereas “User Metrics” (UM) group users by their activity dates.

This post describes CM. The most useful of these are installs, average revenue per install (ARPI), average revenue per paying install (ARPPI), percent conversions, average indirect-revenue per install (AIPI), average indirect-revenue per monetizing install (AIPMI), Retention, and first (I am not above enjoying that the last element of the list is “first”). Many people call ARPI and ARPPI, ARPU and ARPPU, respectively. However they are choosing to say “poo” and “poo-poo” far too much for my liking, i.e., “who’s poo-poo? It is ARPPU!” Even if you do like saying “poo” in business meetings you must agree that Pie >> Poo; if not, you no-longer have my permission to continue reading this blog…

Returning to CM KPIs, installs is the number of installs that occur in the time segment of interest. Commonly these are grouped by install day or week, however other time periods can be useful. It is also common to group installs by builds or versions of a product. If your data is organized in a sensible way it should be very easy to make a SQL query to extract it;

“SELECT day, count(id) as installs FROM user_table group by 1 order by 1;”

Note that a “count(distinct id)” is not necessary because a user_table should only have one entry per user! For user acquisition it may be helpful to think about user attributes; what device they are running, country they are from, or ad type (are they a payed install or organic? if payed which network?). These attributes can either be included in additional group by terms or with case statements:

“SELECT day, country, count(id) as installs_all, count(CASE WHEN ad_type = ‘organic’ THEN id END) as installs_org FROM user_table group by 1,2 order by 2,1;”

Installs is quite straight forward so the only “trap” is the ubiquitous SQL one that a row is only written if there is at least 1 entry, i.e., there are days will zero installs an average of the number of installs per day will be artificially high because all zero days will be excluded.

Moving on to ARPI, ARPPI, and percent conversion, these KPI describe (direct) monetization. ARPI gives the value per install, i.e., what a user is worth.  Normally ARPI has number indicating the calendar age in days of the user. This is important as it allows for the comparison of different cohorts of users. While users that installed a month ago had more time to spend then users that installed last week they have the same time to spend in their first 7 days. If you data is organized into a user_table and an IAP_table then a sql query could be:

“SELECT *, rev_7/installs_7 as arpi_7
FROM
(SELECT first_contact_date,
count(distinct CASE WHEN cal_age >= 6 THEN id END) as installs_7,
SUM(CASE WHEN iap_age <= 6 THEN rev END) as rev_7
FROM
(SELECT Q1.*, Q2.iap_time, Q2.rev, datediff(‘day’, Q1.first_contact_date, Q2.iap_time ) as iap_age,
datediff(‘day’, Q1.first_contact_date, Q1.max_date ) as cal_age
FROM
(
(SELECT id, first_contact_date, (select max(first_contact_date) FROM user_table) as max_date — for if you dont get new data every day other wise use date()
FROM user_table) Q1
LEFT JOIN
(SELECT id, iap_time, rev
FROM iap_table) Q2
ON Q1.id = Q2.id
)
)
GROUP BY 1
)
ORDER BY 1;”

This is a bit verbose; most organizations would put a life time value on each user for key ages in the user summary table. However I think showing this way makes it clear what the ARPI is. To analyze ARPI data it is helpful to make a what I call a strata plot; Fig. 1 is an example of such a plot.  The R code as usual is appended at the end of the post.

arpi

Figure 1: Strata plot of ARPI data, color indicates user age. The ARPI for each install cohort is written (in dollars) if there is room.

Note that with such plots the top bin is generally not meaningful. In the example if you were interested in ARPI for users after 30 days the top bin would have to be sliced in more segments and some of the install weeks may not have any data (they would not have reached 30 days in age). Note that this can be done for a mix of all users or just organic or a particular network/device/ country. Also note that for organic users their ARPI will be a function of time since game launch (golden cohort effect) and app store rank.

ARPI can be thought of as the product of the KPIs ARPPI and percent conversion. Percent conversion is the percent of users that convert (make an In App. Purchase [IAP]) by a given age. ARPPI is the value of converters. Given changes in ARPI it is important to understand whether it is more users buying or the same number of users buying more/ higher value IAPs. The following SQL query gives percent conversion by age 7.

“SELECT *, convert_7/installs_7 as percent_con_7
FROM
(SELECT first_contact_date,
count(CASE WHEN cal_age >= 6 THEN id END) as installs_7,
count(CASE WHEN (cal_age >= 6) and (con_age <= 6)  THEN id_con END) as convert_7
FROM
(SELECT Q1.*, Q2.id as id_con,
datediff(‘day’, Q1.first_contact_date, Q1.max_date ) as cal_age,
datediff(‘day’, Q1.first_contact_date, Q2.con_time ) as con_age
FROM
(
(SELECT id, first_contact_date, (select max(first_contact_date) FROM user_table) as max_date
FROM user_table) Q1
LEFT JOIN
(SELECT id, min(iap_time) as con_time
FROM iap_table
GROUP BY 1) Q2

ON Q1.id = Q2.id
)
)
GROUP BY 1
)
ORDER BY 1;”

The following query gives ARPPI. It is very similar to the query for ARPI however the join is in the other order (I wrote it with a left join but swapped Q1 and Q2 because right joins are the worst; worse then Fred Durst). With the sub queries in this order all users must be in the IAP table however they do not have to spend before the required age so the bold section of the query is also added.

“SELECT *, rev_7/installs_7 as arppi_7
FROM
(SELECT first_contact_date,
count(distinct CASE WHEN (cal_age >= 6) AND (iap_age <= 6) THEN id END) as installs_7,
SUM(CASE WHEN iap_age <= 6 THEN rev END) as rev_7
FROM
(SELECT Q1.*, Q2.iap_time, Q2.rev, datediff(‘day’, Q1.first_contact_date, Q2.iap_time ) as iap_age,
datediff(‘day’, Q1.first_contact_date, Q1.max_date ) as cal_age
FROM
(
LEFT JOIN
(SELECT id, iap_time, rev
FROM iap_table) Q2
LEFT JOIN
(SELECT id, first_contact_date, (select max(first_contact_date) FROM user_table) as max_date
FROM user_table) Q1
ON Q1.id = Q2.id
)
)
GROUP BY 1
)
ORDER BY 1;”

The AIPI, AIPMI, and percent indirect monetizers are almost exactly like ARPI, ARPPI and percent spender but would come from an indirect revenue table. This revenue could be real, e.g., from ads, or it could be theoretical, e.g., the value of attracting another install. Either way using similar queries as the direct revenue KPI should work.

Retention is the probability that a user returns k days after install. This should be defined in absolute time not calendar days however with enough users spread out over different time zones there is little difference.  The following query gives Retention 7 for each install date.

“SELECT *, n_ret_7/installs::float as ret_7
FROM
(SELECT  first_contact_date, count(distinct id) as installs, count(distinct CASE WHEN login_age = 7 THEN id END) as n_ret_7
FROM
(SELECT Q1.*, Q2.login_time,  floor(datediff(‘hour’, Q1.first_contact_date, Q2.iap_time )/24::float) as login_age
FROM
(
(SELECT id, first_contact_date
FROM user_table) Q1
LEFT JOIN
(SELECT id, login_time
FROM login_table) Q2
ON Q1.id = Q2.id
)
)
GROUP BY 1
)
ORDER BY 1;”

Commonly analysis will be more interested in number of user returned and number that don’t because this can be modeled as a binomial, but for reporting most people think in percent retention. Figure 2 gives the probability of retention with confidence intervals for two groups of users. The retention values are calculate for each age individually. To maximize the value of comparison the decay curves should be modeled parametrically, however because this is game specific I am unwilling to share distribution I use at HotHead.

ret

Figure 2: The indicted day retention for two groups of users with 95% central credibility intervals.

Note that although the two decay curve ultimately end up at the same retention value users from the group 0 would have more logins in their first 30 days then users from group 1.

And lastly the first KPI. First is the number of days that a user logs-in in their first x days. This gives a better measure of early engagement then retention.  In-order to promote more (i.e., any) comments / discussion I am only giving a query for this once someone asks for it in the comments.

That is it for now. I will post again soon on User Metrics.

Cheers Gavin

 

R code:

# make some sim data
arpi.1  = rnorm(15, mean=0.15, sd = 0.01) 
arpi.7  = rnorm(15, mean=0.30, sd = 0.02)
arpi.14 = rnorm(15, mean=0.45, sd = 0.05)
arpi.30 = rnorm(15, mean=0.60, sd = 0.05)
arpi.30.plus = rnorm(15, mean=0.6+0.03*(15:1), sd = 0.1)

arpi.7[arpi.7 <= arpi.1] = arpi.1[arpi.7 <= arpi.1]
arpi.14[arpi.14 <= arpi.7] = arpi.7[arpi.14 <= arpi.7]
arpi.30[arpi.30 <= arpi.14] = arpi.14[arpi.30 <= arpi.14]
arpi.30.plus[arpi.30.plus <= arpi.30] = arpi.30[arpi.30.plus <= arpi.30]

data = data.frame(week = 1:15, arpi.1, arpi.7, arpi.14, arpi.30, arpi.30.plus)
g.plot.strata<-function (data) 
{
n = nrow(data)
m = ncol(data)

week = data$week

arpi = seq(0, max(data$arpi.30.plus), length.out = n)

plot(week, arpi, xlab = "week of install", ylab = "ARPI ($)", type = "n", xlim= c(0.5, n+3.5 ) )

for ( i in 1:n)
    {
    x =  c(i-0.5, i-0.5, i+0.5, i+0.5)
    for ( j in m:2)
        {
        kpi = data[i,j]
        kpi.n = 0
        if(j >= 3 ) {kpi.n = data[i,j-1]}
        y = c(0,kpi, kpi, 0)
        polygon(x,y, col=j)
        if (kpi - kpi.n >= 0.13) {text(x= i, y=kpi, pos=1, round(kpi, 2)  )}
        }
    }

legend("topright", col=6:2, legend = c(" >30 days", "15-30 days", "8-14 days", "2-7 days", "0-1 days"), lwd=5)
}
#### for retention stuff
g.ret.data.maker <-function () 
{
day = 1:30
installs = floor(rnorm(30, mean= 1000, sd = 100))
P = seq(-0.7, -3, length.out = 29)
P = exp(P)

data = data.frame(day, installs, matrix(0, ncol=29, nrow=30))
for(i in 1:30)
    {
    for (j in 1:(30-i))
        {
        data[i,j+2] = rbinom(1, size = installs[i] , prob = P[j])
        }
    }
data1 = data

P = seq(-0.9, -2.9, length.out = 29)
P = exp(P)

data = data.frame(day, installs, matrix(0, ncol=29, nrow=30))
for(i in 1:30)
    {
    for (j in 1:(30-i))
        {
        data[i,j+2] = rbinom(1, size = installs[i] , prob = P[j])
        }
    }

data2 = data

data = rbind(data1, data2)
data = data.frame(data,group = c(rep(0,30), rep(1,30)))

return(data)

}

g.ret.plot <-function (data) 
{
group = data$group
day = data$day
installs = data$installs
N = nrow(data)

day.u = unique(day)
group.u = unique(group)

plot(1:29, seq(0, 0.5, length.out=29), xlab = "Age (Days)", ylab = "Prob. Ret", type ="n")


R1.list = matrix(0, nrow=29, ncol=3)
R0.list = R1.list

for (j in 3:31)
    {
    S1 = sum(data[group==1, j])
    F1 = sum(data[(group==1)&(data[group==1, j] > 0), 2]) - S1
    print(c(j-2, S1, F1))



    S0 = sum(data[group==0, j])
    F0 = sum(data[(group==0) &(data[group==0, j] > 0), 2]) - S0
    R1 = qbeta(c(0.025, 0.5, 0.975), S1, F1)
    R0 = qbeta(c(0.025, 0.5, 0.975), S0, F0)

    R1.list[j-2, ] = R1
      R0.list[j-2, ] = R0


    }

polygon(c(1:29, 29:1), c(R1.list[,1], rev(R1.list[,3])), col="red", border =F)

polygon(c(1:29, 29:1), c(R0.list[,1], rev(R0.list[,3])), col="blue", border =F)

lines(1:29,R1.list[,1], lty=2, col="red" )
lines(1:29,R1.list[,3], lty=2, col="red" )

legend("topright", col=c("red", "blue"), legend = c("Group 1", "Group 0"), lwd=3)

}

 

 

 

 

Advertisement