Dishonest dishonesty study study

Inside science

A deep dive into a fraudulent study from Dan Ariely

Jonatan Pallesen
08-21-2021

Introduction

In 2012, Shu, Mazar, Gino, Ariely, and Bazerman published a three-study paper reporting that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document). This study is quite well-known, and has gathered many citations.

Recently, the excellent blog datacolada found that this study was fraudulent. They performed a thorough analysis here. One of the columns was shown not to be genuine, but instead it was created by adding a random number o another column. And also the data was duplicated, and the duplicated rows were amusingly written in a different font.

These things are obvious from a quick look at the data. But what is less obvious is how exactly the data was manipulated in order to show the intended effect of signing at the top. I found this a rather fascinating question, and have been puzzling with it for a while. And I believe I have uncovered the procedure used, which if correct is hilariously inept.


Data detective analysis

The first thing to consider is these numbers1:

library(tidyverse)

library(scales)

library(magrittr)

library(gt)

library(patchwork)

library(pacman)

source('../../src/extra.R', echo = F, encoding="utf-8")



set.seed(1)



d <- function(df){
   df %>% gt() %>%
      tab_options(
         data_row.padding = px(0),
         table.align = "left",
         table.margin.left = px(0),
         table.border.top.style = "hidden",
         table.border.bottom.style = "hidden",
         table.font.size = 12
         ) %>%
      cols_align(align = "left") %>%
      cols_width(
         gt::everything() ~ px(100),
         c(where(is.numeric)) ~ px(70))
}





dff <- readxl::read_excel("DATA/DrivingdataAll with font.xlsx") %>% 
   mutate(distance_car1 = update_car1 - baseline_car1)



df <- dff %>% filter(font == "Calibri")



o <- df %>% filter(condition == "Sign Top")



dff %>% group_by(condition) %>% 
   summarise(
      mean_baseline = mean(baseline_car1),
      mean_update = mean(update_car1),
      mean_distance = mean(distance_car1),
      ) %>% 
   d()
condition mean_baseline mean_update mean_distance
Sign Bottom 74945.71 98568.26 23622.55
Sign Top 59945.09 86149.92 26204.83


The baseline is the distance reading when the driver receives the car, and the update is the distance on return. The distance driven is the difference between the updated value and the baseline. This is self-reported, so it’s possible to write in an amount in the update field that is lower than the true one, and thereby save some money. If signing at the top gives more honesty, then this self-reported distance driven would be higher. And it is indeed ~2,300 higher. So far so good.

But the weird thing is that the baseline values are actually a lot higher for the Sign Bottom group. This is the initial reading when they recieve the car, and should be roughly equal for the two groups. Also the updated values are a lot higher in the Sign Bottom group, where we would expect them to be slightly higher in the Sign Top group (if those drivers that sign at the top are indeed more honest.)

The important thing here is, if you are making fraudulent data, it makes no sense to make the Sign Bottom group reporting higher than the Sign Top. So why did the fraudster do it?

Looking at various attributes of the data, I believe there is only one plausible route, which includes a series of bungled steps. I will go through these in the following


Fraud Step 1

Adding a random value to Sign Bottom baseline.


We can see that this was done from two attributes in the data:


1. Sign Bottom baseline is 15,000 higher on average than Sign Top.

This is shown on the table in the previous section.


2. Sign Bottom baseline values show that they have added a random number to them.

There is a thorough explanation of this in the Datacolada post. The short version is that humans tend to like reporting round values, such as those divisible by 1000. So these numbers will be more common in the actual data. But once you add a random number to them, this attribute will disappear.


analyse <- function(df, v){
   tribble(
      ~Attribute, ~Percentage,
      "Divisible by 1000", (df %>% filter(!!sym(v) %% 1000 == 0) %>% nrow() / nrow(df)) %>% percent(accuracy = 0.1),
      "Equal to 0", (df %>% filter(!!sym(v) == 0) %>% nrow() / nrow(df)) %>% percent(accuracy = 0.1)
   )   
}



analyse(df %>% filter(condition == "Sign Top"), "baseline_car1") %>% 
   d() %>% 
   tab_header("Sign Top")
Sign Top
Attribute Percentage
Divisible by 1000 35.1%
Equal to 0 3.2%


analyse(df %>% filter(condition == "Sign Bottom"), "baseline_car1") %>% 
   d() %>% 
   tab_header("Sign Bottom")
Sign Bottom
Attribute Percentage
Divisible by 1000 5.7%
Equal to 0 0.2%


We can see that the Sign Top group has human-characteristically large number of values divisible by 1000, whereas these have mostly disappeared from Sign Bottom. (NB. Notice that they have not fully disappeared. More about this later.)

Throughout these steps I will use a recreation dataset that starts out looking like the original untampered data set. I will then apply the predicted fraudulent steps to this data set, and confirm that it looks like what we see in the fraudulent data set.

Here I will add random(0, 30000) to baseline for Sign Bottom:

o <- 
   bind_rows(
      o,
      o %>% mutate(condition = "Sign Bottom")
   )



o %<>% mutate(
   r1 = sample(0:30000, nrow(o)),
   baseline_car1_v2 = case_when(
      condition == "Sign Top" ~ baseline_car1,
      condition == "Sign Bottom" ~ baseline_car1 + r1
      ),
   id_v2 = row_number())



p1 <- df %>% 
   filter(baseline_car1 < 230000) %>% 
   ggplot(aes(x = baseline_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Fraudulent data")



p2 <- o %>% 
   filter(baseline_car1_v2 < 230000) %>% 
   ggplot(aes(x = baseline_car1_v2, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Data recreation")



p1 + p2


We can see that they look quite identical, thus confirming that this step would lead to data with the observed attributes.

The reasoning for adding this to the Sign Bottom baseline makes some sense. If you keep the updated distances as they are, then increasing the bottom baseline will make the driven distances shorter for the Sign Bottom than Sign Top.

It would make more sense to increase the updated distances for Sign Top, though. Then the fraud would work in the sense that the driven distances for Sign Top would be higher, as intended, and there wouldn’t be the weird attribute that Sign Bottom were higher already at baseline which made people suspicious. But it gets worse:


Fraud Step 2

Creating the updated distance values from scratch by adding random(0, 50,000) to the baseline value

This is the most hilariously inept step performed. The fraudster seems to have a bad sense for numbers. But social scientists need to publish these kinds of studies to succeed, so he has to try and work with these numbers in an Excel sheet.

It was quite puzzling to figure out why this seemingly meaningless step was performed, but I believe I have figured out the explanation:

After adding a random number to the Sign Bottom baseline values, there will sometimes be values where the baseline value is higher than the updated value. This will result in negative values for distance driven, which is obviously non-sensical. Perhaps these values showed up in some reported summary table, which made the fraudster notice it.

And then he panicked. So he created new updated values by adding random(0, 50,000) to the baseline values. This would solve his problem with negative distances driven, since the updated values are always higher than the baseline values. However, it would also ruin the effect he had created in Fraud Step 1. Since now the difference in distance driven between Sign Top and Sign Bottom would instead be determined by these random numbers, and thus turn back to 0.

Why didn’t he here just go back to the original data set, and add to the updated values for Sign Top instead? It’s hard to understand. Perhaps he had already done a lot of work on the sheet, and it would be hard to start over from scratch. Or perhaps he had lost the original column values for Sign Bottom baseline. Also, he didn’t realize this obvious solution the first time around, so perhaps he still did not see it.

Whatever his reasoning, the data clearly shows that creating the updated distance values from scratch is what he did:

df %>% ggplot(aes(x = distance_car1)) +
   geom_histogram(boundary = 1, fill = "turquoise4", color = "black") +
   scale_x_continuous(labels = comma) +
   theme_minimal()


The values are from a random uniform distribution, and stop abrubtly at 50,000. The only way for this to happen, is if the data are generated by a uniform distribution going from 0 to 50,000.

I recreate this step in the recreation data set, and verify that it looks similar to the actual fake data:

o %<>% mutate(
   r2 = sample(0:50000, nrow(o)),
   update_car1_v2 = baseline_car1_v2 + r2,
   diff_car1_v2 = update_car1_v2 - baseline_car1_v2)



p1 <- df %>% 
   filter(update_car1 < 280000) %>% 
   ggplot(aes(x = update_car1, fill = condition)) +
   theme_minimal() +
   ggeasy::easy_move_legend(to = "bottom") +
   geom_histogram(alpha = 0.5, position = "identity") +
   scale_x_continuous(labels = comma) +
   labs(title = "Fraudulent data")



p2 <- o %>% 
   filter(update_car1_v2 < 280000) %>% 
   ggplot(aes(x = update_car1_v2, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Data recreation")



p1 + p2


Fraud Step 3

Reassign labels for a third of the data set, so that Sign Top gets higher values of distance driven

After his amusingly dumbass Fraud Step 2, he has solved the problem of the negative distances driven, but he introduced a new one: There is no longer the desired effect that the Sign Top distance driven is higher.

He could perhaps solve this problem by adding some random value to Sign Top. However, we can see that this is not what happened, since none of distance driven values exceed 50,000. And given that the distance driven values were defined by random(0, 50,000) in Fraud Step 2, adding some further value to this would bring some of the values above 50,000.

Also if we look at the histogram, the difference is not caused solely by a skewing of the Sign Top values, but there is an equal skewing of the Sign Bottom values in the opposite direction.

df %>% ggplot(aes(x = distance_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(position = "identity", alpha = 0.5, boundary = 1, binwidth = 2000)


So instead I believe he did something very close to this:

The interesting thing about this is that it is actually quite a clever step. If he had done this from the beginning, not only would the study have shown the desired result, it would have been very difficult to detect fraud in the study. There would be none of the easily detectable signs of fraud, such as with numbers divisible by thousand being rare. It is somewhat surprising that he used such a clever approach, after the bungling in the first two steps. Perhaps he sat down and thought things through more properly, or perhaps he got some help from a person who is better with numbers.

Unfortunately for the fraudster, he for some reason kept the manipulations performed in Fraud Steps 1 and 2, even though they are not necessary to achieve the desired result, but still make the data set look suspicious.


Data detective analysis

Figuring out this step was the hardest part of the puzzle for me. There are multiple attributes of the data set that have to add up, most importantly:

So this leaves only this way to solve the puzzle of how the numbers were generated: First some random amount were added to Sign Bottom. Then some proportion of the labels were shuffled around.

As seen above, Sign Top has 35% of baseline values divisible by 1000, whereas Sign Bottom has 5.7%. Since 35 / 6 = 5.8, this means that about 1 / 6 of the Sign Tops would have been changed to Sign Bottom, implying that about a third of the data set would have the labels shuffled

Let’s us try and recreate using the recreation data set. Instead of adding random(0, 30,000) to Sign Bottom in step 2, I will add random(0, 35,000), since some of them will be shuffled back to Sign Top.


set.seed(100)



o <- df %>% filter(condition == "Sign Top")

o <- 
   bind_rows(
      o,
      o %>% mutate(condition = "Sign Bottom")
   )



o %<>% mutate(
   r1 = sample(0:35000, nrow(o)),
   baseline_car1_v2 = case_when(
      condition == "Sign Top" ~ baseline_car1,
      condition == "Sign Bottom" ~ baseline_car1 + r1
      ),
   id_v2 = row_number())



o %<>% mutate(
   r2 = sample(0:50000, nrow(o)),
   update_car1_v2 = baseline_car1_v2 + r2,
   diff_car1_v2 = update_car1_v2 - baseline_car1_v2)



o1 <- o %>% sample_n(nrow(o) / 5, replace = F)

o2 <- o %>% anti_join(o1, by = "id_v2") %>% 
   mutate(condition_v2 = condition)

o1 %<>% arrange(diff_car1_v2) %>% 
   mutate(i = row_number()) %>% 
   mutate(condition_v2 = ifelse(i > 1140, "Sign Top", "Sign Bottom"))



oo <- bind_rows(o1, o2)



oo %>% group_by(condition_v2) %>% 
   summarise(
      mean_baseline = mean(baseline_car1_v2),
      mean_distance= mean(diff_car1_v2),
      ) %>% 
   d() %>% 
   tab_header("Recreation data set mean values")
Recreation data set mean values
condition_v2 mean_baseline mean_distance
Sign Bottom 75953.51 23859.22
Sign Top 59256.92 26262.11


analyse(oo %>% filter(condition_v2 == "Sign Bottom"), "baseline_car1_v2") %>% 
   d() %>% 
   tab_header("Sign Bottom recreation")
Sign Bottom recreation
Attribute Percentage
Divisible by 1000 5.0%
Equal to 0 0.4%


After simulating this Fraud Step, the recreation data set looks very close to the fraudulent data set, on all these characteristics. (Using different random seeds can create slight variations on the exact values.)


Fraud Step 4

Duplicate all entries, adding random(1, 1,000) to the baseline.

This turned out to be perhaps the most humorous step, because he somehow managed to use a different font for the duplicated rows. The datacolada post goes through this in good detail.

We can deduce that this was the last step performed, because the copied values include all the manipulation of the previous steps.

It is a little bit puzzling to me why he performed this step. The p-value without the duplication would already be p < 10^-13 without duplication, so it was not needed to strengthen the result.

It is possible that he just thought that a larger sample size was more impressive, and he was already in for a penny, so why not in for a pound. However, this seems unlikely to me. For one thing, the car insurance company would know how many entries they sent, and if they happened to look at the paper, they may be confused as to why the number was twice as big.

So I think it is more likely that at some step, perhaps from the very beginning, the fraudster started using only half the data set. Then after he had implemented the good solution in Fraud Step 3, he did not want to do these steps again on the other half. So instead he copied what he had, to get to the correct sample size.


Who committed the fraud?

Dan Ariely was the one who was sent the data from the car insurance company, and is the creator of the Excel document containing the fraudulent data. So, it can’t be any of the co-authors.

Is it possible that someone at the car insurance company faked the data, and Dan Ariely simple received this fake data? I would say that it is not.

It could be imagined that some person at the car insurance company would perform Fraud Step #2 and Fraud Step #4. Perhaps they were too lazy to gather the data, so would just generate some fake data instead.

But it is inconceivable that they would perform Fraud Step #1 and Fraud Step #3. These steps specifically have the purpose to make the research hypothesis true. And the car insurance company would have no incentive to do this.


Perspective

This is a case of fraud that is completely bungled by ineptitude. As a result it had signs of fraud that were obvious from just looking at the most basic summary statistics of the data. And still, it was only discovered after 9 years, after someone attempted a replication. As I went through above, the fraudster had multiple obvious opportunities to manipulate the data in a way that would have likely never been discovered. In fact, it seems that the only reason it was discovered, was because of traits the data set acquired after puzzlingly unnecessary fraudulent manipulations.

This makes it seem likely that there is a lot more fraud than most people expect.

I would suggest that no study should be trusted if it doesn’t release the data. I am not having any illusions about the good will of journals here. I am saying that we as a scientific community should not trust any study without open data, regardless of which journal it was published in.

Also, I think we should look at all old Dan Ariely studies. Fraudulent people probably commit fraud more than once, and given the level of mathematical competency showed here, we could expect it to be not too hard to uncover.

Dan Ariely made a reply to the accusations. (Which also uses two different fonts!)


Footnotes

[1] Throughout I will analyze the numbers for car1 only, since the same is done for the other cars. Also I will show data for the Calibri numbers only most of the time, since they are the original values.