A deep dive into a fraudulent study from Dan Ariely

In 2012, Shu, Mazar, Gino, Ariely, and Bazerman published a three-study paper reporting that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document). This study is quite well-known, and has gathered many citations.

Recently, the excellent blog Datacolada found that this study was fraudulent. They performed a thorough analysis here. One of the columns was shown not to be genuine, but instead it was created by adding a random number o another column. And also the data was duplicated, and the duplicated rows were amusingly written in a different font.

These things are obvious from a quick look at the data. But what is less obvious is how exactly the data was manipulated in order to show the intended effect of signing at the top. I found this a rather fascinating question, and have been puzzling with it for a while. And I believe I have uncovered the procedure used, which if correct is hilariously inept.

The first thing to consider is these numbers^{1}:

```
library(tidyverse)
library(scales)
library(magrittr)
library(gt)
library(patchwork)
library(pander)
source('../../src/extra.R', echo = F, encoding="utf-8")
set.seed(1)
<- function(df){
d %>% gt() %>%
df tab_options(
data_row.padding = px(0),
table.font.size = 13,
table.align = "left",
table.margin.left = px(0),
table.border.top.style = "hidden",
table.border.bottom.style = "hidden"
%>%
) cols_align(align = "left") %>%
cols_width(
::everything() ~ px(100),
gtc(where(is.numeric)) ~ px(70))
}
<- readxl::read_excel("DATA/DrivingdataAll with font.xlsx") %>%
dff mutate(distance_car1 = update_car1 - baseline_car1)
<- dff %>% filter(font == "Calibri")
df
%>% group_by(condition) %>%
dff summarise(
mean_baseline = mean(baseline_car1),
mean_update = mean(update_car1),
mean_distance = mean(distance_car1),
%>%
) d()
```

condition | mean_baseline | mean_update | mean_distance |
---|---|---|---|

Sign Bottom | 74945.71 | 98568.26 | 23622.55 |

Sign Top | 59945.09 | 86149.92 | 26204.83 |

The baseline is the distance reading when the driver receives the car, and the update is the distance on return. The distance driven is the difference between the updated value and the baseline. This is self-reported, so it’s possible to write in an amount in the update field that is lower than the true one, and thereby save some money. If signing at the top gives more honesty, then this self-reported distance driven would be higher. And it is indeed ~2,300 higher. So far so good.

But the weird thing is that the baseline values are actually a lot higher for the Sign Bottom group. This is the initial reading when they recieve the car, and should be roughly equal for the two groups. Also the updated values are a lot higher in the Sign Bottom group, where we would expect them to be slightly higher in the Sign Top group (if those drivers that sign at the top are indeed more honest.)

The important thing here is, if you are making fraudulent data, it makes no sense to make the Sign Bottom group reporting higher than the Sign Top. So why did the fraudster do it?

Looking at various attributes of the data, I believe there is only one plausible route, which includes a series of bungled steps. I will go through these in the following. (Note that all the fraudulent aspects were documented in the Datacolada post and the appendix. This post is about figuring out how the fraud was performed, and about recreating the steps and their effects in a synthetic data set.)

*Adding a random value to (most of the) Sign Bottom baseline values.*

We can see that this was done from two attributes in the data:

1. Sign Bottom baseline is 15,000 higher on average than Sign Top.

This is shown on the table in the previous section.

2. Sign Bottom baseline values show that they have added a random number to them.

There is a thorough explanation of this in the Datacolada post. The short version is that humans tend to like reporting round values, such as those divisible by 1000. So these numbers will be more common in the actual data. But once you add a random number to them, this attribute will disappear.

```
<- function(df, v){
analyse tribble(
~Attribute, ~Percentage,
"Divisible by 1000", (df %>% filter(!!sym(v) %% 1000 == 0) %>% nrow() / nrow(df)) %>% percent(accuracy = 0.1),
"Equal to 0", (df %>% filter(!!sym(v) == 0) %>% nrow() / nrow(df)) %>% percent(accuracy = 0.1)
)
}
analyse(df %>% filter(condition == "Sign Top"), "baseline_car1") %>%
d() %>%
tab_header("Sign Top")
```

Sign Top | |
---|---|

Attribute | Percentage |

Divisible by 1000 | 35.1% |

Equal to 0 | 3.2% |

```
analyse(df %>% filter(condition == "Sign Bottom"), "baseline_car1") %>%
d() %>%
tab_header("Sign Bottom")
```

Sign Bottom | |
---|---|

Attribute | Percentage |

Divisible by 1000 | 5.7% |

Equal to 0 | 0.2% |

We can see that the Sign Top group has human-characteristically large number of values divisible by 1000, whereas these have mostly disappeared from Sign Bottom.

Throughout these steps I will use a recreation dataset that starts out looking like the original untampered data set. I will then apply the predicted fraudulent steps to this data set, and confirm that it looks like what we see in the fraudulent data set.

Here I will add random(0, 33000) to 90% of the baseline for Sign Bottom:

```
<- dff %>% filter(condition == "Sign Top", font == "Calibri")
o_half
<-
o bind_rows(
o_half,%>% mutate(condition = "Sign Bottom")
o_half %>%
) mutate(font = "Calibri")
<- sample(o %>% filter(condition == "Sign Bottom") %>% pull(id), nrow(o%>% filter(condition == "Sign Bottom")) * 0.9, replace = F)
sample1
%<>% mutate(
o r1 = sample(0:33000, nrow(o), replace = T),
baseline_car1 = ifelse(condition == "Sign Bottom" & id %in% sample1, baseline_car1 + r1, baseline_car1),
id_v2 = row_number())
<- df %>%
p1 filter(baseline_car1 < 230000) %>%
ggplot(aes(x = baseline_car1, fill = condition)) +
theme_minimal() +
geom_histogram(alpha = 0.5, position = "identity") +
::easy_move_legend(to = "bottom") +
ggeasyscale_x_continuous(labels = comma) +
labs(title = "Fraudulent data")
<- o %>%
p2 filter(baseline_car1 < 230000) %>%
ggplot(aes(x = baseline_car1, fill = condition)) +
theme_minimal() +
geom_histogram(alpha = 0.5, position = "identity") +
::easy_move_legend(to = "bottom") +
ggeasyscale_x_continuous(labels = comma) +
labs(title = "Data recreation")
+ p2 p1
```

We can see that they look quite identical, thus confirming that this step would lead to data with the observed attributes.

The reasoning for adding this to the Sign Bottom baseline makes some sense. If you keep the updated distances as they are, then increasing the bottom baseline will make the driven distances shorter for the Sign Bottom than Sign Top.

It would make more sense to increase the updated distances for Sign Top, though. Then the fraud would work in the sense that the driven distances for Sign Top would be higher, as intended, and there wouldn’t be the weird attribute that Sign Bottom were higher already at baseline which made people suspicious.

*Copy all the entries to a different Excel spreadsheet, and add a small amount to each copy*

This is where he amusingly used a different font in the other spreadsheet, so that the copies are easily distinguishable with the new font. The Datacolada post shows that he added random(1, 1,000) to the baseline for each of the copies.

There is no doubt that step was performed. But it’s the step that’s most difficult for me to understand. Presumably the car insurance company would know how many rows of observations they sent. And if he duplicated the rows, then a higher row number would be listed in the paper. So if a person from the company happened to read it, they would be confused about the higher number of rows.

In any case, the fact that this step was performed is not in doubt. It must also have been performed between Fraud Step 1 and Fraud Step 3, because the manipulation in Fraud Step 1 is copied in the duplicates, but that of Fraud Step 3 is not. The remaining fraud steps from here were performed in both Excel sheets.

*Creating the updated distance values from scratch by adding random(0, 50,000) to the baseline value*

This is the most hilariously inept step performed. The fraudster seems to have a bad sense for numbers. But social scientists need to publish these kinds of studies to succeed, so he has to try and work with these numbers in an Excel sheet.

It was quite puzzling to figure out why this seemingly meaningless step was performed, but I believe I have figured out the explanation:

After adding a random number to the Sign Bottom baseline values, there will sometimes be values where the baseline value is higher than the updated value. This will result in negative values for distance driven, which is obviously non-sensical. Perhaps these values showed up in some reported summary table, which made the fraudster notice it.

And then he panicked. So he created new updated values by adding random(0, 50,000) to the baseline values. This would solve his problem with negative distances driven, since the updated values are always higher than the baseline values. However, it would also ruin the effect he had created in Fraud Step 1. Since now the difference in distance driven between Sign Top and Sign Bottom would instead be determined by these random numbers, and thus turn back to 0.

Why didn’t he here just go back to the original data set, and add to the updated values for Sign Top instead? It’s hard to understand. Perhaps he had already done a lot of work on the sheet, and it would be hard to start over from scratch. Or perhaps he had lost the original column values for Sign Bottom baseline. Also, he didn’t realize this obvious solution the first time around, so perhaps he still did not see it.

Whatever his reasoning, the data clearly shows that creating the updated distance values from scratch is what he did:

```
%>% ggplot(aes(x = distance_car1)) +
df geom_histogram(boundary = 1, fill = "turquoise4", color = "black") +
scale_x_continuous(labels = comma) +
theme_minimal()
```

The values are from a random uniform distribution, and stop abrubtly at 50,000. The only way for this to happen, is if the data are generated by a uniform distribution going from 0 to 50,000.

I recreate this step in the recreation data set, and verify that it looks similar to the actual fake data:

```
%<>% mutate(
o r2 = sample(0:50000, nrow(o)),
update_car1 = baseline_car1 + r2,
distance_car1 = update_car1 - baseline_car1)
<- df %>%
p1 filter(update_car1 < 280000) %>%
ggplot(aes(x = update_car1, fill = condition)) +
theme_minimal() +
::easy_move_legend(to = "bottom") +
ggeasygeom_histogram(alpha = 0.5, position = "identity") +
scale_x_continuous(labels = comma) +
labs(title = "Fraudulent data")
<- o %>%
p2 filter(update_car1 < 280000) %>%
ggplot(aes(x = update_car1, fill = condition)) +
theme_minimal() +
geom_histogram(alpha = 0.5, position = "identity") +
::easy_move_legend(to = "bottom") +
ggeasyscale_x_continuous(labels = comma) +
labs(title = "Data recreation")
+ p2 p1
```

We can see that they look quite similar, but the Sign Bottom is a little higher in the data recreation. This is explained in the next step:

*Reassign labels for about a small subset of the data set, so that Sign Top gets higher values of distance driven*

After his amusingly dumb Fraud Step 3, he has solved the problem of the negative distances driven, but he introduced a new one: There is no longer the desired effect that the Sign Top distance driven is higher.

He could perhaps solve this problem by adding some random value to Sign Top. However, we can see that this is not what happened, since none of distance driven values exceed 50,000. And given that the distance driven values were defined by random(0, 50,000) in Fraud Step 3, adding some further value to this would bring some of the values above 50,000.

Also if we look at the histogram, the difference is not caused solely by a skewing of the Sign Top values, but there is an equal skewing of the Sign Bottom values in the opposite direction.

```
%>% ggplot(aes(x = distance_car1, fill = condition)) +
df theme_minimal() +
geom_histogram(position = "identity", alpha = 0.5, boundary = 1, binwidth = 2000)
```

So instead I believe he did something very close to this:

- Look at a small subset of the dataset.
- Rearrange it from low to high distance driven.
- Assign Sign Bottom to the lowest half of this subset, and Sign Top to the upper half of this subset

The interesting thing about this is that it is actually quite a clever step. If he had done this from the beginning, not only would the study have shown the desired result, it would have been very difficult to detect fraud in the study. There would be none of the easily detectable signs of fraud, such as with numbers divisible by thousand being rare. It is somewhat surprising that he used such a clever approach, after the bungling in the first two steps. Perhaps he sat down and thought things through more properly, or perhaps he got some help from a person who is better with numbers.

Unfortunately for the fraudster, he for some reason kept the manipulations performed in Fraud Steps 1-3, even though they are not necessary to achieve the desired result, but still make the data set look suspicious.

Figuring out this step was the hardest part of the puzzle for me. There are multiple attributes of the data set that have to add up, most importantly:

The assigning of the labels have to happen after the distance driven are generated. This can be seen since the distance driven is a uniform distribution, which is hard to arrive at in other ways.

The Sign Bottom have added random values to some of them. This can be seen since they are about 15,000 higher than Sign Top.

Not all of the Sign Bottom have added random values to them. This can be seen since 5.7% percent of them are divisible by 1000. If they all had added a random number to them, ~0.1% would be divisible by 1000. Also there are a small amount of 0 values, which would not be present if a positive number had been added to them.

The duplicated rows have identical condition labels most of the time, but occasionally they are non-identical.

So this leaves only this way to solve the puzzle of how the numbers were generated: First Fraud Step 1-3 were performed, and then the labels were rearranged seperately in the two Excel sheets.

Let’s us try and recreate using the recreation data set:

```
<- dff %>% filter(condition == "Sign Top", font == "Calibri")
o_half
<-
o bind_rows(
o_half,%>% mutate(condition = "Sign Bottom")
o_half %>%
) mutate(font = "Calibri")
<- sample(o %>% filter(condition == "Sign Bottom") %>% pull(id), nrow(o%>% filter(condition == "Sign Bottom")) * 0.9, replace = F)
sample1
<- o %>% mutate(
o_plus r1 = sample(0:40000, nrow(o), replace = T),
baseline_car1 = ifelse(condition == "Sign Bottom" & id %in% sample1, baseline_car1 + r1, baseline_car1),
id_v2 = row_number())
<- o_plus %>%
dupe mutate(
r3 = sample(1:1000, nrow(o_plus), replace = T),
baseline_car1 = baseline_car1 + r3,
font = "Cambria"
)
<- o_plus %>% mutate(
o_plus2 r2 = sample(0:50000, nrow(o_plus), replace = T),
update_car1 = baseline_car1 + r2,
distance_car1 = update_car1 - baseline_car1)
<- o_plus2 %>% sample_n(nrow(o) / 8, replace = F)
o1
<- o_plus2 %>% anti_join(o1, by = "id_v2")
o2
%<>% arrange(distance_car1) %>%
o1 mutate(g = ntile(distance_car1, 2)) %>%
mutate(condition = ifelse(g == 2, "Sign Top", "Sign Bottom"))
<- bind_rows(o1, o2)
oo
<- dupe %>% mutate(
dupe_plus2 r2 = sample(0:50000, nrow(dupe), replace = T),
update_car1 = baseline_car1 + r2,
distance_car1 = update_car1 - baseline_car1)
<- dupe_plus2 %>% sample_n(nrow(dupe_plus2) / 8, replace = F)
d1
<- dupe_plus2 %>% anti_join(d1, by = "id_v2")
d2
%<>% arrange(distance_car1) %>%
d1 mutate(g = ntile(distance_car1, 2)) %>%
mutate(condition = ifelse(g == 2, "Sign Top", "Sign Bottom"))
<- bind_rows(d1, d2)
dd
<- bind_rows(dd, oo)
synth
%>% group_by(condition) %>%
synth summarise(
mean_baseline = mean(baseline_car1),
mean_distance= mean(distance_car1),
%>%
) d() %>%
tab_header("Recreation data set mean values")
```

Recreation data set mean values | ||
---|---|---|

condition | mean_baseline | mean_distance |

Sign Bottom | 76703.02 | 23305.74 |

Sign Top | 61274.58 | 26435.05 |

The recreated data set has the trait that the mean baseline is ~15k higher in Sign Bottom, but the mean distance is almost 3k higher.

```
<- function(df, condition1, font1, trait){
get_stat if (font1 == "Both"){
<- df %>% filter(condition == condition1)
m else {
} <- df %>% filter(font == font1, condition == condition1)
m
}
if (trait == "Divisible by 1000"){n <- m %>% filter(baseline_car1 %% 1000 == 0) %>% nrow()}
if (trait == "Divisible by 100"){n <- m %>% filter(baseline_car1 %% 100 == 0) %>% nrow()}
if (trait == "Divisible by 10"){n <- m %>% filter(baseline_car1 %% 10 == 0) %>% nrow()}
if (trait == "Equal to 0"){n <- m %>% filter(baseline_car1 == 0) %>% nrow()}
/ nrow(m)) %>% percent(accuracy = 0.01)
(n
}
tribble(
~`Condition`, ~Font, ~Attribute, ~`Excel sheet`, ~Recreation,
"Sign Top", "Cambria", "Divisible by 1000",
get_stat(dff, "Sign Top", "Cambria", "Divisible by 1000"),
get_stat(synth, "Sign Top", "Cambria", "Divisible by 1000"),
"Sign Top", "Cambria", "Divisible by 100",
get_stat(dff, "Sign Top", "Cambria", "Divisible by 100"),
get_stat(synth, "Sign Top", "Cambria", "Divisible by 100"),
"Sign Top", "Cambria", "Divisible by 10",
get_stat(dff, "Sign Top", "Cambria", "Divisible by 10"),
get_stat(synth, "Sign Top", "Cambria", "Divisible by 10"),
"Sign Top", "Cambria", "Equal to 0",
get_stat(dff, "Sign Top", "Cambria", "Equal to 0"),
get_stat(synth, "Sign Top", "Cambria", "Equal to 0"),
"Sign Top", "Calibri", "Divisible by 1000",
get_stat(dff, "Sign Top", "Calibri", "Divisible by 1000"),
get_stat(synth, "Sign Top", "Calibri", "Divisible by 1000"),
"Sign Top", "Calibri", "Divisible by 100",
get_stat(dff, "Sign Top", "Calibri", "Divisible by 100"),
get_stat(synth, "Sign Top", "Calibri", "Divisible by 100"),
"Sign Top", "Calibri", "Divisible by 10",
get_stat(dff, "Sign Top", "Calibri", "Divisible by 10"),
get_stat(synth, "Sign Top", "Calibri", "Divisible by 10"),
"Sign Top", "Calibri", "Equal to 0",
get_stat(dff, "Sign Top", "Calibri", "Equal to 0"),
get_stat(synth, "Sign Top", "Calibri", "Equal to 0"),
"Sign Bottom", "Cambria", "Divisible by 1000",
get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 1000"),
get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 1000"),
"Sign Bottom", "Cambria", "Divisible by 100",
get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 100"),
get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 100"),
"Sign Bottom", "Cambria", "Divisible by 10",
get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 10"),
get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 10"),
"Sign Bottom", "Cambria", "Equal to 0",
get_stat(dff, "Sign Bottom", "Cambria", "Equal to 0"),
get_stat(synth, "Sign Bottom", "Cambria", "Equal to 0"),
"Sign Bottom", "Calibri", "Divisible by 1000",
get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 1000"),
get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 1000"),
"Sign Bottom", "Calibri", "Divisible by 100",
get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 100"),
get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 100"),
"Sign Bottom", "Calibri", "Divisible by 10",
get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 10"),
get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 10"),
"Sign Bottom", "Calibri", "Equal to 0",
get_stat(dff, "Sign Bottom", "Calibri", "Equal to 0"),
get_stat(synth, "Sign Bottom", "Calibri", "Equal to 0"),
"Sign Bottom", "Both", "Divisible by 1000",
get_stat(dff, "Sign Bottom", "Both", "Divisible by 1000"),
get_stat(synth, "Sign Bottom", "Both", "Divisible by 1000"),
"Sign Bottom", "Both", "Divisible by 100",
get_stat(dff, "Sign Bottom", "Both", "Divisible by 100"),
get_stat(synth, "Sign Bottom", "Both", "Divisible by 100"),
"Sign Bottom", "Both", "Divisible by 10",
get_stat(dff, "Sign Bottom", "Both", "Divisible by 10"),
get_stat(synth, "Sign Bottom", "Both", "Divisible by 10"),
"Sign Bottom", "Both", "Equal to 0",
get_stat(dff, "Sign Bottom", "Both", "Equal to 0"),
get_stat(synth, "Sign Bottom", "Both", "Equal to 0"),
"Sign Top", "Both", "Divisible by 1000",
get_stat(dff, "Sign Top", "Both", "Divisible by 1000"),
get_stat(synth, "Sign Top", "Both", "Divisible by 1000"),
"Sign Top", "Both", "Divisible by 100",
get_stat(dff, "Sign Top", "Both", "Divisible by 100"),
get_stat(synth, "Sign Top", "Both", "Divisible by 100"),
"Sign Top", "Both", "Divisible by 10",
get_stat(dff, "Sign Top", "Both", "Divisible by 10"),
get_stat(synth, "Sign Top", "Both", "Divisible by 10"),
"Sign Top", "Both", "Equal to 0",
get_stat(dff, "Sign Top", "Both", "Equal to 0"),
get_stat(synth, "Sign Top", "Both", "Equal to 0")
%>%
) d()
```

Condition | Font | Attribute | Excel sheet | Recreation |
---|---|---|---|---|

Sign Top | Cambria | Divisible by 1000 | 0.00% | 0.15% |

Sign Top | Cambria | Divisible by 100 | 1.06% | 1.11% |

Sign Top | Cambria | Divisible by 10 | 9.78% | 9.75% |

Sign Top | Cambria | Equal to 0 | 0.00% | 0.00% |

Sign Top | Calibri | Divisible by 1000 | 35.12% | 33.22% |

Sign Top | Calibri | Divisible by 100 | 44.56% | 42.22% |

Sign Top | Calibri | Divisible by 10 | 52.87% | 50.76% |

Sign Top | Calibri | Equal to 0 | 3.22% | 3.10% |

Sign Bottom | Cambria | Divisible by 1000 | 0.15% | 0.15% |

Sign Bottom | Cambria | Divisible by 100 | 1.29% | 1.14% |

Sign Bottom | Cambria | Divisible by 10 | 10.21% | 10.08% |

Sign Bottom | Cambria | Equal to 0 | 0.00% | 0.00% |

Sign Bottom | Calibri | Divisible by 1000 | 5.75% | 5.44% |

Sign Bottom | Calibri | Divisible by 100 | 10.83% | 7.37% |

Sign Bottom | Calibri | Divisible by 10 | 22.47% | 15.53% |

Sign Bottom | Calibri | Equal to 0 | 0.21% | 0.50% |

Sign Bottom | Both | Divisible by 1000 | 2.94% | 2.79% |

Sign Bottom | Both | Divisible by 100 | 6.05% | 4.25% |

Sign Bottom | Both | Divisible by 10 | 16.33% | 12.80% |

Sign Bottom | Both | Equal to 0 | 0.11% | 0.25% |

Sign Top | Both | Divisible by 1000 | 17.60% | 16.69% |

Sign Top | Both | Divisible by 100 | 22.86% | 21.68% |

Sign Top | Both | Divisible by 10 | 31.37% | 30.27% |

Sign Top | Both | Equal to 0 | 1.61% | 1.55% |

We see that these numbers are all similar in the original data set and the recreated data set. (Note that the values are slightly smaller in the recreated data set. This makes sense since the recreation data set is based on the Sign Top Calibri rows in the original data set, which are slightly diluted due to the label rearrangement step.)

As mentioned, in the original Excel sheets, sometimes the duplicated rows had different labels. It is hard to say exactly how often, since it’s not possible to idenfity all duplicated rows. If we take subsets of the data set where the duplicated rows are easily identifiable, and check how many are of them have unmatching labels, it seems to be something like 5-15%. If we check for unmatching labels in the recreation data set, we find a similar value:

```
inner_join(dd %>% select(condition_o = condition, id_v2),
%>% select(condition_d = condition, id_v2), by = "id_v2") %>%
oo mutate(identical_labels = condition_o == condition_d) %>%
count(identical_labels) %>%
d()
```

identical_labels | n |
---|---|

FALSE | 800 |

TRUE | 6040 |

Dan Ariely was the one who was sent the data from the car insurance company, and is the creator of the Excel document containing the fraudulent data. So, it can’t be any of the co-authors.

Is it possible that someone at the car insurance company faked the data, and Dan Ariely simple received this fake data? I would say that it is not.

It could be imagined that some person at the car insurance company would perform Fraud Step #2 and Fraud Step #3. Perhaps they were too lazy to gather the data, so would just generate some fake data instead.

But it is inconceivable that they would perform Fraud Step #1 and even more so for Fraud Step #4. These steps specifically have the purpose to make the research hypothesis true. And the car insurance company would have no incentive to do this.

This is a case of fraud that is completely bungled by ineptitude. As a result it had signs of fraud that were obvious from just looking at the most basic summary statistics of the data. And still, it was only discovered after 9 years, after someone attempted a replication. As I went through above, the fraudster had multiple obvious opportunities to manipulate the data in a way that would have likely never been discovered. In fact, it seems that the only reason it was discovered, was because of traits the data set acquired after puzzlingly unnecessary fraudulent manipulations.

This makes it seem likely that there is *a lot* more fraud than most people expect.

I would suggest that no study should be trusted if it doesn’t release the data. I am not having any illusions about the good will of journals here. I am saying that we as a scientific community should not trust any study without open data, regardless of which journal it was published in.

Also, I think we should look at all old Dan Ariely studies. Fraudulent people probably commit fraud more than once, and given the level of mathematical competency showed here, we could expect it to be not too hard to uncover.

Dan Ariely made a reply to the accusations. (Which also uses two different fonts!)

[1] Throughout I will analyze the numbers for car1 only, since the same is done for the other cars. Also I will show data for the Calibri numbers only most of the time, since they are the original values.