(cropped from STA302-utsg-lec01)


## Common Steps in Statistical Studies

Broadly speaking, analyses often involve the following steps:

▶ Articulate research question  
▶ Collect, obtain the data that will help answer this question  
▶ Clean and examine data: exploratory data analysis (EDA)  
▶ Decide on an appropriate statistical model for the data  
(cid:44)→ e.g. linear regression!  
▶ Estimation: fit the model  
▶ Carryout appropriate inferential tests  
▶ Troubleshooting; something does not look right...

• Reconsider a different modelling approach, redo tests  
• Data may be insufficient or inappropriate to answer question  
• Initial question not properly articulated...

▶ Draw conclusions

---

## Communication and Reproducibility

---

## Ethical Data Analysis

▶ As an ethical statistician/data scientist, you should strive to  
• be as accurate and unbiased as possible;  
• be aware of possible consequences of your results on others;  
• be honest and transparent in reporting of results (especially if  
it’s results you didn’t hope for).

▶ We will focus on these in our work throughout the term,  
especially being accurate and transparent with our work.

▶ This is because results are stronger if they can be reproduced  
by other independent researchers.

---

## Soccer Example

Soccer example: https://fivethirtyeight.com/features/science-isnt-broken/#part1

---

## Why did we see different results?

▶ The study looked into what could be causing such differences  
in the results and found that none of the below options could  
explain the high variability in results

• different levels of statistical expertise in a team  
• peer ratings of overall analysis quality  
• peer assessment of specific statistical issues of the analysis

▶ Do any of these reasons associated with the estimated effect  
of giving a red card?

▶ Difference steps, different variables, different conclusions  
(cid:44)→ Hence the importance of clear and transparent communication  
in decision-making processes

---

## Statistical Communication

▶ The soccer analysis example highlights the importance of  
communication and nuances in data analyses.

▶ Results are only one facet of the whole story/analysis.

▶ Some very general things to keep in mind when thinking about  
writing up results of a data analysis:

• Know your audience: Expert (in research papers), Manager (in  
project reports), technician (instructions and reports),  
layperson (news paper articles), hybrid (multi-disciplinary  
environment)

• Make informative tables and figures (but also pretty)  
• Outline your steps  
• Discuss the limitations

---

## Group Activity

▶ Make a group of 3-5 around you  
▶ Explain the statistical terms to each other using non-statistical  
terms:

1. Significance  
2. p-values  
3. Regression

▶ Some groups will have the chance to explain

---

## Importance of Visualization

▶ A picture is worth a thousand words  
▶ Digesting information visually is often easier and memorable  
than through text.

▶ Graphs and tables are easily customizable.  
(cid:44)→ Always remember to put self-contained captions/titles!

▶ It’s easy to make tables/plots overly complicated...

---

## Example: Simpson’s Paradox in Youtube

Video 1: https://www.youtube.com/watch?v=ebEkn-BiW5k  
Video 2: https://www.youtube.com/watch?v=sxYrzzy3cq8

---

## Example: Simpson’s Paradox in R

Example  
See lec-1-simpsons-paradox.R.

Q.: Does more x lead to increased y values?

---

## Transparency in Justifications

▶ Analyses results depend on decisions made along the way.  
(cid:44)→ In Simpson’s Paradox, correlations or trends in groups reverse  
when data is aggregated  
(cid:44)→ Regression model should include group information

▶ Reproducible results are generally more convincing than those  
who can’t.

▶ To enhance reproducibility, remember to:  
• have a clear analysis plan and follow it  
• be explicit about the decisions made along the way  
• make sure the choices made are justifiable  
• report everything!

---

## Transparency in Justifications (cont.)

▶ Justifying your choices requires theoretical or context-specific  
argument

• Model selection via statistical tests, collinearity investigation,  
etc.

• Underlying factors from previous research (e.g. age affecting  
health outcomes)

▶ Conclusions may differ depending on decisions

• In Simpson’s Paradox example, groupings are important to be  
considered

---

## Discuss the Limitations

▶ Sometimes, analyses doesn’t work out the way you want it to,  
and that’s ok.

▶ Remember that:

• no analysis is perfect - just be upfront about it.  
• talk about what could be done differently or in the future (to  
get better results) in the discussions / limitations / conclusion  
section

• discuss how you made some decisions that might be  
questionable, or how there was possibly no other choice  
• discuss how maybe assumptions weren’t completely satisfied  
but they were good enough and how that might mean some  
results could be off.

---

## Discuss the Limitations (cont.)

▶ Data analysis is often thought as both an art and science.  
▶ A good analyst is always: be ethical, have clear  
communication, be critical to their and other’s work, will  
produce results that are reproducible

---

## Exploratory Data Analysis

---

## Exploratory Data Analysis

▶ Exploratory Data Analysis (EDA): early data examination and  
visualization

▶ Your first step the minute you load your data into R should be  
to perform an EDA.

▶ This involves a thorough numerical and graphical description  
of your data.

▶ You want to make sure you look out for the following:

• the range and center of each variable  
• the general distribution of each variable  
• any extreme observations (e.g. outliers)  
• variables are coded and defined the way you need them  
• presence of missing observations.

---

## Distribution Plot of the Variables

▶ For univariate variables, we can plot box-plots and histograms  
to observe:

• The symmetricity of the distribution  
• Presence (or absence) of outliers  
▶ Scatterplot or bar plots for two variables

• Again, outliers  
• Strength of association

---

## Creating Useful Variables

▶ Not only is it important to verify that your data/variables  
represent the population, but you want to make sure they are  
meaningful for how you want to use them

▶ You should also check that:

• the variables are coded the way you want them to be (i.e.  
numbers are numbers and haven’t been read as words)  
• the levels/labels of qualitative variables are meaningful and  
have observations (otherwise you may want to collapse them  
or merge them)

• the variables are what you need and don’t need to be modified  
or changed

▶ It’s totally normal to need to redefine or change the variables  
to suit your purpose/make them meaningful.

▶ Avoid overwriting your data!  
(cid:44)→ Instead, create a new variable with your desired changes!

---

## Dealing with Missing Observations

▶ When working with real data (not textbook data), it is very  
common to come across missing observations in some or all of  
your variables.

▶ Handling missing data is an entire area of statistics on its own  
so we won’t get into much detail for now.

▶ One way to deal with it is to just remove any observation that  
has missing data in at least one variable, called complete  
case analysis.  
(cid:44)→ No particular reason why some data are missing

▶ However, always check if numerical summaries (e.g. mean,  
median, variance) are similar before and after removing  
observations

---

## General Guidelines of EDA

▶ Exploratory data analyses are just as important as the  
statistical analyses.

▶ They provide you with a heads-up about potential issues  
▶ It gives you the opportunity to verify if your data correspond  
to the intended population.

▶ Try to supplement your data exploration with information from  
the literature to justify your decision-making.

▶ Document your steps and decisions along the way and discuss  
the impacts on your results.