(cropped from STA302-utsg-lec01)
## Common Steps in Statistical Studies
Broadly speaking, analyses often involve the following steps:
▶ Articulate research question
▶ Collect, obtain the data that will help answer this question
▶ Clean and examine data: exploratory data analysis (EDA)
▶ Decide on an appropriate statistical model for the data
(cid:44)→ e.g. linear regression!
▶ Estimation: fit the model
▶ Carryout appropriate inferential tests
▶ Troubleshooting; something does not look right...
• Reconsider a different modelling approach, redo tests
• Data may be insufficient or inappropriate to answer question
• Initial question not properly articulated...
▶ Draw conclusions
---
## Communication and Reproducibility
---
## Ethical Data Analysis
▶ As an ethical statistician/data scientist, you should strive to
• be as accurate and unbiased as possible;
• be aware of possible consequences of your results on others;
• be honest and transparent in reporting of results (especially if
it’s results you didn’t hope for).
▶ We will focus on these in our work throughout the term,
especially being accurate and transparent with our work.
▶ This is because results are stronger if they can be reproduced
by other independent researchers.
---
## Soccer Example
Soccer example: https://fivethirtyeight.com/features/science-isnt-broken/#part1
---
## Why did we see different results?
▶ The study looked into what could be causing such differences
in the results and found that none of the below options could
explain the high variability in results
• different levels of statistical expertise in a team
• peer ratings of overall analysis quality
• peer assessment of specific statistical issues of the analysis
▶ Do any of these reasons associated with the estimated effect
of giving a red card?
▶ Difference steps, different variables, different conclusions
(cid:44)→ Hence the importance of clear and transparent communication
in decision-making processes
---
## Statistical Communication
▶ The soccer analysis example highlights the importance of
communication and nuances in data analyses.
▶ Results are only one facet of the whole story/analysis.
▶ Some very general things to keep in mind when thinking about
writing up results of a data analysis:
• Know your audience: Expert (in research papers), Manager (in
project reports), technician (instructions and reports),
layperson (news paper articles), hybrid (multi-disciplinary
environment)
• Make informative tables and figures (but also pretty)
• Outline your steps
• Discuss the limitations
---
## Group Activity
▶ Make a group of 3-5 around you
▶ Explain the statistical terms to each other using non-statistical
terms:
1. Significance
2. p-values
3. Regression
▶ Some groups will have the chance to explain
---
## Importance of Visualization
▶ A picture is worth a thousand words
▶ Digesting information visually is often easier and memorable
than through text.
▶ Graphs and tables are easily customizable.
(cid:44)→ Always remember to put self-contained captions/titles!
▶ It’s easy to make tables/plots overly complicated...
---
## Example: Simpson’s Paradox in Youtube
Video 1: https://www.youtube.com/watch?v=ebEkn-BiW5k
Video 2: https://www.youtube.com/watch?v=sxYrzzy3cq8
---
## Example: Simpson’s Paradox in R
Example
See lec-1-simpsons-paradox.R.
Q.: Does more x lead to increased y values?
---
## Transparency in Justifications
▶ Analyses results depend on decisions made along the way.
(cid:44)→ In Simpson’s Paradox, correlations or trends in groups reverse
when data is aggregated
(cid:44)→ Regression model should include group information
▶ Reproducible results are generally more convincing than those
who can’t.
▶ To enhance reproducibility, remember to:
• have a clear analysis plan and follow it
• be explicit about the decisions made along the way
• make sure the choices made are justifiable
• report everything!
---
## Transparency in Justifications (cont.)
▶ Justifying your choices requires theoretical or context-specific
argument
• Model selection via statistical tests, collinearity investigation,
etc.
• Underlying factors from previous research (e.g. age affecting
health outcomes)
▶ Conclusions may differ depending on decisions
• In Simpson’s Paradox example, groupings are important to be
considered
---
## Discuss the Limitations
▶ Sometimes, analyses doesn’t work out the way you want it to,
and that’s ok.
▶ Remember that:
• no analysis is perfect - just be upfront about it.
• talk about what could be done differently or in the future (to
get better results) in the discussions / limitations / conclusion
section
• discuss how you made some decisions that might be
questionable, or how there was possibly no other choice
• discuss how maybe assumptions weren’t completely satisfied
but they were good enough and how that might mean some
results could be off.
---
## Discuss the Limitations (cont.)
▶ Data analysis is often thought as both an art and science.
▶ A good analyst is always: be ethical, have clear
communication, be critical to their and other’s work, will
produce results that are reproducible
---
## Exploratory Data Analysis
---
## Exploratory Data Analysis
▶ Exploratory Data Analysis (EDA): early data examination and
visualization
▶ Your first step the minute you load your data into R should be
to perform an EDA.
▶ This involves a thorough numerical and graphical description
of your data.
▶ You want to make sure you look out for the following:
• the range and center of each variable
• the general distribution of each variable
• any extreme observations (e.g. outliers)
• variables are coded and defined the way you need them
• presence of missing observations.
---
## Distribution Plot of the Variables
▶ For univariate variables, we can plot box-plots and histograms
to observe:
• The symmetricity of the distribution
• Presence (or absence) of outliers
▶ Scatterplot or bar plots for two variables
• Again, outliers
• Strength of association
---
## Creating Useful Variables
▶ Not only is it important to verify that your data/variables
represent the population, but you want to make sure they are
meaningful for how you want to use them
▶ You should also check that:
• the variables are coded the way you want them to be (i.e.
numbers are numbers and haven’t been read as words)
• the levels/labels of qualitative variables are meaningful and
have observations (otherwise you may want to collapse them
or merge them)
• the variables are what you need and don’t need to be modified
or changed
▶ It’s totally normal to need to redefine or change the variables
to suit your purpose/make them meaningful.
▶ Avoid overwriting your data!
(cid:44)→ Instead, create a new variable with your desired changes!
---
## Dealing with Missing Observations
▶ When working with real data (not textbook data), it is very
common to come across missing observations in some or all of
your variables.
▶ Handling missing data is an entire area of statistics on its own
so we won’t get into much detail for now.
▶ One way to deal with it is to just remove any observation that
has missing data in at least one variable, called complete
case analysis.
(cid:44)→ No particular reason why some data are missing
▶ However, always check if numerical summaries (e.g. mean,
median, variance) are similar before and after removing
observations
---
## General Guidelines of EDA
▶ Exploratory data analyses are just as important as the
statistical analyses.
▶ They provide you with a heads-up about potential issues
▶ It gives you the opportunity to verify if your data correspond
to the intended population.
▶ Try to supplement your data exploration with information from
the literature to justify your decision-making.
▶ Document your steps and decisions along the way and discuss
the impacts on your results.