[WIP] Introduction to Statistics & Data Analysis — Notes

26.12.2023 — notes — 2 min read

Chapter 1

Statistics is the scientific discipline that provides methods to help us make sense of data.
Statistical methods are used to organize, summarize, and draw conclusions from data.
The entire collection of individuals or objects about which information is desired is called the population of interest.
A sample is a subset of the population, selected for study in some prescribed manner.
Descriptive statistics is the branch of statistics that includes methods for organizing and summarizing data.
Inferential statistics is the branch of statistics that involves generalizing from a sample to the population from which it was selected and assessing the reliability of such generalizations.
The following questions should be addressed as part of a study evaluation:
- Understanding the nature of the problem: What were the researchers trying to learn? What questions motivated their research?
- Deciding what to measure and how to measure it: Was relevant information collected? Were the right things measured?
- Data collection: Were the data collected in a sensible way?
- Data summarization and preliminary analysis: Were the data summarized in an appropriate way?
- Formal data analysis: Was an appropriate method of analysis used, given the type of data and how the data were collected?
- Interpretation of results: Are the conclusions drawn by the researchers supported by the data analysis?
A variable is any characteristic whose value may change from one individual or object to another.
Data result from making observations on variable(s).
A data set consisting of observations on a single variable is a univariate data set.
A data set consisting of observations on multiple variable is a multivariate data set. A bivariate data set is a special case of a multivariate data set consisting of only two variables.
A univariate data set is categorical (or qualitative) if the individual observations are categorical responses.
A univariate data set is numerical (or quantitative) if each observation is a number.
A numerical variable results in discrete data if the possible values of the variable correspond to isolated points on the number line. We can place an interval that is small enough that no other possible value is included in the interval. Discrete data usually arise when each observation is determined by counting.
A numerical variable results in continuous data if the set of possible values forms an entire interval on the number line. In general, data are continuous when observations involve making measurements.
A frequency distribution for categorical data is a table that displays the possible categories along with the associated frequencies and/or relative frequencies.
The frequency for a particular category is the number of times the category appears in the data set.
The relative frequency for a particular category is the fraction or proportion of the observations resulting in the category. It is calculated as: $relative~frequency = \frac{frequency}{number~of~observations~in~the~data~set}$
If the table includes relative frequencies, it is sometimes referred to as a relative frequency distribution.
A bar chart is a graph of the frequency distribution of categorical data. Each category in the frequency distribution is represented by a bar or rectangle of equal width, and the picture is constructed in such a way that the height of each bar is proportional to the corresponding frequency or relative frequency.
A dot-plot is a simple way to display numerical data when the data set is reasonably small. Each observation is represented by a dot above the location corresponding to its value on a horizontal measurement scale. When a value occurs more than once, there is a dot for each occurrence and these dots are stacked vertically.
Dotplots convey information about:
- A representative or typical value in the data set.
- The extent to which the data values spread out.
- The nature of the distribution of values along the number line.
- The presence of unusual values in the data set.