The Art Of Statistical Thinking Detect Misinformation, Understand The World Deeper, And Make Better Decisions. By Albert Rutherford And Jae H. Kim, PhD AudioChapter

Hear it Here - https://adbl.co/3WjBwT0

00:00:00 The art of statistical thinking

00:03:31 1. Sample versus population

00:06:57 2. Descriptive statistics

00:15:42 3. Sample statistics and population parameters

00:19:26 4. Descriptive statistics for relative position

00:24:09 5. Data Visualization

00:26:25 6. Comparing alternative distributions

00:30:21 7. Normal distribution

00:34:32 8. Checking the normality of a distribution

00:38:03 9. Concluding remarks

https://www.amazon.com/dp/B0BJ7QGMBL

Not knowing statistics can lead to a loss of money, time, and accurate information.

What am I looking at? What do these numbers mean? Why? These are frequent thoughts of those who don’t know much about statistics.

“I’m not a number’s person” is not a good excuse to avoid learning the basics of this essential skill. Are you a person who earns money? Do you shop at the supermarket? Do you vote? Do you read the news? I’m sure you do.

Learn to make decisions like world leaders do.

Do you like to make uninformed, often poor decisions? Are you okay with being manipulated by skewed charts and diagrams? How about being lied to about the effectiveness of a product? I’m sure you don’t.

Statistics can help you make exponentially better calls on what to buy, who to listen to, and what to believe.

This book offers a detailed, illustrated breakdown of the fundamentals of statistics. Develop and use formal logical thinking abilities to understand the message behind numbers and charts in science, politics, and economy.

Sharpen your critical and analytic thinking skills.

Know what to look for when analyzing data. Information gets skewed – often unintentionally – because of the mainstream ways of doing statistics that didn’t catch up to big data. Stop staying in the dark. This book shines the light on the most common statistical methods - and their most frequent misuse. This step-by-step guide not only helps you detect what goes wrong in statistics but also educates you on how to utilize invaluable information statistics gets right to your benefit.

Avoid making decisions on misleading information.

- How to Use Descriptive and Inferential Statistics to Understand the World.

- Be Wary of Misleading Charts.

- Make Better Decisions Using Probability.

- Understand P-Values in Research.

- Understand Potential Bias in Studies.

Albert Rutherford is the internationally bestselling author of several books on systems thinking, game theory, and mathematical thinking. Jae H. Kim is a freelance writer in econometrics, statistics, and data analysis. Since obtaining his PhD in econometrics in 1997, he has been a professor in major Australian universities until 2022. He has published more than 70 academic articles and book chapters in econometrics, empirical finance, economics, and applied statistics, which have attracted nearly 5000 citations to date.

Learn basic statistics and spend your money wisely.

Statistics, as a learning tool, can be used or misused. Some will actively lie and mislead with statistics. More often, however, well-meaning people – even professionals - unintentionally report incorrect statistical conclusions. Knowing what errors and mistakes to look for will help you to be in a better position to evaluate the information you have been given.

#Data #DataVisualization #Descriptive #Deviation #Histogram #Interquartile #IQ #Mean #Median #MedianStandard #Outlier #Percentile #Percentiles #Quartile #StandardDeviation #StatisticalThinking #Variance #Visualization #RussellNewton #NewtonMG #TheArtofStatisticalThinking #DetectMisinformation #UnderstandTheWorldDeeper #AndMakeBetterDecisions. #ByAlbertRutherfordAndJaeH.Kim #PhD #AlbertRutherford # #JaeH.KimPhD

Transcript

Speaker: 00:00:00

The art of statistical thinking, detect misinformation, understand the world

Speaker: 00:00:06

deeper, and make better decisions. Advanced Thinking Skills, book 3, written by

Speaker: 00:00:12

Albert Rutherford, J. H. Kim, Ph.D., narrated by Russell Newton.

Speaker: 00:00:19

We make decisions every day - some can change our lives and those of our loved ones.

Speaker: 00:00:26

But it is not only the individuals who make decisions.

Speaker: 00:00:30

Companies, courts of law, governments and international organizations also make decisions,

Speaker: 00:00:36

often on a large scale, that can affect our jobs, the justice system, and everyday life

Speaker: 00:00:43

in a positive or negative way.

Speaker: 00:00:46

Such decisions usually are made under incomplete information and uncertainty.

Speaker: 00:00:52

The decision-makers often make correct decisions that will benefit our society, but they make

Speaker: 00:00:58

incorrect decisions too.

Speaker: 00:00:59

The cost of the latter can sometimes be devastating, starting from personal tragedies to changing

Speaker: 00:01:05

the course of human history.

Speaker: 00:01:07

But let’s not run so far ahead.

Speaker: 00:01:10

Suppose you are making an investment decision for your retirement.

Speaker: 00:01:14

Investment funds report their average returns for the past 5 years; you read a media report

Speaker: 00:01:19

about the recent growth of the real estate market, and you hear about overnight millionaires

Speaker: 00:01:24

who have made big from investing in cryptocurrency.

Speaker: 00:01:28

You also hear about those who lost their life savings because of wrong investments or scams.

Speaker: 00:01:33

And there is always a catch in the fine print - “Past performance is not necessarily indicative

Speaker: 00:01:39

of future performance."

Speaker: 00:01:41

This means you are facing uncertainty in your investment decisions, and you should learn

Speaker: 00:01:46

how to make a well-informed decision under this circumstance.

Speaker: 00:01:51

If you make a decision after you sampled a range of different funds, compared them with

Speaker: 00:01:55

those of real estate markets, and studied the future prospect of the world economy,

Speaker: 00:02:00

learned from the investment gurus such as Warren Buffet and listened to your friends

Speaker: 00:02:04

and advisors, then it is most likely that you have made an informed decision that will

Speaker: 00:02:10

bring handsome payoff eventually.

Speaker: 00:02:13

This is, in a way, “statistical thinking”; you sample the population and learn from it

Speaker: 00:02:20

to make an informed decision.

Speaker: 00:02:22

The more diverse and informative your sample’s elements are, the more likely it is that you

Speaker: 00:02:27

have made the right decision.

Speaker: 00:02:29

This book will show you how to understand statistics as a layman and make informed decisions

Speaker: 00:02:35

with the help of statistical thinking.

Speaker: 00:02:38

The problem is that statistics can easily be manipulated and misinterpreted.

Speaker: 00:02:43

If statistical findings were always presented and utilized in an honest and correct way,

Speaker: 00:02:48

the results wouldn’t always be as rosy.

Speaker: 00:02:51

We often see distorted and misguided numbers and outcomes, even though that was not the

Speaker: 00:02:56

intention of those who report statistics.

Speaker: 00:03:00

This book is intended to help readers gain better understanding and decision-making skills

Speaker: 00:03:04

– the kind that professional statisticians possess.

Speaker: 00:03:08

In the first chapter, we will review the definitions and basic concepts of statistics.

Speaker: 00:03:15

As a book on statistics, it is inevitable to introduce mathematical details.

Speaker: 00:03:20

However, these details will only be presented when necessary, without providing the full

Speaker: 00:03:26

theoretical background.

Speaker: 00:03:27

Chapter 1 - Definition and Basic Concepts.

Speaker: 00:03:31

Speaker: 00:03:32

Sample versus population.

Speaker: 00:03:36

An investor wishes to know the five-year average return from investing in the U. S. stock market.

Speaker: 00:03:43

There are nearly 2,400 stocks (as of August 2022) listed on the NYSE (New York Stock Exchange),

Speaker: 00:03:53

and they must select a manageable number of stocks to form a portfolio of stocks.

Speaker: 00:03:58

However, they don’t need to calculate the average return of all 2400 stocks.

Speaker: 00:04:04

There are stocks not worth investing in – too low return or too risky.

Speaker: 00:04:10

Our investor will need to select a set of stocks that suits their investment style.

Speaker: 00:04:16

In this example, the collection of all stocks in the NYSE is called the population in statistical

Speaker: 00:04:23

jargon, and a subset of all stocks is called a sample.

Speaker: 00:04:28

Collecting the information from all the members of the population is too costly and time-consuming

Speaker: 00:04:35

and even unnecessary.

Speaker: 00:04:37

We can obtain a good indicator of average return by looking at a sample.

Speaker: 00:04:42

The way we select the sample is critically important, and it depends largely on the purpose

Speaker: 00:04:47

of the study or the aim of the statistical task at hand.

Speaker: 00:04:52

Suppose the investor’s aim is to achieve a steady return with relatively low risk by

Speaker: 00:04:57

investing in big and stable companies.

Speaker: 00:05:00

Then a good sample is the Dow Jones index, which comprises the stocks of 30 prominent

Speaker: 00:05:07

companies, such as Boeing, Coca-Cola, Microsoft, and Proctor & Gamble.

Speaker: 00:05:14

If the investor’s goal is to achieve a higher return with higher growth, albeit taking a

Speaker: 00:05:19

higher risk, the NASDAQ-100 index is a good sample that mainly includes the top technology

Speaker: 00:05:25

and IT stocks, such as Amazon, Apple, eBay, and Google.

Speaker: 00:05:31

By looking at the average returns of these indices, the investor can get a clear indication

Speaker: 00:05:37

and impression of the performance of these stocks.

Speaker: 00:05:41

Seasoned investors can select their own sample based on their aim and risk-return preference.

Speaker: 00:05:47

The important point is that the sample should be a good representation of the target population.

Speaker: 00:05:54

If the investor wants safe and steady investment returns, but their sample represents high-risk

Speaker: 00:06:01

stocks, they may not effectively achieve the aim of their investment.

Speaker: 00:06:06

Hence, the target population should be determined in consideration of the aim of the statistical

Speaker: 00:06:12

study.

Speaker: 00:06:13

A sample that is a good representation of the population can be obtained by pure random

Speaker: 00:06:19

sampling.

Speaker: 00:06:20

The members of the population are selected randomly with an equal chance.

Speaker: 00:06:24

For example, in political polls, all eligible voters should be treated equally.

Speaker: 00:06:30

In this situation, the most effective way of selecting an unbiased and representative

Speaker: 00:06:35

sample is random sampling, where the members of the eligible voters are selected with equal

Speaker: 00:06:41

chance, with no pre-selection or exclusions.

Speaker: 00:06:46

In a later chapter, we will discuss an example of one of the most disastrous polling outcomes

Speaker: 00:06:50

in the history, which occurred due to a violation of this random sampling principle.

Speaker: 00:06:57

Speaker: 00:06:59

Descriptive statistics.

Speaker: 00:07:02

Descriptive statistics is a branch of statistics where the sample features are presented with

Speaker: 00:07:08

a range of summary statistics and visualization methods.

Speaker: 00:07:13

The summary statistics include the mean and median, which describe the centre of the sample

Speaker: 00:07:19

values, and the variance and standard deviation are the measures of the variability of the

Speaker: 00:07:25

sample values.

Speaker: 00:07:28

Visualization methods include plots, charts, and graphs, which are used to make a visual

Speaker: 00:07:33

impression about the distribution of the sample values.

Speaker: 00:07:38

1.1.

Speaker: 00:07:40

Mean and median.

Speaker: 00:07:43

The mean refers to the average of a set of values.

Speaker: 00:07:47

It is computed by adding the numbers and dividing the total by the number of observations.

Speaker: 00:07:54

The mean is the average of the sample values of size n, with each individual point given

Speaker: 00:08:01

the weight of 1/n.

Speaker: 00:08:05

The formula for the mean can be written as, .

Speaker: 00:08:08

(1).

Speaker: 00:08:10

where (X1, X2,…, Xn) represent the data points and n is called the sample size.

Speaker: 00:08:17

That is, the sample mean is the sum of all sample points divided by the sample size.

Speaker: 00:08:24

Alternatively, it can be interpreted as a weighted sum of all data points with an equal

Speaker: 00:08:31

weight of 1/n.

Speaker: 00:08:33

The median is the middle number in a sequence of numbers.

Speaker: 00:08:38

To find the median, organize each number in order by size; the number in the middle is

Speaker: 00:08:44

the median.i In statistical terms, the median is defined as the middle value of (X1, X2,

Speaker: 00:08:53

…, Xn) when sorted in ascending or descending order.

Speaker: 00:08:58

Consider a simple example of (X1, …, Xn) = (1, 2, 3, 4, 5) and n = 5.

Speaker: 00:09:07

The sum of all X’s is 15 (1+2+3+4+5=15), and the sample mean is 3 (15/5=3).

Speaker: 00:09:22

The middle value of (1, 2, 3, 4, 5) is 3.

Speaker: 00:09:27

In this case, the sample’s mean and median are the same.

Speaker: 00:09:31

In general, the mean and median values are different, and the median is widely used where

Speaker: 00:09:38

there are possible extreme values in the sample points.

Speaker: 00:09:44

Consider the sample points with an extreme observation (X1, …, Xn) = (1, 2, 3, 4, 20),

Speaker: 00:09:54

then the sample mean is 6 (1+2+3+4+20 = 30; 30/5=6), and the median is still 3 as the

Speaker: 00:10:07

middle value of the distribution (1, 2, 3, 4, 20).

Speaker: 00:10:14

If this extreme value is unusual and does not represent the target population, then

Speaker: 00:10:19

the sample mean of 6 can be a misleading value because it was distorted by the presence of

Speaker: 00:10:25

20.

Speaker: 00:10:26

In this case, the median should be preferred to the mean.

Speaker: 00:10:31

A practical example of using the median over the mean is the case for house prices.

Speaker: 00:10:36

For example, the researcher is interested in the average house price in a middle-class

Speaker: 00:10:42

suburb.

Speaker: 00:10:43

In such a suburb, there is still a chance that a big mansion or two in a large block

Speaker: 00:10:47

of land may be included in the sale.

Speaker: 00:10:50

However, these houses do not represent the general characteristics of the suburb, and

Speaker: 00:10:55

it is reasonable to use the median in this case to find the average value free from the

Speaker: 00:11:01

effect of these extreme values1.

Speaker: 00:11:04

The mean vs. median is closely related with the “skewedness” of the distribution.

Speaker: 00:11:11

If the distribution of the numbers you have is (more or less) symmetric around the mean

Speaker: 00:11:17

as in (X1, …, Xn) = (1, 2, 3, 4, 5), the mean and median will be identical or practically

Speaker: 00:11:26

the same.

Speaker: 00:11:27

However, when the distribution of the numbers is asymmetric or skewed, then the mean and

Speaker: 00:11:33

median can be different.

Speaker: 00:11:34

For example, if the distribution is asymmetric, as in (X1, …, Xn) = (1, 2, 3, 4, 20), then

Speaker: 00:11:43

the two values can be different.

Speaker: 00:11:46

Photo source - Study.comii.

Speaker: 00:11:47

Graphical illustrations of the different shapes of the distribution and the positions of the

Speaker: 00:11:52

mean and median are given above.

Speaker: 00:11:55

Suppose the above is the distribution of the performance of all salespeople in a company.

Speaker: 00:12:01

A symmetric distribution means the higher performers and lower performers are in the

Speaker: 00:12:07

same or similar proportion; in which case the mean and median are almost identical.

Speaker: 00:12:12

A positive skewed distribution means the presence of a small number of extremely capable performers.

Speaker: 00:12:20

In this case, the mean of the sales is inflated by their performance.

Speaker: 00:12:25

If the sales manager wants an average value that represents the performance of the “average

Speaker: 00:12:31

salesperson”, then the use of median is appropriate.

Speaker: 00:12:34

If she wants to know the average sales, including the performance of all salespeople in the

Speaker: 00:12:40

company, then the use of the mean is appropriate.

Speaker: 00:12:42

A similar interpretation can also be made from a negatively skewed distribution illustrated

Speaker: 00:12:49

above.

Speaker: 00:12:51

1.2.

Speaker: 00:12:53

Variance and standard deviation.

Speaker: 00:12:57

When analyzing or presenting a set of numbers, it is important to know the centre of the

Speaker: 00:13:02

distribution.

Speaker: 00:13:03

But understanding their dispersion and variability is also important.

Speaker: 00:13:08

Consider two salespeople with the same or a similar number of mean sales in the past

Speaker: 00:13:13

year.

Speaker: 00:13:14

In evaluating who was a more consistent performer, the manager will compare the dispersions in

Speaker: 00:13:21

their sales throughout the year.

Speaker: 00:13:24

Measures of variability, variance, and standard deviation present how widespread the sample

Speaker: 00:13:31

points are around the mean.

Speaker: 00:13:34

The distance of the sample point from the mean is calculated as , and they are squared

Speaker: 00:13:42

to make them all positive.

Speaker: 00:13:44

The average of all the squared distances from the mean is called the variance, which can

Speaker: 00:13:49

be written as,.

Speaker: 00:13:50

(0).

Speaker: 00:13:51

How this formula works will be explained in the table below.

Speaker: 00:13:55

But it is, in a way, the average of the squared distance of the data points from the mean,

Speaker: 00:14:05

i.e., .

Speaker: 00:14:07

The standard deviation (s) is defined as the square root of the variance, namely, .

Speaker: 00:14:16

(0).

Speaker: 00:14:18

Since the variance is the distance of the sample points from the mean in squares, the

Speaker: 00:14:24

standard deviation converts the value into the same unit as the original value of the

Speaker: 00:14:29

sample points by taking the square root.

Speaker: 00:14:31

Speaker: 00:14:32

Speaker: 00:14:33

-2 (=1-3).

Speaker: 00:14:34

-22 = 4.

Speaker: 00:14:35

Speaker: 00:14:36

-1(=2-3).

Speaker: 00:14:37

-22 = 1.

Speaker: 00:14:38

Speaker: 00:14:39

0 (=3-3).

Speaker: 00:14:40

02 = 0.

Speaker: 00:14:41

Speaker: 00:14:42

1 (=4-3).

Speaker: 00:14:43

12 = 1.

Speaker: 00:14:44

Speaker: 00:14:45

2 (=1-3).

Speaker: 00:14:46

22 = 4.

Speaker: 00:14:47

Sum.

Speaker: 00:14:48

10.

Speaker: 00:14:49

=3.

Speaker: 00:14:50

Using the example we used above as an illustration, X= (1, 2, 3, 4, 5) and The variance is the

Speaker: 00:14:51

sum of the numbers in the last column on the chart above divided by 4, which is 10/4 = 2.5.

Speaker: 00:14:53

The standard deviation is .

Speaker: 00:15:01

The interpretation is that the sample points are, on average, 1.58 units away from the

Speaker: 00:15:08

mean value of 3.

Speaker: 00:15:12

Why the division (or weight) is by (n-1), not by n, is beyond the scope of this book,

Speaker: 00:15:20

but it is to make the calculation more accurate when the sample size is small.

Speaker: 00:15:25

When the sample size is large, the division by n or by (n-1) makes no practical difference.

Speaker: 00:15:32

There are other variability measures around the median (i.e., interquartile range), and

Speaker: 00:15:38

they will be introduced in this book later.

Speaker: 00:15:42

Speaker: 00:15:44

Sample statistics and population parameters.

Speaker: 00:15:49

The sample mean () and standard deviation (s) are the statistics calculated from a sample.

Speaker: 00:15:57

The sample is a subset of the population, which also has the mean and standard deviation

Speaker: 00:16:03

(the median and variance as well).

Speaker: 00:16:06

When we use statistics, what we eventually want to know is the population values (also

Speaker: 00:16:12

called the population parameters), such as the mean and standard deviation.

Speaker: 00:16:17

The population mean and standard deviation are often written with Greek letters as  and

Speaker: 00:16:24

, values that are never known.

Speaker: 00:16:29

Suppose you want to know the mean household income of California.

Speaker: 00:16:33

If you visit all the households in California to find their mean income, as in a census,

Speaker: 00:16:38

you are looking for the value of .

Speaker: 00:16:40

However, such an exercise is often neither feasible nor necessary.

Speaker: 00:16:47

A good representative sample can tell us a lot about , as we shall see later.

Speaker: 00:16:53

We can gather a random sample of 1,000 households to find their income, and this will give the

Speaker: 00:17:00

value of the sample mean ().

Speaker: 00:17:04

If the sample was a good representation of the population, it is likely the sample mean

Speaker: 00:17:10

is a good indicator for the population mean.

Speaker: 00:17:14

The population and variance (and standard deviation) can be written formally as,.

Speaker: 00:17:21

(0).

Speaker: 00:17:24

(0).

Speaker: 00:17:27

(0).

Speaker: 00:17:30

where north is the population size and represent the population values.

Speaker: 00:17:45

The formulae above are similar to their sample counterparts in (1) to (3), hence their interpretations

Speaker: 00:17:51

are similar, but they are the values of the population.

Speaker: 00:17:56

In our example, north is the number of the total households in California, and are their

Speaker: 00:18:04

incomes.

Speaker: 00:18:05

If 1,000 households are selected randomly and their mean income is found to be $75,000,

Speaker: 00:18:14

then with n =1,000.

Speaker: 00:18:18

It is hoped that this value of the sample mean is in close neighbourhood of the true

Speaker: 00:18:24

value of the population mean.

Speaker: 00:18:26

Let us take another example.

Speaker: 00:18:29

Consider a fictitious country with 1 million (north) eligible voters who are voting for

Speaker: 00:18:35

their President.

Speaker: 00:18:36

A candidate should have the support rate of more than 0.5 to get elected.

Speaker: 00:18:40

The true value of the support rate () is unknown, and what matter is this value on

Speaker: 00:18:49

the election date.

Speaker: 00:18:50

A poll is conducted from a sample of 1000 (n) eligible voters, 10 days before the election

Speaker: 00:18:57

date.

Speaker: 00:18:58

This value is the sample mean ().

Speaker: 00:19:02

Suppose this sample value () is 50.1 per cent.

Speaker: 00:19:07

This value is called an estimate of the population parameter ().

Speaker: 00:19:12

If the sample is a good representation of the population, this estimate of sample mean

Speaker: 00:19:19

is an indicator for the value of , 10 days before the election date.

Speaker: 00:19:26

Speaker: 00:19:28

Descriptive statistics for relative position.

Speaker: 00:19:33

Suppose your IQ score is 115.

Speaker: 00:19:36

A natural question is how smart are you (according to the IQ score only) relative to the other

Speaker: 00:19:43

people in the sample or population.

Speaker: 00:19:47

Suppose your annual income is $50,000.

Speaker: 00:19:50

You want to know how rich or how poor you are relative to the others in the sample or

Speaker: 00:19:56

population.

Speaker: 00:19:57

You ran a marathon, and you completed the race with a record of 3 hours.

Speaker: 00:20:02

You want to know your rank in the race and where your rank stands relative to all the

Speaker: 00:20:07

participants of the race.

Speaker: 00:20:09

These questions are asking for a relative position, another important question in statistics.

Speaker: 00:20:16

The popular measures of relative positions are percentiles (sometimes called quantiles)

Speaker: 00:20:23

and quartiles.

Speaker: 00:20:26

Percentiles (quantiles).

Speaker: 00:20:28

With percentiles, we divide the distribution of the numbers into 100 positions.

Speaker: 00:20:34

For example, the 90th percentile represents the value in the sample that has 10% of the

Speaker: 00:20:40

sample points higher and 90% of the values lower than it.

Speaker: 00:20:46

That is, if your IQ score of 115 is said to be the 90th percentile, this means you are

Speaker: 00:20:55

at the top 10% of the distribution of all IQ scores.

Speaker: 00:21:01

Suppose your income of $50,000 is the 40th percentile of the distribution, then it means

Speaker: 00:21:07

your income is at the bottom 40% of the distribution.

Speaker: 00:21:12

That is, if there were 1000 people in the sample, your income stands at the 400th position

Speaker: 00:21:19

when all incomes are sorted in ascending order.

Speaker: 00:21:23

Similarly, among the 100 runners who participated in the marathon event, suppose your record

Speaker: 00:21:29

of 3 hours is at the 75th percentile.

Speaker: 00:21:34

This means your record is at the top 25%, and there are 24 runners who finish the race

Speaker: 00:21:40

with a better record than yours, and 74 of them were behind you.

Speaker: 00:21:47

Quartiles.

Speaker: 00:21:50

Quartiles are similar to percentile, but instead of dividing the distribution of the numbers

Speaker: 00:21:54

into 100 positions, they are based on the division into 4, as the following table shows

Speaker: 00:22:00

- .

Speaker: 00:22:01

The first quartile is the value whose position is at the bottom 25%, and it is the same as

Speaker: 00:22:07

the 25th percentile.

Speaker: 00:22:09

The second quartile is the 50th percentile, which is also the median.

Speaker: 00:22:14

If we go back to your marathon record, your record of 3 hours is the third quartile of

Speaker: 00:22:20

the distribution.

Speaker: 00:22:24

Interquartile range.

Speaker: 00:22:26

An interquartile range is defined as the difference between the third and 1st quartile of the

Speaker: 00:22:32

distribution.

Speaker: 00:22:33

It is a measure of variability or dispersion of a distribution alternative to the standard

Speaker: 00:22:40

deviation.

Speaker: 00:22:42

As the difference between the 3rd and 1st quartiles, the length of the interval contains

Speaker: 00:22:47

the (middle) 50% of the data points around the median.

Speaker: 00:22:51

Similarly to the median, the interquartile range is not sensitive to a few extreme values

Speaker: 00:22:57

in the distribution, while standard deviation can be inflated by extreme values.

Speaker: 00:23:04

More examples will follow for the interquartile range.

Speaker: 00:23:08

As an example, consider two suburbs whose median house prices are similar at 1 million

Speaker: 00:23:14

dollars.

Speaker: 00:23:15

The researcher finds the first suburb has the 1st quartile at the $750,000 and the 3rd

Speaker: 00:23:23

quartile at $1.25 million, with the interquartile range of $500,000 ($1.25 million - $750,000).

Speaker: 00:23:35

The second suburb has the 1st quartile at the $500,000 and the 3rd quartile at $1.5

Speaker: 00:23:43

million, with the interquartile range of 1 million dollars ($1.5 million – $500,000).

Speaker: 00:23:53

The interval that contains the middle 50% of the house prices are much longer in the

Speaker: 00:23:59

second suburb, which indicates the variability of house prices is substantially larger in

Speaker: 00:24:05

the second suburb.

Speaker: 00:24:09

Speaker: 00:24:11

Data Visualization.

Speaker: 00:24:15

Visualization is a powerful way of understanding the key features of a sample and making impressions.

Speaker: 00:24:22

It often makes a better and stronger impression about the data characteristics than a table

Speaker: 00:24:26

full of numbers.

Speaker: 00:24:29

Consider an investor who wishes to invest in U. S. stocks.

Speaker: 00:24:32

They gather the sample for NASDAQ-100 index and want to know how the index and its return

Speaker: 00:24:38

have performed in the last 5 years to December 2021.

Speaker: 00:24:43

Figure 1 presents the line charts (time plots) of and return (growth rate) in percentage,

Speaker: 00:24:51

monthly from 2017 to 2021.

Speaker: 00:24:55

The index has been growing with an upward trend for the last 5 years, and the trend

Speaker: 00:24:59

gets steeper from early 2020.

Speaker: 00:25:03

The monthly return fluctuates around 0, with most values between -10% and 10%.

Speaker: 00:25:10

These plots provide a clear impression of how the index has performed in the last five

Speaker: 00:25:16

years.

Speaker: 00:25:17

Figure 1 - Time plots of NASDAQ-100 index and return.

Speaker: 00:25:18

Data source - Yahoo Finance.

Speaker: 00:25:19

A histogram is another popular method of data visualization that presents the frequencies

Speaker: 00:25:24

of data points over the intervals of sample points.

Speaker: 00:25:27

It is a useful method of presenting the distributional shape of the sample points.

Speaker: 00:25:33

Figure 2 presents the histogram of the monthly returns, which shows the monthly returns are

Speaker: 00:25:38

centred between 0% and 5%, and most of the values are in the range of -10% and 10%.

Speaker: 00:25:47

The sample mean value of the monthly return is 2.02%, and their median is 2.68%, so the

Speaker: 00:25:56

index has been increasing at an average growth rate of just higher than 2%.

Speaker: 00:26:02

The standard deviation is 4.92%, which indicates the average deviation of the monthly returns

Speaker: 00:26:09

from the mean has been around 5%.

Speaker: 00:26:13

By combining the plots and summary statistics, the investor can learn about the performance

Speaker: 00:26:18

of the index in detail.

Speaker: 00:26:20

Figure 2 - Histogram of Returns from NASAQQ-100 index.

Speaker: 00:26:23

Data source - Yahoo Finance.

Speaker: 00:26:25

Speaker: 00:26:26

Comparing alternative distributions.

Speaker: 00:26:28

Now suppose the investor wishes to compare the performance of the NASDAQ-100 with the

Speaker: 00:26:35

Apple stock (APPL) for the same period.

Speaker: 00:26:38

The following table compares the basic statistics discussed so far.

Speaker: 00:26:42

Monthly returns for two alternative investments.

Speaker: 00:26:43

NASDAQ-100.

Speaker: 00:26:44

APPL.

Speaker: 00:26:45

Mean.

Speaker: 00:26:46

2.01.

Speaker: 00:26:47

3.02.

Speaker: 00:26:48

Median.

Speaker: 00:26:49

2.68.

Speaker: 00:26:50

5.00.

Speaker: 00:26:51

Standard Deviation.

Speaker: 00:26:52

4.92.

Speaker: 00:26:53

8.34.

Speaker: 00:26:54

1st Quartile.

Speaker: 00:26:55

-0.18.

Speaker: 00:26:56

-1.66.

Speaker: 00:26:57

3rd Quartile.

Speaker: 00:26:58

5.13.

Speaker: 00:26:59

9.25.

Speaker: 00:27:00

10th percentile.

Speaker: 00:27:01

-5.89.

Speaker: 00:27:02

-7.35.

Speaker: 00:27:03

90th percentile.

Speaker: 00:27:04

7.37.

Speaker: 00:27:05

12.27.

Speaker: 00:27:06

Data source - Yahoo finance.

Speaker: 00:27:07

The figures in this table reveal many details of the two investment alternatives -

Speaker: 00:27:08

•The average return from NASDAQ-100 is substantially lower than APPL. The mean and median of the

Speaker: 00:27:09

former is 2.01% and 2.68% per month, but those of APPL 3.02% and 5.00%.

Speaker: 00:27:10

•For both cases, the median is larger than the mean, especially the APPL. This means

Speaker: 00:27:13

the distribution is skewed to the left, with the presence of extremely low returns.

Speaker: 00:27:19

This means, when they go down, they can go down deep!

Speaker: 00:27:24

(Especially APPL!).

Speaker: 00:27:26

•The variability is a lot higher for the returns from APPL. The standard deviation

Speaker: 00:27:31

of APPL (8.34) is nearly twice larger than that of NASDA-100 (4.92).

Speaker: 00:27:40

This means APPL has a lot larger variation around the mean.

Speaker: 00:27:45

•The interquartile range for APPL is 10.91 (9.25 + 1.66) and that of NASDAQ-100 is 5.31

Speaker: 00:27:57

(5.13+0.18).

Speaker: 00:28:00

The length of interval that contains the middle 50% of the returns around the median is again

Speaker: 00:28:07

nearly twice larger for the APPL. .

Speaker: 00:28:10

•The worst possible outcome with 10% chance for APPL has been -7.35%, and that for NASDAQ-100

Speaker: 00:28:20

has been -5.89%.

Speaker: 00:28:23

The best possible outcome with 10% chance for APPL has been 12.27% a month, and that

Speaker: 00:28:31

for NASDAQ-100 has been 7.37%.

Speaker: 00:28:36

The comparison of these descriptive statistics reveals that monthly returns are a lot higher

Speaker: 00:28:41

for APPL investment, but it shows substantially higher variability or risk.

Speaker: 00:28:47

This is a well-known principle in finance - a higher return is compensation for taking

Speaker: 00:28:53

a higher risk.

Speaker: 00:28:56

The above plots present the histograms for the two investments.

Speaker: 00:29:00

A larger variability of the APPL with a heavier skew to the left of the distribution than

Speaker: 00:29:06

NASDAQ-100 is clear.

Speaker: 00:29:08

While the summary statistics tell the difference with the numbers, these histograms can make

Speaker: 00:29:13

a visual comparison.

Speaker: 00:29:16

To make a further visual comparison, another method of visualisation called the “Box-Whisker”

Speaker: 00:29:23

plot is introduced.

Speaker: 00:29:24

It plots the mean, the median, the 1st quartile, the 3rd quartile, maximum and minimum, along

Speaker: 00:29:31

with outliers.

Speaker: 00:29:33

The box in the middle is based on the 3rd quartile and 1st quartile, and the height

Speaker: 00:29:38

of the box represents the interquartile range.

Speaker: 00:29:41

Outliers are determined by a certain criterion (i.e., the outliers are defined as those lying

Speaker: 00:29:47

three standard deviations away from the mean).

Speaker: 00:29:51

Again, the APPL investment gives a substantially higher median return per month, but its monthly

Speaker: 00:29:57

variability is much higher than NASDAQ-100.

Speaker: 00:30:02

Which investment to choose depends on how risk-averse or risk-tolerant the investor

Speaker: 00:30:08

is.

Speaker: 00:30:09

If you are a Braveheart and enjoy a roller coaster ride, investing in APPL is not a bad

Speaker: 00:30:15

choice; otherwise, stick to the NASDAQ-100 for a safer option.

Speaker: 00:30:21

Speaker: 00:30:23

Normal distribution.

Speaker: 00:30:27

Figure 2 presents a distribution of the sample points using a histogram.

Speaker: 00:30:32

In statistics, distribution is an important feature for both the sample and the population.

Speaker: 00:30:38

While we can observe a distribution of the sample as in Figure 2, that of the population

Speaker: 00:30:43

is often unknown and not observable.

Speaker: 00:30:48

Understanding the features of a distribution is one of the fundamental questions of statistics.

Speaker: 00:30:53

For example, what is the chance that investing in the NASDAQ-100 index will provide a return

Speaker: 00:31:00

greater than 2%?

Speaker: 00:31:02

What proportion of the households in California has a lower annual income than $50,000?

Speaker: 00:31:08

We can only guess using the distribution of the sample we observe.

Speaker: 00:31:12

Again, if the sample is a fair representation of the population, the distribution of the

Speaker: 00:31:17

sample can well reflect the distribution of the population.

Speaker: 00:31:22

On the other hand, there are several known distributions in statistics where the probability

Speaker: 00:31:27

can be calculated using the given values of the parameters, such as the mean and standard

Speaker: 00:31:33

deviation.

Speaker: 00:31:34

Among them, the most fundamental and popular is the normal distribution.

Speaker: 00:31:39

It is also a key distribution in the inferential statistics to be discussed in the next chapter.

Speaker: 00:31:47

Normal distribution is a bell-shaped distribution, symmetric around its mean (or median), and

Speaker: 00:31:54

the probability at any point of the distribution is known.

Speaker: 00:31:58

A normal distribution with a mean  and standard deviation of  is written as north(,).

Speaker: 00:32:09

In the special case of the mean being zero and the standard deviation 1, it is called

Speaker: 00:32:14

standard normal distribution, and it is denoted as north(0,1).

Speaker: 00:32:21

Figure 3 is a screenshot from an online calculator.2 .

Speaker: 00:32:24

Figure 3 - Standard normal distribution.

Speaker: 00:32:26

Given the values of the mean and standard deviation, any probability between an interval

Speaker: 00:32:31

can be calculated.

Speaker: 00:32:32

Figure 3 shows a normal distribution with zero mean and a standard deviation of 1 (called

Speaker: 00:32:39

the standard normal distribution).

Speaker: 00:32:42

Suppose your return (in percentage) from an investment follows the standard normal distribution.

Speaker: 00:32:49

The probability that your return is between -1.96% and 1.96% is calculated to be 0.95

Speaker: 00:32:59

(dark area on the bell illustration).

Speaker: 00:33:03

This also means the probability of the tail areas is 5% (white area on the bell illustration).

Speaker: 00:33:11

Your investment return can be lower than -1.96% with the probability of 0.025 and can take

Speaker: 00:33:20

a value greater than 1.96% with the probability 0.025.

Speaker: 00:33:26

Let’s assume the household income in California follows a normal distribution of $75,000 with

Speaker: 00:33:36

the standard deviation of $30,000 (see Figure 4).

Speaker: 00:33:40

Then, the household income distribution of California is represented by the bell curve

Speaker: 00:33:45

in Figure 4.

Speaker: 00:33:47

The probability that a household income is less than $50,000 or the proportion of the

Speaker: 00:33:54

households with income less than $50,000 is represented by the dark area in the distribution,

Speaker: 00:34:01

which is 0.20 approximately.

Speaker: 00:34:03

In other words, if you pick a household at random, you have a 0.20 chance to bump into

Speaker: 00:34:12

one with an income less than $50,000.

Speaker: 00:34:16

This also means the chance of a randomly selected household having an income higher than $50,000

Speaker: 00:34:23

is around 0.80 (= 1- 0.2023) .

Speaker: 00:34:26

Figure 4 - Application of a normal distribution.

Speaker: 00:34:32

Speaker: 00:34:33

Checking the normality of a distribution.

Speaker: 00:34:38

Normal distribution is the most fundamental and popular distribution in statistics, and

Speaker: 00:34:45

it is widely used as a “benchmark” distribution or as an “approximation” to the true distribution

Speaker: 00:34:51

when it is unknown.

Speaker: 00:34:53

Being a benchmark or approximation means it may be sometimes useful, but sometimes not,

Speaker: 00:35:00

depending on the context and situation.

Speaker: 00:35:03

Figure 5 is the histogram we have seen in Figure 2, the returns from NASDAQ-100 investment,

Speaker: 00:35:10

overlayed with the normal distribution with the same mean and standard deviation values

Speaker: 00:35:15

of the returns.

Speaker: 00:35:18

While the histogram shows a similar shape to the normal distribution, with near symmetry

Speaker: 00:35:23

and bell curve, the fine details are not impressively consistent with the normal distribution.

Speaker: 00:35:30

While an approximation by a normal distribution to a stock return distribution is sometimes

Speaker: 00:35:35

used, it is generally accepted that a stock return distribution shows a clear departure

Speaker: 00:35:41

from a normal distribution.

Speaker: 00:35:42

Figure 5 - Histogram of the NASDAQ-100 and APPL returns and a normal curve.

Speaker: 00:35:48

The Q-Q (quantile-quantile) plot provides a clearer way of checking the normality of

Speaker: 00:35:53

a sample distribution using a graphical method.

Speaker: 00:35:57

It connects the sample quantiles (or percentiles) with the (theoretical) quantiles from the

Speaker: 00:36:05

normal distribution.

Speaker: 00:36:06

If the sample follows the standard normal distribution, then its percentiles should

Speaker: 00:36:12

match the percentiles from the normal distribution with the same mean and standard deviation.

Speaker: 00:36:19

The 95th percentile from the sample distribution (which is normal) should match the 1.96, and

Speaker: 00:36:27

the 50th percentile from the sample distribution should be 0, which is the 50th percentile

Speaker: 00:36:33

from the normal distribution.

Speaker: 00:36:36

An example of the Q-Q plot is given here - .

Speaker: 00:36:39

The grid lines are at (-1.96, 0, 1.96) for both axes, which are the 2.5th, 50th, and

Speaker: 00:36:52

97.5th percentiles from the standard normal distribution.

Speaker: 00:36:57

The y-axis (vertical) represents the sample quantile, and the x-axis (horizontal) represents

Speaker: 00:37:05

the theoretical quantiles from the normal distribution.

Speaker: 00:37:09

The grid lines are at (-1.96, 0, 1.96) for both axes, which match exactly.

Speaker: 00:37:19

Hence, any sample that shows a Q-Q plot like the one above can be well approximated by

Speaker: 00:37:25

a normal distribution.

Speaker: 00:37:28

Figure 6.

Speaker: 00:37:29

Q-Q plots for NASDAQ-100 and APPL returns.

Speaker: 00:37:30

The grid lines are at (-1.96, 0, 1.96) for both axes.

Speaker: 00:37:31

Figure 6 presents the Q-Q plots for the NASDAQ-100 and APPL returns.

Speaker: 00:37:35

The return from the NASDAQ-100 return shows a reasonable match with the normal quantiles,

Speaker: 00:37:41

while the quantiles of the APPL return show substantial departures from the normal quantiles.

Speaker: 00:37:47

This indicates that, while the NASDAQ-100 returns may be approximated by a normal distribution

Speaker: 00:37:54

with reasonable accuracy, a normal distribution will be a poor approximation to the APPL return

Speaker: 00:38:01

distribution.

Speaker: 00:38:03

Speaker: 00:38:05

Concluding remarks.

Speaker: 00:38:08

As an opening chapter, the basic concepts and descriptive measures of statistics were

Speaker: 00:38:13

discussed with the following keywords -

Speaker: 00:38:16

•Sample and population.

Speaker: 00:38:18

•Mean and Median.

Speaker: 00:38:20

•Standard deviation and Inter-quartile range.

Speaker: 00:38:24

•Percentile or quartiles.

Speaker: 00:38:26

•Histogram, Time plots, Q-Q plot, Box-Whisker plot.

Speaker: 00:38:33

•Normal distribution.

Speaker: 00:38:36

If you understand the listed concepts and methods, and you can apply them to real-world

Speaker: 00:38:42

situations, you already have made big steps into the world of statistical thinking!

Speaker: 00:38:48

You can produce these statistics using popular tools such as Excel.

Speaker: 00:38:57

This has been the art of statistical thinking, detect misinformation, understand the world

Speaker: 00:39:03

deeper

Speaker: 00:39:04

and make better decisions. Advanced Thinking Skills Book 3. Written by Albert Rutherford,

Speaker: 00:39:11

Speaker: 00:39:20

Voice over Work - An Audiobook Sampler

full

9th Apr 2024

The Art Of Statistical Thinking Detect Misinformation, Understand The World Deeper, And Make Better Decisions. By Albert Rutherford And Jae H. Kim, PhD AudioChapter

Transcript

Listen for free

About the Podcast

About your host

Russell Newton