PythonPlaza.com

Numpy

NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. NumPy stands for Numerical Python.In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

Mean, Median, and Mode

In Machine Learning (and in mathematics) there are often three values that interests us:
Mean - The average value
Median - The mid point value
Mode - The most common value

Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

What is the average, the middle, or the most common speed value?

Mean
The mean value is the average value.
To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

Median
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
Step 1: Order the data First, arrange the given numbers in ascending order:\(77,78,85,86,86,86,87,87,88,94,99,103,111)
Step 2: Determine the number of data points Count the total number of values in the dataset.There are (N=13) numbers.
Step 3: Calculate the median position Since the number of data points (N) is odd, the median is the middle value, found at the position determined by the formula (N+1)/2).
The position is (13 + 1)/2=7.
Step 4: Identify the median value the value at the 7th position in the ordered list is 87

Mode
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86

Standard Deviation

Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.

Example: Lets consider the spped of 7 cars:
speed = [86,87,88,86,87,85,86]
Find the Mean
Sum the values and divide by the number of data points (N=7):
Mean = (86 + 87 + 88 + 86 + 87 + 85 + 86) / 7 = 86.4
Calculate the Squared Difference from the Mean for each data point
(86 - 86.4)² = (-0.4)² = 0.16
(87 - 86.4)² = (0.6)² = 0.36
(88 - 86.4)² = (1.6)² = 2.56
(86 - 86.4)² = (-0.4)² = 0.16
(87 - 86.4)² = (0.6)² = 0.36
(85 - 86.4)² = (-1.4)² = 1.96
(86 - 86.4)² = (-0.4)² = 0.16
Sum the Squared Differences
Sum of squares = 0.16 + 0.36 + 2.56 + 0.16 + 0.36 + 1.96 + 0.16 = 5.72
Divide by the Number of Data Points minus 1 (N-1) for a sample
This gives the variance. For a sample, we divide by 6 (7-1):
Variance = 5.72 / 6 ≈ 0.9533
Take the Square Root
The square root of the variance is the standard deviation:
Standard Deviation ≈ √0.9533 ≈ 0.9764

Let us do the same with a selection of numbers with a wider range:
speed = [32,111,138,28,59,77,97]
The standard deviation is:37.85
Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.
As you can see, a higher standard deviation indicates that the values are spread out over a wider range.

Variance

Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
To calculate the variance you have to do as follows:

Find Variance of 32,111,138,28,59,77,97
1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:

32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:

(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16

4. The variance is the average number of these squared differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2

Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Example: Let's say we have an array that contains the ages of every person living on a street.
ages = [5,31,43,48,50,41,7,11,15,39]
What is the 75 percentile? The answer is 42.5, meaning that 75% of the people are 43 or younger.

The NumPy module has a method for finding the specified percentile: