|
|
NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. NumPy stands for Numerical Python.In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
Code Example 1: #Create Numpy array from List import numpy arr=numpy.array([["A", "B"],["C", "D"]]) print(arr) #Output: [['A' 'B'] ['C' 'D']] Code Example 2: #Create single-dimentional array arr = np.array([1, 2, 3, 4, 5]) print(arr) #Output: [1 2 3 4 5] Code Example 3: #Create 1D array arr = np.zeros(5) print(arr) #Output: [0. 0. 0. 0. 0.] Code Example 4: #Create array with 0s arr = np.zeros((3, 4)) print(arr) #Output: [[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]] Code Example 5: #Creates array filled with 1s. ones_array = np.ones((3, 3)) print(ones_array) #Output: [[1. 1. 1.] [1. 1. 1.] [1. 1. 1.]] Code Example 6: #Create array wityh 7s. arr = np.full((2, 2), 7) print(arr) #Output: [[7 7] [7 7]] Code Example 7: #Create array with arange() # start, stop, step range_array = np.arange(0, 10, 2) print(range_array ) #Output: [0 2 4 6 8]
In Machine Learning (and in mathematics) there are often three values that interests us:
Mean - The average value
Median - The mid point value
Mode - The most common value
Example: We have registered the speed of 13 cars:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
What is the average, the middle, or the most common speed value?
Mean
The mean value is the average value.
To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
Median
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
Step 1: Order the data First, arrange the given numbers in ascending order:\(77,78,85,86,86,86,87,87,88,94,99,103,111)
Step 2: Determine the number of data points Count the total number of values in the dataset.There are (N=13) numbers.
Step 3: Calculate the median position Since the number of data points (N) is odd, the median is the middle value, found at the position determined by the formula
(N+1)/2).
The position is (13 + 1)/2=7.
Step 4: Identify the median value the value at the 7th position in the ordered list is 87
Mode
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Code Example 8: #Calculate Mean with Numpy: import numpy speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = numpy.mean(speed) print(x) #Output: 89.76923076923077 Code Example 9: #Calculate Median with Numpy: import numpy speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = numpy.median(speed) print(x) #Output: 87.0 Code Example 10: #Use SciPy to fine Mode from scipy import stats speed = [99,86,87,88,111,86,103,87,94,78,77,85,86] x = stats.mode(speed) print(x) #Output: ModeResult(mode=array([86]), count=array([3]))
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
Example: Lets consider the spped of 7 cars:
speed = [86,87,88,86,87,85,86]
Find the Mean
Sum the values and divide by the number of data points (N=7):
Mean = (86 + 87 + 88 + 86 + 87 + 85 + 86) / 7 = 86.4
Calculate the Squared Difference from the Mean for each data point
(86 - 86.4)² = (-0.4)² = 0.16
(87 - 86.4)² = (0.6)² = 0.36
(88 - 86.4)² = (1.6)² = 2.56
(86 - 86.4)² = (-0.4)² = 0.16
(87 - 86.4)² = (0.6)² = 0.36
(85 - 86.4)² = (-1.4)² = 1.96
(86 - 86.4)² = (-0.4)² = 0.16
Sum the Squared Differences
Sum of squares = 0.16 + 0.36 + 2.56 + 0.16 + 0.36 + 1.96 + 0.16 = 5.72
Divide by the Number of Data Points minus 1 (N-1) for a sample
This gives the variance. For a sample, we divide by 6 (7-1):
Variance = 5.72 / 6 ≈ 0.9533
Take the Square Root
The square root of the variance is the standard deviation:
Standard Deviation ≈ √0.9533 ≈ 0.9764
Let us do the same with a selection of numbers with a wider range:
speed = [32,111,138,28,59,77,97]
The standard deviation is:37.85
Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.
As you can see, a higher standard deviation indicates that the values are spread out over a wider range.
Code Example 11: #Standard Deviation using Numpy import numpy speed = [86,87,88,86,87,85,86] x = numpy.std(speed) print(x) #Output: 0.9035079029052513
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
To calculate the variance you have to do as follows:
Find Variance of 32,111,138,28,59,77,97
1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
3. For each difference: find the square value:
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
4. The variance is the average number of these squared differences:
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2
Code Example 12: #Variance using Numpy import numpy speed = [32,111,138,28,59,77,97] x = numpy.var(speed) print(x) #Output: 1432.2448979591834
Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Example: Let's say we have an array that contains the ages of every person living on a street.
ages = [5,31,43,48,50,41,7,11,15,39]
What is the 75 percentile? The answer is 42.5, meaning that 75% of the people are 43 or younger.
The NumPy module has a method for finding the specified percentile:
Code Example 13: import numpy ages = [5,31,43,48,50,41,7,11,15,39] x = numpy.percentile(ages, 75) print(x) #Output: 42.5