#http://www.soef.nl/study/daai/2017/09/14/univariate-analysis/
Explore tagged Tumblr posts
waywardthingrunaway · 4 years ago
Text
Code for Second Week Data Analysis
Code
import pandas # Statics DATA_SET = 'gapminder.csv' # GapMinder indicators GP_COUNTRY = 'country' GP_INCOMEPERPERSON = 'incomeperperson' GP_CO2EMISSIONS = 'co2emissions' GP_URBANRATE = 'urbanrate' def load_data_set(filename):    """    Loads a data set from the file system    @param filename: the name of the CSV file that contains the data set    """    print('Loading data set "' + filename + '"...')    # low_memory=False prevents pandas to try to determine the data type of each value    return pandas.read_csv(filename, low_memory=False) def load_gapminder_data_set():    """    Load the GapMinder data set and prepare the columns needed.    """    data = load_data_set(DATA_SET)        # The number of observations    print("Number of records: " + str(len(data)))        # The number of variables    print("Number of columns: " + str(len(data.columns)))    # convert the values of co2emissions, urbanrate and incomeperperson to numeric    data[GP_CO2EMISSIONS] = data[GP_CO2EMISSIONS].convert_objects(convert_numeric=True)    data[GP_URBANRATE] = data[GP_URBANRATE].convert_objects(convert_numeric=True)    data[GP_INCOMEPERPERSON] = data[GP_INCOMEPERPERSON].convert_objects(convert_numeric=True)    return data     def groupby(data_set, variables):    """    Get the distributed values of a variable of the data_set.    @param data_set: the data set to examine.    @param variables: the variable, or list of variables, of interest.    @return a tuple of 2 pandas.core.series.Series objects where the first            object is the absolute distribution over the values of            the given variable(s) and the second list is their            precentages as part of the total number of rows.    """    counts = data_set.groupby(variables).size()    return counts, counts * 100 / len(counts) def print_distributions(data, variable):    """    Prints the distribution of the values of a specific variable.    @param data: the data set to examine.    @param variable: the variable of interest.    """    distribution = groupby(data, variable)        print("Counts for " + variable + ":")    print(distribution[0])    print("Percentages for " + variable + ":")    print(distribution[1])    print("----------------------------") if __name__ == "__main__":    data = load_gapminder_data_set()    print_distributions(data, GP_CO2EMISSIONS)    print_distributions(data, GP_URBANRATE)    print_distributions(data, GP_INCOMEPERPERSON)
Distribution of the co2emissions variable
When I look at the co2emissions indicator from GapMinder I notice that every value is different. That is not that strange because the co2emissions variable contains the “Total amount of CO2 emission in metric tons since 1751” and the chance that there is a country that have cumulatively emit the same amount of CO2 as one other country is very very small. A frequency analysis on the absolute values here is not useful. Every value only exists once, except for the countries of which no data is available. These countries are not of use for my research, so I have to remove them from the data set first. (I will come to that when I write my python program to analyse the data.)
From 1751 up to 2006 the United States had the largest cumulative amount of CO2 emissions.
Counts for co2emissions: 1.320000e+05 1 8.506667e+05 1 1.045000e+06 1 1.111000e+06 1 1.206333e+06 1 1.723333e+06 1 2.251333e+06 1 2.335667e+06 1 2.368667e+06 1 2.401667e+06 1 2.907667e+06 1 2.977333e+06 1 3.659333e+06 1 4.352333e+06 1 4.774000e+06 1 4.814333e+06 1 5.210333e+06 1 5.214000e+06 1 6.024333e+06 1 7.315000e+06 1 7.355333e+06 1 7.388333e+06 1 7.601000e+06 1 7.608333e+06 1 7.813667e+06 1 8.092333e+06 1 8.231667e+06 1 8.338000e+06 1 8.968667e+06 1 9.155667e+06 1 .. 4.466084e+09 1 5.248815e+09 1 5.418886e+09 1 5.584766e+09 1 5.675630e+09 1 5.872119e+09 1 5.896389e+09 1 6.710202e+09 1 7.104137e+09 1 7.861553e+09 1 9.183548e+09 1 9.483023e+09 1 9.580226e+09 1 9.666892e+09 1 1.082253e+10 1 1.089703e+10 1 1.297009e+10 1 1.330450e+10 1 1.460985e+10 1 1.900045e+10 1 2.305360e+10 1 2.340457e+10 1 2.497905e+10 1 3.039132e+10 1 3.334163e+10 1 4.122955e+10 1 4.609221e+10 1 7.252425e+10 1 1.013862e+11 1 3.342209e+11 1 Length: 200, dtype: int64
Percentages for co2emissions: 1.320000e+05 0.5 8.506667e+05 0.5 1.045000e+06 0.5 1.111000e+06 0.5 1.206333e+06 0.5 1.723333e+06 0.5 2.251333e+06 0.5 2.335667e+06 0.5 2.368667e+06 0.5 2.401667e+06 0.5 2.907667e+06 0.5 2.977333e+06 0.5 3.659333e+06 0.5 4.352333e+06 0.5 4.774000e+06 0.5 4.814333e+06 0.5 5.210333e+06 0.5 5.214000e+06 0.5 6.024333e+06 0.5 7.315000e+06 0.5 7.355333e+06 0.5 7.388333e+06 0.5 7.601000e+06 0.5 7.608333e+06 0.5 7.813667e+06 0.5 8.092333e+06 0.5 8.231667e+06 0.5 8.338000e+06 0.5 8.968667e+06 0.5 9.155667e+06 0.5 4.466084e+09 0.5 5.248815e+09 0.5 5.418886e+09 0.5 5.584766e+09 0.5 5.675630e+09 0.5 5.872119e+09 0.5 5.896389e+09 0.5 6.710202e+09 0.5 7.104137e+09 0.5 7.861553e+09 0.5 9.183548e+09 0.5 9.483023e+09 0.5 9.580226e+09 0.5 9.666892e+09 0.5 1.082253e+10 0.5 1.089703e+10 0.5 1.297009e+10 0.5 1.330450e+10 0.5 1.460985e+10 0.5 1.900045e+10 0.5 2.305360e+10 0.5 2.340457e+10 0.5 2.497905e+10 0.5 3.039132e+10 0.5 3.334163e+10 0.5 4.122955e+10 0.5 4.609221e+10 0.5 7.252425e+10 0.5 1.013862e+11 0.5 3.342209e+11 0.5 Length: 200, dtype: float64
Distribution of the urbanrate variable
The values of the urbanrate variable represent the urban population in percentage of the total population. These values are between 0% and 100%. The data set uses floating point numbers up to an accuracy of 2 digits after the dot. With about 200 samples this makes the frequency distribution also not very useful. Most values exist only once. The samples without a value for urbanrate are skipped. Here the distribution of the absolute values and there percentages:
Counts for urbanrate: 10.40 1 12.54 1 12.98 1 13.22 1 14.32 1 15.10 1 16.54 1 17.00 1 17.24 1 17.96 1 18.34 1 18.80 1 19.56 1 20.72 1 21.56 1 21.60 1 22.54 1 23.00 1 24.04 1 24.76 1 24.78 1 24.94 1 25.46 1 25.52 1 26.46 1 26.68 1 27.14 1 27.30 1 27.84 2 28.08 1 .. 82.42 1 82.44 1 83.52 1 83.70 1 84.54 1 85.04 1 85.58 1 86.56 1 86.68 1 86.96 1 87.30 1 88.44 1 88.52 1 88.74 1 88.92 1 89.94 1 91.66 1 92.00 1 92.26 1 92.30 1 92.68 1 93.16 1 93.32 1 94.22 1 94.26 1 95.64 1 97.36 1 98.32 1 98.36 1 100.00 6 Length: 194, dtype: int64
Percentages for urbanrate: 10.40     0.515464 12.54     0.515464 12.98     0.515464 13.22     0.515464 14.32     0.515464 15.10     0.515464 16.54     0.515464 17.00     0.515464 17.24     0.515464 17.96     0.515464 18.34     0.515464 18.80     0.515464 19.56     0.515464 20.72     0.515464 21.56     0.515464 21.60     0.515464 22.54     0.515464 23.00     0.515464 24.04     0.515464 24.76     0.515464 24.78     0.515464 24.94     0.515464 25.46     0.515464 25.52     0.515464 26.46     0.515464 26.68     0.515464 27.14     0.515464 27.30     0.515464 27.84     1.030928 28.08     0.515464   82.42     0.515464 82.44     0.515464 83.52     0.515464 83.70     0.515464 84.54     0.515464 85.04     0.515464 85.58     0.515464 86.56     0.515464 86.68     0.515464 86.96     0.515464 87.30     0.515464 88.44     0.515464 88.52     0.515464 88.74     0.515464 88.92     0.515464 89.94     0.515464 91.66     0.515464 92.00     0.515464 92.26     0.515464 92.30     0.515464 92.68     0.515464 93.16     0.515464 93.32     0.515464 94.22     0.515464 94.26     0.515464 95.64     0.515464 97.36     0.515464 98.32     0.515464 98.36     0.515464 100.00    3.092784 Length: 194, dtype: float64
Distribution of the incomeperperson variable
Like the other two variables, the type of incomeperperson is a floating pointing number too. This will generate a distribution like above.
Counts for incomeperperson: 103.775857       1 115.305996       1 131.796207       1 155.033231       1 161.317137       1 180.083376       1 184.141797       1 220.891248       1 239.518749       1 242.677534       1 268.259450       1 268.331790       1 269.892881       1 275.884287       1 276.200413       1 279.180453       1 285.224449       1 320.771890       1 336.368749       1 338.266391       1 354.599726       1 358.979540       1 369.572954       1 371.424198       1 372.728414       1 377.039699       1 377.421113       1 389.763634       1 411.501447       1 432.226337       1                .. 20751.893424     1 21087.394125     1 21943.339898     1 22275.751661     1 22878.466567     1 24496.048264     1 25249.986061     1 25306.187193     1 25575.352623     1 26551.844238     1 26692.984107     1 27110.731591     1 27595.091347     1 28033.489283     1 30532.277044     1 31993.200694     1 32292.482984     1 32535.832512     1 33923.313868     1 33931.832079     1 33945.314422     1 35536.072471     1 37491.179523     1 37662.751250     1 39309.478859     1 39972.352768     1 52301.587179     1 62682.147006     1 81647.100031     1 105147.437697    1 Length: 190, dtype: int64
Percentages for incomeperperson: 103.775857       0.526316 115.305996       0.526316 131.796207       0.526316 155.033231       0.526316 161.317137       0.526316 180.083376       0.526316 184.141797       0.526316 220.891248       0.526316 239.518749       0.526316 242.677534       0.526316 268.259450       0.526316 268.331790       0.526316 269.892881       0.526316 275.884287       0.526316 276.200413       0.526316 279.180453       0.526316 285.224449       0.526316 320.771890       0.526316 336.368749       0.526316 338.266391       0.526316 354.599726       0.526316 358.979540       0.526316 369.572954       0.526316 371.424198       0.526316 372.728414       0.526316 377.039699       0.526316 377.421113       0.526316 389.763634       0.526316 411.501447       0.526316 432.226337       0.526316   20751.893424     0.526316 21087.394125     0.526316 21943.339898     0.526316 22275.751661     0.526316 22878.466567     0.526316 24496.048264     0.526316 25249.986061     0.526316 25306.187193     0.526316 25575.352623     0.526316 26551.844238     0.526316 26692.984107     0.526316 27110.731591     0.526316 27595.091347     0.526316 28033.489283     0.526316 30532.277044     0.526316 31993.200694     0.526316 32292.482984     0.526316 32535.832512     0.526316 33923.313868     0.526316 33931.832079     0.526316 33945.314422     0.526316 35536.072471     0.526316 37491.179523     0.526316 37662.751250     0.526316 39309.478859     0.526316 39972.352768     0.526316 52301.587179     0.526316 62682.147006     0.526316 81647.100031     0.526316 105147.437697    0.526316 Length: 190, dtype: float64
Conclusion
Frequency analysis on the variables co2emissions, urbanrate and incomeperperson indicate that the values are unique. The counts for (almost) all values give 1.
To do additional univariate analysis, the samples could be split into groups and the frequency analysis could count the number of samples in a group.
0 notes