datarookie
Okay! So here I am.
7 posts
Don't wanna be here? Send us removal request.
datarookie · 4 years ago
Text
Tumblr media
Assignment (week 4)
------------------------------------------------------------------------------
Scatterplot - Association between life expectancy and income
The initial plot using normal scale for the x-axis shows a relationship between income and life expectancy as higher income does show higher life expectancy. However, by changing the x-axis to a log scale, a linear relationship become visible.
So my initial hypothesis that life expectancy is positively associated with income will not be rejected. However, more in-depth will be required to establish how if the hypothesis should be accepted and if the association does indeed exist, how robust that relationship is.
0 notes
datarookie · 4 years ago
Text
Tumblr media
Assignment (week 4)
------------------------------------------------------------------------------
Categorical Plots
I have created three categorical plots showing frequency distribution for 'lifeexpectancy', 'incomeperperson' and 'urbanrate'.
For lifeexpectancy, 72 countries (41%) has average value ranging from 65-75. This is followed by age ranging from 75-85 for 57 (32%) countries.
For incomeperperson, 59 countries (34%) has average value ranging from US$1k-5k. This is followed by value less than US$1k for 53 (30%) countries.
For urbanrate, 66 countries (38%) has average rate ranging from 50%-75%. This is followed by rate ranging from 25%-50%for 52 (30%) countries.
0 notes
datarookie · 4 years ago
Text
Assignment (Week 3)
2) Output
This Pandas is verison 1.2.1
The dataset "gapminder.csv" has 213 records and 16 variables.
The dateset [df] has 176 records and 4 variables after extracting the required variables and filtering out records with missing values.
Variable : incomeperperson
Rec w miss_val : 23
Maximum value : 105147.4
Minimum value : 103.8
Frequency Distribution - values
Inc group (US$0-US$1,000) 54
Inc group (US$1,000-US$5,000) 61
Inc group (US$5,000-US$10,000) 28
Inc group (>US$10,000) 46
Name: incomeperperson, dtype: int64
Frequency Distribution - percentage
Inc group (US$0-US$1,000) 0.285714
Inc group (US$1,000-US$5,000) 0.322751
Inc group (US$5,000-US$10,000) 0.148148
Inc group (>US$10,000) 0.243386
Name: incomeperperson, dtype: float64
Variable : lifeexpectancy
Rec w miss_val : 22
Maximum value : 83.4
Minimum value : 47.8
Frequency Distribution - values
Age group (45-55) 24
Age group (55-65) 26
Age group (65-75) 76
Age group (75-85) 65
Name: lifeexpectancy, dtype: int64
Frequency Distribution - percentage
Age group (45-55) 0.125654
Age group (55-65) 0.136126
Age group (65-75) 0.397906
Age group (75-85) 0.340314
Name: lifeexpectancy, dtype: float64
Variable : urbanrate
Rec w miss_val : 10
Maximum value : 100.0
Minimum value : 10.4
Frequency Distribution - values
%Urban (0-25) 22
%Urban (25-50) 59
%Urban (50-75) 74
%Urban (75-100) 48
Name: urbanrate, dtype: int64
Frequency Distribution - percentage
%Urban (0-25) 0.108374
%Urban (25-50) 0.290640
%Urban (50-75) 0.364532
%Urban (75-100) 0.236453
Name: urbanrate, dtype: float64
------------------------------------------------------------------------------
3) Description
The data cleaning process replaces all missing values in the dataset gapminder.csv with "NaN".
Three variables were put through a looping function to output some basic stats, bin their values into groups and display their frequency distributions by values and percentages.
For incomeperperson, there are 23 missing values out of 213 records. About 28% of the countries has GDP per capita of US$1,000 or less, with 32% falling within US$1,000 and US$5,000.
For lifeexpectancy, there are 22 missing values out of 213 records. About 40% of the countries has average life expectancy between 65 to 75.
For urbanrate, there are 10 missing values out of 213 records. About 60% of the countries has urban rate from 50% to 100%.
0 notes
datarookie · 4 years ago
Text
Assignment (Week 3)
------------------------------------------------------------------------------
1) Program
# Created on 03 Feb 2021 for Coursera Data Mgmt and Visualisation - week 2 assignment
# Updated on 28 Feb 2021 for Coursera Data Mgmt and Visualisation - week 3 assignment
import pandas
import numpy
print("This Pandas is verison",pandas.__version__)
print()
output_file = "for visualisation.csv"
# Define and load dataset
csv_datafile = "gapminder.csv"
data = pandas.read_csv(csv_datafile, low_memory = False)
# Display no. of records and variables
print('The dataset "gapminder.csv" has', len(data), 'records and', len(data.columns), 'variables.')
'''
# Examine dataset in details
var_list = data.columns.tolist()
for var in var_list:
if var == 'country': continue
df = pandas.to_numeric(data[var], errors='coerce')
df = df.dropna()
miss_val = len(data) - len(df)
max_val = df.max()
min_val = df.min()
print('Variable :', var)
print('Rec w miss_val :', miss_val)
print('Maximum value : {0:.1f}'.format(max_val))
print('Minimum value : {0:.1f}'.format(min_val))
print()
'''
# Data Mgmt - replace missing values with NaN
df = data.replace(r'^\s*$', numpy.NaN, regex=True)
# Data Mgmt - filter out records with NaN for selected variables.
df = df[['country', 'incomeperperson', 'lifeexpectancy', 'urbanrate']]
df = df.dropna()
print('The dateset [df] has', len(df), 'records and', len(df.columns), 'variables after extracting the required variables and filtering out records with missing values.')
print()
incomeperperson_bin = 0, 1000, 5000, 10000, 99999
incomeperperson_lab = ["Inc group (US$0-US$1,000)", "Inc group (US$1,000-US$5,000)", "Inc group (US$5,000-US$10,000)", "Inc group (>US$10,000)"]
lifeexpectancy_bin = 45, 55, 65, 75, 85
lifeexpectancy_lab = ["Age group (45-55)", "Age group (55-65)", "Age group (65-75)", "Age group (75-85)"]
urbanrate_bin = 0, 25, 50, 75, 100
urbanrate_lab = ["%Urban (0-25)", "%Urban (25-50)", "%Urban (50-75)", "%Urban (75-100)"]
var_list = df.columns.tolist()
for var in var_list:
# Examine selected variables in details
if var == 'country': continue
df = pandas.to_numeric(data[var], errors='coerce')
df = df.dropna()
miss_val = len(data) - len(df)
max_val = df.max()
min_val = df.min()
print('Variable :', var)
print('Rec w miss_val :', miss_val)
print('Maximum value : {0:.1f}'.format(max_val))
print('Minimum value : {0:.1f}'.format(min_val))
print()
# Bin the lifeexpectancy into user-defined bins
if var == 'incomeperperson': bin_ = incomeperperson_bin; lab = incomeperperson_lab
if var == 'lifeexpectancy': bin_ = lifeexpectancy_bin; lab = lifeexpectancy_lab
if var == 'urbanrate': bin_ = urbanrate_bin; lab = urbanrate_lab
df = pandas.cut(df.astype(float), bin_, labels = lab)
c1 = df.value_counts(sort=False)
p1 = df.value_counts(sort=False, normalize=True)
print('Frequency Distribution - values')
print(c1)
print()
print('Frequency Distribution - percentage')
print(p1)
print()
0 notes
datarookie · 4 years ago
Text
Assignment (Week 2)
------------------------------------------------------------------------------
1) Program to display frequency distribution of life expectancy in dataset gapminder.csv"'
# Created on 03 Feb 2021 for Coursera Data mgmt and Visualisation
import pandas
import numpy
print("This Pandas is verison",pandas.__version__)
print()
output_file = "for visualisation.csv"
# Define and load dataset
csv_datafile = "gapminder.csv"
data = pandas.read_csv(csv_datafile, low_memory = False)
# Display no. of records and variables
print('The dataset "gapminder.csv" has', len(data), 'records and', len(data.columns), 'variables.')
# Extract variables "country" and "lifeexpectancy"
df = data[['country', 'incomeperperson', 'lifeexpectancy']]
df1 = df[(df['incomeperperson']!= " ")]
print(213 - len(df1), 'records were found to have no values for "incomeperperson".')
df1 = df[(df['lifeexpectancy']!= " ")]
print(213 - len(df1), 'records were found to have no values for "lifeexpectancy".')
print()
df = df[(df['incomeperperson']!= " ")]
df = df[(df['lifeexpectancy']!= " ")]
print('The dateset [df] has', len(df), 'records and', len(df.columns), 'variables after extracting the "country", "incomeperperson" and "lifeexpectancy" variables and filtering out records without values for the "incomeperperson" or "lifeexpectancy" variables.')
print()
# Find out max and min life expectancy to work out age group
print("The minimum life expectancy age is", df['lifeexpectancy'].min())
print("The maximum life expectancy age is", df['lifeexpectancy'].max())
print()
# Bin the lifeexpectancy into user-defined bins
age_group = pandas.cut(df['lifeexpectancy'].astype(float), [45, 55, 65, 75, 85], labels=["Age 45-55", "Age 55-65", "Age 65-75", "Age 75-85"])
df['agegroup'] = age_group
c1 = df["agegroup"].value_counts(sort=False)
p1 = df["agegroup"].value_counts(sort=False, normalize=True)
print(c1)
print()
print(p1)
print()
------------------------------------------------------------------------------
2) Output from program
This Pandas is verison 1.2.1
The dataset "gapminder.csv" has 213 records and 16 variables.
23 records were found to have no values for "incomeperperson".
22 records were found to have no values for "lifeexpectancy".
The dateset [df] has 176 records and 3 variables after extracting the "country", "incomeperperson" and "lifeexpectancy" variables and filtering out records without values for the "incomeperperson" or "lifeexpectancy" variables.
The minimum life expectancy age is 47.794
The maximum life expectancy age is 83.394
Age 45-55 22
Age 55-65 25
Age 65-75 72
Age 75-85 57
Name: agegroup, dtype: int64
Age 45-55 0.125000
Age 55-65 0.142045
Age 65-75 0.409091
Age 75-85 0.323864
Name: agegroup, dtype: float64
------------------------------------------------------------------------------
3) Descriptions
I started by examining the dataset. There are 213 records and 16 variables. 23 of the 213 records were found to have no values for "incomeperperson" while 22 of the 213 records were found to have no values for "lifeexpectancy".
After filtering out records without any values for either "incomeperperson" or "lifeexpectancy", 176 records remains.
The life expectancy ranges from an average of 45 to 85 in the list and an age group bin starting at 45 at an interval of 10-year were built to output the frequency distribution.
The bin with the highest occurrence were age group between 65-75 (41%). This is followed by age group 75-85 (32%).
0 notes
datarookie · 4 years ago
Text
Data Management and Visualization Course
Assignment (week 1)
The Research Question
I studied the Gapminder codebook and are particularly interested in finding out how level of income per person relates to human life expectancy and how strongly it can explain life expectancy.
 Literature Research
On life expectancy association with the income:
A study done by Raj Chetty, Michael Stepner, and Sarah Abraham uses income data from 1.4 billion deidentified tax records between 1999 and 2014 for the US population. Life expectancy were estimated mortality data obtained from Social Security Administration death records. They are adjusted to accord for race and ethnicity. The study concluded that higher income was associated with greater longevity. However, the association varied substantially across areas.
Another study done by Erick Messias uses data from the Brazilian Ministry of Health and the Brazilian Institute of Geography and Statistics.  He conducted simple and multiple linear regressions to measure the association between income disparity, measured by the Gini coefficient, gross domestic product (GDP) per capita, and illiteracy rate. The study found that GDP per capita was positively associated with life expectancy.
Based on the desktop research done on studies conducted by different authors on the association of life expectancy to income within their countries, it was found that there is a positive association between life expectancy and income.
However, those studies are confined within individual country.  As such, analyzing such association between countries might yield different or different magnitudes of outcomes.
References:
Raj Chetty, PhD1; Michael Stepner, BA2; Sarah Abraham, BA2; et al. “The Association Between Income and Life Expectancy in the United States, 2001-2014.” Jama Network, 26 April 2016, https://jamanetwork.com/journals/jama/article-abstract/2513561
Erick Messias MD, MPH. “Income Inequality, Illiteracy Rate, and Life Expectancy in Brazil.” American Journal of Public Health, 10 Oct 2011, https://ajph.aphapublications.org/doi/full/10.2105/AJPH.93.8.1294
Hypothesis
Life expectancy is positively associated with income.
I will be using the variables “incomeperperson” and “lifeexpectancy” from the Gapminder dataset.
0 notes
datarookie · 4 years ago
Text
01 Feb 2021 marked the day I landed in Tumblr.
1 note · View note