datarookie - Tumblr blog

datarookie · 4 years ago

Text

Assignment (week 4)

------------------------------------------------------------------------------

Scatterplot - Association between life expectancy and income

The initial plot using normal scale for the x-axis shows a relationship between income and life expectancy as higher income does show higher life expectancy. However, by changing the x-axis to a log scale, a linear relationship become visible.

So my initial hypothesis that life expectancy is positively associated with income will not be rejected. However, more in-depth will be required to establish how if the hypothesis should be accepted and if the association does indeed exist, how robust that relationship is.

0 notes

datarookie · 4 years ago

Text

Assignment (week 4)

------------------------------------------------------------------------------

Categorical Plots

I have created three categorical plots showing frequency distribution for 'lifeexpectancy', 'incomeperperson' and 'urbanrate'.

For lifeexpectancy, 72 countries (41%) has average value ranging from 65-75. This is followed by age ranging from 75-85 for 57 (32%) countries.

For incomeperperson, 59 countries (34%) has average value ranging from US$1k-5k. This is followed by value less than US$1k for 53 (30%) countries.

For urbanrate, 66 countries (38%) has average rate ranging from 50%-75%. This is followed by rate ranging from 25%-50%for 52 (30%) countries.

0 notes

datarookie · 4 years ago

Text

Assignment (Week 3)

2) Output

This Pandas is verison 1.2.1

The dataset "gapminder.csv" has 213 records and 16 variables.

The dateset [df] has 176 records and 4 variables after extracting the required variables and filtering out records with missing values.

Variable : incomeperperson

Rec w miss_val : 23

Maximum value : 105147.4

Minimum value : 103.8

Frequency Distribution - values

Inc group (US$0-US$1,000) 54

Inc group (US$1,000-US$5,000) 61

Inc group (US$5,000-US$10,000) 28

Inc group (>US$10,000) 46

Name: incomeperperson, dtype: int64

Frequency Distribution - percentage

Inc group (US$0-US$1,000) 0.285714

Inc group (US$1,000-US$5,000) 0.322751

Inc group (US$5,000-US$10,000) 0.148148

Inc group (>US$10,000) 0.243386

Name: incomeperperson, dtype: float64

Variable : lifeexpectancy

Rec w miss_val : 22

Maximum value : 83.4

Minimum value : 47.8

Frequency Distribution - values

Age group (45-55) 24

Age group (55-65) 26

Age group (65-75) 76

Age group (75-85) 65

Name: lifeexpectancy, dtype: int64

Frequency Distribution - percentage

Age group (45-55) 0.125654

Age group (55-65) 0.136126

Age group (65-75) 0.397906

Age group (75-85) 0.340314

Name: lifeexpectancy, dtype: float64

Variable : urbanrate

Rec w miss_val : 10

Maximum value : 100.0

Minimum value : 10.4

Frequency Distribution - values

%Urban (0-25) 22

%Urban (25-50) 59

%Urban (50-75) 74

%Urban (75-100) 48

Name: urbanrate, dtype: int64

Frequency Distribution - percentage

%Urban (0-25) 0.108374

%Urban (25-50) 0.290640

%Urban (50-75) 0.364532

%Urban (75-100) 0.236453

Name: urbanrate, dtype: float64

------------------------------------------------------------------------------

3) Description

The data cleaning process replaces all missing values in the dataset gapminder.csv with "NaN".

Three variables were put through a looping function to output some basic stats, bin their values into groups and display their frequency distributions by values and percentages.

For incomeperperson, there are 23 missing values out of 213 records. About 28% of the countries has GDP per capita of US$1,000 or less, with 32% falling within US$1,000 and US$5,000.

For lifeexpectancy, there are 22 missing values out of 213 records. About 40% of the countries has average life expectancy between 65 to 75.

For urbanrate, there are 10 missing values out of 213 records. About 60% of the countries has urban rate from 50% to 100%.

0 notes

datarookie · 4 years ago

Text

Assignment (Week 3)

------------------------------------------------------------------------------

1) Program

# Created on 03 Feb 2021 for Coursera Data Mgmt and Visualisation - week 2 assignment

# Updated on 28 Feb 2021 for Coursera Data Mgmt and Visualisation - week 3 assignment

import pandas

import numpy

print("This Pandas is verison",pandas.__version__)

print()

output_file = "for visualisation.csv"

# Define and load dataset

csv_datafile = "gapminder.csv"

data = pandas.read_csv(csv_datafile, low_memory = False)

# Display no. of records and variables

print('The dataset "gapminder.csv" has', len(data), 'records and', len(data.columns), 'variables.')

'''

# Examine dataset in details

var_list = data.columns.tolist()

for var in var_list:

if var == 'country': continue

df = pandas.to_numeric(data[var], errors='coerce')

df = df.dropna()

miss_val = len(data) - len(df)

max_val = df.max()

min_val = df.min()

print('Variable :', var)

print('Rec w miss_val :', miss_val)

print('Maximum value : {0:.1f}'.format(max_val))

print('Minimum value : {0:.1f}'.format(min_val))

print()

'''

# Data Mgmt - replace missing values with NaN

df = data.replace(r'^\s*$', numpy.NaN, regex=True)

# Data Mgmt - filter out records with NaN for selected variables.

df = df[['country', 'incomeperperson', 'lifeexpectancy', 'urbanrate']]

df = df.dropna()

print('The dateset [df] has', len(df), 'records and', len(df.columns), 'variables after extracting the required variables and filtering out records with missing values.')

print()

incomeperperson_bin = 0, 1000, 5000, 10000, 99999

incomeperperson_lab = ["Inc group (US$0-US$1,000)", "Inc group (US$1,000-US$5,000)", "Inc group (US$5,000-US$10,000)", "Inc group (>US$10,000)"]

lifeexpectancy_bin = 45, 55, 65, 75, 85

lifeexpectancy_lab = ["Age group (45-55)", "Age group (55-65)", "Age group (65-75)", "Age group (75-85)"]

urbanrate_bin = 0, 25, 50, 75, 100

urbanrate_lab = ["%Urban (0-25)", "%Urban (25-50)", "%Urban (50-75)", "%Urban (75-100)"]

var_list = df.columns.tolist()

for var in var_list:

# Examine selected variables in details

if var == 'country': continue

df = pandas.to_numeric(data[var], errors='coerce')

df = df.dropna()

miss_val = len(data) - len(df)

max_val = df.max()

min_val = df.min()

print('Variable :', var)

print('Rec w miss_val :', miss_val)

print('Maximum value : {0:.1f}'.format(max_val))

print('Minimum value : {0:.1f}'.format(min_val))

print()

# Bin the lifeexpectancy into user-defined bins

if var == 'incomeperperson': bin_ = incomeperperson_bin; lab = incomeperperson_lab

if var == 'lifeexpectancy': bin_ = lifeexpectancy_bin; lab = lifeexpectancy_lab

if var == 'urbanrate': bin_ = urbanrate_bin; lab = urbanrate_lab

df = pandas.cut(df.astype(float), bin_, labels = lab)

c1 = df.value_counts(sort=False)

p1 = df.value_counts(sort=False, normalize=True)

print('Frequency Distribution - values')

print(c1)

print()

print('Frequency Distribution - percentage')

print(p1)

print()

0 notes

datarookie · 4 years ago

Text

Assignment (Week 2)

------------------------------------------------------------------------------

1) Program to display frequency distribution of life expectancy in dataset gapminder.csv"'

# Created on 03 Feb 2021 for Coursera Data mgmt and Visualisation

import pandas

import numpy

print("This Pandas is verison",pandas.__version__)

print()

output_file = "for visualisation.csv"

# Define and load dataset

csv_datafile = "gapminder.csv"

data = pandas.read_csv(csv_datafile, low_memory = False)

# Display no. of records and variables

print('The dataset "gapminder.csv" has', len(data), 'records and', len(data.columns), 'variables.')

# Extract variables "country" and "lifeexpectancy"

df = data[['country', 'incomeperperson', 'lifeexpectancy']]

df1 = df[(df['incomeperperson']!= " ")]

print(213 - len(df1), 'records were found to have no values for "incomeperperson".')

df1 = df[(df['lifeexpectancy']!= " ")]

print(213 - len(df1), 'records were found to have no values for "lifeexpectancy".')

print()

df = df[(df['incomeperperson']!= " ")]

df = df[(df['lifeexpectancy']!= " ")]

print('The dateset [df] has', len(df), 'records and', len(df.columns), 'variables after extracting the "country", "incomeperperson" and "lifeexpectancy" variables and filtering out records without values for the "incomeperperson" or "lifeexpectancy" variables.')

print()

# Find out max and min life expectancy to work out age group

print("The minimum life expectancy age is", df['lifeexpectancy'].min())

print("The maximum life expectancy age is", df['lifeexpectancy'].max())

print()

# Bin the lifeexpectancy into user-defined bins

age_group = pandas.cut(df['lifeexpectancy'].astype(float), [45, 55, 65, 75, 85], labels=["Age 45-55", "Age 55-65", "Age 65-75", "Age 75-85"])

df['agegroup'] = age_group

c1 = df["agegroup"].value_counts(sort=False)

p1 = df["agegroup"].value_counts(sort=False, normalize=True)

print(c1)

print()

print(p1)

print()

------------------------------------------------------------------------------

2) Output from program

This Pandas is verison 1.2.1

The dataset "gapminder.csv" has 213 records and 16 variables.

23 records were found to have no values for "incomeperperson".

22 records were found to have no values for "lifeexpectancy".

The dateset [df] has 176 records and 3 variables after extracting the "country", "incomeperperson" and "lifeexpectancy" variables and filtering out records without values for the "incomeperperson" or "lifeexpectancy" variables.

The minimum life expectancy age is 47.794

The maximum life expectancy age is 83.394

Age 45-55 22

Age 55-65 25

Age 65-75 72

Age 75-85 57

Name: agegroup, dtype: int64

Age 45-55 0.125000

Age 55-65 0.142045

Age 65-75 0.409091

Age 75-85 0.323864

Name: agegroup, dtype: float64

------------------------------------------------------------------------------

3) Descriptions

I started by examining the dataset. There are 213 records and 16 variables. 23 of the 213 records were found to have no values for "incomeperperson" while 22 of the 213 records were found to have no values for "lifeexpectancy".

After filtering out records without any values for either "incomeperperson" or "lifeexpectancy", 176 records remains.

The life expectancy ranges from an average of 45 to 85 in the list and an age group bin starting at 45 at an interval of 10-year were built to output the frequency distribution.

The bin with the highest occurrence were age group between 65-75 (41%). This is followed by age group 75-85 (32%).

0 notes

datarookie · 4 years ago

Text

Data Management and Visualization Course

Assignment (week 1)

The Research Question

I studied the Gapminder codebook and are particularly interested in finding out how level of income per person relates to human life expectancy and how strongly it can explain life expectancy.

Literature Research

On life expectancy association with the income:

A study done by Raj Chetty, Michael Stepner, and Sarah Abraham uses income data from 1.4 billion deidentified tax records between 1999 and 2014 for the US population. Life expectancy were estimated mortality data obtained from Social Security Administration death records. They are adjusted to accord for race and ethnicity. The study concluded that higher income was associated with greater longevity. However, the association varied substantially across areas.

Another study done by Erick Messias uses data from the Brazilian Ministry of Health and the Brazilian Institute of Geography and Statistics. He conducted simple and multiple linear regressions to measure the association between income disparity, measured by the Gini coefficient, gross domestic product (GDP) per capita, and illiteracy rate. The study found that GDP per capita was positively associated with life expectancy.

Based on the desktop research done on studies conducted by different authors on the association of life expectancy to income within their countries, it was found that there is a positive association between life expectancy and income.

However, those studies are confined within individual country. As such, analyzing such association between countries might yield different or different magnitudes of outcomes.

References:

Raj Chetty, PhD1; Michael Stepner, BA2; Sarah Abraham, BA2; et al. “The Association Between Income and Life Expectancy in the United States, 2001-2014.” Jama Network, 26 April 2016, https://jamanetwork.com/journals/jama/article-abstract/2513561

Erick Messias MD, MPH. “Income Inequality, Illiteracy Rate, and Life Expectancy in Brazil.” American Journal of Public Health, 10 Oct 2011, https://ajph.aphapublications.org/doi/full/10.2105/AJPH.93.8.1294

Hypothesis

Life expectancy is positively associated with income.

I will be using the variables “incomeperperson” and “lifeexpectancy” from the Gapminder dataset.

0 notes

datarookie · 4 years ago

Text

01 Feb 2021 marked the day I landed in Tumblr.

1 note · View note