data-viz-sowmya - Tumblr blog

data-viz-sowmya · 4 years ago

Text

Week 4 Assignment: Creating graphs for your data

Univariate graphs-

For each of the 3 variables I have chosen for data analysis, I have created univariate histogram charts as the data is quantitative ranging from 0 to 100 for two variables and 0 to 10 for one variable. I have created bins, given a range value and plotted histograms. The univariate graphs resemble the frequency distribution of the values categoried into each bin.

The 3 variables are – Legal aid score for women, Happiness score (women and men), Literacy rate women for all the countries.

1. univariate histogram plot for legal aid score

The frequency distribution program in the Week 2 assignment is converted into a univariate histogram plot here. please refer table below to understand the histogram plot.

The countries are scored between 0 to 100 in providing legal aid to women with 100 being the best. The histogram has 4 bins – 0-25, 25-50, 50-75 and 75-100. The distribution shows that many countries around 72 of them have scored between 50-75 being the highest while good number of 60 fall above 75-100. Overall countries have to still fair well in providing legal support to women with only 37% scoring 75 and above.

2. univariate histogram plot for happiness score

The frequency distribution program in the Week 2 assignment is converted into a univariate histogram plot here. please refer table below to understand the histogram plot.

The countries are given score ranging between 0 to 10 with 10 being the best. And then they are ranked. Here for analysis, I have considered scores. The data is binned for analysis purposes as 0, 2, 4, 6, 8 and 10. The histogram shows distribution of countries into these bins. The distribution shows around 77 countries falling in the bin 4-6 which is around 55%. And only 31.2% falling under 6-8 bin.

3. univariate histogram plot for literacy rate

The frequency distribution program in the Week 2 assignment is converted into a univariate histogram plot here. please refer table below to understand the histogram plot.

The countries are rated between 0 to 100 for percent of women being literate with 100 being highest. The histogram has 4 bins – 0-25, 25-50, 50-75 and 75-100. The distribution shows that many countries around 74 of them have scored between 75-100 being the highest which is good and is around 63.8% of total countries in that category

Bivariate graphs –

1. bivariate Scatter plot to see relation between literacy rate and legal aid ranking/score for countries

As in the lessons shown, since both the variables here are quantitative, I have used a scatter plot to show the relation between the quantitative explanatory and response variables.

Female_literacy_rate : This is the quantitative explanatory variable

Overall Average Score (Legal aid score) : This is the quantitative response variable

Scatterplot shows that most of the countries fall between 50-80 range when it comes to legal aid. Even countries with literacy rate between 80-100 also fall short in providing legal aid and are concentrated in the range 50-80 with very few countries crossing 80% for legal aid score.

2. bivariate Scatter plot to see relation between literacy rate and happiness ranking/score for countries

As in the lessons shown, since both the variables here are quantitative, I have used a scatter plot to show the relation between the quantitative explanatory and response variables.

Female_literacy_rate : This is the quantitative explanatory variable

Happiness_Score : This is the quantitative response variable

The scatter plot clearly states that more the literacy among women, higher is the happiness score. Here also a large chunk of countries is concentrated between 5 and 6.5 of happiness score.

Program code:

# -*- coding: utf-8 -*- """ Spyder Editor

Week 4 - Creating graphs

"""

import pandas import numpy import seaborn import matplotlib.pyplot as plt

# Import the entire dataset to memory data = pandas.read_csv('Dataset_happiness_women_role.csv',low_memory=False)

# recode missing values to python missing (NaN) data['Happiness_score']=data['Happiness_score'].replace(0.000000, numpy.nan) print(len(data))

#subset data excluding zero happiness_score and null legal_aid_score and null female_literacy_rate subset = data[(data['Female_literacy_rate'] > 0) | (data['OVERALL AVERAGE SCORE'] > 0) | (data['Happiness_score'] > 0)] print(len(subset))

# Calculate frequency distribution and Percentage for a variable about legal aid score ranging from 0 to 100 bin = [-numpy.inf, 25, 50, 75, numpy.inf] # build bins to categorize data legal_aid_avg_score = pandas.cut(subset['OVERALL AVERAGE SCORE'], bin) # each score is assigned a bin legal_aid_count = legal_aid_avg_score.value_counts(sort=False) # each bin value is counted percent_legal_aid = legal_aid_avg_score.value_counts(sort=False, normalize=True) print('Legal aid frequency distribution:\n',legal_aid_count) print('Legal aid percent distribution:\n',round(percent_legal_aid*100,1))

#univariate histogram plot for legal aid score range = (0, 100) bins = 4 # plotting a histogram plt.hist(subset['OVERALL AVERAGE SCORE'], bins, range, color = 'orange', histtype = 'bar', rwidth = 0.8) plt.xlabel('Bin') plt.ylabel('Legal aid score distribution') plt.title('Histogram for Legal Aid Score')

# Calculate frequency distribution and Percentage for a variable about happiness score ranging from 0 to 10 happiness_bin = [0, 2, 4, 6, 8] # build bins to categorize data happiness_score = pandas.cut(subset['Happiness_score'], happiness_bin) # each score is assigned a bin happiness_score_count = happiness_score.value_counts(sort=False) # each bin value is counted percent_happiness_score = happiness_score.value_counts(sort=False, normalize=True) print('Happiness score frequency distribution:\n',happiness_score_count) print('Happiness score percent distribution:\n',round(percent_happiness_score*100,1))

#univariate histogram plot for happiness score range = (0, 8) bins = 4 # plotting a histogram plt.hist(subset['Happiness_score'], bins, range, color = 'green', histtype = 'bar', rwidth = 0.8) plt.xlabel('Bin') plt.ylabel('Happiness score distribution') plt.title('Histogram for Happiness Score')

# Calculate frequency distribution and Percentage for a variable about literacy rate ranging from 0 to 100 literacy_bin = [-numpy.inf, 25, 50, 75, numpy.inf] # build bins to categorize data literacy_rate = pandas.cut(subset['Female_literacy_rate'], literacy_bin) # each score is assigned a bin literacy_rate_count = literacy_rate.value_counts(sort=False) # each bin value is counted percent_literacy_rate = literacy_rate.value_counts(sort=False, normalize=True) print('Literacy rate frequency distribution:\n',literacy_rate_count) print('Literacy rate percent distribution:\n',round(percent_literacy_rate*100,1))

#univariate histogram plot for literacy rate range = (0, 100) bins = 4 # plotting a histogram plt.hist(subset['Female_literacy_rate'], bins, range, color = 'blue', histtype = 'bar', rwidth = 0.8) plt.xlabel('Bin') plt.ylabel('Literacy Rate distribution') plt.title('Histogram for Literacy Rate')

#bivariate Scatter plot to see relation between literacy rate and legal aid ranking/score for countries desc1 = subset['Female_literacy_rate'].describe() print(desc1) desc2 = subset['OVERALL AVERAGE SCORE'].describe() print(desc2) scat1 = seaborn.regplot(x='Female_literacy_rate', y='OVERALL AVERAGE SCORE', fit_reg=False, data=subset, color = 'orange') plt.xlabel('Female Literacy Rate') plt.ylabel('Legal Aid Rate') plt.title('Scatterplot for the Association Between Female Literacy Rate and Legal aid Rate for countries')

#bivariate Scatter plot to see relation between literacy rate and happiness ranking/score for countries desc3 = subset['Female_literacy_rate'].describe() print(desc3) desc4 = subset['Happiness_score'].describe() print(desc4) scat1 = seaborn.regplot(x='Female_literacy_rate', y='Happiness_score', fit_reg=False, data=subset, color = 'green') plt.xlabel('Female Literacy Rate') plt.ylabel('Happiness score') plt.title('Scatterplot for the Association Between Female Literacy Rate and Happiness score for countries')

0 notes

data-viz-sowmya · 4 years ago

Text

Week 3 Assignment: Data Management decisions

As I have mentioned while choosing the dataset, the data is sourced from three different files for the same dimension i.e., country – Population and Literacy rates, happiness ranking and legal aid for women ranking of the countries. These are the three main variables considered.

Coding out missing data - Here on merging the data for the countries, I found out that some countries have missing data for happiness score. i.e., some countries with dataset for population and literacy rates did not have happiness score and it was taken as zero on merging. Since happiness score which ranges from 0 to 10 actually started from 2 and above in the actual dataset, I went ahead and replaced this missing value for the countries as nan instead of zero, which would have otherwise given a wrong impression of countries being scored as zero.

Coding in valid data / Creating subset of data – In order to be specific that the countries have data for all 3 variables, I created a subset of data eliminating null values and zeros.

Binning or grouping variables – As part of week2 assignment while calculating frequency distribution, I have created bins for all 3 variables as data in the variables is continuous ranging from 0 to 10 or 0 to 100.

The frequency distribution for legal aid a woman is guaranteed by a nation, sees many countries falling into bin size 50 to 75, which is about 44% of the total observations.

The frequency distribution for happiness score combined men and women, sees many countries falling under score 4-6, which is 28% of the countries. Here, data is not present for many countries and hence value for 0-2 is very high around 128. No country in the data has value below 3 other than 2 countries. All the count 128 amounts to either missing data or countries grouped in categories like South Asia, Central Africa etc.

Literacy rate is high among nations from the frequency distribution calculated, as 74 countries fall under the rate of 75 and above which amounts for approximately 64% of the total observations

After creating a subset of data, I see that the observations (no. of rows) have reduced from 266 to 209.

Assignment program :

# -*- coding: utf-8 -*- """ Spyder Editor

Week 3 - data management decisions

"""

import pandas import numpy

# Import the entire dataset to memory data = pandas.read_csv('Dataset_test.csv',low_memory=False)

# recode missing values to python missing (NaN) data['Happiness_score']=data['Happiness_score'].replace(0.000000, numpy.nan) print(len(data))

#subset data excluding zero happiness_score and null legal_aid_score and null female_literacy_rate subset = data[ (data['Female_literacy_rate'] > 0) | (data['OVERALL AVERAGE SCORE'] > 0) | (data['Happiness_score'] > 0)] print(len(subset))

# build bins to categorize data using subset

# Calculate frequency distribution and Percentage for a variable about legal aid score ranging from 0 to 100 bin = [-numpy.inf, 25, 50, 75, numpy.inf]

# build bins to categorize data legal_aid_avg_score = pandas.cut(subset['OVERALL AVERAGE SCORE'], bin) # each score is assigned a bin legal_aid_count = legal_aid_avg_score.value_counts(sort=False)

# each bin value is counted percent_legal_aid = legal_aid_avg_score.value_counts(sort=False, normalize=True) print('Legal aid frequency distribution:\n',legal_aid_count) print('Legal aid percent distribution:\n',round(percent_legal_aid*100,1))

# Calculate frequency distribution and Percentage for a variable about happiness score ranging from 0 to 10 happiness_bin = [0, 2, 4, 6, 8]

# build bins to categorize data happiness_score = pandas.cut(subset['Happiness_score'], happiness_bin)

# each score is assigned a bin

happiness_score_count = happiness_score.value_counts(sort=False)

# each bin value is counted

percent_happiness_score = happiness_score.value_counts(sort=False, normalize=True) print('Happiness score frequency distribution:\n',happiness_score_count) print('Happiness score percent distribution:\n',round(percent_happiness_score*100,1))

# Calculate frequency distribution and Percentage for a variable about literacy rate ranging from 0 to 100 literacy_bin = [-numpy.inf, 25, 50, 75, numpy.inf]

# build bins to categorize data literacy_rate = pandas.cut(subset['Female_literacy_rate'], literacy_bin)

# each score is assigned a bin literacy_rate_count = literacy_rate.value_counts(sort=False)

# each bin value is counted percent_literacy_rate = literacy_rate.value_counts(sort=False, normalize=True) print('Literacy rate frequency distribution:\n',literacy_rate_count) print('Literacy rate percent distribution:\n',round(percent_literacy_rate*100,1))

0 notes

data-viz-sowmya · 4 years ago

Text

Week 2 Assignment: Run Program on Frequency Distribution

The three variables considered for the assignment is - Legal aid score ranging from 0 to 100 <OVERALL AVERAGE SCORE>, Happiness score ranging from 0 to 10 <happiness_score>, Literacy rate of female adult 15 years and above ranging from 0 to 100 <Female_literacy_rate>. Here I have categorized data for each variable into bins for easy count and frequency distribution calculation. 0 to 100 into bins of 4 as 0-25, 26-50, 51-75 and 75-100. Then for 0 to 10 as 0-2, 2-4, 4-6, 6 and above into one bin as data here is upto maximum count 8.

The code is as below:

# -*- coding: utf-8 -*-

"""

Spyder Editor

Week 2 Frequency Distribution program

"""

import pandas

import numpy

# Import the entire dataset to memory

data = pandas.read_csv('Dataset_happiness_women_role.csv',low_memory=False)

# Calculate frequency distribution and Percentage for a variable about legal aid score ranging from 0 to 100

bin = [-numpy.inf, 25, 50, 75, numpy.inf] # build bins to categorize data

legal_aid_avg_score = pandas.cut(data['OVERALL AVERAGE SCORE'], bin) # each score is assigned a bin

legal_aid_count = legal_aid_avg_score.value_counts(sort=False) # each bin value is counted

percent_legal_aid = legal_aid_avg_score.value_counts(sort=False, normalize=True)

print('Legal aid frequency distribution:\n',legal_aid_count)

print('Legal aid percent distribution:\n',round(percent_legal_aid*100,1))

# Calculate frequency distribution and Percentage for a variable about happiness score ranging from 0 to 10

happiness_bin = [-numpy.inf, 2, 4, 6, numpy.inf] # build bins to categorize data

happiness_score = pandas.cut(data['Happiness_score'], happiness_bin) # each score is assigned a bin

happiness_score_count = happiness_score.value_counts(sort=False) # each bin value is counted

percent_happiness_score = happiness_score.value_counts(sort=False, normalize=True)

print('Happiness score frequency distribution:\n',happiness_score_count)

print('Happiness score percent distribution:\n',round(percent_happiness_score*100,1))

# Calculate frequency distribution and Percentage for a variable about literacy rate ranging from 0 to 100

literacy_bin = [-numpy.inf, 25, 50, 75, numpy.inf] # build bins to categorize data

literacy_rate = pandas.cut(data['Female_literacy_rate'], literacy_bin) # each score is assigned a bin

literacy_rate_count = literacy_rate.value_counts(sort=False) # each bin value is counted

percent_literacy_rate = literacy_rate.value_counts(sort=False, normalize=True)

print('Literacy rate frequency distribution:\n',literacy_rate_count)

print('Literacy rate percent distribution:\n',round(percent_literacy_rate*100,1))

Frequency distribution tables: output

The frequency distribution for legal aid a woman is guaranteed by a nation, sees many countries falling into bin size 50 to 75, which is about 44% of the total observations.

The frequency distribution for happiness score combined men and women, sees many countries falling under score 4-6, which is 28% of the countries. Here, data is not present for many countries and hence value for 0-2 is very high around 128. No country in the data has value below 3. All the count 128 amounts to either missing data or countries grouped in categories like South Asia, Central Africa etc.

Literacy rate is high among nations from the frequency distribution calculated, as 74 countries fall under the rate of 75 and above which amounts for approximately 64% of the total observations.

0 notes

data-viz-sowmya · 4 years ago

Text

Research Question: World happiness from women’s literacy and legal aid perspective

I have created a codebook with variables and dataset from three different sources. The main variables being Male and Female literacy rates, Global ranking of women on gender inequalities in legal treatment and Global Happiness ranking.

Step 1: Dataset

The dataset I choose is a combination from three different sources researched and gathered around the same year 2017-

1. World Bank data of human population – male and female, literacy rates - 2017

2. Global index ranking on women’s workforce equality from Council on Foreign Relations (CFR) – 2016-2017

3. World Happiness Report – Ranking of countries on Happiness Index –2017

Step 2: First Topic

As nations progress towards development, there is growth in literacy rates among both men and women. While trying to see as literacy rates increase among women which is a good sign of development, I would like to relate higher literacy rates with whether governments and countries are supportive in providing legal aids to women and are women able to bargain for support and equal rights. Whether increased literacy has helped increased legal aid?

Step 3: Codebook

As mentioned in the step 1, the dataset I chose is a combination from three different sources. For the first topic data is from -

World Bank data of human population and literacy – The data for population total, male and female is sourced from United Nations Population Division and other combined sources as mentioned in variable list. The UNESCO Institute for Statistics (UIS) is the official statistical agency of UNESCO and trusted source of internationally-comparable data on education, science, culture and communication. The data for literacy rate of male and female is sourced from here for the year under consideration.

Data on legal aid – i.e. Global index ranking on women’s workforce equality - The World Bank has created a dataset considering 50 questions of legal gender inequalities and an additional 6 questions added by the Council on Foreign Relations(CFR). CFR grouped these into seven categories: accessing institutions, building credit, getting a job, going to court, protecting women from violence, providing incentives to work, and using property. CFR then calculated a ranking of countries, giving each an overall average score between 0 and 100 (100 being the best). These seven categories and average score are the considered variables here.

Step 4: Second topic of interest

I would like to concentrate here whether literacy and legal aid for women has its impact on the happiness of the whole population irrespective of being men or women.

Step 5: Variables and question pertaining to second topic

The Global Happiness ranking of 2017 is based on six key variables - GDP per capita, social support, healthy life expectancy, social freedom, generosity, and absence of corruption. The emphasizes for 2017 report is on the importance of the social foundations of happiness. The World Happiness Report is a publication of the Sustainable Development Solutions Network and the data is powered by Gallup World Poll and Lloyd’s Register. The analysis is based chiefly on individual life evaluations, roughly 1,000 per year in each of more than 150 countries, as measured by answers to the Cantril ladder question - a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time? The report presents the average life evaluation scores for each country, based on averages from surveys covering the most recent three-year period, in this report including 2014-2016.

Step 6: Literature Review – Bibliographic information

Social well-being and happiness of every individual are measured across various parameters/variables like education, health, income, good governance, social freedom etc. Researches like relationship between health performance and education performance, association between well-being and education performance was explored by the Journal on European Psychologist with data from 30 Nations.

An article on Gender issue from the Springer Link titled – Happiness the World Over notes the relationship between subjective wellbeing having a positive correlation with countries that enjoy individual and economic freedom, higher life expectancy, lower rates of infant mortality and greater wealth. There were no significant correlations between happiness and marriage rates, divorce rates, fertility rates, literacy rates, suicide rates and penal incarceration rates.

The world happiness report also looks into the importance of employment for people’s subjective wellbeing through its Happiness at Work analysis for 2017. When considering the world’s population as a whole, people with a job evaluate the quality of their lives much more favorably than those who are unemployed. The importance of having a job extends far beyond the salary attached to it, with non-pecuniary aspects of employment such as social status, social relations, daily structure, and goals all exerting a strong influence on people’s happiness.

Step 7: Hypothesis

The subjective wellbeing of nations, societies, families and individuals leading to happiness has correlation with various factors like health, education, income, work, good governance, economic and social freedom alike but does the role of women who form equal part of these factors has a major influence on the happiness and wellbeing? Does higher literacy rates and legal aid by governments for women leads to higher ranking of nations on happiness?

1 note · View note