EXAMPLE USING ANOVA
/Example Using ANOVA/
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
PROC SORT; BY EXERCISE;
PROC ANOVA; CLASS REGION;
MODEL AGE=REGION;
MEANS REGION; BY S2AQ6B;
RUN;
The ANOVA Procedure
S2AQ6B=.Class Level InformationClassLevelsValuesREGION32 3 4Number of Observations Read5Number of Observations Used5
The ANOVA Procedure
Dependent Variable: AGE
S2AQ6B=.SourceDFSum of SquaresMean SquareF ValuePr > FModel22400.3000001200.15000013.760.0678Error2174.50000087.250000 Corrected Total42574.800000 R-SquareCoeff VarRoot MSEAGE Mean0.93222825.109609.34077137.20000SourceDFAnova SSMean SquareF ValuePr > FREGION22400.3000001200.15000013.760.0678
The ANOVA Procedure
S2AQ6B=.
Due to the p alue of .0678 there is no moderation between the age, region and drinking of wine
0 notes
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
IF S1Q6A eq 1 THEN incomegroup=.;
ELSE IF S1Q6A LE 3 THEN education=1;
ELSE IF S1Q6A LE 5 THEN education=2;
ELSE IF S1Q6A LE 7 THEN education=3;
ELSE IF S1Q6A GT 10 THEN education=3;
PROC SORT; by IDNUM;
PROC CORR; VAR S1Q10B S1Q11B;
RUN;
The CORR Procedure2 Variables:S1Q10B S1Q11BSimple StatisticsVariableNMeanStd DevSumMinimumMaximumS1Q10B430936.757504.40665291201017.00000S1Q11B430939.421414.843084059971.0000021.00000Pearson Correlation Coefficients, N = 43093
Prob > |r| under H0: Rho=0 S1Q10BS1Q11BS1Q10B
1.00000
0.67157
<.0001S1Q11B
0.67157
<.0001
1.00000
both education level is correlated to individual income and household income as shown by the data above. this is because the p value is less than .0001
0 notes
CHi 2
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL TAB12MDX="Tobacco Dependence Past 12 Months"
CHECK321="Smoked Cigarettes in Past 12 Months"
S3AQ3B1="Usual Smoking Frequency"
S3AQ3C1="Usual Smoking Quantity"
S1Q6A="Highest Grade or year of school completed"
S2AQ8A="How often drank any alcohol in the last 12 months"
/Set appropriate missing data as needed/
IF S3AQ3B1=9 THEN S3AQ3B1=.;
IF S3AQ3C1=99 THEN S3AQ3C1=.;
IF S2AQ8A=99 THEN S2AQ8A=.;
IF S1Q6A= 1 THEN S1Q6A= .;
IF S1Q6A= 2-4 THEN GRADE=1;
IF S1Q6A= 5-7 THEN GRADE=2;
IF S1Q6A= 8-9 THEN GRADE=3;
IF S1Q6A=10-14 THEN GRADE=4;
INTELLIGENCE=GRADE*S1Q6A;
/subsetting data to include only age greater than 18/
IF AGE GE 18;
PROC SORT; by IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=2;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=3;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ; RUN; DATA COMPARISON1; SET NEW; IF S1Q6A=1 OR S1Q6A=4; PROC SORT; BY IDNUM; PROC FREQ; TABLES S2AQ8AS1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=5;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=6;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=7;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=8;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=9;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=10;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=11;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=12;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=13;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
DATA COMPARISON1; SET NEW;
IF S1Q6A=1 OR S1Q6A=14;
PROC SORT; BY IDNUM;
PROC FREQ; TABLES S2AQ8A*S1Q6A/CHISQ;
RUN;
The table tells me about the probabilty of correlation between varibales of grade level completed vs alcohol use. However based on the CHi2 value I don't think I can make any valid conclusions about this data being correlated.
0 notes
Data analysis for grade vs gender and race
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.nesarc_pds;
LABEL TAB12MDX="Tobacco Dependence Past 12 Months"
CHECK321="Smoked Cigarettes in Past 12 Months"
S3AQ3B1="Usual Smoking Frequency"
S3AQ3C1="Usual Smoking Quantity"
S1Q6A="Highest Grade or year of school completed"
S2AQ8A="How often drank any alcohol in the last 12 months"
/Set appropriate missing data as needed/
IF S3AQ3B1=9 THEN S3AQ3B1=.;
IF S3AQ3C1=99 THEN S3AQ3C1=.;
IF S2AQ8A=99 THEN S2AQ8A=.;
IF S1Q6A= 1 THEN S1Q6A= .;
IF S1Q6A= 2-4 THEN GRADE=1;
IF S1Q6A= 5-7 THEN GRADE=2;
IF S1Q6A= 8-9 THEN GRADE=3;
IF S1Q6A=10-14 THEN GRADE=4;
INTELLIGENCE=GRADE*S1Q6A;
/subsetting data to include only age greater than 18/
IF AGE GE 18;
PROC SORT; by IDNUM;
PROC ANOVA; CLASS SEX;
MODEL S1Q6A=SEX;
MEANS SEX;
PROC ANOVA; CLASS ETHRACE2A;
MODEL S1Q6A=ETHRACE2A;
MEANS ETHRACE2A/DUNCAN;
RUN;
The ANOVA ProcedureClass Level InformationClassLevelsValuesSEX21 2Number of Observations Read43093Number of Observations Used42875
The ANOVA Procedure
Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel1112.9008112.900818.75<.0001Error42873258168.30276.0217 Corrected Total42874258281.2035 R-SquareCoeff VarRoot MSES1Q6A Mean0.00043725.847022.4539159.493994SourceDFAnova SSMean SquareF ValuePr > FSEX1112.9008106112.900810618.75<.0001
The ANOVA Procedure
The ANOVA ProcedureClass Level InformationClassLevelsValuesETHRACE2A51 2 3 4 5Number of Observations Read43093Number of Observations Used42875
The ANOVA Procedure
Dependent Variable: S1Q6A Highest Grade or year of school completedSourceDFSum of SquaresMean SquareF ValuePr > FModel415981.16333995.2908706.88<.0001Error42870242300.04025.6520 Corrected Total42874258281.2035 R-SquareCoeff VarRoot MSES1Q6A Mean0.06187525.040962.3773889.493994SourceDFAnova SSMean SquareF ValuePr > FETHRACE2A415981.163283995.29082706.88<.0001
The ANOVA Procedure
The ANOVA Procedure
Duncan's Multiple Range Test for S1Q6A
Note:This test controls the Type I comparisonwise error rate, not the experimentwise error rate.Alpha0.05Error Degrees of Freedom42870Error Mean Square5.651972Harmonic Mean of Cell Sizes2007.656
Note:Cell sizes are not equal.Number of Means2345Critical Range.1471.1549.1601.1639
Males generally would have a higher eudctation level but, were also more likely to not even attend school or drop out early.
Hispanic and latinos have the lowest degree of education. Asian tend to have the highest.
0 notes
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Read the dataset
data = pd.read_csv('gapminder.csv', low_memory=False)
Replace empty strings with NaN
data = data.replace(r'^\s*$', np.NaN, regex=True)
Setting variables you will be working with to numeric
data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce')
data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce')
data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data['hivrate'] = pd.to_numeric(data['hivrate'], errors='coerce')
#
data['country'] = pd.to_numeric(data['country'], errors='coerce')
data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce')
data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce')
data['employrate'] = pd.to_numeric(data['hivrate'], errors='coerce')
Descriptive statistics
desc1 = data['alcconsumption'].describe()
print(desc1)
desc2 = data['employrate'].describe()
print(desc2)
Basic scatterplot: Q -> Q
plt.figure(figsize=(10, 6))
sns.regplot(x="alcconsumption", y="hivrate", fit_reg=False, data=data)
plt.xlabel('alcconsumption')
plt.ylabel('hivrate')
plt.title('Scatterplot for the Association Between alcconsumption and HIVRate')
plt.show()
plt.figure(figsize=(10, 6))
sns.regplot(x="lifeexpectancy", y="employrate", data=data)
plt.xlabel('lifeexpectancy')
plt.ylabel('employrate')
plt.title('Scatterplot for the Association Between Life Expectancy and Employment Rate')
plt.show()
Quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print('Employrate - 4 categories - quartiles')
data['EMPLOYRATE4'] = pd.qcut(data['employrate'], 4, labels=["1=25th%tile", "2=50%tile", "3=75%tile", "4=100%tile"])
c10 = data['EMPLOYRATE4'].value_counts(sort=False, dropna=True)
print(c10)
Bivariate bar graph C -> Q
plt.figure(figsize=(10, 6))
sns.catplot(x='EMPLOYRATE4', y='alcconsumption', data=data, kind="bar", ci=None)
plt.xlabel('Employ Rate')
plt.ylabel('Alcohol Consumption Rate')
plt.title('Mean Alcohol Consumption Rate by Income Group')
plt.show()
c11 = data.groupby('EMPLOYRATE4').size()
print(c11)
result = data.sort_values(by=['EMPLOYRATE4'], ascending=True)
print(result)
My data shows little correlation between alcohol consumption ahd HIV Rate but, When life expectancy is lower employment rate is higher which is probably because people retire as they get older. Finally, I found that in each income group the average alcohol consumption rate down as people make more money
0 notes
weight of smokers
import pandas as pd
import numpy as np
Read the dataset
data = pd.read_csv('nesarc_pds.csv', low_memory=False)
Bug fix for display formats to avoid runtime errors
pd.set_option('display.float_format', lambda x: '%f' % x)
Setting variables to numeric
numeric_columns = ['TAB12MDX', 'CHECK321', 'S3AQ3B1', 'S3AQ3C1', 'WEIGHT']
data[numeric_columns] = data[numeric_columns].apply(pd.to_numeric)
Subset data to adults 100 to 600 lbs who have smoked in the past 12 months
sub1 = data[(data['WEIGHT'] >= 100) & (data['WEIGHT'] <= 600) & (data['CHECK321'] == 1)]
Make a copy of the subsetted data
sub2 = sub1.copy()
def print_value_counts(df, column, description):
"""Print value counts for a specific column in the dataframe."""
print(f'{description}')
counts = df[column].value_counts(sort=False, dropna=False)
print(counts)
return counts
Initial counts for S3AQ3B1
print_value_counts(sub2, 'S3AQ3B1', 'Counts for original S3AQ3B1')
Recode missing values to NaN
sub2['S3AQ3B1'].replace(9, np.nan, inplace=True)
sub2['S3AQ3C1'].replace(99, np.nan, inplace=True)
Counts after recoding missing values
print_value_counts(sub2, 'S3AQ3B1', 'Counts for S3AQ3B1 with 9 set to NaN and number of missing requested')
Recode missing values for S2AQ8A
sub2['S2AQ8A'].fillna(11, inplace=True)
sub2['S2AQ8A'].replace(99, np.nan, inplace=True)
Check coding for S2AQ8A
print_value_counts(sub2, 'S2AQ8A', 'S2AQ8A with Blanks recoded as 11 and 99 set to NaN')
print(sub2['S2AQ8A'].describe())
Recode values for S3AQ3B1 into new variables
recode1 = {1: 6, 2: 5, 3: 4, 4: 3, 5: 2, 6: 1}
recode2 = {1: 30, 2: 22, 3: 14, 4: 5, 5: 2.5, 6: 1}
sub2['USFREQ'] = sub2['S3AQ3B1'].map(recode1)
sub2['USFREQMO'] = sub2['S3AQ3B1'].map(recode2)
Create secondary variable
sub2['NUMCIGMO_EST'] = sub2['USFREQMO'] * sub2['S3AQ3C1']
Examine frequency distributions for WEIGHT
print_value_counts(sub2, 'WEIGHT', 'Counts for WEIGHT')
print('Percentages for WEIGHT')
print(sub2['WEIGHT'].value_counts(sort=False, normalize=True))
Quartile split for WEIGHT
sub2['WEIGHTGROUP4'] = pd.qcut(sub2['WEIGHT'], 4, labels=["1=0%tile", "2=25%tile", "3=50%tile", "4=75%tile"])
print_value_counts(sub2, 'WEIGHTGROUP4', 'WEIGHT - 4 categories - quartiles')
Categorize WEIGHT into 3 groups (100-200 lbs, 200-300 lbs, 300-600 lbs)
sub2['WEIGHTGROUP3'] = pd.cut(sub2['WEIGHT'], [100, 200, 300, 600], labels=["100-200 lbs", "201-300 lbs", "301-600 lbs"])
print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3')
Crosstab of WEIGHTGROUP3 and WEIGHT
print(pd.crosstab(sub2['WEIGHTGROUP3'], sub2['WEIGHT']))
Frequency distribution for WEIGHTGROUP3
print_value_counts(sub2, 'WEIGHTGROUP3', 'Counts for WEIGHTGROUP3')
print('Percentages for WEIGHTGROUP3')
print(sub2['WEIGHTGROUP3'].value_counts(sort=False, normalize=True))
Counts for original S3AQ3B1
S3AQ3B1
1.000000 81
2.000000 6
5.000000 2
4.000000 6
3.000000 3
6.000000 4
Name: count, dtype: int64
Counts for S3AQ3B1 with 9 set to NaN and number of missing requested
S3AQ3B1
1.000000 81
2.000000 6
5.000000 2
4.000000 6
3.000000 3
6.000000 4
Name: count, dtype: int64
S2AQ8A with Blanks recoded as 11 and 99 set to NaN
S2AQ8A
6 12
4 2
7 14
5 16
28
1 6
2 2
10 9
3 5
9 5
8 3
Name: count, dtype: int64
count 102
unique 11
top
freq 28
Name: S2AQ8A, dtype: object
Counts for WEIGHT
WEIGHT
534.703087 1
476.841101 5
534.923423 1
568.208544 1
398.855701 1
..
584.984241 1
577.814060 1
502.267758 1
591.875275 1
483.885024 1
Name: count, Length: 86, dtype: int64
Percentages for WEIGHT
WEIGHT
534.703087 0.009804
476.841101 0.049020
534.923423 0.009804
568.208544 0.009804
398.855701 0.009804
584.984241 0.009804
577.814060 0.009804
502.267758 0.009804
591.875275 0.009804
483.885024 0.009804
Name: proportion, Length: 86, dtype: float64
WEIGHT - 4 categories - quartiles
WEIGHTGROUP4
1=0%tile 26
2=25%tile 25
3=50%tile 25
4=75%tile 26
Name: count, dtype: int64
Counts for WEIGHTGROUP3
WEIGHTGROUP3
100-200 lbs 0
201-300 lbs 0
301-600 lbs 102
Name: count, dtype: int64
WEIGHT 398.855701 437.144557 … 599.285226 599.720557
WEIGHTGROUP3 …
301-600 lbs 1 1 … 1 1
[1 rows x 86 columns]
Counts for WEIGHTGROUP3
WEIGHTGROUP3
100-200 lbs 0
201-300 lbs 0
301-600 lbs 102
Name: count, dtype: int64
Percentages for WEIGHTGROUP3
WEIGHTGROUP3
100-200 lbs 0.000000
201-300 lbs 0.000000
301-600 lbs 1.000000
Name: proportion, dtype: float64
I changed the code to see the weight of smokers who have smoked in the past year. For weight group 3, 102 people over 102lbs have smoked in the last year
0 notes
gapminder data review
-- coding: utf-8 --
"""
Spyder Editor
This is a temporary script file.
"""
import pandas
import numpy
read file
data = pandas.read_csv('gapminder.csv', low_memory=False)
set columns to uppercase
data.columns = map(str.upper, data.columns)
bug fix
pandas.set_option("display.float_format", lambda x: "%f" %x)
print # of rows and columns
print(len(data)) # number of observations (rows)
print(len(data.columns)) # number of observations (columns)
print income per person and %
print("counts for incomeperperson = income per person")
c1 = data["INCOMEPERPERSON"].value_counts(sort=False)
print (c1)
print("percentages for income per person")
p1 = data["INCOMEPERPERSON"].value_counts(sort=False, normalize=True)
print(p1)
print HIVRATE and %
print("counts for HIV RATE")
c1 = data["HIVRATE"].value_counts(sort=False)
print (c1)
print("percentages for HIV RATE")
p1 = data["HIVRATE"].value_counts(sort=False, normalize=True)
print(p1)
print ALCOHOL CONSUMPTION and %
print("counts for ALCOHOL CONSUMPTION")
c1 = data["ALCCONSUMPTION"].value_counts(sort=False)
print (c1)
print("percentages for ALCOHOL CONSUMPTION")
p1 = data["ALCCONSUMPTION"].value_counts(sort=False, normalize=True)
print(p1)
213
16
counts for incomeperperson = income per person
INCOMEPERPERSON
23
1914.99655094922 1
2231.99333515006 1
21943.3398976022 1
1381.00426770244 1
..
5528.36311387522 1
722.807558834445 1
610.3573673206 1
432.226336974583 1
320.771889948584 1
Name: count, Length: 191, dtype: int64
percentages for income per person
INCOMEPERPERSON
0.107981
1914.99655094922 0.004695
2231.99333515006 0.004695
21943.3398976022 0.004695
1381.00426770244 0.004695
5528.36311387522 0.004695
722.807558834445 0.004695
610.3573673206 0.004695
432.226336974583 0.004695
320.771889948584 0.004695
Name: proportion, Length: 191, dtype: float64
counts for HIV RATE
HIVRATE
66
.1 28
2 2
.5 5
.3 10
3.1 1
.06 16
1.4 1
.2 15
2.3 1
1.2 4
24.8 1
.45 1
3.3 1
5.3 1
4.7 1
3.4 3
.4 9
2.5 2
.9 4
.8 5
5 1
5.2 1
1.8 1
1.3 2
1.9 1
1.7 1
6.3 1
.7 3
23.6 1
1.5 2
11 1
1 4
11.5 1
.6 3
13.1 1
3.6 1
2.9 1
1.6 1
17.8 1
1.1 2
25.9 1
5.6 1
3.2 1
6.5 1
13.5 1
14.3 1
Name: count, dtype: int64
percentages for HIV RATE
HIVRATE
0.309859
.1 0.131455
2 0.009390
.5 0.023474
.3 0.046948
3.1 0.004695
.06 0.075117
1.4 0.004695
.2 0.070423
2.3 0.004695
1.2 0.018779
24.8 0.004695
.45 0.004695
3.3 0.004695
5.3 0.004695
4.7 0.004695
3.4 0.014085
.4 0.042254
2.5 0.009390
.9 0.018779
.8 0.023474
5 0.004695
5.2 0.004695
1.8 0.004695
1.3 0.009390
1.9 0.004695
1.7 0.004695
6.3 0.004695
.7 0.014085
23.6 0.004695
1.5 0.009390
11 0.004695
1 0.018779
11.5 0.004695
.6 0.014085
13.1 0.004695
3.6 0.004695
2.9 0.004695
1.6 0.004695
17.8 0.004695
1.1 0.009390
25.9 0.004695
5.6 0.004695
3.2 0.004695
6.5 0.004695
13.5 0.004695
14.3 0.004695
Name: proportion, dtype: float64
counts for ALCOHOL CONSUMPTION
ALCCONSUMPTION
.03 1
7.29 1
.69 1
10.17 1
5.57 1
..
7.6 1
3.91 1
.2 1
3.56 1
4.96 1
Name: count, Length: 181, dtype: int64
percentages for ALCOHOL CONSUMPTION
ALCCONSUMPTION
.03 0.004695
7.29 0.004695
.69 0.004695
10.17 0.004695
5.57 0.004695
7.6 0.004695
3.91 0.004695
.2 0.004695
3.56 0.004695
4.96 0.004695
Name: proportion, Length: 181, dtype: float64
Analyze: The table shows that for income per person, alcohol consumption , every category has unique values with country being the unique identifier for each row. In general HIVRATE has commonality with a low rate. I am hoping to correlate all this data together to draw conclusions.
0 notes
Data Analytics Project
The data set I have selected is coming from the National Epidemiologic Survey of Drug Use and Health dataset.
My question is, is there a correlation between drug use and depression within the dataset. I believe there will be a positive correlation between the two datasets.
In google scholar, I searched "drug abuse rates vs major depression "and found the following article: drug abuse rates vs major depression - Google Scholar and it makes me think that my hypothesis will be correct.
The article states the following: or example, the crude lifetime prevalences for depression ranged from 0.7 to 8.6 per 100; for alcohol it was 7.5 to 32.6, and for drugs it was 3.1 to 10.5
Which makes me think drugs are not the highest cause of depression but are a contributing factor
1 note
·
View note