#cause for once I feel like I can actively and publicly indulge in my silly joys without fear of judgement
Explore tagged Tumblr posts
oorangesoda Ā· 2 months ago
Text
I know love is stored in the OC because what do you mean Iā€™m helplessly attached to a guy I created and drew for the first time only four months ago
5 notes Ā· View notes
because-its-important Ā· 7 years ago
Text
whatā€™s the most annoying question to ask a nun* in 1967?
tl;dr - In 1967, a very long survey was administered to nearly 140,000 American women in Catholic ministry. I wroteĀ this script, which makes the survey data work-ready and satisfies a very silly initial inquiry: Which survey question did the sisters find most annoying?
* The study participants are never referred to as nuns, so I kind of suspect that not all sisters are nuns, but I couldn't find a definitive answer about this during a brief search. 'Nun' seemed like an efficient shorthand for purposes of an already long title, but if this is wrong please holler at me!
During my first week at Recurse I made a quick game using a new language and a new toolset. Making a game on my own had been a long-running item on my list of arbitrary-but-personally-meaningful goals, so being able to cross it off felt pretty good!Ā 
AnotherĀ such goal Iā€™ve had for a while goes something like this:Ā ā€œDevelop the skills to be able to find a compelling data set, ask some questions, and share the results.ā€ As such, I spent last week familiarizing myself with Python šŸ, selecting a fun dataset, prepping it for analysis, and indulging my curiosity.
the process
On recommendation from Robert Schuessler, another Recurser in my batch, I read through the first ten chapters in Python Crash CourseĀ and did the data analysis project. This section takes you through comparing time series data using weather reports for two different locations, then through plotting country populations on a world map.
During data analysis study group, Robert suggested that we find a few datasets and write scripts to get them ready to work with as a sample starter-pack for the group. Jeremy Singer-Vinesā€™ collection of esoteric datasets, Data Is Plural, came to mind immediately. I was super excited to finally have an excuse to pour through it and eagerly set about picking a real mixed bag of 6 different data sets.
One of those datasets was The Sister Survey, a huge, one-of-its-kind collection of data on the opinions of American Catholic sisters about religious life. When I read the first question, I was hooked.Ā 
ā€œIt seems to me that all our concepts of God and His activity are to some degree historically and culturally conditioned, and therefore we must always be open to new ways of approaching Him.ā€Ā 
I decided I wanted to start with this survey and spend enough time with it to answer at least one easy question. A quick skim of the Questions and Responses file showed that of the multiple choice answer options, a recurring one was:Ā ā€œThe statement is so annoying to me that I cannot answer.ā€Ā 
I thought this was a pretty funny option, especially given that participants were already tolerant enough to take such an enormous survey! How many questions can one answer before anyĀ question is too annoying to answer? šŸ¤”Ā I decided itā€™d be fairly simple to find the most annoying question, so I started there.Ā 
I discovered pretty quickly that while the survey responses are in a large yet blessedly simple csv, the file with the question and answers key is just a big ole plain text. My solution was to regex through every line in the txt file and build out a survey_key dict that holds the question text and another dict of the set of possible answers for each question. This works pretty well, though Iā€™ve spotted at least one instance where the txt file is inconsistently formatted and therefore breaks answer retrieval.
Next, I ran over each question in the survey, counted how many responses include the phrase ā€œso annoyingā€ and selected the question with the highest count of matching responses.
the most annoying question
Turns out itā€™s this one! The survey asks participants to indicate whether they agree or disagree with the following statement:
ā€œChristian virginity goes all the way along a road on which marriage stops half way.ā€
3702 sisters (3%) responded that they found the statement too annoying to answer. The most popular answer was No at 56% of respondents.Ā 
Iā€™m not really sure how to interpret this question! So far I have two running theories about the responses:
The survey participants were also confused and boy, being confused isĀ annoying!
The sisters generally werenā€™t down for claiming superiority over other women on the basis of their marital-sexual status.
Both of these interpretations align suspiciously well with my own opinions on the matter, though, so, ymmv.
9x speed improvement in one lil refactor
The first time I ran a working version of the full script it took around 27 minutes.Ā 
I didnā€™t (still donā€™t) have the experience to know if this is fast or slow for the size of the dataset, but I did figure that it was worth making at least one attempt to speed up. Half an hour is a long time to wait for a punchline!
As you can see in this commit, I originally had a function called unify that rewrote the answers in the survey from the floats which they'd initially been stored as, to plain text returned from the survey_key.Ā I figured that it made sense to build a dataframe with the complete info, then perform my queries against that dataframe alone.Ā 
However, the script was spending over 80% of its time in this function, which I knew from aggressively outputting the scriptā€™s progress and timing it. I also knew that I didnā€™t strictly need to be doing any answer rewriting at all. So, I spent a little while refactoring find_the_most_annoying_question to use a new function, get_answer_text, which returns the descriptive answer text when passed the answer key and its question. This shaved 9 lines (roughly 12%) off my entire script.
Upon running the script post-refactor, I knew right away that this approach was much, much faster - but I still wasnā€™t prepared when it finished after only 3 minutes! And since I knew between one and two of those minutes were spent downloading the initial csv alone, that meant Iā€™d effectively neutralized the most egregious time hog in the script. šŸ‘
I still donā€™t know exactly why this is so much more efficient. The best explanation I have right now is ā€œwelp, writing data must be much more expensive than comparing it!ā€ Perhaps thisĀ Nand2TetrisĀ course Iā€™ll be starting this week will help me better articulate these sorts of things.
flourishes šŸ’ššŸ’›šŸ’œ
Working on a script that takes forever to run foments at least two desires:
to know what the script is doing Right Now
to spruce the place up a bit
I added an otherwise unnecessary index while running over all the questions in the survey so that I could use it to cycle through a small set of characters.Ā Last week I wrote in my mini-RC blog, "Find out wtf modulo is good for."Ā Well, well, well.
Hereā€™s what my script looks like when itā€™s iterating over each question in the survey:
Tumblr media
I justified my vanity with the (true!) fact that it is easier to work in a friendly-feeling environment.
Plus, this was good excuse to play with constructing emojis dynamically.Ā I thought Iā€™d find a rainbow of hearts with sequential unicode ids, but it turns out that ā¤ļø šŸ’™ and šŸ–¤ all have very different values. ĀÆ\_(惄)_/ĀÆ
the data set
One of the central joys of working with this dataset has been having cause to learn some history that Iā€™d otherwise never be exposed to. Hereā€™s a rundown of some interesting things I learned:
This dataset was only made accessible in October this year. The effort to digitize and publicly release The Sister Survey was spearheaded by Helen Hockx-Yu, Notre Dameā€™sĀ Program Manager for Digital Product Access and Dissemination, and Charles Lamb, a senior archivist at Notre Dame. After attending one of her forums on digital preservation, Lamb approached Hockx-Yu with a dataset he thoughtĀ ā€œwould generate enormous scholarly interest but was not publicly accessible.ā€
Previously, the data had been stored on ā€œ21 magnetic tapes dating from 1966 to 1990ā€ (Ibid)Ā and an enormous amount of work went into making it usable. This included both transferring the raw data from the tapes, but also deciphering it once itā€™d been translated into a digital form.
The timing of the original survey in 1967 was not arbitrary: it was a response to the Second Vatican Council (Vatican II). Vatican II was a Big Deal! Half a century later, it remains the most recent Catholic council of its magnitude. For example, before Vatican II, mass was delivered in Latin by a priest who faced away from his congregation and Catholics were forbidden from attending Protestant services or reading from a Protestant Bible. Vatican II decreed that mass should be more participatory and conducted in the vernacular, that women should be allowed into roles asĀ ā€œreaders, lectors, and Eucharistic ministers,ā€Ā and that the Jewish people should be considered as ā€œbrothers and sisters under the same Godā€ (Ibid).
The surveyā€™s author, Marie Augusta Neal, SND, dedicated her life of scholarship towards studying theĀ ā€œsources of values and attitudes towards changeā€Ā (Ibid)Ā  among religious figures. A primary criticism of the survey was that Nealā€™s questions were leading, and in particular, leading respondents towards greater political activation. āœŠ
As someone with next to zero conception of religious history, working with this dataset was a way to expand my knowledge in a few directons all at once. Pretty pumped to keep developing my working-with-data skills.
2 notes Ā· View notes