blogdipcf2019
Data and Humanities: Understanding How and Where They Intercept
11 posts
Don't wanna be here? Send us removal request.
blogdipcf2019 · 5 years ago
Photo
Tumblr media
0 notes
blogdipcf2019 · 5 years ago
Text
Introduction
Computers perform complex tasks repetitively and significantly faster than humans. All computers follow a series of yes/no instructions to carry out their specific purpose. I choose to open with this to highlight the common thread of ‘instructions’ that inevitably exists in all computer systems – the computer algorithm or code. The granularity of these coded instructions is mind-boggling for many (and should definitely be explored before proceeding.) That, for a word-processing application or any functional tech gadget, is thousands of tiny instructions intricately woven together logically. Essentially, all our technologies boil down to a set of specific yes/no commands, which means we can control every element in every decision that a computer makes.
In the 21st century, we are heavily experimenting in new realms of technology such as Artificial Intelligence (AI), which primarily include sophisticated classification systems. Classification systems learn and make observations from big data to arrive at insights, predictions, recommendations, or classifications of prospective events of interest. To explain, this enables the computer to draft a set of its own instructions and apply the instructions to read, understand, and classify newly entered data records. Once you are acquainted with the idea of how important these instructions, or elements of code, or building blocks of technology are, we wonder “hmm, but what exactly are these decisions and how can I influence them to contribute to a computer system in my ability?”
0 notes
blogdipcf2019 · 5 years ago
Text
Computational Thinking and the Rise of Big Data
In recent years, humans have experienced an uptick in population, technological advances, and data of all sorts. Our present systems are capable of performing complex tasks because computational thinking is more evolved than before. The roots for such computational lie in the inventions of Ada Lovelace and Charles Babbage – who were the first to think in terms of software, hardware, and finally but most importantly, getting the hardware to perform drafted instructions. Humans’ computational thinking alongside access to big data has enabled us to problem-solve in some notably timely and accurate ways.
Data, unless processed at the time of entry, requires a fair amount of pre-processing and cleaning prior to any analysis. This cleaning involves addressing missing values, outlying observations, and evaluating specific variables. Despite sounding strongly mathematical, these guidelines are applicable to all datasets. Beyond being digitally applicable, it is subject to an individual’s digital literacy and ability to convert data from its raw form to computer-readable form.
The application of computational thinking has mistakenly alluded to the fields of STEM but in reality, it is the lack of cross-disciplinary knowledge of computer data forms, storage, and analysis that slows down our cross-disciplinary growth and optimization of algorithms. It is this digital literacy that can halt the drift between humanities and sciences. The SEMMA methodology by SAS Institute is a popular data mining (i.e. classification system building) a process that stands for Sample, Explore, Modify, Model, and Assess. Below, I will dive into the specific steps that can be adopted by humanities professionals and young adults to seamlessly integrate their domain skills and knowledge in the bigger picture of data mining and leading technological research.
0 notes
blogdipcf2019 · 5 years ago
Text
Humanities and Data Processing
There are many proven disadvantages of silo approaches in the creation of new policies, laws, products, and more. We need an established relationship between STEM and the humanities while working to eradicate the pointless dichotomy between sciences and humanities. We see the application of humanities’ principles in modern technology, or lack thereof, in our day-to-day lives and surroundings. Time and again, we’ve seen tech-savvy executives commit unforced errors as they interact with broader society on sensitive issues like consumer privacy. For technology to deliver on its promise of human betterment, it needs a cultural and moral compass. The disciplines that instil such a compass – the humanities – have been dismissed as an anachronism wherein they may precisely bridge the gap between mankind and its best use of potent technologies (World Economic Forum, Hans Vestberg.)
Reiterating the potential damage of silo, instead of transdisciplinary, approaches, below is a quote from the article, “Why economists need to expand their knowledge to include the humanities” by Gary Saul Morson and Morton Schapiro stressing on the need for economics colliding with humanities:
There is no better source of ethical insight than the novels of Tolstoy, Dostoevsky, George Eliot, Jane Austen, Henry James, and the other great realists. Their works distil the complexity of ethical questions that are too important to be safely entrusted to an overarching theory – questions that call for empathy and good judgment, which are developed through experience and cannot be formalized. To be sure, some theories of ethics may recommend empathy, but reading literature and identifying with characters involves extensive practice in placing oneself in others’ shoes. If one has not identified with Anna Karenina, one has not really read Anna Karenina.
I particularly liked this vision of opening up the horizons of literature to economists. The last line strongly makes it seem like one [economist] has to read Anna Karenina to identify with Anna Karenina at all. This action-focused vocabulary only makes me think of how linguistic literacy enables economists to pick up a codex and learn from it, in the meanwhile the lack of digital literacy hinders a humanist’s ability to identify with, contribute to, or even comprehend code or computer processes.
0 notes
blogdipcf2019 · 5 years ago
Text
Sample
The sample is a subset of the full dataset used for learning and training of the classification system. Within humanities, the ‘full dataset’ is often archaeological materials, literature, books, or other unstructured or qualitative data. The first step is for acquainting with the following:
• What is data?
Data are individual units of information that can be processed by a computer.
• Do I have a dataset?
Consider whether your data of interest exist in the scope, size, and quality you desire for analysis.
• What does my dataset consist of? 
What gets counted counts
• What is the best and most exhaustive way to break down and structure my data?
This ensures flexibility and wide application of the data. Typically, bigger chunks of data might contain useful insights that are overlooked. However, granular data observations can compromise the reliability of conclusions, overfit to (i.e. corresponds too closely or exactly to a particular set of data) the training sample data, or draw 2-dimensional conclusions that could be identified with less sophisticated means.
• The data is digitized, now what?
It is crucial for engineers and humanists to periodically engage with one another to share a strong mutual understanding of the data-mining model, goals, and creation. Technical and domain knowledge together, under the influence of computational processes, can make an all-rounded system equipped with process flows and rules that are validated from a technical feasibility standpoint as well as subject-specific solutions. This harmonious interaction between the two cannot eliminate the ‘unforced errors’ or slips in conclusions, but it certainly maximizes the inherent knowledge and competency built into the system.
• Digital Archives: Humanists can use data mining and classification systems for unconventional data including Optical Character Recognition (OCR) recognized text from a corpus, archival metadata, sound files, geospatial data, and more. Literacy in the subject matter can be enhanced by digital literacy – essentially looping the learning cycle of researchers wherein their hypothesis are qualitative, theoretic, or do not exist, and the results reap further knowledge, credibility, and so on. This data can be primarily collected but more and more libraries are investing in digital archives including the Library of Congress and the New York Public Library.
• Project Sound: The last bullet of every section deals with this specific example: Let’s say I, a researcher, am working on Project Sound with historic oral sound files and my first step is to gather raw data (sound files), choose the best-suited conversion (voice recognition v/s sound wavelengths?), looking to classify my sound files by male and female orators. Of my 1000 files, I will use 700 files to train my system and use the remaining 300 to model and assess.
0 notes
blogdipcf2019 · 5 years ago
Text
Explore
• Despite the global emphasis on imagination and exploration, it is fairly challenging to explore big data given endless opportunities. Modern software allows such exploration through the use of visualizations and tables, which are summaries and detailed relationships between varying combinations of data or variables.
• Exploration is key to humanities research and professionals are intellectually trained to analyze, interpret, connect, and predict events based on all data available. Literature researchers need to explore the data-form of their literature to discover new insights that are difficult, or even impossible, to extract with the mental capacity of human beings or simply discover insights that have been unconventional in the field altogether.
• Text from 19th-century newspapers, facts surrounding historic wars, features of archaeological artefacts, unnested tokens from pre-medieval poetry, and notes from sociology observation are all examples of data within humanities. While experts will certainly have reliable input for any data that pertains to their area of expertise, it is important that they leverage their resources in order to support qualitative conclusions and applications of theory. Many ideas remain core to the humanities as a result of status quo and historic significance, but as mankind and its approaches evolve, so should our theories and this can be enabled by tangible results from actual data.
• Exploring data is simply the identification of opportunities and threats to account for as one builds their computer system. Exploring data will provide quick facts such as the spread of the data and other numeric averages, but it will also raise questions and provide leads for further investigation. On the contrary, highly unusual occurrences or abnormalities will help identify areas of caution and prospective model nuances. This is another area that is sensitive to and in need of humanist contribution, as a purely technical expert would not have the perspective necessary to identify research leads or nuances – making the solo approach a source of sub-optimization.
• Project Sound: I explored my observations and find that some orators are high pitched and consistently found using the keyword ‘marriage’ – I want to further investigate if there is a running trend of correlations between identified keywords and pitch of the orators. I will remember to explore this when my model is up and running! Additionally, I desire this model to be non-binary, however, the historic data I have on-hand only classifies between male and female – how can I possibly increase the number of classes and train the system for the same? Is there a way to break the status quo of binary gendering?
0 notes
blogdipcf2019 · 5 years ago
Text
Modify
• This stage heavily involves technical experts due to the modification of variables through creating variables, combining variables, optimizing data types, and basically preparing for the modelling phase.
• When combining or optimizing variables, humanists play a key role in driving these decisions. For instance, if a programmer working on a text-focused study has to standardize abbreviations and spellings for computational use, it is important for the humanist to define abbreviations and support these alterations within reason.
• Project Sound: Bob, from software engineering, got in touch and asked if he can eliminate the no-audio sections from each observation. I understand his point, he is correct to question it as ‘noise’ or ‘redundancy’ in the dataset. Hmm, do those moments of silence hold any significance? Do I know this right off the bat? Do women take longer pauses or buffer time to start-up? Well, I will discuss in detail how keeping or removing this impacts the model.
0 notes
blogdipcf2019 · 5 years ago
Text
Model
• This is essentially the meat of the sandwich when we have an actual model that is simulating processes and arriving at conclusions. The model is arguably the easiest stage to understand. Majority of the final tech products we utilize used to be models. Models are publicized or put to use only after deemed successful, which suggests there can always be multiple models at this stage. So, what makes a model successful? Wait for the next step!
• There are some compelling examples of models in the world of digital humanities and can be checked out here.
• Project Sound: We have 2 models for the Project Sound system; one has a misclassification error rate of 3% and one of 6% (i.e. male orators classified as female or vice versa.) What differences in the models are causing these results? How can I overcome the limitations in each model as best possible? What are other criteria that I value when making my choice of model? To what extent does my research budget allow me to experiment between both models.
0 notes
blogdipcf2019 · 5 years ago
Text
Assess
• Assessing the model is as intuitive as it sounds; it is the assessment of the model relative to its pre-defined goals and objectives.
• The assessment phase involves applying the model to new observations for which the results are known; hence they can be compared to those returned by the model.
• Once again, the assessment can be completed to the best knowledge of tech experts, but the added layer of humanist assessment and evaluation will help the model perform best for people closest to the data – historian, linguists, socialists, etc. The assessment criteria should always be broadened to account for different perspectives to ensure selecting the best model.
• Project Sound: Based on inputting 300 new records into the system, Model A classified more records accurately than Model B. In addition to the error rate, it is a priority for the model to be compatible with latest .mp4 files for future research. Model A can now be implemented and continuously learn based on new observations entered.
0 notes
blogdipcf2019 · 5 years ago
Text
Challenges
When working with analog, non-discreet data, making the leap to structured or even semi-structured data is a challenge. This availability of digital materials is one of the top limitations in digital humanities. There are tools in place such as OCR that help this translation, but the scenario is further blurred in the case of historic events and their details – which are often based on a series of facts and tied assumptions. Some practitioners of digital humanities, notably Joanna Drucker, have argued that the term “data” is inadequate. ‘Data’ comes from the Latin datum, which means, “that which is given” i.e. an observer-independent fact, which cannot be challenged in itself. Instead, Drucker prefers to speak of “capta” i.e. that which has been captured or gathered, underlining the idea that the act of capturing data in the first place is oriented by certain goals, done with specific instruments, and driven by attention to a small part of what could have been captured given different goals and instruments. In other words, capturing data is not passively accepting what is given, but actively constructing what one is interested in.
This rather subjective view of humanities data and its collection makes some critiques believe how data analysis is not fit for this purpose. It challenges the authenticity of data, considering qualitative data is easy to amend, perspective-dependent, context-dependent, and highly non-discreet. Despite these challenges it is important for humanities researchers, just like all else, to maintain a detailed plan of action and account for the research. Although this may involve extensive notes in the data index, appendix, or research itself, if the researchers convey the specificity and reliability of the source data and its application, there is no reason for it to be dismissed as arbitrary or inconclusive.
0 notes
blogdipcf2019 · 5 years ago
Text
Conclusion
In conclusion, the predominance of computer systems and data is transdisciplinary, so now is the time to question why its creation is highly centralized. The seemingly high barriers to entry in the world of digital humanities are the primary obstacles to this idealistic integration of STEM and humanities. The issues of structural racism, sexism, or other bias built into computer systems must change now to ensure we are not building upon a flawed foundation of our (virtual) world and lives. There is a need to broaden the set of people and perspectives that define every strategy and decision that helps arrive at the final product. As I mentioned before, cultivating a data mindset in the humanities fields is the only path to build an all-rounded system suited for its challenges and purpose.
1 note · View note