As an academic discipline, the rate of maturation for data science should be measured in light years. Although it’s really only about 10 years old as a field of study – with the first Ph.D. program in the country emerging just four years ago – most, major universities across the world have integrated data science into their portfolio of degree options. Universities – not typically known for nimbleness or their even their receptivity to market forces – have responded to the calls for developing the talent needed to translate increasingly massive amounts of structured and unstructured data into information.
Over the last 10 years as an academic engaged in data science programs at my own university, I have seen a shift in the conversations that take place at conferences, in panel discussions and in coffee shops. A few years ago, the conversations related to data science centered on computational skills – students needed programming languages, combined with “deep learning”, “machine learning”, and “predictive analytics”. It was all about what data science students could do with data.
Today, the myriad of issues related Facebook’s use of personal data (e.g., the Cambridge Analytica data leak, their emotional contagion testing, fake quiz apps) or the inherent racial bias that has manifested through facial recognition…, now have pivoted these conversations away from what students can do, and more to what they should (or should not) do. Part of this conversation includes a thread related to what responsibility academics have to teach ethics as part of a data science curriculum. Most academics are unsure how to engage this thread.
I think it is helpful to frame this shift from a conversation related to computational skills to one related to ethics in data science in the context of Maslow’s Hierarchy of Needs. Simplistically, this foundational theory in Psychology places human needs into five tiers – from basic physiological needs (e.g., food and water) up to “self-actualization” (i.e., the ability to achieve one’s full potential). The needs are typically placed in a hierarchical pyramid, where progression from one tier is contingent on the achievement of the tier below. In other words, you are not worried if you look better in mauve or purple if you have not eaten in two days.
I propose an academic “Maslow’s Hierarchy of Data Science”:
In the Hierarchy of Data Science as an academic discipline, students MUST learn the basics of mathematics as a starting point. This is true, because the higher level concepts of statistics, computer science and programming are grounded in the basics of algebra, matrices, discrete optimization…graph theory and calculus don’t hurt.
Only AFTER students understand the core concepts from statistics and computer science (note that this is an “and” rather than an “or”), will the skills related to modeling and classification, followed by the ability to communicate results (particularly to a non-scientific audience) actually make sense. Borrowing from the point above, you really should not be worried about visualizing the performance of the model, if you have no idea which features (variables) were actually used. Or why. Or in what form. Or if they are biased.
While the concept of the “citizen data scientist” has its place, I believe that the national conversations related to ethics in data science, have emerged partially as a function of academic programs trying to shortcut this hierarchy.
In the rush to become a “data scientist”, students are given opportunities to take the one year certificate or the online course, which will allow them to update their LinkedIn profile. In the process, they skip the science disciplines and go right to the business disciplines of analytical modeling (through a point-and-click interface) and visualization, without any understanding of HOW the algorithms they are “pointing and clicking” actually work. Or if they even make sense. Frequently, this generates meaningless output – like the highly-paid analyst who used social security number as a predictor (true) – or in the worst case, algorithms (unintentionally) built on racially biased data.
The national conversations related to ethics in data science are much needed and are a manifestation of the maturing of the discipline. However, an important thread in these conversations in conference panels and in coffee shops, needs to include an acknowledgement that shortcutting the science disciplines has contributed to the problem.