Home » Technical Topics » Data Science

The 2020 Data Science Dictionary—Key Terms You Need to Know

  • ODSC 

As I discussed early last year, the data science field is a virtual hot-bed of terminology, a confluence of terms from computer science, statistics, mathematics, and software engineering. In addition, the language of data science evolves very quickly. As a journalist and also a data scientist, I probably see the newest terms before many others in the ecosystem. I encounter them from conferences, Meetups, social media, LinkedIn, blog posts, Stack Overflow, research papers, as well as conversations I have with colleagues. In this article, I’ll provide you with an additional brief dictionary of terms surrounding 2020 data science including AI, machine learning, and deep learning. The list below consists of a number of important terms that I feel will help you move forward into 2020.  

[Related Article: Top Data Science Skills for 2020]

  1. AI Chatbots–AI chatbots represent a class of software that is able to simulate a user conversation with a natural language through messaging applications. The main attraction of the technology is that it increases user response rate by being available 24/7 on your website in order to provide better customer satisfaction. Chatbots use machine learning and natural language processing (NLP) to deliver a near human like conversational experience.
  2. AutoML–Automated machine learning or AutoML is the process of automating the end-to-end process of applying machine learning to achieve the goals of data science projects. AutoML is an attempt to make machine learning available to people without strong expertise in the field, although more realistically it is designed to help increase productivity of experienced data scientists by automating many steps in the data science process. Some of the advantages of using AutoML include: (i) increasing productivity by automating repetitive tasks which enables a data scientist to focus more on the problem rather than the models; (ii) automating components of the data pipeline helps to avoid errors that might slip in with manual processes; and (iii) AutoML is a step towards democratizing machine learning by making the power of machine learning accessible to those outside the data science team.
  3. BERT–BERT (Bidirectional Encoder Representations from Transformers) was introduced in a recent paper published by researchers at Google AI Language. It has caused disruption in the machine learning community by presenting state-of-the-art results in a wide variety of NLP tasks. BERT’s main technical advance is applying the bidirectional training of Transformer, a popular attention model, to language modeling. This direction is in contrast to prior efforts which examined a sequence of text either from left to right or combined left-to-right and right-to-left training. BERT’s methodology shows that a language model which is bidirectionally trained is able to have a deeper sense of language context and flow than single-direction language models. 
  4. Cognitive computing–Cognitive computing is based on self-learning systems that use machine-learning techniques to perform specific, human-like tasks in an intelligent way. The main goal of cognitive computing is to simulate human thought processes using a computerized model. With self-learning algorithms that use pattern recognition and natural language processing, the computer is able to imitate the way the human brain functions.
  5. Data pipeline–Data scientists depend on data pipelines to encapsulate a number of processing steps required to prepare data for machine learning. These steps may include acquiring data sets from various data sources, performing “data prep” operations such as cleansing data and handling missing data and outliers, and also transforming data into a form better suited for machine learning. A data pipeline also includes training or fitting a model and determining its accuracy. Data pipelines are typically automated so their steps may be performed on a continued basis.  
  6. Data lake, data warehouse–Data lakes and data warehouses are both widely used for storing so-called “big data,” but they are not interchangeable terms. A data lake constitutes a large-scale pool of raw data without a concrete purpose. On the other hand, a data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose. Enterprises often need both types of repositories. Data lakes were born out of the need to harness big data and benefit from the raw, granular structured and unstructured data for machine learning, but there is still a need to create data warehouses for analytics used by business users.
  7. Edge analytics–Edge analytics is a method of performing data collection and analysis where an analytical computation is performed on data at the point of collection (e.g. a sensor) instead of waiting for the data to be sent back to a centralized data store and then analyzed. Edge analytics has come into favor as the IoT model of connected devices has become more established. In many enterprises, streaming data from various company operations connected to IoT networks creates a massive amount of operational data which can be difficult and expensive to manage. By running the data through an analytics process as it is collected, at the “edge” of a network, it’s possible to establish a filter for what information is worth sending to a central data store for later use.
  8. GANs–Generative adversarial networks (GANs) are deep neural network architectures comprised of two nets pitting one against the other, e.g. the term “adversarial”). The theory of GANs was first introduced in a 2014 paper by deep learning luminary Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio. The potential of GANs is significant because they are generative models in that they create new data instances that resemble training data. For example, GANs can create images that look like photographs of human faces, even though the faces don’t belong to any real person.
  9. Geospatial analytics–Geospatial analytics is a technology that works to gather, manipulate and display geographic information system (GIS) data (e.g. GPS data) and imagery (e.g. satellite photographs). Geospatial analytics uses geographic coordinates as well as specific identifier variables such as street address and zip code. The technology is used to create geographic models and data visualizations for more accurate modeling and predictions. 
  10. Graph database–A graph database uses “graph theory” to store, map and query relationships of data elements. Essentially, a graph database is a collection of what are known as nodes and edges. A node represents an entity such as a product or customer, while an edge represents a connection or relationship between two nodes. Each node contained in a graph database is defined by–a unique identifier, a set of outgoing edges and/or incoming edges, in addition to a set of key/value pairs. Each edge is defined by a unique identifier, a starting-place and/or an ending-place node along with a set of properties. Graph databases are well-suited for analyzing interconnections.
  11. Julia–Whether you’re a data scientist who uses the most popular programming languages, R or Python, you still should be aware of a relatively new language that was designed from the ground up for data science applications. Julia was officially announced in 2012 in a blog post. The designers of the language and two others founded Julia Computing in July 2015 to “develop products that make Julia easy to use, easy to deploy, and easy to scale.” Julia is a free open source, high-level programming language for numerical computing. It has the convenience of a dynamic language with the performance of a compiled statically typed language, by way of a JIT-compiler that generates native machine code, and also a design that implements type stability through specialization via multiple dispatch, making it easy to compile to efficient code.
  12. Low-code/No-code–You may see the terms “low-code” and/or “no-code” being mentioned a lot these days. Many new products, along with some mature products, are being re-branded as adopting low-code/no-code methodologies. Simply defined, a low-code/no-code development platform is a visual integrated development environment that allows citizen developers to drag and drop application components, connect them together and create a finished application. Many enterprise BI platforms fall into this platform category.

[Related Article: Data Science Job Titles to Look Out for in 2020]