When AI Errors Are a Data Problem (or How the GIGO Rule Applies to AI)
Regardless of your role or your level of understanding, there is little doubt that AI is a fascinating field. Even when we don't use it properly or when it exhibits problematic behavior (e.g., racial biases), it's still interesting to explore the how's and why's of it. In this article, we'll look at AI's relationship with data, particularly the data we use to train it. I'll avoid any technical jargon so that everyone can get something useful out of this text. So, let's get started, shall we?
First of all, why is data so important, even for something so sophisticated as an AI system? Well, if you think about it, any intelligent decision we make is based on information. The latter usually stems from data. Even if the exact relationship between data and information is opaque, there is no doubt that the two are linked in various ways. In AI, things aren't any different. Just instead of physical neurons, the brains of an AI process data using artificial neurons, which make up an artificial neural network (ANN), in the vast majority of cases. So, it all boils down to what data we use to brew this information that will drive the decisions at hand.
However, data comes in all shapes and forms, while not all data is of the same value (or veracity to be more accurate). Just like some platforms are full of it when it comes to content, while others maintain some standards, the datasets at our disposal are varied. And it's such a dataset that will be used to train an AI system, enabling it to make its decisions, so what dataset we use significantly affects the final result. An efficient AI system (e.g., a state-of-the-art one) will do a great job at processing the data, even cleaning it to some extent. However, it's an AI, not a miracle worker, so if you give it garbage to train with, don't expect unicorns and rainbows at the other end.
All this is reminiscent of an adage in Computer Science called the GIGO rule. This acronym stands for Garbage In Garbage Out and illustrates how if you feed a computer program garbage, it's going to spit out garbage. Take, for example, a spreadsheet. If the data you put in its cells doesn't make any sense (e.g., it's random numbers), the stuff in the pivot tables and other cells, where the results live, is bound to be equally useless. An AI isn't much different, though as a bonus, you get to waste computational resources (e.g., RAM and computer power) if you make such a mistake.
When Big Data made its debut, many people wrote about it. One of the key books I read on the topic was one from IBM, which had developed Big Data software at the time and wanted to educate the world of the potential of this new resource. In that book, the authors (who were seasoned professionals at various roles) dedicated several pages on the 4 Vs of big data (you may hear about the 3 Vs or even the 6 Vs of big data in other places). Namely, Volume, Velocity, Variety, and Veracity. The latter is the one that many people forget about, but it's one that's crucial. Because if you have lots and lots of data, some of it moving fast (e.g., a trading data stream), and having various forms (e.g., some of it coming from a database, other parts coming from Twitter, etc.), naturally not all of it will carry a strong enough signal (information). Parts of this amalgamation of data is going to be useless and are better off being jettisoned. What's left would be something that a data scientist would clean, organize, and process, to build a model that will provide some useful insights and (ideally) a service you can use even without that expert being present.
Things haven't changed much since then, though the tools have evolved, with AI being in the limelight. Still, veracity is crucial, which is why we need to be mindful of the quality of the data we use. Otherwise, if an AI makes a blunder, we only need to look at the mirror to find the culprit!
If you are interested in AI and similar topics, feel free to check out my blog. Cheers!
Source: pixabay.com · Brief Overview of What's Hap ...
Source: Semantix Brasil · I generally don't opt fo ...
Non hai gruppi che si adattano alla tua ricerca