Data-centric: data at the heart of Artificial Intelligence

By EloInsights

Artificial intelligence (AI) models are already taking organizations to the next level of advanced analytics, merging a data-driven culture.
Against a backdrop of the maturing of companies’ analytical ability, the principle of data-centric AI proposes bringing concern for data quality to the center to evolve AI-based systems.
Besides an overview of the topic, the article also presents factors that could affect the future of AI, such as data labelling and causal models.

In today’s business environment, adopting a data-centric culture is fundamental to an organization that aims, not only to remain competitive, but also, to expand its ability to act and deliver value to all the stakeholders involved. To realize a scenario in which decision-making is based on data, it is crucial to invest in key areas such as advanced analytics and artificial intelligence (AI).

This means that organizations need to take a few steps back and answer some fundamental questions that concern their foundations. For example: how does the company currently handle its data?

Although discussions around advanced analytics and artificial intelligence evoke scenarios of high technological sophistication, there is earlier groundwork that needs to be done to make this more advanced reality viable. And this is where many companies still fall short.

It is within this scenario that the concept of “data-centric AI” is born. It proves that does not matter how advanced and mature current AI models may be, it is vital for organizations to have on their horizon the crucial importance of the quality and availability of data so that this work can be done more accurately, thus opening new possibilities for application and business development.

“It is about having very correct data with lots of information and in smaller quantities to get the best performance from the models in relation to what you are trying to answer. That is the remarkable thing about data-centric: an AI in which you work more carefully so that the data can generate good future result”

Pedro Guilherme Ferreira

Data Science Specialist

What is Data-centric?

In 2020, the Cappra Institute interviewed 500 specialists in various companies in Brazil who, on average, dealt with a volume of data close to 10 petabytes, with an expected growth of 175% over the next 5 years. Despite this, only 48% of the employees in these organizations actually used data in decision-making. It is to meet the demand to extract the largest amount of useful information from big data that data-centric AI brings a new perspective: looking at how the data used to train artificial intelligence models is extracted, processed and stored.

Organizations are already channeling resources into AI-based technologies. An IDC research concludes that 86% of more than 300 companies analyzed in eight Latin American countries have already adopted Data, Analytics and Artificial Intelligence solutions. This, combined with machine learning, is among the main investments made from 2022 onwards. Another study, from Dell in partnership with MIT Technology Review, shows that AI is on the horizon for four out of every ten Latin American companies, with the Internet of Things (IoT) appearing in the plans of 34% of organizations.

The plans are promising, but the reality now is different, as companies’ analytical capacity is still low. Proof of this is that the lack of maturity in dealing with data is the main reason for dissatisfaction among analytics professionals, according to State of Data Brazil 2021. In other words, to leverage innovation and develop and keep the most talented professionals, it is essential to provide a technology and tools structure, show a clear vision of processes and governance, create a prioritized roadmap of use cases and shape the culture for data-centric based decision-making. And all of this can come at a huge cost.

It is important to first understand the importance of prioritizing the foundation of the data. Even before hiring data scientists or super-specialized engineers to make gains with technologies linked to advanced analytics, such as AI, robustness must be guaranteed. This is the thesis of Pedro Guilherme Ferreira, Data Science Specialist: “It is essential to have governance that is more concerned with the accuracy of the data in order to be able to make decisions based on it”.

Having data-centric AI as a principle reinforces the construction of indicators, since today many industries do not even collect data to structure historical series on which to base predictive models. Bringing data to the center means, for example, applying a metric layer – a semantic layer or a centralized repository in which data teams define and store business metrics. “That way, everything you build in Analytics will have a single source, a source of truth for that data”, explains Ferreira.

This does not just refer to storing them in a data lake (a store of structured and unstructured data that allows for different types of analysis and processing of big data), but it concerns how this repository will be built and what will be extracted from it. In a simple example, this means that the way in which the data is described and stored can make an AI-based model for quality control of manufacturing in an industry using image recognition more or less efficiently.

“Data-centric is more complex. To start being a data-driven company, you must think more about the data than the model. Data has always been at the heart of analytics and statistics. The concern must be there”, says Ferreira.

“It is essential to have a governance that is more concerned with the accuracy of data to be able to make decisions based on it”

Pedro Guilherme Ferreira

Data Science Specialist

The evolution of AI models

The concept of data-centric AI is new. So much so, that it occupies the first quadrant among the innovations in Gartner’s Hype Cycle for AI 2022 chart.

In a recent interview for Fortune, Andrew Ng, one of the pioneers of deep learning, founder and CEO of Landing AI and a great advocate of data-centric AI, emphasizes the value of data in this context. The premise is that the last generation AI algorithms are increasingly ubiquitous thanks to open-source repositories and the publication of cutting-edge research. Companies can access the same code as giants like Google or NASA, but success will depend on what data is used to train their algorithms and how it is collected and processed – also on how it is governed. The idea behind the data-centric concept is to create artificial intelligence systems that are well-trained using the smallest possible amount of very well-prepared data.

Ferreira highlights that this understanding is the result of an evolution. “For a while, everything was very focused on the issue of models. The data came from social media, from giants like Facebook and Google. The focus was on modelling. “You went from machine learning to other artificial intelligence models, with neural networks”,” points out the Data Science Specialist.

Following this same line of evolution, we can better understand the proposed shift towards data-centricity. At an early stage, interest centered on the performance and accuracy of AI models. Deep learning has deepened the learning layers of artificial intelligence through more autonomous algorithms within the machining learning process, making it possible to develop adaptive systems that resemble the human brain, neural networks, adding complexity to data analysis models.

In the current discussion, these models are already considered mature and what makes the difference is the quality of the data used to train them. In addition, companies are now able to collect information more independently and process this information in the best possible way to take analyses to the next level.

“It is about having very correct data with lots of information and in smaller quantities to get the best performance from the models in relation to what you are trying to answer”, reinforces Ferreira. “That is the great thing about data-centric: an AI in which you work more carefully so that the data can generate good future results”.

As with the application of any technological resource, there is no silver bullet in artificial intelligence. For each problem to be solved, there is a more suitable model. The way it works can involve all types of data, whether it is a text, an image, a time series, etc. Regardless, the model reduces dimensionality to capture the entire structure of what is being analyzed, until there is something left over that cannot be captured. Let’s look at a more concrete example: in an industry, you need to predict the sales of a brand of soft drink in each month. To do so, they use historical sales data.

An AI model can consider various parameters: seasonality, trend, economic cycle and short-term variations, among others. The sales time series can have different profiles: more or less seasonality, for example. Therefore, it is possible to use one artificial intelligence tool to better capture seasonality; another to check on the trend; another model responsible to capture another parameter of interest to put together this history that will form the basis of the predictive model.

The evolution of the model happens when it picks up nuances that earlier versions were unable to. From a predictive model that used linear variations, it evolved to one that analyses non-linearly over time. Or, in a more sophisticated application context, an image recognition system that used to be based on black and white, but now can capture ten colors; another turned able to capture 20 colors. And so go on.

“In this hypothetical example, we can say that there are already models that pick up all the color spectrums. What is new is that the problem lies in the information used to train artificial intelligence algorithms. If I use distorted, low-quality images, they will have a negative impact on image recognition technology, for example”, states Ferreira, establishing the link with the data-centric principle.

“To start being a data-driven company, you have to think more about the data than the model. Data has always been at the heart of analytics and statistics. The concern must be there”

Pedro Guilherme Ferreira

Data Science Specialist

The use of Artificial Intelligence

The expansion of artificial intelligence uses on different sectors of the economy, such as industry 4.0, manufacturing or agriculture, does not mean that we want to exclude the evolution of models. The discussion is that, to improve the precision and performance of these models, it is more helpful to priorities the quality of the data, with greater accuracy.

We can think here of the issue of technology bias – another universe that involves issues such as algorithmic racism. But in the industrial context, if I train with biased or inaccurate information, the algorithm responsible for identifying imperfections in the manufacturing process, even with the best possible code, that AI model will not optimally improve my manufacturing process. Biased data produces biased results.

In this hypothetical scenario, the Internet of Things can be a strong ally. “Every day, more factories are installing sensors. In addition, we will have cameras that take pictures with more pixels, in better resolution, to feed quality control systems through image recognition. The repository of this data will also have better processing and more tools will be available to help look after the quality of the information”, suggests Pedro Guilherme Ferreira.

As the importance of advanced analytics and AI as a tool for differentiating companies across the entire spectrum of the economy grows rapidly, organizations need to find ways to make data analysis more precise, automated and better able to provide targeted responses to business challenges. Against a backdrop of maturing AI models, attention and priorities are gradually turning to the fundamentals of data and the well-known wisdom that, in these cases, quality is more important than quantity.