Beyond The Algorithm: Why Data Governance Is Key To Pharma’s AI Future

What is chatbot training data and why high-quality datasets are necessary for machine learning

With proper data governance, the pharma industry can improve patient-centricity in trials and bring lifesaving therapies to market quickly and safely. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. Jordan Betterman (MLDS ’25) is a graduate assistant for Northwestern’s men’s soccer team and was responsible for gathering the data students used during the Hackathon.

What is chatbot training data and why high-quality datasets are necessary for machine learning

“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance.

Could data center boom threaten Colorado’s water supply and climate goals?

What is chatbot training data and why high-quality datasets are necessary for machine learning

“There’d be something very strange if the best way to train a model was to just generate, like, a quadrillion tokens of synthetic data and feed that back in,” Altman said. “Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” But as ML models grow bigger and data becomes more abundantly available, there is a need to find scalable solutions to assemble quality training data. “The emphasis on how important data — especially high-quality data that match with application scenarios — is to the success of an AI model has brought teams together to solve these challenges,” Sagiraju said. Machine learning powers AI programs like text-prompted image generator Midjourney and OpenAI’s chat-based text generator ChatGPT.

But there is still a lot more to be done at different levels, including organization structure and company policies.
“This accelerated timeline taught us critical lessons in rapid decision-making, collaborative teamwork, and efficient problem-solving,” he said.
Fortunately, a few trends are helping companies overcome some of these challenges, and Appen’s AI Report shows that the average time spent in managing and preparing data is trending down.
Such models train on vast reams of human-created data from the internet to learn, for instance, when asked to draw a banana that it should be yellow or green and curved.
Not only did the Hackathon bring students together with the athletic department, but it also introduced them to The Garage and its resources for entrepreneurs.
“I love tackling real-world problems through data science and wanted to create something impactful alongside a group of classmates I’ve grown close to throughout the program,” Li said.

Facebook parent company Meta Platforms recently claimed the largest version of their upcoming Llama 3 model — which has not yet been released — has been trained on up to 15 trillion tokens, each of which can represent a piece of a word. From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online.

ML teams can start with prelabeled datasets, but they will eventually need to collect and label their own custom data to scale their efforts.
AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.
“While many teams start off with manually labeling their datasets, more are turning to time-saving methods to partially automate the process,” Sagiraju said.

“As the person who uses the data the teams were provided, it was an amazing experience to see unique solutions that the teams presented,” he said. “It inspired me to create new solutions to the program’s issues and I cannot wait to implement some of the projects into the program’s workflow.” “This accelerated timeline taught us critical lessons in rapid decision-making, collaborative teamwork, and efficient problem-solving,” he said. “It’s a unique opportunity to simulate real-world, high-pressure scenarios where delivering impactful solutions quickly is crucial.” “While many teams start off with manually labeling their datasets, more are turning to time-saving methods to partially automate the process,” Sagiraju said.

Upcoming Events

Fortunately, a few trends are helping companies overcome some of these challenges, and Appen’s AI Report shows that the average time spent in managing and preparing data is trending down. According to Appen, business leaders are much less likely than technical staff to consider data sourcing and preparation as the main challenges of their AI initiatives. “There are still gaps between technologists and business leaders when understanding the greatest bottlenecks in implementing data for the AI lifecycle. This results in misalignment in priorities and budget within the organization,” according to the Appen report. ML teams can start with prelabeled datasets, but they will eventually need to collect and label their own custom data to scale their efforts. It is also scalable and adaptable to other classes of metamaterials and future multi-functional materials, making it a potentially transformative step toward AI-guided materials-by-design.

What is chatbot training data and why high-quality datasets are necessary for machine learning

The work was supported by the Air Force Office of Scientific Research, the Office of Naval Research, and the US National Science Foundation. The framework models the complex mechanical behavior of spinodal microstructures by combining submicron 3D printing, in-situ electron microscopy testing, and deep learning. It accurately captures nonlinear, directional stress-strain responses with prediction errors as low as 5 to 10 percent.

The Northwestern team plans to apply the approach to multifunctional metamaterials with thermal, acoustic, or biological properties, and to develop generative design tools that use machine learning to quickly evaluate design options. “It sets a precedent for using real, scarce and sparse experimental data — not simulations — as the foundation for inverse design of architected materials, bridging the gap between theoretical design and fabrication reality,” Espinosa said. If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used.

Artificial intelligence is being asked to predict the future of AI

AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said. The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. Advances in computing and ML tools have helped automate and accelerate tasks such as training and testing different ML models. Cloud computing platforms make it possible to train and test dozens of different models of different sizes and structures simultaneously.

What is chatbot training data and why high-quality datasets are necessary for machine learning

AI ‘gold rush’ for chatbot training data could run out of human-written text as early as 2026

As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training. From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online.

What is chatbot training data and why high-quality datasets are necessary for machine learning

Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem. Not only did the Hackathon bring students together with the athletic department, but it also introduced them to The Garage and its resources for entrepreneurs. For example, object detection models require the bounding boxes of each object in the training examples to be specified, which takes considerable manual effort. Automated and semi-automated labeling tools use a deep learning model to process the training examples and predict the bounding boxes.

Faculty Of Commerce

AI gold rush for chatbot training data could run out of human-written text

Beyond The Algorithm: Why Data Governance Is Key To Pharma’s AI Future

Could data center boom threaten Colorado’s water supply and climate goals?

Upcoming Events

Artificial intelligence is being asked to predict the future of AI

AI ‘gold rush’ for chatbot training data could run out of human-written text as early as 2026

Leave a Reply Cancel reply