Back to Home
Learning to Build AI's Report Card: Building Safety Before the Machines Take Over

Learning to Build AI's Report Card: Building Safety Before the Machines Take Over

A conversation with Parth Mukul Gupta, Founder & CEO of Styx, on AI governance, testing agentic systems, and ensuring AI safety before deployment in critical applications.

July 1, 2025
14 min read
By Rachit Magon

Meet Parth Mukul Gupta - at just 23, he's the founder and CEO of Styx, building testing and evaluation platforms for agentic AI systems. He's had multiple failed startups, worked as a consultant at EY, and is now running a company that could determine whether AI agents really work or just end up costing us millions.

While everyone's racing to build the next smartest AI, Parth chose to be the person asking the hard questions: Is AI safe? Is AI biased? Will AI actually do what we want it to do? Let's dive in.

Q: With the EU AI Act potentially fining companies up to €35 million and Tesla facing criminal investigations for their AI, do you think we're already too late to implement proper AI governance and testing?

Parth: Honestly, you're not wrong, and this is exactly why we do what we do. We are already late to the game. Think about it, everyone's racing to build the next AI tool, but very few people are asking: how do we test it before it goes live?

Just imagine this: a few months ago, Tesla self-driving cars were under criminal investigation after a series of unintended accidents. These were AI systems operating in real-time with people trusting them to drive safely, but who was there to test their behavior at scale before going live?

We cannot wait for those failures to happen and then scramble for solutions. It's like waiting for a fire to happen before buying a fire extinguisher.

🔥 ChaiNet's Hot Take: The AI Incidents Database recorded 233 incidents in 2024, marking a 56.4% increase from 2023. Meanwhile, 74% of companies struggle to achieve tangible value from AI, with only 21% having established policies governing employee use of AI technologies. The disconnect between rapid deployment and proper governance is creating a perfect storm of risks.

Q: Amazon had a hiring AI that discriminated against women for years, McDonald's AI tried to sell hundreds of nuggets. How do you test for failures nobody even thought about?

Parth: Let's talk about McDonald's AI chatbot that ordered hundreds of nuggets when no one asked for them. That's a massive blunder that McDonald's never anticipated. Here's the kicker: someone had already tested the system, but the problem was the very limited scope of testing. They never ran real-world scenarios to simulate what could go wrong.

The testers aren't at fault, they had no production data to predict this behavior. That's where the problem comes in: we have to simulate all the edge cases.

At Styx, we do multi-step agent testing and focus specifically on this: not just "can this agent answer this," but "what happens when the agent is forced to make decisions in ambiguous, stressful scenarios?" We stress-test AI systems just like a car would have crash tests before hitting the road.

If we were involved in McDonald's scenario, we would have seen their AI overwhelmed by complex inputs, and we would have caught soft failures before it went public.

🔥 ChaiNet's Hot Take: Only 32% of companies mitigate AI inaccuracy risks, despite inaccuracy being cited more frequently than cybersecurity concerns. Over 40% of agentic AI projects are predicted to be scrapped by 2027 due to escalating costs and unclear business value, highlighting the critical need for proper testing frameworks before deployment.

Q: Traditional AI testing is like testing a car on the same route in perfect weather. But agentic systems are unpredictable. How do you evaluate something when you don't even know what the output should be?

Parth: You're spot on. This was exactly where me and my co-founder were six months ago, thinking this is totally out of scope, we don't know what the output is going to be.

Traditional AI testing is like testing a car with the same route, same speed, perfect weather. But agentic systems are like self-driving cars dealing with constant traffic changes, new roadblocks, weather changes - Bangalore traffic, for example!

The answer is we cannot do the same thing that was happening with chatbots before. We need to simulate how AI will interact with real-world data. We create perfect scenarios with all the entropy of everything hypothetical that could happen, make the agent perform in that scenario, and observe how it adapts.

We look at how agents adapt to new situations, react to previously unseen inputs, and test them in environments that mimic chaotic real-world use cases. We don't wait for agents to break down, we try to break them before they break us.

🔥 ChaiNet's Hot Take: Gartner predicts 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028, up from 0% in 2024. However, 25% of developers estimate that 1 in 5 AI-generated suggestions contain factual errors. The challenge of testing emergent behaviors in complex systems is becoming critical as adoption accelerates.

Q: You've had multiple failed startups and openly wear your failures as badges. What was the moment you realized the real problem wasn't building AI, but testing and evaluating it?

Parth: In our previous startups, we were so focused on creating an AI solution that we never stopped to ask: is this actually working in a real-world environment? The moment I realized that the real problem wasn't AI development but AI deployment at scale - that's where Styx was born.

When we were working on our CFO copilot, I'd claim we had one of the best algorithms, but when it came time for clients to use it, it failed in unexpected ways. Clients didn't get the right insights that AI could predict. That's when we realized: building an AI model is one thing, ensuring it works in real life is what really matters.

This frustration was common. We talked to folks in the valley, seniors in our circuit who were in this field. Everyone was facing the same problem. A friend was working on satellite data for geospatial imaging, a beautiful tool, but it got stalled in pilot because the government couldn't rely on the output.

That's when we realized: this is a common problem across every vertical. Let's solve it once and for all.

🔥 ChaiNet's Hot Take: Despite widespread AI adoption (82% of developers use AI coding assistants daily or weekly), enterprises are struggling with deployment. McKinsey research shows that organizations need at least a year to resolve ROI and adoption challenges including governance, training, and trust issues. The gap between development and reliable deployment represents a massive market opportunity.

Q: How do you sell caution in a market that's racing at high speed and rewarding velocity over validation?

Parth: Honestly, it's not easy - it's a very tough sell. But here's the thing: speed without validation leads to chaos.

I always tell companies transparently: we take one of their agentic systems and do very normal, minimal testing, just 250-300 prompt scenarios that an end user might do. We show them the results, and even top companies fail multiple times. Their performance is not optimum on something they claim works perfectly.

Moving faster without testing isn't just risky, it's expensive. I always talk about three types of risk: operational, regulatory, and reputational. That last one is most important. If reputation gets damaged, it's going to destroy you completely. One single failure in production can set you back months, lose customers, and create reputation crises.

What we sell is smart speed: testing upfront, fixing problems before they happen, so speed doesn't come at the expense of trust. Our vision is to do what Jenkins did for software development, for example, create a CI/CD layer where testing isn't a roadblock but enables reliable execution.

🔥 ChaiNet's Hot Take: Air Canada lost a legal case when their chatbot hallucinated and created a non-existent refund policy, becoming a cautionary tale that made organizations more conscious. With the global AI governance market projected to grow from $227.6 million in 2024 to $4.31 billion by 2033 (35.7% CAGR), the business case for AI safety is becoming undeniable.

Q: Most AI implementation challenges seem organizational rather than technical. How do you convince people who might not understand the risks or are protecting their turf?

Parth: You're 100% right. Even board members and executives are subconsciously aware that the speed we're going is risky. That's why you notice all the use cases are in less problematic scenarios where the cost of failure is minimal and the cost of value is high.

We never see ourselves as just a company selling tech. We sit with teams, understand their problem statements, understand where they're coming from, what business problems they want to solve, and convey it in analogies they understand.

I'm very optimistic about the future. When leaders understand the long-term benefits of early testing and the risks of not doing it, it's easier to implement these solutions. Companies are becoming more aware because of all the failures we see across the world.

The damage in terms of reputational, operational, or regulatory impact is so high that stress testing has become necessary. Agents will be a daily part of life in the next 5-10 years, we cannot have even a small percentage of error.

🔥 ChaiNet's Hot Take: 68% of CEOs say governance for generative AI must be integrated upfront in design rather than retrofitted after deployment. However, only 27% of organizations review all AI-generated content before use. The gap between executive awareness and operational implementation remains a critical challenge requiring both technological and cultural solutions.

Q: When you think about five years from now, with AI agents managing insurance, finance, healthcare. What do you want people to remember about Styx's work?

Parth: We want to go forward with the Jenkins analogy - that was a revolution that almost every engineer was happy about. I want Styx to be remembered as the company that made reliability testing a non-negotiable part of every agent's lifecycle.

When AI is making decisions about your finance, health, job opportunities, and day-to-day applications, I want people to look back and say: "Styx made sure it's safe. Styx made sure I can use it every single day without fear. Styx helped build the foundation of testing and accountability for AI systems."

That's the vision we're working toward, and that's something we're confident we're going to achieve in this next 5-10 years marathon.

🔥 ChaiNet's Hot Take: 33% of enterprise software applications will include agentic AI by 2028, up from less than 1% in 2024. With AI systems becoming integral to critical decisions, the companies that establish testing and governance standards early will likely become the infrastructure layer that enables safe AI deployment at scale, potentially creating a new category of AI safety tools worth billions.

🔥 ChaiNet's Hot Take: This prediction aligns with industry data showing that AI coding assistants are already being used by 82% of developers daily or weekly, with AI contributing to 46% of code in enabled files. The transformation from coding quantity to architectural quality mirrors historical technological shifts, suggesting fundamental changes in how technical work is structured and valued.

Final Thoughts

Parth Mukul Gupta's work at Styx addresses one of the most critical gaps in the AI ecosystem: ensuring safety and reliability before deployment. As AI systems become more autonomous and integrated into critical applications, the need for comprehensive testing and governance frameworks becomes not just important, but essential for preventing catastrophic failures.

His vision of creating "Jenkins for AI testing" represents more than just a business opportunity, it's about building the infrastructure that will enable society to safely harness the power of AI agents. In a world racing toward AI deployment, someone needs to grade the machines before they start grading us.

Connect with Parth: You can find Parth building AI's report card while working toward his mission of making AI testing a non-negotiable part of every agent's lifecycle.

Want to be featured on ChaiNet? We're always looking for interesting conversations about technology, AI, and the future of work. Reach out to us!


Related YouTube Shorts

Explore short-form videos that expand on the ideas covered in this blog.