saas22inc - Tumblr blog

saas22inc · 5 months ago

Text

The Road to Reliable AI with Hamming CEO, Sumanyu Sharma

TL;DR: Sumanyu Sharma is the Founder & CEO of AI startup, Hamming. He’s been obsessed with AI forever—from academic research to production systems at Tesla—and he has an uncommonly well-rounded view of what it takes to make AI actually work. In this wide-ranging interview, Sumanyu shares how Hamming is tackling the key challenges of AI reliability through prompt tuning, evaluation, and observability. He also dives into the thorny questions around governing AI's risks and societal impacts. Read Sumanyu's interview for a nuanced take on what responsible AI development looks like and why reliability is the key to unlocking the technology's full potential.

Sumanyu, thanks for making the time to do this. Let's kick things off with how you first got hooked on AI - was there a lightbulb moment that set you on this path or was it a gradual build up?

I've always been fascinated by systems that learn and improve with more data or repetition, whether it's human or AI systems. As a kid, I used spaced repetition techniques to retain information better and optimize my own learning. This early interest in self-optimization set the stage for my passion for AI. During my undergrad at the University of Waterloo, I got my first taste of real-world AI—back then it was called machine learning.

“My "aha" moment came when I took Andrew Ng's class on Coursera and built a hand digit classifier that actually worked. It was incredible to see that machines can learn and improve with more data, just like humans.”

The realization that I could build systems capable of learning and adapting on their own was a game-changer for me. Since then, I've been hooked on systems that get better with more data. I took courses to learn the fundamentals of AI, pursued internships to gain practical experience, and worked with research groups at my university to dive into academic research.

I was fortunate to drive massive business outcomes using AI at Tesla and later at Citizen, sometimes with simple models, sometimes with complex ones.

When ChatGPT launched, I became obsessed with how LLMs work and what they can do, quickly becoming a daily active user and a power user. The pace of improvement in the AI space since then has been breathtaking.

At places like Tesla, Citizen, or even back in college - what were some of the key experiences that made you recognize the need for more reliable AI systems?

“Lack of reliability has always been the #1 issue limiting the widespread use of machine learning across various fields. Generative AI is just a new flavour of an old problem.”

During my time at the University of Waterloo, I aimed to help radiologists diagnose patients more accurately and quickly by leveraging past records of similar X-rays. This idea seemed obviously useful, but achieving reliable results was challenging. We developed a feature extraction technique by stacking auto-encoders into binary vectors to semantically search for similar medical images. While we got results good enough to post an arXiv paper, they weren't reliable enough for clinical settings, where a misdiagnosis could harm real patients. This experience highlighted the gap between academic success and real-world reliability, underscoring the need for more robust AI systems.

At Tesla, I encountered a similar issue but on a larger scale. Humans are terrible drivers—94% of motor vehicle accidents are caused by driver error. Building Level 5 autonomy could save thousands of lives each year. It took Waymo eight years to develop a system reliable enough for production. Tesla's approach is different but still not reliable enough for use without human supervision. Working on these projects made it clear how crucial reliability is for AI applications that impact human lives. The stakes are incredibly high, and any system that isn't 100% reliable can't be trusted to operate autonomously in critical situations.

I believe current multi-modal models face similar reliability challenges as early autonomous systems. While foundational models continue to improve, there remains a significant gap between their raw capabilities and the ability to translate these capabilities into reliable AI products and services.

“Businesses need AI systems that can consistently deliver accurate results in real-world conditions, not just in controlled environments or simulations.”

Moreover, this reliability gap isn't just a technical issue but also a matter of trust. Users need to trust that AI systems will perform as expected without causing unintended harm. This trust is built through rigorous testing, transparent operations, and continuous improvement. As AI systems become more integrated into our daily lives, ensuring their reliability will be paramount.

Your background spans publishing AI research, product leadership, leading data science teams, and driving strategic initiatives. How did wearing all those hats shape your approach and priorities at Hamming?

Well, I've been fortunate to have a diverse background.

Having done research in medical image search and deep learning, I have some intuition for fruitful research-oriented work streams. For example, we recently benchmarked major LLMs—GPT-4, Opus, Gemini 1.5 Pro, and Llama 3 70B—on codegen tasks by partnering with the University of Waterloo. This study was pretty popular on Reddit and within the YC community, and we plan to do more work like this.

My background in growth and data also provides a sixth sense of the growth rates we need to hit, what's realistic for fundraising and internal planning, as well as how to properly impact size and prioritize new features as Hamming scales.

And having been an engineer, I have a decent grasp of the engineering talent required at each stage, making collaboration with our technical team that much smoother without heavily relying on external judgement.

“I would be a significantly worse CEO without the critical skills I learned along the way before co-founding Hamming.”

Because I have a deep understanding of the entire product, engineering, research, and go-to-market landscape, I can make decisions quickly and decisively.

Can you walk us through the early days of founding Hamming? What were the biggest challenges in going from an innovative idea to an actual product and business?

The core idea for Hamming started from an evals script I wrote to solve a customer's pain point when building their RAG pipeline. The goal was to make changes to any part of the retrieval pipeline and get quantitative feedback on metrics like accuracy, hallucinations, and latency in minutes—unlike the typical vibe-checking outputs on a handful of examples in a prompt playground.

I found myself spending more time fixing the script, which was supposed to accelerate my feedback loop, than doing the actual work. This felt unusually hard, and I realized others must be experiencing the same problem.

So, I teamed up with Marius, my co-founder and CTO, to build a platform to solve this problem more comprehensively. This was definitely a "scratch your own itch" founding story.

Going from zero to one is brutal. Everything is manual, and there are no A/B tests because you don't have enough users. You have to do things that don't scale to get a handful of paying customers and do everything possible to make them love your product.

“My conviction in the problem space grows with every user I speak to. Most teams tell me that reliability is the #1 concern stopping them from shipping AI products to their customers.”

We've shipped prompt tuning, evaluation, and observability, but there's a lot more to do. We're releasing a new product every month, each tackling a different aspect of reliability.

It seems your platform components—prompt tuning, evaluation, and observability—form a cohesive solution. Can you give us an overview of how they work together to increase AI reliability for enterprises?

That's right. Prompt tuning, evaluation, and observability work together to drive reliability in a piecewise fashion.

Most teams start by writing basic prompts to quickly prototype a solution to their problem. It's easy to get 60% of the results with 20% of the effort.

“Improving prompts from 60% to 95% is extremely painful. Our prompt tuner samples the prompt solution space, tries hundreds of new prompts, and quickly returns the optimal prompt with its quantitative performance on a dataset.”

This prompt tuning heavily relies on evaluation to find the best prompts. As you move beyond prompts, you'll find it's challenging to improve the retrieval performance of RAG-based systems. RAG systems can fail during retrieval (failing to fetch the correct documents) or reasoning (failing to reason with the correct documents passed to the LLM).

We've built RAG-specific model-based evaluators that help teams differentiate between these error types. This allows teams to focus on improving their retrieval pipeline if retrieval is the bottleneck or their prompts if hallucinations are the primary issue.

Similarly, tool use is often unreliable when building AI agents. Common failure modes include improper arguments returned, wrong functions called, or hallucinated functions/parameters. We've built a function definition optimizer that tests different function call definitions to get the best results.

The combination of prompt tuning, RAG evaluations, and function definition optimization saves teams hundreds of hours per week during development. As you launch your AI products into production, you need to deeply understand how users are interacting with them beyond just token usage, latency, and other basic metrics.

Our observability tools provide this insight, helping you ensure ongoing reliability and user satisfaction.

One of your key value propositions is "reliable AI in weeks" vs months. What core innovations allow for such a rapid development cycle?

“The biggest bottleneck in building reliable AI systems comes from having humans in the loop. Without LLMs to speed up iteration velocity, making changes to your prompt, retrieval pipeline, or function definition, requires a human to check if the change improved accuracy or increased hallucinations. But humans are slow, expensive, and unreliable.”

It's impossible for a human to check every single edge case by hand. As a result, you can run a maximum of 2-3 experiments per week with this human-in-the-loop approach. This is why most AI products take months or even years to become reliable enough for teams to feel comfortable shipping. Even after shipping something to production, teams are nervous about making changes that could cause regressions for end customers.

LLMs, however, are great at reasoning and can measure accuracy, tone, hallucinations, and other quality metrics 20 times cheaper and 10 times faster than humans. LLMs will only continue to get smarter, making iteration feedback cycles even faster.

“By using LLMs as judges and leveraging them to generate optimized prompts and function definitions, teams can run 10 experiments per day.”

We spend a lot of time fine-tuning LLM evaluators to model human preferences because an LLM's definition of good may not be the same as a human's. We work closely with each customer to build their own custom evaluators to ensure the highest accuracy and relevance.

This ability to rapidly iterate and improve using LLMs is what allows us to deliver reliable AI in weeks rather than months. By automating the evaluation process and optimizing every step of the development cycle, we significantly cut down the time required to achieve robust, reliable AI systems.

I'm interested in hearing more about Hamming's prompt tuning capabilities. How does auto-generating optimized prompts improve performance vs manual approaches?

Writing high-quality and performant prompts by hand requires enormous trial and error. Here's the usual workflow:

Write an initial prompt.

Measure how well it performs on a few examples in a prompt playground.

Tweak the prompt by hand to handle cases where it's failing.

Repeat steps 2 & 3 until you get tired of wordsmithing.

What's worse, new model versions often break previously working prompts. Or, say you want to switch from OpenAI GPT3.5 Turbo to Llama 3. You need to re-optimize your prompts by hand.

Our take: use LLMs to write optimized prompts for other LLMs.

Describe the task you want to accomplish.

Add examples of input/output pairs that best describe the task.

Start optimizing.

Behind the scenes, we use LLMs to generate different prompt variants. We use an LLM judge to measure how well a particular prompt solves the task by measuring performance on the input/output pairs you described. We capture outlier examples and use them to improve the few-shot examples in the prompt. We run several "trials" to refine the prompts iteratively.

This is very similar to how metaheuristic optimization algorithms like genetic algorithms and simulated annealing find global optimal solutions by intelligently sampling the search space. The benefits are obvious.

“No more tedious wordsmith-ing. No more scoring outputs by hand. No need to remember to tip your LLM or ask it to think carefully step-by-step. Using LLMs to auto-generate optimized prompts drastically improves performance compared to manual approaches, saving you time and effort while ensuring consistent, high-quality results.”

Shifting gears to a more tactical question - as a founder, I'm sure you've leveraged many SaaS tools to power Hamming's growth and operations. Across customer experience, employee engagement, sales intelligence, productivity, and security - what have been some of your favorite tools that you consider indispensable in your stack? Any hidden gems more founders should know about?

We love supporting other YC companies, and we use a variety of SaaS tools to power Hamming's growth and operations. Here are some of our favorites across different areas:

Customer Support: We use Atlas (YC) for customer support. It's been a game-changer for managing customer interactions.

Sales Intelligence: For sales, we use a combination of Apollo (YC), LinkedIn, and Dripify. This mix helps us talk to people who care about what we're building.

Productivity: I'm a huge fan of Superhuman for emails—it makes managing my inbox a breeze. Superwhisper is great for speech-to-text, Warp (an AI-first terminal) saves me time from having to remember bash commands, and Cursor (an AI-first VSCode fork) makes our eng team at least 3x faster. For task management, I rely on Sunsama for personal tasks and Linear for managing engineering tasks.

Code Management: We use Greptile (YC) for semantically searching across our entire codebase and automating PR reviews, and Ellipsis (YC) for additional PR review support.

Documentation: Mintlify (YC) is our go-to for documentation. It makes creating and maintaining high-quality docs easy and efficient.

These tools are indispensable in our stack, and I highly recommend them to other founders. They help us stay productive, organized, and focused on what matters most - talking to customers and making something people want.

Talent capable of building robust AI products is liquid gold—expensive and in high-demand. How has Hamming been able to attract and retain a high caliber team despite this challenge?

You're totally right. At Hamming, we've been fortunate to attract and retain a talented team so far by emphasizing our mission and creating a culture of complete freedom and ownership.

Our team is passionate about making AI reliable for all enterprises. There's no silver bullet to reliability—we need to solve many problems along the way to achieve this goal. Our internal success criteria is for every enterprise to use Hamming to build AI products. The team won't stop until we get there.

Many organizations are extremely top-down, with CEOs pushing half-baked ideas down the product and engineering teams' throats. At Hamming, we believe the best ideas come from within the organization and from insights gained by talking to customers. This is especially true in our space, where the market is changing rapidly.

“Anyone in our org can propose an idea, show why it's a good idea, how many customers it can impact, and then execute on it without needing top-down buy-in from me or anyone else.”

This inclusive and empowering culture drives our success and keeps our team motivated and engaged.

And with great freedom comes total ownership. When someone executes an idea and it doesn't work, they take complete ownership of the outcome, and propose a plan to either wind it down or present new adjacent ideas based on what they learned.

The combination of having a compelling mission and a culture that treats people like adults that keeps our team grinding every single day, seven days a week.

Unreliable AI systems can lead to serious issues like the Air Canada chatbot incident. What processes or guardrails does Hamming have to prevent such brand risks or violations of business policies?

Absolutely. That was a huge wake-up call for enterprises using the "move fast and break things" mentality to ship unreliable AI products. Another instance was New York City's "MyCity" AI chatbot, which ended up hallucinating and accidentally telling users to break the law.

For example, when asked if an employer can take a portion of their employees' tips, the bot responded affirmatively, despite the law stating that bosses are not allowed to take employee tips.

Preventing brand risks and violations of business policies requires a comprehensive, multi-pronged approach. Here's how we tackle it at Hamming:

Prompt Reliability: Every prompt in your system needs to be reliable, version-controlled, and audited for robustness against prompt injection attacks. Our prompt tuning product helps businesses create prompts that are more reliable and less susceptible to such attacks.

Evaluation for RAG & AI Agents: Our evaluation solution helps teams measure and minimize hallucinations during development. Every time a team makes a change to the prompt or retrieval pipeline, they can rely on Hamming to detect regressions and identify areas for improvement.

Proactive Red-Teaming: We conduct proactive red-teaming on existing AI systems to test their resilience against known prompt injection attacks or malicious inputs. This adds an extra layer of safety before teams deploy their products to production.

Guardrails: We're currently building AI guardrails that act as an internal firewall, preventing unwanted, harmful, or inaccurate statements from ever reaching your end users.

“By making prompts resilient, using evaluations to measure and minimize hallucinations, proactively red-teaming to ensure robustness against prompt injection attacks, and using guardrails as a final firewall, enterprises can safely deploy their AI products to production and keep them secure.”

You've described 2023 as the "year of demos" and 2024 as the "year of reliability". Looking ahead, what do you see as the next frontier or major challenge facing widespread enterprise AI adoption in 2025 and beyond?

I think reliability will continue to be a challenge in 2025 and beyond. The shape of 'reliability' will evolve—what's hard today may be easier tomorrow, but new attack vectors will emerge. For example, most LLMs are trained on publicly available data, often scraped from websites. A rogue actor, possibly a government, could create poisoned datasets to corrupt the pre-training process and subtly bias model outputs. I recently learned about a front-running poisoning technique targeting web-scale datasets that periodically snapshot crowdsourced content—such as Wikipedia—where an attacker only needs a time-limited window to inject malicious examples. We'll need new solutions to check the integrity of the datasets used to train LLMs.

Governance will also be a significant challenge. As AI models become more powerful and ubiquitous, enterprises will face increased scrutiny to ensure their AI systems are used safely, ethically, and without bias.

“Regulatory frameworks will likely become more stringent, requiring companies to demonstrate compliance with new standards. And this regulatory burden‌ could introduce significant friction, making it harder to innovate rapidly.”

Organizations will need robust governance frameworks to manage compliance while still fostering innovation. Ensuring AI operates ethically and without bias will be paramount. As AI systems influence more aspects of society, the demand for transparency and accountability will grow.

Enterprises will need to implement comprehensive bias detection and mitigation strategies, ensuring their AI models do not perpetuate existing inequalities or introduce new biases. Developing explainable AI (XAI) will be crucial to provide insights into how models make decisions, thereby building trust with users and regulators.

Security and privacy concerns will also be more pronounced. Protecting sensitive data from breaches and ensuring individual privacy will be critical. AI systems must be robust against adversarial attacks, where malicious actors attempt to deceive or manipulate AI behavior. Strong security measures and maintaining data privacy will be essential to safeguard both the technology and its users.

Finally, the environmental impact of large-scale AI deployments, particularly the energy consumption of training and running complex models, will become a significant concern. Enterprises will need to adopt more sustainable practices, such as optimizing algorithms for efficiency and leveraging green computing resources. Balancing the benefits of AI with its environmental footprint will be a key consideration for future AI strategies.

As Hamming deploys AI systems at scale there's power - but also responsibility. What's your take on governing the ethical risks and societal impacts of such powerful technologies?

Deploying AI systems at scale comes with significant responsibility. AI has the potential to revolutionize industries and improve lives, but it may also displace millions of jobs. As research teams advance foundational models, everyone is learning and adapting to this new reality. While I don't have deep expertise in risk mitigation, I can offer a few ideas.

Firstly, transparency is crucial. Our AI systems must be explainable, providing clear insights into how decisions are made. This builds trust with users and regulators, ensuring our technology isn't a black box but a tool that can be understood and scrutinized. I admire the work Anthropic has done in making their models more explainable and easier to trust.

Secondly, we need robust frameworks to detect and mitigate bias. AI systems should not perpetuate existing inequalities or introduce new biases. Continuous monitoring helps detect and address biases as they emerge. Beyond technical solutions, fostering a diverse team with contrarian opinions is essential in managing these risks.

Thirdly, privacy and security are non-negotiable. Protecting sensitive data from breaches and ensuring individual privacy will become even more important.

Fourth, on the regulation side, we need to create robust frameworks that balance the need for innovation with the imperatives of safety and ethics.

In essence, governing the ethical risks and societal impacts of AI requires a multi-faceted approach. At Hamming, we are committed to building not just powerful AI systems, but responsible ones.

What responsibilities do you believe AI companies should have in terms of pressure-testing for biases, discrimination, or potential harms before releasing systems?

Both foundational AI companies and application-focused AI companies have an enormous responsibility to ensure their systems are thoroughly pressure-tested for biases, discrimination, and potential harms before release.

At Hamming, our evaluation, continuous monitoring, and red-teaming services rigorously evaluate models and AI systems for any signs of bias or discrimination. We use diverse datasets that reflect the real-world scenarios our customers' systems will encounter. By simulating various edge cases and stress-testing the models in controlled environments, we can identify and mitigate risks early on.

Secondly, transparency and alignment are key. The more we understand about how these AI systems work, the better we can align them to reflect human preferences, confidently eliminating bias, discrimination, and other potential harms.

Moreover, continuous monitoring post-deployment is essential. Implementing feedback loops to measure real-world performance ensures that the systems remain fair and effective over time.

At Hamming, we take pressure-testing for biases, discrimination, and potential harms seriously. By doing so, we can ensure that our AI systems are not only powerful but also just and beneficial to all.

Looking 5-10 years out, what types of guardrails—whether regulation, compliance, dynamic monitoring, or other safeguards—do you believe will be critical for responsible AI development?

Looking 5-10 years out, the most exciting AI use cases are in regulated industries like healthcare, financial services, and law. All four areas you mentioned—regulation, compliance, dynamic monitoring, and guardrails—are crucial for using AI safely, fairly, and without systemic bias.

Firstly, we need a sensible regulation framework. As AI systems become more integrated into critical sectors, clear and enforceable regulations will help ensure these technologies are developed and deployed responsibly. Regulations should focus on transparency, accountability, and fairness, setting standards that are logical and don't slow down innovation too much.

Compliance is equally important. AI companies must develop and maintain comprehensive compliance programs that align with regulatory requirements. This includes regular audits, documentation, and adherence to best practices to ensure ongoing compliance.

Dynamic monitoring will be crucial for maintaining the integrity and reliability of AI systems. We continuously monitor, detect, and mitigate potential biases, errors, or malicious activities on behalf of our customers before they cause harm.

Guardrails, including ethical guidelines and operational safeguards, are necessary to guide AI development. These guardrails should be embedded throughout the AI lifecycle, from design to deployment.

Additionally, fostering a culture of responsibility within AI companies is pretty vital. Education and training programs that emphasize ethical AI development, coupled with a commitment to transparency and accountability, will help build a foundation for responsible innovation. Encouraging collaboration with external stakeholders, including policymakers, ethicists, and the public, will further strengthen the guardrails around AI development.

We've covered a lot of ground today. Before we wrap up, is there anything else you'd like to add or emphasize in terms of Hamming's mission and the future you envision? What final thoughts can you leave us with?

Our mission is making AI reliable.

“We believe every company is already an AI company or will become an AI company in the future.”

We imagine a world where we can help every enterprise build self-improving and reliable systems that unlock trillions in economic value and significantly speed up innovation in all areas - especially science and technology.

#AI consultancy #SaaS #AI Agents

1 note · View note