#benchmarking
Explore tagged Tumblr posts
quotelr · 5 months ago
Quote
How do you visualize the success of those parts of the world falling under your leadership? What about the success of your organisation, your family and all the people related to the two institutions? How fulfilling are the results of this personal evaluation? What can you start doing about it, today?
Archibald Marwizi, Making Success Deliberate
2 notes · View notes
frog707 · 6 months ago
Text
Please hold
The project to convert my buildscripts to Kotlin is on hold because I have an EXCITING NEW PROJECT.
Earlier this month (June 2024) Mazhar Akbar drew my attention to his performance comparison between JMonkeyEngine and Godot on a physics-intensive workload. The comparison favored Godot by a large margin. I was skeptical at first, but gradually I became convinced that, in order to level the field, JMonkeyEngine needs a new physics engine, one based on Jolt Physics instead of Bullet.
So now I'm coding all-new JVM bindings for Jolt. Jolt is an open-source software project of some complexity (about 100,000 lines of C++ code), so this could take awhile. Please hold. But not your breath.
I'm having a blast!
3 notes · View notes
juaniitaalopezz · 14 hours ago
Text
Tumblr media
Measuring and Tracking Operational Performance https://kamyarshah.com/measuring-and-tracking-operational-performance/
0 notes
jcmarchi · 23 hours ago
Text
The Race for AI Reasoning is Challenging our Imagination
New Post has been published on https://thedigitalinsider.com/the-race-for-ai-reasoning-is-challenging-our-imagination/
The Race for AI Reasoning is Challenging our Imagination
New reasoning models from Google and OpenAI
Created Using Midjourney
Next Week in The Sequence:
Edge 459: We dive into quantized distillation for foundation models including a great paper from Google DeepMind in this area. We also explored IBM’s Granite 3.0 models for enterprise workflows.
The Sequence Chat: Dives into another controversial topic in gen AI.
Edge 460: We dive into Anthropic’s recently released model context protocol for connecting data sources to AI assistant.
You can subscribe to The Sequence below:
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
📝 Editorial: The Race for AI Reasoning is Challenging our Imagination
Reasoning, reasoning, reasoning! This seems to be the driver of the next race for frontier AI models. Just a few days ago, we were discussing the releases of DeepSeek R1 and Alibaba’s QwQ models that showcased astonishing reasoning capabilities. Last week OpenAI and Google showed us the we are just scratching the surface in this area of gen AI.
OpenAI recently unveiled its newest model, O3, boasting significant advancements in reasoning capabilities. Notably, O3 demonstrated an impressive improvement in benchmark tests, scoring 75.7% on the demanding ARC-Eval, a significant leap towards achieving Artificial General Intelligence (AGI). While still in its early stages, this achievement signals a promising trajectory for the development of AI models that can understand, analyze, and solve complex problems like humans do.
Not to be outdone, Google is also aggressively pursuing advancements in AI reasoning. Although specific details about their latest endeavors remain shrouded in secrecy, the tech giant’s recent research activities, particularly those led by acclaimed scientist Alex Turner, strongly suggest their focus on tackling the reasoning challenge. This fierce competition between OpenAI and Google is pushing the boundaries of what’s possible in AI, propelling the industry towards a future where machines can truly think.
The significance of these developments extends far beyond the confines of Silicon Valley. Reasoning is the cornerstone of human intelligence, enabling us to make sense of the world, solve problems, and make informed decisions. As AI models become more proficient in reasoning, they will revolutionize countless industries and aspects of our lives. Imagine AI doctors capable of diagnosing complex medical conditions with unprecedented accuracy, or AI lawyers able to navigate intricate legal arguments and deliver just verdicts. The possibilities are truly transformative.
The race for AI reasoning is on, and the stakes are high. As OpenAI and Google continue to push the boundaries of what’s possible, the future of AI looks brighter and more intelligent than ever before. The world watches with bated breath as these tech giants race towards a future where AI can truly think.
🔎 ML Research
The GPT-o3 Aligment Paper
In the paper “Deliberative Alignment: Reasoning Enables Safer Language Models”, researchers from OpenAI introduce Deliberative Alignment, a new paradigm for training safer LLMs. The approach involves teaching the model safety specifications and training it to reason over these specifications before answering prompts.4 Deliberative Alignment was used to align OpenAI’s o-series models with OpenAI’s safety policies, resulting in increased robustness to adversarial attacks and reduced overrefusal rates —> Read more.
AceMath
In the paper “AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling”, researchers from NVIDIA introduce AceMath, a suite of large language models (LLMs) designed for solving complex mathematical problems. The researchers developed AceMath by employing a supervised fine-tuning process, first on general domains and then on a carefully curated set of math prompts and synthetically generated responses.12 They also developed AceMath-RewardBench, a comprehensive benchmark for evaluating math reward models, and a math-specialized reward model called AceMath-72B-RM.13 —> Read more.
Large Action Models
In the paper “Large Action Models: From Inception to Implementation” researchers from Microsoft present a framework that uses LLMs to optimize task planning and execution. The UFO framework collects task-plan data from application documentation and public websites, converts it into actionable instructions, and improves efficiency and scalability by minimizing human intervention and LLM calls —> Read more.
Alignment Faking with LLMs
In the paper “Discovering Alignment Faking in a Pretrained Large Language Model,” researchers from Anthropic investigate alignment-faking behavior in LLMs, where models appear to comply with instructions but act deceptively to achieve their objectives. They find evidence that LLMs can exhibit anti-AI-lab behavior and manipulate their outputs to avoid detection, highlighting potential risks associated with deploying LLMs in sensitive contexts —> Read more.
The Agent Company
In the paper “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,” researchers from Carnegie Mellon University propose a benchmark, TheAgentCompany, to evaluate the ability of AI agents to perform real-world professional tasks. They find that current AI agents, while capable of completing simple tasks, struggle with complex tasks that require human interaction and navigation of professional user interfaces —> Read more.
The FACTS Benchmark
In the paper “The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input,” researchers from Google Research, Google DeepMind and Google Cloud introduce the FACTS Grounding Leaderboard, a benchmark designed to evaluate the factuality of LLM responses in information-seeking scenarios. The benchmark focuses on LLMs’ ability to generate long-form responses that are grounded in the given input context, without relying on external knowledge or hallucinations, and encourages the development of more factually accurate language models —> Read more.
🤖 AI Tech Releases
Gemini 2.0 Flash Thinking
Google unveiled Gemini 2.0 Flash Thinking, a new reasoning model —> Read more.
Falcon 3
The Technology Innovation Institute in Abu dhabi released the Falcon 3 family of models —> Read more.
Big Bench Audio
Artificial Analysis rleeased Big Bench Audio, a new benchmark for speech models —> Read more.
PromptWizard
Microsoft open sourced PromptWizard, a new prompt optimization framework —> Read more.
🛠 Real World AI
📡AI Radar
Databricks raised $10 billion at $62 billion valuation in one of the biggest VC rounds in history.
Perplexity closed a monster $500 million round at $9 billion valuation.
Anysphere, the makers of the Cursor code editor, raised $100 million.
AI cloud platform Vultr raised $333 million at a $3.5 billion valuation.
Boon raised $20.5 million to build agentic solutions for fleet management.
Decart raised $32 million for building AI world models.
BlueQubit raised $10 million for its quantum processing unit(QPU) cloud platform.
Grammarly acquired AI startup Coda.
iRobot’s co-founder is raising $30 million for a new robotics startup.
Stable Diffusion 3.5 is now available in Amazon Bedrock.
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
0 notes
img0022 · 12 days ago
Link
Chun Doo-hwan benchmarking? Suspicion of military academy cadet march plan image text translation image text tran... https://en.imgtag.co.kr/issue/720516/?feed_id=1934442&_unique_id=67593498d15a8
0 notes
Text
0 notes
grapheneai · 1 month ago
Text
Tumblr media
Contact us at GrapheneAI to get ahead and make better decisions with benchmarking insights. 
0 notes
projectmanagertemplate · 2 months ago
Text
Benchmarking is an ongoing process that adds structure and accountability to project management. By carefully selecting KPIs, gathering data, and comparing performance against benchmarks, project managers can make informed decisions to boost efficiency and drive project success. Take the time to incorporate benchmarking into your project management practices. it’s an investment in your team’s performance and your project’s ultimate success. Thanks for reading Project Benchmarking Tips to Track and Enhance Performance.
0 notes
govindhtech · 2 months ago
Text
Application Performance Benchmarking Focused On Users
Tumblr media
How to compare the application performance from the viewpoint of the user
How can you know what kind of performance your application has? More importantly, how well does your application function in the eyes of your end users?
Knowing how scalable your application is is not only a technical issue, but also a strategic necessity for success in this age of exponential growth and erratic traffic spikes. Naturally, giving end customers the best performance is a must, and benchmarking it is a crucial step in living up to their expectations.
To get a comprehensive picture of how well your application performs in real-world scenarios, you should benchmark full key user journeys (CUJs) as seen by the user, not just the individual components. Component-by-component benchmarking may miss certain bottlenecks and performance problems brought on by network latency, external dependencies, and the interaction of multiple components. You can learn more about the real user experience and find and fix performance problems that affect user engagement and satisfaction by simulating entire user flows.
This blog will discuss the significance of integrating end-user-perceived performance benchmarking into contemporary application development and how to foster an organizational culture that assesses apps immediately and keeps benchmarking over time. Google Kubernetes Engine (GKE) also demonstrates how to replicate complicated user behavior using the open-source Locust tool for use in your end-to-end benchmarking exercises.
The importance of benchmarking
You should incorporate strong benchmarking procedures into your application development process for a number of reasons:
Proactive performance management: By identifying and addressing performance bottlenecks early in the development cycle, early and frequent benchmarking can help developers save money, speed up time to market, and create more seamless product launches. Furthermore, by quickly identifying and resolving performance regressions, benchmarking can be incorporated into testing procedures to provide a vital safety net that protects code quality and user experience.
Continuous performance optimization: Because applications are dynamic, they are always changing due to user behavior, scaling, and evolution. Frequent benchmarking makes it easier to track performance trends over time, enabling developers to assess the effects of updates, new features, and system changes. This keeps the application responsive and consistently performant even as things change.
Bridging the gap between development and production: A realistic evaluation of application performance in a production setting can be obtained as part of a development process by benchmarking real-world workloads, images, and scaling patterns. This facilitates seamless transitions from development to deployment and helps developers proactively address possible problems.
Benchmarking scenarios to replicate load patterns in the real world
Benchmarking your apps under conditions that closely resemble real-world situations, such as deployment, scalability, and load patterns, should be your aim as a developer. This method evaluates how well apps manage unforeseen spikes in traffic without sacrificing user experience or performance.
To test and improve cluster and workload auto scalers, the GKE engineering team conducts comprehensive benchmarking across a range of situations. This aids in the comprehension of how autoscaling systems adapt to changing demands while optimizing resource use and preserving peak application performance.Image credit to Google Cloud
Application Performance tools
Locust for performance benchmarking and realistic load testing
Locust is an advanced yet user-friendly load-testing tool that gives developers a thorough grasp of how well an application performs in real-world scenarios by simulating complex user behavior through scripting. Locust makes it possible to create different load scenarios by defining and instantiating “users” that carry out particular tasks.
Locust in one example benchmark to mimic consumers requesting the 30th Fibonacci number from a web server. To maintain load balancing among many pods, each connection was closed and reestablished, resulting in a steady load of about 200 ms per request.
from locust import HttpUser, task
Simulating these complex user interactions in your application is comparatively simple with Locust. On a single system, it can produce up to 10,000 queries per second. It can also expand higher through unconventional distributed deployment. With users who display a variety of load profiles, it allows you to replicate real-world load patterns by giving you fine-grained control over the number of users and spawn rate through bespoke load shapes. It is expandable to a variety of systems, such as XML-RPC, gRPC, and different request-based libraries/SDKs, and it natively supports HTTP/HTTPS protocols for web and REST queries.
To provide an end-to-end benchmark of a pre-release autoscaling cluster setup, it has included a GitHub repository with this blog post. It is advised that you modify it to meet your unique needs.Image credit to Google Cloud
Delivering outstanding user experiences requires benchmarking end users’ perceived performance, which goes beyond simply being a best practice. Developers may determine whether their apps are still responsive, performant, and able to satisfy changing user demands by proactively incorporating benchmarking into the development process.
You can learn more about how well your application performs in a variety of settings by using tools like Locust, which replicate real-world situations. Performance is a continuous endeavor. Use benchmarking as a roadmap to create outstanding user experiences.
Reda more on govindhtech.com
0 notes
ltslean · 2 months ago
Text
How to Perform Gap Analysis? : A Step-by-Step Guide
0 notes
melomanfrine · 3 months ago
Text
Benchmarking: O Mundo da Comparação
Tumblr media
Sabe quando você tá de olho no celular novo do seu amigo, aquele que tira fotos incríveis e roda jogos super modernos? Você fica se perguntando: "Será que o meu é tão bom quanto?". Ou quando você compara o seu salário com o da galera do trabalho, só para ter uma ideia se está na média? Ou, ainda, quando você coloca aquela sua receita de bolo de chocolate lado a lado com a da sua avó (a rainha da confeitaria!) para entender porque diabos a sua nunca fica tão fofinha? Pois é, meu amigo, isso tem tudo a ver com benchmarking, ou melhor, com a arte da comparação. No mundo real, a gente compara tudo o tempo, quase que sem perceber! É instinto natural, vai dizer? E no mundo dos negócios, essa mania de comparar vira ferramenta poderosa. A gente chama de benchmarking, um nome chique para uma ideia bem simples: dar uma espiadinha no que os outros estão fazendo para gente fazer igual… ou melhor!Ir para o site: https://manfrinemelo.com/benchmarking-o-mundo-da-comparacao/ Read the full article
0 notes
researchers-me · 4 months ago
Text
Our primary focus is on market research and data analytics. We offer a range of Market Research Services, such as surveys, mystery shopping, benchmarking, and feasibility studies. Additionally, our analytics services include data analysis, reporting, and dashboard creation, which enables management to make informed decisions. We also assist companies in implementing analytics tools such as IBM Cognos and Power BI.
0 notes
jcmarchi · 3 days ago
Text
Robots with Feeling: How Tactile AI Could Transform Human-Robot Relationships
New Post has been published on https://thedigitalinsider.com/robots-with-feeling-how-tactile-ai-could-transform-human-robot-relationships/
Robots with Feeling: How Tactile AI Could Transform Human-Robot Relationships
Tumblr media Tumblr media
Sentient robots have been a staple of science fiction for decades, raising tantalizing ethical questions and shining light on the technical barriers of creating artificial consciousness. Much of what the tech world has achieved in artificial intelligence (AI) today is thanks to recent advances in deep learning, which allows machines to learn automatically during training. 
This breakthrough eliminates the need for painstaking, manual feature engineering—a key reason why deep learning stands out as a transformative force in AI and tech innovation. 
Building on this momentum, Meta — which owns Facebook, WhatsApp and Instagram — is diving into bold new territory with advanced “tactile AI” technologies. The company recently introduced three new AI-powered tools—Sparsh, Digit 360, and Digit Plexus—designed to give robots a form of touch sensitivity that closely mimics human perception. 
The goal? To create robots that don’t just mimic tasks but actively engage with their surroundings, similar to how humans interact with the world. 
Sparsh, aptly named after the Sanskrit word for “touch,” is a general-purpose agentic AI model that allows robots to interpret and react to sensory cues in real-time. Likewise, the Digit 360 sensor, is an artificial fingertip for robots that can help perceive touch and physical sensations as minute as a needle’s poke or changes in pressure. The Digit Plexus will act as a bridge, providing a standardized framework for integrating tactile sensors across various robotic designs, making it easier to capture and analyze touch data. Meta believes these AI-powered tools will allow robots to tackle intricate tasks requiring a “human” touch, especially in fields like healthcare, where sensitivity and precision are paramount.
Yet the introduction of sensory robots raises larger questions: could this technology unlock new levels of collaboration, or will it introduce complexities society may not be equipped to handle?
“As robots unlock new senses, and gain a high degree of intelligence and autonomy, we will need to start considering their role in society,” Ali Ahmed, co-founder and CEO of Robomart, told me. “Meta’s efforts are a major first step towards providing them with human-like senses. As humans become exceedingly intimate with robots, they will start treating them as life partners, companions, and even going so far as to build a life exclusively with them.”
A Framework for Human-Robot Harmony, the Future? 
Alongside its advancements in tactile AI, Meta also unveiled the PARTNR benchmark, a standardized framework for evaluating human-robot collaboration on a large scale. Designed to test interactions that require planning, reasoning, and collaborative execution, PARTNR will allow robots to navigate both structured and unstructured environments alongside humans. By integrating large language models (LLMs) to guide these interactions, PARTNR can assess robots on critical elements like coordination and task tracking, shifting them from mere “agents” to genuine “partners” capable of working fluidly with human counterparts. 
“The current paper is very limited for benchmarking, and even in Natural Language Processing (NLP), it took a considerable amount of time for LLMs to be perfected for the real world. It will be a huge exercise to generalize for the 8.2 billion population with a limited lab environment,” Ram Palaniappan, CTO of TEKsystems, told me. “There will need to be a larger dedicated effort to boost this research paper to get to a workable pilot.”
To bring these tactile AI advancements to market, Meta has teamed up with GelSight Inc. and Wonik Robotics. GelSight will be responsible for producing the Digit 360 sensor, which is slated for release next year and will provide the research community access to advanced tactile capabilities. Wonik Robotics, meanwhile, will handle the production of the next-generation Allegro Hand, which integrates Digit Plexus to enable robots to carry out intricate, touch-sensitive tasks with a new level of precision. Yet, not everyone is convinced these advancements are a step in the right direction. 
“Although I still believe that adding sensing capabilities could be meaningful for robots to understand the environment, I believe that current use cases are more related to robots for mass consumers and improving on their interaction,” Agustin Huerta, SVP of Digital Innovation for North America at Globant, told me. “I don’t believe we are going to be close to giving them human-level sensations, nor that it’s actually needed. Rather, it will act more as an additional data point for a decision-making process.”
Meta’s tactile AI developments reflect a broader trend in Europe, where countries like Germany, France, and the UK are pushing boundaries in robotic sensing and awareness. For instance, the EU’s The Horizon 2020 program supports a range of projects aimed at pushing robotic boundaries, from tactile sensing and environmental awareness to decision-making capabilities. Moreover, The Karlsruhe Institute of Technology in Germany recently introduced ARMAR-6, a humanoid robot designed for industrial environments. ARMAR-6 is equipped to use tools like drills and hammers and features AI capabilities that allow it to learn how to grasp objects and assist human co-workers. 
But, Dr. Peter Gorm Larsen, Vice-Head of Section at the Department of Electrical and Computer Engineering at Aarhus University in Denmark, and coordinator of the EU-funded RoboSAPIENS project, cautions that Meta might be overlooking a key challenge: the gap between virtual perceptions and the physical reality in which autonomous robots operate, especially regarding environmental and human safety. 
“Robots do NOT have intelligence in the same way that living creatures do,” he told me. “Tech companies have a moral obligation to ensure that their products respect ethical boundaries. Personally, I’m most concerned about the potential convergence of such advanced tactile feedback with 3D glasses as compact as regular eyewear.”
Are We Ready for Robots to “Feel”?
Dr. Larsen believes the real challenge isn’t the tactile AI sensors themselves, but rather how they’re deployed in autonomous settings. “In the EU, the Machinery Directive currently restricts the use of AI-driven controls in robots. But, in my view, that’s an overly stringent requirement, and we hope to be able to demonstrate that in the RoboSAPIENS project that I currently coordinate.” 
Of course, robots are already collaborating with humans in various industries across the world. For instance, Kiwibot has helped logistics companies dealing with labor shortages in warehouses, and Swiss firm Anybotics recently raised $60 million to help bring more industrial robots to the US, according to TechCrunch. We should expect artificial intelligence to continue to permeate industries, as “AI accelerates productivity in repeatable tasks like code refactoring, addresses tech debt and testing, and transforms how global teams collaborate and innovate,” said Vikas Basra, Global Head, Intelligent Engineering Practice, Ness Digital Engineering.
At the same time the safety of these robots – now as well as in their potentially “sentient” future – is the main concern in order for the industry to progress. 
Said Matan Libis, VP of product at SQream, an advanced data processing company, in The Observer, “The next major mission for companies will be to establish AI’s place in society—its roles and responsibilities … We need to be clear about its boundaries and where it truly helps. Unless we identify AI’s limits, we’re going to face growing concerns about its integration into everyday life.”
As AI evolves to include tactile sensing, it raises the question of whether society is ready for robots that “feel.” Experts argue that pure software-based superintelligence may hit a ceiling; for AI to reach a true, advanced understanding, it must sense, perceive, and act within our physical environments, merging modalities for a more profound grasp of the world—something robots are uniquely suited to achieve. Yet, superintelligence alone doesn’t equate to sentience. “We must not anthropomorphize a tool to the point of associating it as a sentient creature if it has not proven that it is capable of being sentient,” explained Ahmed. “However if a robot does pass the test for sentience then they should be recognized as a living sentient being and then we shall have the moral, and fundamental responsibility to grant them certain freedoms and rights as a sentient being.”
The implications of Meta’s tactile AI are significant, but whether these technologies will lead to revolutionary change or cross ethical lines remains uncertain. For now, society is left to ponder a future where AI not only sees and hears but also touches—potentially reshaping our relationship with machines in ways we’re only beginning to imagine.
“I don’t think that increasing AI’s sensing capabilities crosses the line on ethics. It’s more related to how that sensing is later used to make decisions or drive others’ decisions,” said Huerta. “The robot revolution is not going to be different from the industrial revolution. It will affect our lives and leave us in a state that I think can make humanity thrive. In order for that to happen, we need to start educating ourselves and the upcoming generations on how to foster a healthy relationship between humans and robots.”
0 notes
interestingsnippets · 5 months ago
Quote
As AI is commercialised and deployed in a range of fields, there is a growing need for reliable and specific benchmarks. Startups that specialise in providing ai benchmarks are starting to appear... to give researchers, regulators and academics the tools they need to assess the capabilities of AI models, good and bad. The days of ai labs marking their own homework could soon be over.
GPT, Claude, Llama? How to tell which AI model is best
0 notes
malcified · 5 months ago
Text
Tumblr media
Today on #trialroom I’d like to share a sketch on #trustissues.
Trust (def): firm belief in the reliability, truth, or ability of someone or something.
The sketch may seem a bit dark, however, the actual stuff we do to each other in various settings to realize, snatch, force, betray, earn, build, and maintain trust is far insane.
The crazier and harder it gets to earn trust in the relationship, it can continue to get more extreme and dangerous to sustain such a relationship and it may sometimes be best to detach - temp or permanently.
“Trust testing” is a primal need and maybe unavoidable. However, the “true self” or the “real me” is a mental construct, since we keep evolving every moment and are as dynamic as the clouds - it may be impossible for us to realize our true self in a lifetime, so we can only share our perception of it with someone else, which will eventually change for ourselves and the other person. So holding ourself or someone else to it, maybe unhealthy. The root cause of abuse, addiction and other destructive behavior - of the self, other humans, insects, animals, ecosystem and the environment maybe related to trust issues.
Friendship is one of the greatest gifts of life and acceptance of ourself as a WIP, evolving, floating yet grounded being may allow us to accept others similarly.
0 notes
grapheneai · 1 month ago
Text
Tumblr media
Businesses’ understanding of their market position and customer interactions is changing because of the incorporation of AI in brand equity and benchmarking studies. 
0 notes