Tumgik
#contrastive learning
jcmarchi · 1 month
Text
The AI Scientist
New Post has been published on https://thedigitalinsider.com/the-ai-scientist/
The AI Scientist
A model that can produce novel AI papers plus some really cool papers and tech releases this week.
Next Week in The Sequence:
Edge 423: We explore the fundamentals of state space models including the fmaous S4 paper. The tech section provides an overview of NVIDIA’s NIM framework.
Edge 424: We dive into the DeepMind’s amazing AlphaProof and AlphaGeometry-2 that achieved silver medal in the latest international math olympiad.
You can subscribe to The Sequence below:
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
📝 Editorial: The AI Scientist
If you read this newsletter, you know that I firmly believe discovering new science might be the ultimate test for AGI. While we are still far from having AI that can formulate something like the Riemann Hypothesis or the Theory of General Relativity, we have made tremendous progress in proving and validating scientific ideas across disciplines such as mathematics, physics, biology, chemistry, and others.
The reason science presents such a challenging bar for AI is that it involves aspects like long-term planning, creativity, multidisciplinary knowledge, multi-step fact-checking, and many other components that are still in the very early stages of development in generative AI.
However, progress is being made.
This week, the Japanese AI startup Sakana AI, in collaboration with several other AI labs, published a paper detailing The AI Scientist, a framework for open-ended scientific discovery. The AI Scientist is capable of conducting open-ended research, executing experiments, generating code, visualizing results, and even presenting them in full reports. In the initial demonstrations, The AI Scientist made several contributions across different areas of AI research, including diffusion models, transformers, and grokking.
The core ideas behind The AI Scientist resemble models such as DeepMind’s Alpha Geometry, AlphaProof, or the NuminaMath model that recently won first prize in the AI Math Olympiad. These models use an LLM for idea formulation, combined with more symbolic models for experimentation. The biggest challenge with this approach is whether the idea-generation portion will quickly hit its limits. Some of the most groundbreaking scientific discoveries in history seem to involve a component of human ingenuity that doesn’t yet appear to be present in LLMs. However, this path holds great potential for exploring new ideas in scientific research.
For now, The AI Scientist represents an exciting advancement in open-ended scientific research.
🔎 ML Research
The AI Scientist
Researchers from Sakana AI, Oxford, University of British Columbia and several other institutions published a paper unveiling the AI Scientist, a pipeline for open ended scientific research using LLMs. The AI Scientist injects AI in different area of scientific research such as ideation, a literature search, experiment planning, experiment iterations, manuscript writing, and peer reviewing —> Read more.
Imagen 3
Google published the technical report of Imagen 3, their marquee text-to-image model. The paper details the training and evaluation details behind Imagen 3 as well as some of the challenges around safety —> Read more.
Mitigating Hallucinations
Google Research published a paper detailing HALVA, a contrastive tuning method that can mitigate hallucinations in language and image assistants. Like other contrastive learning methods, HALVA generates alternative representations of factual tokens with the objective of boosting the probability of the model identifying the correct token —> Read more.
Your Context is Not an Array
Qualcomm Research published a paper that explores the limitations of transformers. The paper suggest that some of the generalization challenges of transformers are related with the inability to perform random memory access within its context window —> Read more.
Mutual Reasoning in LLMs
Microsoft Research published a paper introducing rStar, a self-play multi reasoning approach that seems to improve reasoning capabilities in small language models. rStar uses a generation-discrimination process to decouple the different steps in the reasoning process —> Read more.
Pretraining vs. Fine Tuning
Researchers from Johns Hopkins University published a paper exploring the relationship between pretraining and fine-tuning in LLMs. The paper explores the diminishing returns of fine-tuning after certain scale —> Read more.
🤖 AI Tech Releases
Grok-2
xAI unveiled a new version of Grok that matches the performance of top open source models —> Read more.
SWE-Bench
OpenAI released a subset of the famous SWE-Bench benchmark with human verification —> Read more.
Claude Prompt Caching
Anthropic unveiled prompt caching capabilities for Claude 3.5 Sonnet and Claude 3 Haiku —> Read more.
Airflow 2.10
Apache Airflow 2.10 arrived with a strong focu on AI workflows —> Read more.
AI Risks Database
MIT open sourced a database of over 700 AI risks across different categories —> Read more.
🛠 Real World AI
Image Animation at Meta
Meta discusses the AI techniques used for image animation at scale —> Read more.
Model Reliability at Salesforce
Salesforce discusses the methods used to ensure AI model reliability and performance in their internal pipelines —> Read more.
📡AI Radar
Fei-Fei Li’s World Labs raised $100 million at a $1 billion valuation.
Decentralized AI startup Sahara AI raised $43 million in new funding.
Snowflake announced its Cortex Analyst solution to power self-service analytics with AI.
AI observaility platform Goodfire raised $7 million in new funding.
AI-focused VC Radical Ventures raised a new $800 million fund.
Raunway Gen-3 Turbo showcased very impressive capabilities.
AI-based stock evaluator TipRanks was acquired for $200 million.
Real Estate AI company EliseAI raised $75 million at $1 billion valuation.
Encord, an AI data development platform, raised a $30 million Series B.
RAG as a service platform Ragie raised $5.5 million.
CodeRabbit raised $16 million for using AI to automate code reviews.
AI-based scientific research platform Consensus raised an $11.5 million Series A.
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
0 notes
jamalir · 8 months
Text
AI discovers that not every fingerprint is unique
0 notes
hattersarts · 1 year
Text
Tumblr media Tumblr media
drew some book!husbands. they feel like they've taken more traits from each other than the show.
20K notes · View notes
cozylittleartblog · 5 days
Text
Tumblr media
50+ deaths at 5 am got me yelling absolute nonsense to the bosses kicking my whole entire ass
667 notes · View notes
little-pondhead · 5 months
Text
Your Ancient History, Written In Wax
-
Danny knew he should have put better security around the Sarcophagus of Eternal Sleep. It wasn’t even Vlad who opened it this time! The fruitloop was too busy doing his actual mayor duties because for some godforsaken reason, the man got re-elected.
No, it wasn’t Vlad. And it wasn’t Fright Knight, either. Nor the Observants. Who opened the Sarcophagus, then? Danny didn’t have time to find out as Pariah Dark promptly tore open a hole in reality and started hunting Danny down.
The battle was longer this time. He didn’t have the Ecto-Skeleton, as that was the first thing Pariah had destroyed. The halfa had grown a lot over the past few years, and learned some new tricks, but apparently sleeping in a magic ghost box meant that Pariah had absorbed a lot of power. The bigger ghost acted like a one-man army!
Amity Park was caught in the middle of the battle, but the residents made sure it went no further than that. Vlad and the Fentons made a barrier around the town to keep the destruction from leaking. Sam, Tucker, and Dani did crowd control while Danny faced the king head-on.
Their battle shook the Zone and pulled them wildly between the mortal plane and the afterlife. Sometimes, residents noticed a blow from Pariah transported them to the age of the dinosaurs, and Phantom’s Wail brought them to an unknown future. Then they were in a desert. Then a blazing forest. Then underwater. It went on like that, but no one dared step foot outside of Amity. They couldn’t risk being left behind.
It took ages to beat him, but eventually, Danny stood above the old ghost king, encasing his symbols of power in ice so they couldn’t be used again. He refused to claim the title for himself. Tired as he was, Danny handed the objects off to Clockwork for safe keeping and started repairing the damage Pariah had done to the town. The tear he’d made was too big to fix, for now, so no one bothered. They just welcomed their new ghostly neighbors with open arms and worked together to restore Amity Park.
Finally, the day came to bring down the barrier. People were gathered around the giant device the Fentons had built to sustain it. Danny had brought Clockwork to Amity, to double check that they had returned to the right time and dimension.
Clockwork assured everyone that they were in the right spot, and only a small amount of time had passed, so the Fentons gave the signal to drop the shield.
Very quickly did they discover that something was wrong. The air smelled different. The noise of the nearby city, Elmerton, was louder and more chaotic. Something was there that wasn’t before, and it put everyone on edge.
Clockwork smiled, made a remark about the town fitting in better than before, and disappearing before Danny could catch him.
Frantic, Danny had a few of his ghost buds stay behind to protect the town while he investigated.
He flew far and wide, steadily growing horrified at the changes the world had undergone. Heroes, villains, rampant crime and alien invasions. The Earth was unrecognizable. There were people moving around the stars like it was second nature and others raising dead gods like the apocalypse was coming. Magic and ectoplasm was everywhere, rather than following the ley lines like they were supposed to.
Danny returned to Amity.
The fight with Pariah had taken them through space and time. Somewhere along the way, they had changed the course of history so badly that this now felt like an alien world.
How was he supposed to fix this?
-
In the Watchtower, The Flash was wrapping up monitor duty while Impulse buzzed around him, a little more jittery than usual. The boy was talking a mile a minute, when alarms started blaring an alarming green. Flash had never seen this alarm before, and its crackling whine was grating on his ears.
Flash returned to the monitor, frantically clicking around to find the issue, but nothing was popping up. No major disasters, no invasions, no declarations of war. Nothing! What was causing the alarm?
Impulse swore and zipped to a window, pressing his face against it and staring down at Earth. “Fuck! It’s today isn’t it? I forgot!”
“What’s today?” Flash asked. He shot off a text to Batman, asking if it was an error. The big Bat said it wasn’t, and that he would be there soon.
“The arrival of Amity Park. I learned about this in school; the alarm always gives me headaches.”
Flash turned to his grandson, getting his attention. “Bart,” he stressed. “What are you talking about?”
Impulse barely glanced over his shoulder. Now that Flash was facing him, he could see a strong glow coming from Earth. “The first villain, first anti-villain, and the first hero,” he said anxiously. “They all protect the town of the original metas. They’re all here.”
“Here? Now??”
“Yeah? They weren’t before, but they are now. The first hero said there was time stuff involved, which was what inspired me to start practicing time travel in the first place.”
“I’m not following.”
“It’s okay. We should probably go welcome them before they tear apart Illinois, though. The history I remember says that some of them freaked and destroyed a chunk of the Midwest during a fight with each other.”
“WHAT?”
#dpxdc#pondhead blurbs#liminal amity park#I’ve seen stuff like this in the mhaxdp fandom and I eat it up every time#basically the fight with Pariah caused the town to jump through time a little#and while they THOUGHT they were keeping everything in#shit leaked out and tainted those points in time#so technically#historically and genetically speaking#Amity Park is the origin point for the meta gene and Danny made history as the first hero#because Clockwork is a little shit#everyone embodies a basic ability and it has grown from there#the flash family are direct descendants of Dani (speed force Dani for the win)#Dash is the reason super strength exists#so on and so forth#go buck wild#bart learned about it briefly in history class in the 30th century#practically hero worships them#booster gold knows about them too but in contrast to Bart’s excitement#booster is fucking terrified because there was a period where Amity Park rebelled against the US government#and he’s from that specific time#he learned to fear phantom because he lived during that part while Bart is from farther in the future when those issues got resolved#guess who’s chosen to welcome the town? >:)#if you’re wondering what happened to the GIW#they turned into the branch Amanda Waller runs#Danny is the first hero#Vlad the first villain#and Dani the first anti hero#there’s an arc where Danny is trying to fix things but clockwork won’t let him into the timestream and all the heroes are horrified#because yeah Danny is the OG but if he goes back in time to fix his ‘mistake’ what will happen to them?
660 notes · View notes
omaano · 2 years
Text
Tumblr media
Had a bit of a crisis over trying to draw Boba, so I did the adult thing and did some face studies. Learned a lot, hopefully some of it sticks
6K notes · View notes
monobmp · 1 year
Text
Tumblr media
Shadow
2K notes · View notes
obstinateson · 18 days
Text
Tumblr media Tumblr media
visit
195 notes · View notes
skunkes · 5 months
Text
Tumblr media
306 notes · View notes
binglepringle · 2 months
Text
Idek what this is but here you go 🤲
Tumblr media
176 notes · View notes
carmen-berzattos · 9 months
Text
There's something at the tip of my tongue about the parallels between Jackie and Wilson and Who We Are
How the narrator of Jackie and Wilson wants to run away with a woman that he's carved out of his imagination based on a brief interaction. How they would try the world, but good god it wasn't for them. So they run away from it into a fantasy world where they live by their own rules.
And then comes the narrator of Who We Are, who dreamt his whole life of finding someone who would hold him like water or like a knife, only to find that running away from the world will only get them so far, since "the hardest part is who we are". And only to find out that the "phantom life" he's fantasized about is actually just that: a phantom. And its absence sharpens like a knife
300 notes · View notes
jcmarchi · 3 months
Text
Code Embedding: A Comprehensive Guide
New Post has been published on https://thedigitalinsider.com/code-embedding-a-comprehensive-guide/
Code Embedding: A Comprehensive Guide
Code embeddings are a transformative way to represent code snippets as dense vectors in a continuous space. These embeddings capture the semantic and functional relationships between code snippets, enabling powerful applications in AI-assisted programming. Similar to word embeddings in natural language processing (NLP), code embeddings position similar code snippets close together in the vector space, allowing machines to understand and manipulate code more effectively.
What are Code Embeddings?
Code embeddings convert complex code structures into numerical vectors that capture the meaning and functionality of the code. Unlike traditional methods that treat code as sequences of characters, embeddings capture the semantic relationships between parts of the code. This is crucial for various AI-driven software engineering tasks, such as code search, completion, bug detection, and more.
For example, consider these two Python functions:
def add_numbers(a, b): return a + b
def sum_two_values(x, y): result = x + y return result
While these functions look different syntactically, they perform the same operation. A good code embedding would represent these two functions with similar vectors, capturing their functional similarity despite their textual differences.
Vector Embedding
How are Code Embeddings Created?
There are different techniques for creating code embeddings. One common approach involves using neural networks to learn these representations from a large dataset of code. The network analyzes the code structure, including tokens (keywords, identifiers), syntax (how the code is structured), and potentially comments to learn the relationships between different code snippets.
Let’s break down the process:
Code as a Sequence: First, code snippets are treated as sequences of tokens (variables, keywords, operators).
Neural Network Training: A neural network processes these sequences and learns to map them to fixed-size vector representations. The network considers factors like syntax, semantics, and relationships between code elements.
Capturing Similarities: The training aims to position similar code snippets (with similar functionality) close together in the vector space. This allows for tasks like finding similar code or comparing functionality.
Here’s a simplified Python example of how you might preprocess code for embedding:
import ast def tokenize_code(code_string): tree = ast.parse(code_string) tokens = [] for node in ast.walk(tree): if isinstance(node, ast.Name): tokens.append(node.id) elif isinstance(node, ast.Str): tokens.append('STRING') elif isinstance(node, ast.Num): tokens.append('NUMBER') # Add more node types as needed return tokens # Example usage code = """ def greet(name): print("Hello, " + name + "!") """ tokens = tokenize_code(code) print(tokens) # Output: ['def', 'greet', 'name', 'print', 'STRING', 'name', 'STRING']
This tokenized representation can then be fed into a neural network for embedding.
Existing Approaches to Code Embedding
Existing methods for code embedding can be classified into three main categories:
Token-Based Methods
Token-based methods treat code as a sequence of lexical tokens. Techniques like Term Frequency-Inverse Document Frequency (TF-IDF) and deep learning models like CodeBERT fall into this category.
Tree-Based Methods
Tree-based methods parse code into abstract syntax trees (ASTs) or other tree structures, capturing the syntactic and semantic rules of the code. Examples include tree-based neural networks and models like code2vec and ASTNN.
Graph-Based Methods
Graph-based methods construct graphs from code, such as control flow graphs (CFGs) and data flow graphs (DFGs), to represent the dynamic behavior and dependencies of the code. GraphCodeBERT is a notable example.
TransformCode: A Framework for Code Embedding
TransformCode: Unsupervised learning of code embedding
TransformCode is a framework that addresses the limitations of existing methods by learning code embeddings in a contrastive learning manner. It is encoder-agnostic and language-agnostic, meaning it can leverage any encoder model and handle any programming language.
The diagram above illustrates the framework of TransformCode for unsupervised learning of code embedding using contrastive learning. It consists of two main phases: Before Training and Contrastive Learning for Training. Here’s a detailed explanation of each component:
Before Training
1. Data Preprocessing:
Dataset: The initial input is a dataset containing code snippets.
Normalized Code: The code snippets undergo normalization to remove comments and rename variables to a standard format. This helps in reducing the influence of variable naming on the learning process and improves the generalizability of the model.
Code Transformation: The normalized code is then transformed using various syntactic and semantic transformations to generate positive samples. These transformations ensure that the semantic meaning of the code remains unchanged, providing diverse and robust samples for contrastive learning.
2. Tokenization:
Train Tokenizer: A tokenizer is trained on the code dataset to convert code text into embeddings. This involves breaking down the code into smaller units, such as tokens, that can be processed by the model.
Embedding Dataset: The trained tokenizer is used to convert the entire code dataset into embeddings, which serve as the input for the contrastive learning phase.
Contrastive Learning for Training
3. Training Process:
Train Sample: A sample from the training dataset is selected as the query code representation.
Positive Sample: The corresponding positive sample is the transformed version of the query code, obtained during the data preprocessing phase.
Negative Samples in Batch: Negative samples are all other code samples in the current mini-batch that are different from the positive sample.
4. Encoder and Momentum Encoder:
Transformer Encoder with Relative Position and MLP Projection Head: Both the query and positive samples are fed into a Transformer encoder. The encoder incorporates relative position encoding to capture the syntactic structure and relationships between tokens in the code. An MLP (Multi-Layer Perceptron) projection head is used to map the encoded representations to a lower-dimensional space where the contrastive learning objective is applied.
Momentum Encoder: A momentum encoder is also used, which is updated by a moving average of the query encoder’s parameters. This helps maintain the consistency and diversity of the representations, preventing the collapse of the contrastive loss. The negative samples are encoded using this momentum encoder and enqueued for the contrastive learning process.
5. Contrastive Learning Objective:
Compute InfoNCE Loss (Similarity): The InfoNCE (Noise Contrastive Estimation) loss is computed to maximize the similarity between the query and positive samples while minimizing the similarity between the query and negative samples. This objective ensures that the learned embeddings are discriminative and robust, capturing the semantic similarity of the code snippets.
The entire framework leverages the strengths of contrastive learning to learn meaningful and robust code embeddings from unlabeled data. The use of AST transformations and a momentum encoder further enhances the quality and efficiency of the learned representations, making TransformCode a powerful tool for various software engineering tasks.
Key Features of TransformCode
Flexibility and Adaptability: Can be extended to various downstream tasks requiring code representation.
Efficiency and Scalability: Does not require a large model or extensive training data, supporting any programming language.
Unsupervised and Supervised Learning: Can be applied to both learning scenarios by incorporating task-specific labels or objectives.
Adjustable Parameters: The number of encoder parameters can be adjusted based on available computing resources.
TransformCode introduces A data-augmentation technique called AST transformation, applying syntactic and semantic transformations to the original code snippets. This generates diverse and robust samples for contrastive learning.
Applications of Code Embeddings
Code embeddings have revolutionized various aspects of software engineering by transforming code from a textual format to a numerical representation usable by machine learning models. Here are some key applications:
Improved Code Search
Traditionally, code search relied on keyword matching, which often led to irrelevant results. Code embeddings enable semantic search, where code snippets are ranked based on their similarity in functionality, even if they use different keywords. This significantly improves the accuracy and efficiency of finding relevant code within large codebases.
Smarter Code Completion
Code completion tools suggest relevant code snippets based on the current context. By leveraging code embeddings, these tools can provide more accurate and helpful suggestions by understanding the semantic meaning of the code being written. This translates to faster and more productive coding experiences.
Automated Code Correction and Bug Detection
Code embeddings can be used to identify patterns that often indicate bugs or inefficiencies in code. By analyzing the similarity between code snippets and known bug patterns, these systems can automatically suggest fixes or highlight areas that might require further inspection.
Enhanced Code Summarization and Documentation Generation
Large codebases often lack proper documentation, making it difficult for new developers to understand their workings. Code embeddings can create concise summaries that capture the essence of the code’s functionality. This not only improves code maintainability but also facilitates knowledge transfer within development teams.
Improved Code Reviews
Code reviews are crucial for maintaining code quality. Code embeddings can assist reviewers by highlighting potential issues and suggesting improvements. Additionally, they can facilitate comparisons between different code versions, making the review process more efficient.
Cross-Lingual Code Processing
The world of software development is not limited to a single programming language. Code embeddings hold promise for facilitating cross-lingual code processing tasks. By capturing the semantic relationships between code written in different languages, these techniques could enable tasks like code search and analysis across programming languages.
Choosing the Right Code Embedding Model
There’s no one-size-fits-all solution for choosing a code embedding model. The best model depends on various factors, including the specific objective, the programming language, and available resources.
Key Considerations:
Specific Objective: For code completion, a model adept at local semantics (like word2vec-based) might be sufficient. For code search requiring understanding broader context, graph-based models might be better.
Programming Language: Some models are tailored for specific languages (e.g., Java, Python), while others are more general-purpose.
Available Resources: Consider the computational power required to train and use the model. Complex models might not be feasible for resource-constrained environments.
Additional Tips:
Experimentation is Key: Don’t be afraid to experiment with a few different models to see which one performs best for your specific dataset and use case.
Stay Updated: The field of code embeddings is constantly evolving. Keep an eye on new models and research to ensure you’re using the latest advancements.
Community Resources: Utilize online communities and forums dedicated to code embeddings. These can be valuable sources of information and insights from other developers.
The Future of Code Embeddings
As research in this area continues, code embeddings are poised to play an increasingly central role in software engineering. By enabling machines to understand code on a deeper level, they can revolutionize the way we develop, maintain, and interact with software.
References and Further Reading
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
GraphCodeBERT: Pre-trained Code Representation Learning with Data Flow
InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
Transformers: Attention Is All You Need
Contrastive Learning for Unsupervised Code Embedding
0 notes
knifeforkspooncup · 27 days
Text
Wip for a lil Bildad/UZ mini comic that's making me laugh bc it is indeed gonna be a really silly time.
Tumblr media
As much as I live laugh love trying out new things stylistically, sometimes I just wanna do what is the most intuitive for me (aka pencil scribbles to make up for my piss poor anatomy skills, messy gouache coloring, and fucking with the colour gradients until I look like I know colour theory (I do not))
97 notes · View notes
mewkwota · 1 month
Text
Tumblr media
I understand now, the "C." in his name stands for Cu-- Coming to another spooky game as a guest character.
It's been a while but I still remember how to draw him. Thank goodness.
75 notes · View notes
waitineedaname · 2 months
Text
when im writing them, i often find myself thinking about the difference between jiang cheng and wei wuxian's anger. im not sure how to put it into words, but the way they both express anger is so interesting to me bc like... for jiang cheng, it tends to be his first response. he's confronted with something unpleasant and the safest response is to be angry and lash out, because then that protects the vulnerability hidden beneath whatever has upset him. his anger tends to be explosive, with shouting and violence, until eventually it ebbs away and the vulnerable emotion underneath is revealed. wei wuxian, however, tends to let it simmer until it eventually boils over. I think about the confrontation with jin zixun about the wens a lot, and how you can see him trying to keep a lid on the anger at first, and how that makes him cold and sharp (which is made all the more striking by how warm he is under better circumstances) until finally he can't keep a lid on it anymore and the anger boils over, which is the point at which you should probably aim to leave his general vicinity because an angry wei wuxian is very, very scary
79 notes · View notes
slavhew · 5 months
Text
Tumblr media
2024 redraw of a 2017 dirkus
120 notes · View notes