Tumgik
#custom metrics
jcmarchi · 2 months
Text
Tracking Large Language Models (LLM) with MLflow : A Complete Guide
New Post has been published on https://thedigitalinsider.com/tracking-large-language-models-llm-with-mlflow-a-complete-guide/
Tracking Large Language Models (LLM) with MLflow : A Complete Guide
As Large Language Models (LLMs) grow in complexity and scale, tracking their performance, experiments, and deployments becomes increasingly challenging. This is where MLflow comes in – providing a comprehensive platform for managing the entire lifecycle of machine learning models, including LLMs.
In this in-depth guide, we’ll explore how to leverage MLflow for tracking, evaluating, and deploying LLMs. We’ll cover everything from setting up your environment to advanced evaluation techniques, with plenty of code examples and best practices along the way.
Functionality of MLflow in Large Language Models (LLMs)
MLflow has become a pivotal tool in the machine learning and data science community, especially for managing the lifecycle of machine learning models. When it comes to Large Language Models (LLMs), MLflow offers a robust suite of tools that significantly streamline the process of developing, tracking, evaluating, and deploying these models. Here’s an overview of how MLflow functions within the LLM space and the benefits it provides to engineers and data scientists.
Tracking and Managing LLM Interactions
MLflow’s LLM tracking system is an enhancement of its existing tracking capabilities, tailored to the unique needs of LLMs. It allows for comprehensive tracking of model interactions, including the following key aspects:
Parameters: Logging key-value pairs that detail the input parameters for the LLM, such as model-specific parameters like top_k and temperature. This provides context and configuration for each run, ensuring that all aspects of the model’s configuration are captured.
Metrics: Quantitative measures that provide insights into the performance and accuracy of the LLM. These can be updated dynamically as the run progresses, offering real-time or post-process insights.
Predictions: Capturing the inputs sent to the LLM and the corresponding outputs, which are stored as artifacts in a structured format for easy retrieval and analysis.
Artifacts: Beyond predictions, MLflow can store various output files such as visualizations, serialized models, and structured data files, allowing for detailed documentation and analysis of the model’s performance.
This structured approach ensures that all interactions with the LLM are meticulously recorded, providing a comprehensive lineage and quality tracking for text-generating models​.
Evaluation of LLMs
Evaluating LLMs presents unique challenges due to their generative nature and the lack of a single ground truth. MLflow simplifies this with specialized evaluation tools designed for LLMs. Key features include:
Versatile Model Evaluation: Supports evaluating various types of LLMs, whether it’s an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any Python callable representing your model.
Comprehensive Metrics: Offers a range of metrics tailored for LLM evaluation, including both SaaS model-dependent metrics (e.g., answer relevance) and function-based metrics (e.g., ROUGE, Flesch Kincaid).
Predefined Metric Collections: Depending on the use case, such as question-answering or text-summarization, MLflow provides predefined metrics to simplify the evaluation process.
Custom Metric Creation: Allows users to define and implement custom metrics to suit specific evaluation needs, enhancing the flexibility and depth of model evaluation.
Evaluation with Static Datasets: Enables evaluation of static datasets without specifying a model, which is useful for quick assessments without rerunning model inference.
Deployment and Integration
MLflow also supports seamless deployment and integration of LLMs:
MLflow Deployments Server: Acts as a unified interface for interacting with multiple LLM providers. It simplifies integrations, manages credentials securely, and offers a consistent API experience. This server supports a range of foundational models from popular SaaS vendors as well as self-hosted models.
Unified Endpoint: Facilitates easy switching between providers without code changes, minimizing downtime and enhancing flexibility.
Integrated Results View: Provides comprehensive evaluation results, which can be accessed directly in the code or through the MLflow UI for detailed analysis.
MLflow is a comprehensive suite of tools and integrations makes it an invaluable asset for engineers and data scientists working with advanced NLP models.
Setting Up Your Environment
Before we dive into tracking LLMs with MLflow, let’s set up our development environment. We’ll need to install MLflow and several other key libraries:
pip install mlflow>=2.8.1 pip install openai pip install chromadb==0.4.15 pip install langchain==0.0.348 pip install tiktoken pip install 'mlflow[genai]' pip install databricks-sdk --upgrade
After installation, it’s a good practice to restart your Python environment to ensure all libraries are properly loaded. In a Jupyter notebook, you can use:
import mlflow import chromadb print(f"MLflow version: mlflow.__version__") print(f"ChromaDB version: chromadb.__version__")
This will confirm the versions of key libraries we’ll be using.
Understanding MLflow’s LLM Tracking Capabilities
MLflow’s LLM tracking system builds upon its existing tracking capabilities, adding features specifically designed for the unique aspects of LLMs. Let’s break down the key components:
Runs and Experiments
In MLflow, a “run” represents a single execution of your model code, while an “experiment” is a collection of related runs. For LLMs, a run might represent a single query or a batch of prompts processed by the model.
Key Tracking Components
Parameters: These are input configurations for your LLM, such as temperature, top_k, or max_tokens. You can log these using mlflow.log_param() or mlflow.log_params().
Metrics: Quantitative measures of your LLM’s performance, like accuracy, latency, or custom scores. Use mlflow.log_metric() or mlflow.log_metrics() to track these.
Predictions: For LLMs, it’s crucial to log both the input prompts and the model’s outputs. MLflow stores these as artifacts in CSV format using mlflow.log_table().
Artifacts: Any additional files or data related to your LLM run, such as model checkpoints, visualizations, or dataset samples. Use mlflow.log_artifact() to store these.
Let’s look at a basic example of logging an LLM run:
This example demonstrates logging parameters, metrics, and the input/output as a table artifact.
import mlflow import openai def query_llm(prompt, max_tokens=100): response = openai.Completion.create( engine="text-davinci-002", prompt=prompt, max_tokens=max_tokens ) return response.choices[0].text.strip() with mlflow.start_run(): prompt = "Explain the concept of machine learning in simple terms." # Log parameters mlflow.log_param("model", "text-davinci-002") mlflow.log_param("max_tokens", 100) # Query the LLM and log the result result = query_llm(prompt) mlflow.log_metric("response_length", len(result)) # Log the prompt and response mlflow.log_table("prompt_responses", "prompt": [prompt], "response": [result]) print(f"Response: result")
Deploying LLMs with MLflow
MLflow provides powerful capabilities for deploying LLMs, making it easier to serve your models in production environments. Let’s explore how to deploy an LLM using MLflow’s deployment features.
Creating an Endpoint
First, we’ll create an endpoint for our LLM using MLflow’s deployment client:
import mlflow from mlflow.deployments import get_deploy_client # Initialize the deployment client client = get_deploy_client("databricks") # Define the endpoint configuration endpoint_name = "llm-endpoint" endpoint_config = "served_entities": [ "name": "gpt-model", "external_model": "name": "gpt-3.5-turbo", "provider": "openai", "task": "llm/v1/completions", "openai_config": "openai_api_type": "azure", "openai_api_key": "secrets/scope/openai_api_key", "openai_api_base": "secrets/scope/openai_api_base", "openai_deployment_name": "gpt-35-turbo", "openai_api_version": "2023-05-15", , , ], # Create the endpoint client.create_endpoint(name=endpoint_name, config=endpoint_config)
This code sets up an endpoint for a GPT-3.5-turbo model using Azure OpenAI. Note the use of Databricks secrets for secure API key management.
Testing the Endpoint
Once the endpoint is created, we can test it:
<div class="relative flex flex-col rounded-lg"> response = client.predict( endpoint=endpoint_name, inputs="prompt": "Explain the concept of neural networks briefly.","max_tokens": 100,,) print(response)
This will send a prompt to our deployed model and return the generated response.
Evaluating LLMs with MLflow
Evaluation is crucial for understanding the performance and behavior of your LLMs. MLflow provides comprehensive tools for evaluating LLMs, including both built-in and custom metrics.
Preparing Your LLM for Evaluation
To evaluate your LLM with mlflow.evaluate(), your model needs to be in one of these forms:
An mlflow.pyfunc.PyFuncModel instance or a URI pointing to a logged MLflow model.
A Python function that takes string inputs and outputs a single string.
An MLflow Deployments endpoint URI.
Set model=None and include model outputs in the evaluation data.
Let’s look at an example using a logged MLflow model:
import mlflow import openai with mlflow.start_run(): system_prompt = "Answer the following question concisely." logged_model_info = mlflow.openai.log_model( model="gpt-3.5-turbo", task=openai.chat.completions, artifact_path="model", messages=[ "role": "system", "content": system_prompt, "role": "user", "content": "question", ], ) # Prepare evaluation data eval_data = pd.DataFrame( "question": ["What is machine learning?", "Explain neural networks."], "ground_truth": [ "Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming.", "Neural networks are computing systems inspired by biological neural networks, consisting of interconnected nodes that process and transmit information." ] ) # Evaluate the model results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", ) print(f"Evaluation metrics: results.metrics")
This example logs an OpenAI model, prepares evaluation data, and then evaluates the model using MLflow’s built-in metrics for question-answering tasks.
Custom Evaluation Metrics
MLflow allows you to define custom metrics for LLM evaluation. Here’s an example of creating a custom metric for evaluating the professionalism of responses:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric professionalism = make_genai_metric( name="professionalism", definition="Measure of formal and appropriate communication style.", grading_prompt=( "Score the professionalism of the answer on a scale of 0-4:n" "0: Extremely casual or inappropriaten" "1: Casual but respectfuln" "2: Moderately formaln" "3: Professional and appropriaten" "4: Highly formal and expertly crafted" ), examples=[ EvaluationExample( input="What is MLflow?", output="MLflow is like your friendly neighborhood toolkit for managing ML projects. It's super cool!", score=1, justification="The response is casual and uses informal language." ), EvaluationExample( input="What is MLflow?", output="MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, and deployment.", score=4, justification="The response is formal, concise, and professionally worded." ) ], model="openai:/gpt-3.5-turbo-16k", parameters="temperature": 0.0, aggregations=["mean", "variance"], greater_is_better=True, ) # Use the custom metric in evaluation results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", extra_metrics=[professionalism] ) print(f"Professionalism score: results.metrics['professionalism_mean']")
This custom metric uses GPT-3.5-turbo to score the professionalism of responses, demonstrating how you can leverage LLMs themselves for evaluation.
Advanced LLM Evaluation Techniques
As LLMs become more sophisticated, so do the techniques for evaluating them. Let’s explore some advanced evaluation methods using MLflow.
Retrieval-Augmented Generation (RAG) Evaluation
RAG systems combine the power of retrieval-based and generative models. Evaluating RAG systems requires assessing both the retrieval and generation components. Here’s how you can set up a RAG system and evaluate it using MLflow:
from langchain.document_loaders import WebBaseLoader from langchain.text_splitter import CharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI # Load and preprocess documents loader = WebBaseLoader(["https://mlflow.org/docs/latest/index.html"]) documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) # Create vector store embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(texts, embeddings) # Create RAG chain llm = OpenAI(temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(), return_source_documents=True ) # Evaluation function def evaluate_rag(question): result = qa_chain("query": question) return result["result"], [doc.page_content for doc in result["source_documents"]] # Prepare evaluation data eval_questions = [ "What is MLflow?", "How does MLflow handle experiment tracking?", "What are the main components of MLflow?" ] # Evaluate using MLflow with mlflow.start_run(): for question in eval_questions: answer, sources = evaluate_rag(question) mlflow.log_param(f"question", question) mlflow.log_metric("num_sources", len(sources)) mlflow.log_text(answer, f"answer_question.txt") for i, source in enumerate(sources): mlflow.log_text(source, f"source_question_i.txt") # Log custom metrics mlflow.log_metric("avg_sources_per_question", sum(len(evaluate_rag(q)[1]) for q in eval_questions) / len(eval_questions))
This example sets up a RAG system using LangChain and Chroma, then evaluates it by logging questions, answers, retrieved sources, and custom metrics to MLflow.
The way you chunk your documents can significantly impact RAG performance. MLflow can help you evaluate different chunking strategies:
This script evaluates different combinations of chunk sizes, overlaps, and splitting methods, logging the results to MLflow for easy comparison.
MLflow provides various ways to visualize your LLM evaluation results. Here are some techniques:
You can create custom visualizations of your evaluation results using libraries like Matplotlib or Plotly, then log them as artifacts:
This function creates a line plot comparing a specific metric across multiple runs and logs it as an artifact.
0 notes
yesmissnyx · 8 months
Note
If i could only get 1 of your gumroads, which one should i get between;
"By my perfect cockslut femdom pegging joi"
"cum for me- ordering you to cum"
"cum like a girl"
Or a better way to ask is which one was your favorite to record =)
Ohhh...I think I have to choose Cum like a Girl because the idea of it gets *me* super turned on. I can't help it, I'm a slut for chastity and feminization. Making someone cum from just using a vibrator on their caged cock? That's it, that's the stuff.
But...I also really love recording POV/RP stuff like Be My Perfect Cockslut 😈 My Dommy Mommy Girlfriend Voice is really good in that one.
Cum for Me is fun, but it shines most as a bonus to any of my edging JOIs.
Hope that helps you decide 😘!
30 notes · View notes
yes-armageddon-it · 1 year
Text
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
57 notes · View notes
merrysithmas · 1 year
Text
my FAVORITE all time inane AU formula is "help something happened to santa and SOS somehow the cast of _____ has to deliver all the presents to the entire world/universe in one night"
and so now im picturing the SNW enterprise crew doing so w/kirk and Santa as some ubiquitous benevolent good tidings space entity
32 notes · View notes
shopwitchvamp · 4 months
Note
ive bought a bunch of your joggers and skirts and i have figured out that i can fit my 36 pack metal tin of prismacolor watercolors into one of the pockets of the joggers and am using this newfound power to bring more of my art supplies to work 💜 ᓚᘏᗢ
Ohh, nice!! I love that, haha
13 notes · View notes
insidiousclouds · 6 months
Text
Selling my art as a teenager really killed my passion for it. I don't want to draw anymore because it feels like work.
9 notes · View notes
istherewifiinhell · 1 year
Text
too loud too tired too much textures and smells 👍🏻
4 notes · View notes
silvermizuki · 2 years
Text
Ever since I left my last job, it started falling to shambles 🧍
9 notes · View notes
Text
To the lovely customer who called my Hubs Samuel L. Jackson's favorite word, fifteen minutes into his shift:
Tumblr media
3 notes · View notes
luckyredeyes · 2 years
Text
Me: /leaves an email for a customer/ Hi, this order is on hold for [this specific question], please call me directly at [MY DIRECT NUMBER]; you can also leave a voicemail.
Email above: /is visible to everyone who sees this case in the system/
My direct number and voicemail: /very much in good working order/
Customer: /calls the regular Customer Service reps instead of me/ I got an email I don’t understand, why is my order on hold and why didn’t anyone tell me, why the fuck are you delaying my order????
The regular CS reps, rightly so: ARE YOU FUCKING KIDDING ME
5 notes · View notes
Note
(Frisk Side)
*The spiders wave happily to their favorite customer*
*One of them crawl down and gives Frisk a few skulls as a reward for their service.*
(MK Side)
*A carnival game master calls them over. He is dressed as an old timey mine.*
"Young child, if you can correctly guess how much this box weighs ill give you 5 skulls!"
(Side Suzy)
*Nearby is a Test of Strength game*
*A cardboard cutout of Mettaton stands nearby*
"Win 5 Skulls if you can beat my score of 25!"
Hey, thanks you guys!!! Hope you're enjoying the surface!! I have to go find my friends now, seeya!! (6/30 Skulls!)
Uh... is it, um... 6? Kilos??
Oh, I've GOT THIS!!!
(Suzy hits the button as hard as she can!)
4 notes · View notes
ai-azura · 2 years
Text
The Role of Analytics and Metrics in Measuring the Success of Influencer Marketing Campaigns
The Role of Analytics and Metrics in Measuring the Success of Influencer Marketing Campaigns
Analytics and metrics play a crucial role in measuring the success of influencer marketing campaigns. With the rise of social media and the proliferation of influencers, it has become increasingly important for businesses to track the effectiveness of their influencer marketing efforts. There are several key metrics that businesses should consider when evaluating the success of an influencer…
Tumblr media
View On WordPress
2 notes · View notes
realjdobypr · 2 months
Text
Create Unique and Personalized Content That Engages Your Audience
Creating unique and personalized content that truly captivates your audience is paramount for any successful marketing strategy. With the vast amount of information available online, standing out from the crowd requires a strategic approach that resonates with your target demographic. By tailoring your content to address the specific needs, interests, and preferences of your audience, you can…
0 notes
isa-ah · 2 months
Note
personally in a game i like a thing that's more automatic similar to your run or coop idea, kinda like what stardew valley does where you leave the coop door open and all your chickens will go inside at night on their own.
yeah that's the issue right. I really loved having that as an anchor in my gameplay, keeping my chickens safe and manually sending them inside every night, but that's not long-term gameplay compatible... but if it's always an option, you can free range But. or it's a hurdle to overcome early where you need to be diligent but once you have the run it's fine. then that probably works?
0 notes
isubhamdas · 3 months
Text
Ecommerce Analytics-Data-Driven Potential
Discover the power of ecommerce analytics to skyrocket your online business. Dive deeper into the data and uncover hidden opportunities. Empower your decision-making with cutting-edge expertise. Harness the Power of Google AnalyticsLeverage Social Media AnalyticsAnalyze Your CompetitionTrack Your Key Performance Indicators (KPIs)Utilize A/B TestingExpert Tip on Ecommerce AnalyticsSuggestion on…
0 notes
acubeai · 3 months
Text
Creating an Effective Power BI Dashboard: A Comprehensive Guide
Tumblr media
Introduction to Power BI Power BI is a suite of business analytics tools that allows you to connect to multiple data sources, transform data into actionable insights, and share those insights across your organization. With Power BI, you can create interactive dashboards and reports that provide a 360-degree view of your business.
Step-by-Step Guide to Creating a Power BI Dashboard
1. Data Import and Transformation The first step in creating a Power BI dashboard is importing your data. Power BI supports various data sources, including Excel, SQL Server, Azure, and more.
Steps to Import Data:
Open Power BI Desktop.
Click on Get Data in the Home ribbon.
Select your data source (e.g., Excel, SQL Server, etc.).
Load the data into Power BI.
Once the data is loaded, you may need to transform it to suit your reporting needs. Power BI provides Power Query Editor for data transformation.
Data Transformation:
Open Power Query Editor.
Apply necessary transformations such as filtering rows, adding columns, merging tables, etc.
Close and apply the changes.
2. Designing the Dashboard After preparing your data, the next step is to design your dashboard. Start by adding a new report and selecting the type of visualization you want to use.
Types of Visualizations:
Charts: Bar, Line, Pie, Area, etc.
Tables and Matrices: For detailed data representation.
Maps: Geographic data visualization.
Cards and Gauges: For key metrics and KPIs.
Slicers: For interactive data filtering.
Adding Visualizations:
Drag and drop fields from the Fields pane to the canvas.
Choose the appropriate visualization type from the Visualizations pane.
Customize the visual by adjusting properties such as colors, labels, and titles.
3. Enhancing the Dashboard with Interactivity Interactivity is one of the key features of Power BI dashboards. You can add slicers, drill-throughs, and bookmarks to make your dashboard more interactive and user-friendly.
Using Slicers:
Add a slicer visual to the canvas.
Drag a field to the slicer to allow users to filter data dynamically.
Drill-throughs:
Enable drill-through on visuals to allow users to navigate to detailed reports.
Set up drill-through pages by defining the fields that will trigger the drill-through.
Bookmarks:
Create bookmarks to capture the state of a report page.
Use bookmarks to toggle between different views of the data.
Tumblr media
Different Styles of Power BI Dashboards Power BI dashboards can be styled to meet various business needs. Here are a few examples:
1. Executive Dashboard An executive dashboard provides a high-level overview of key business metrics. It typically includes:
KPI visuals for critical metrics.
Line charts for trend analysis.
Bar charts for categorical comparison.
Maps for geographic insights.
Example:
KPI cards for revenue, profit margin, and customer satisfaction.
A line chart showing monthly sales trends.
A bar chart comparing sales by region.
A map highlighting sales distribution across different states.
2. Sales Performance Dashboard A sales performance dashboard focuses on sales data, providing insights into sales trends, product performance, and sales team effectiveness.
Example:
A funnel chart showing the sales pipeline stages.
A bar chart displaying sales by product category.
A scatter plot highlighting the performance of sales representatives.
A table showing detailed sales transactions.
3. Financial Dashboard A financial dashboard offers a comprehensive view of the financial health of an organization. It includes:
Financial KPIs such as revenue, expenses, and profit.
Financial statements like income statement and balance sheet.
Trend charts for revenue and expenses.
Pie charts for expense distribution.
Example:
KPI cards for net income, operating expenses, and gross margin.
A line chart showing monthly revenue and expense trends.
A pie chart illustrating the breakdown of expenses.
A matrix displaying the income statement.
Best Practices for Designing Power BI Dashboards To ensure your Power BI dashboard is effective and user-friendly, follow these best practices:
Keep it Simple:
Avoid cluttering the dashboard with too many visuals.
Focus on the most important metrics and insights.
2. Use Consistent Design:
Maintain a consistent color scheme and font style.
Align visuals properly for a clean layout.
3. Ensure Data Accuracy:
Validate your data to ensure accuracy.
Regularly update the data to reflect the latest information.
4. Enhance Interactivity:
Use slicers and drill-throughs to provide a dynamic user experience.
Add tooltips to provide additional context.
5. Optimize Performance:
Use aggregations and data reduction techniques to improve performance.
Avoid using too many complex calculations.
Conclusion Creating a Power BI dashboard involves importing and transforming data, designing interactive visuals, and applying best practices to ensure clarity and effectiveness. By following the steps outlined in this guide, you can build dashboards that provide valuable insights and support data-driven decision-making in your organization. Power BI’s flexibility and range of visualizations make it an essential tool for any business looking to leverage its data effectively.
0 notes