DSPy-arXiv¶
Given an arXiv paper from the Computer Science (cs) section,
extract its subcategories (e.g., cs.AI, cs.IR, ...).
Dataset¶
We use the term dataset to refer to a small selection of papers that will be used to 'train' the pipeline.
- 50 papers in
trainset
- used for pipeline training - 50 papers in
valset
- used for pipeline evaluation during training - 50 papers in
testset
- used for pipeline evaluation after training
To construct the dataset, please refer to the database.ipynb notebook.
Each datapoint (paper + paper metadata) is a dspy.Example
,
a dict-like structure with inputs ($x$) and labels ($y$).
- Inputs:
title
: Title of the paper.abstract
: Abstract of the paper.text
: Text body of the paper parsed from PDF with arxiv2text. (This is future work.)
- Labels:
categories
: Set of associated categories.
print(re.sub(r"[\s\n]+", " ", valset[0].title), "\n")
print(re.sub(r"[\s\n]+", " ", valset[0].abstract)[:400], "...\n")
print(valset[0].labels().categories, "\n")
Numerical Investigation of Graph Spectra and Information Interpretability of Eigenvalues We undertake an extensive numerical investigation of the graph spectra of thousands regular graphs, a set of random Erd\"os-R\'enyi graphs, the two most popular types of complex networks and an evolving genetic network by using novel conceptual and experimental tools. Our objective in so doing is to contribute to an understanding of the meaning of the Eigenvalues of a graph relative to its topolo ... {'cs.IT'}
Metrics¶
Metrics are scalar values that quantify the performance of a pipeline with respect to a given task.
def metric_fn(labels, preds, trace=None):
preds: list[str] | str = preds.categories
labels: list[str] = labels.categories
# We assume that predicted categories are sorted by relevance
# We selected top k predicted categories
k = min(len(labels), len(preds))
top_k_preds = preds[:k] if isinstance(preds, list) else [preds]
# ground-truth labels are alphabetically sorted
# so it make sense to look at the intesection with top_k_preds
top_k_pred_set: set[str] = set(top_k_preds)
lables_set: set[str] = set(labels)
score: float = len(top_k_pred_set & lables_set) / len(labels)
return score
I/O interface:¶
- $(x, y)$ → pipeline → generated outputs
- (
title & abstract
,categories
) → pipeline →preds
Modular structure:¶
In PyTorch:
- Tensor/s → Module → Tensor/s
- Tensor/s → Module → Module → ... → Module → Tensor/s
- e.g.
Linear
,Dropout
,ReLU
...
In DSPy:
- InputField/s → Module → OutputField/s
- InputField/s → Module → Module → ... → Module → OutputField/s
- e.g.
Predict
,ChainOfThought
,React
, custom, ...
"Optimization"¶
In PyTorch:
- Define
loss
. e.g.,nn.MSELoss
,nn.CrossEntropyLoss
, ... - Define
optimizer
. e.g.,optim.SGD
,optim.Adam
, ... - Minimize
loss
overtrainset
usingoptimizer
by adjusting model parameters.
In DSPy:
- Define
metric
. e.g.,metric_fn
- Define
optimizer
. e.g.,BootstrapFewShot
,SignatureOptimizer
, ... - Maximize
metric
overtrainset
usingoptimizer
by adjusting text generation within modules.
DSPy heuristically searches for the most effective strategy to prompt an LLM to achieve the task according to the pipeline.
Pipeline 101¶
(title & abstract
, categories
) → pipeline101 → preds
- Just title & abstract, no text body of the paper.
- No custom modules or creative modules usage.
- No RAG.
(But all the above can be easily added later.)
Signature¶
Signatures are like types in a programming language.
- They define the module's input/output.
- Their
__doc__
will be included in the LLM prompt, so they can specify the goal of a module.
class PredictCategories(dspy.Signature):
__doc__ = (
f"Given the abstract of a scientific paper, "
f"identify most relevant categories. "
f"Valid categories are {CATEGORIES.keys()}"
)
title = dspy.InputField()
abstract = dspy.InputField()
categories = dspy.OutputField(
desc="list of comma-separated categories",
format=lambda x: ", ".join(x) if isinstance(x, list) else x,
)
Pipeline / Module¶
The pipeline is a Module as well.
Similar to PyTorch, it makes use of:
__init__
: Here, the modules are instantiated.forward
: Here, it is defined how modules interact.
class Pipeline101(dspy.Module):
def __init__(self):
super().__init__()
self.predict = dspy.ChainOfThought(PredictCategories)
def forward(self, title, abstract, text=None, labels=None):
categories = self.predict(title=title, abstract=abstract).completions.categories
categories = [cat.strip() for cat in categories[0].split(",")]
return dspy.Prediction(categories=categories)
Language Model¶
The Language Model (LM) is at the core of the pipeline.
- It is used for processing and generating text in the pipeline.
- It is used by the optimizers to improve the pipeline itself.
For simple tasks, it can be fast and cheap (many calls in the optimization).
DSPy caches all the calls to LM.
# You can host local model with ollama.
# Just change `model` and `api_base` accordingly.
# For example: `model="gemma"` & `api_base="http://localhost:11434/v1/"`
lm = dspy.OpenAI(
model="gpt3.5-turbo",
api_base=f"http://localhost:{PORT_LM}/v1/",
api_key="you-api-key",
model_type="chat",
)
# configure dspy to use `lm` as Language Model
dspy.settings.configure(lm=lm)
# Just testing that LM works
lm("What's red + yellow?")
['Red + yellow equals orange.']
# This is not optimized
pipeline101 = Pipeline101()
optimizer = BootstrapFewShotWithRandomSearch(
metric=metric_fn,
max_bootstrapped_demos=2,
max_labeled_demos=0,
max_rounds=1,
num_candidate_programs=20,
num_threads=8,
teacher_settings=dict(lm=lm),
)
pipeline101_optimized = optimizer.compile(
pipeline101,
teacher=pipeline101,
trainset=trainset,
valset=valset,
)
scores_pipeline101 = []
for example in testset:
example_x = example.inputs()
example_y = example.labels()
prediction = pipeline101(**example_x)
score = metric_fn(example_y, prediction)
scores_pipeline101.append(score)
# Inspcet the last prompt given to LLM
lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)
print("\n" * 5)
Given the abstract of a scientific paper, identify most relevant categories. Valid categories are dict_keys(['cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY'])
---
Follow the following format.
Title: ${title}
Abstract: ${abstract}
Reasoning: Let's think step by step in order to ${produce the categories}. We ...
Categories: list of comma-separated categories
---
Title: BarcodeBERT: Transformers for Biodiversity Analysis
Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT
Reasoning: Let's think step by step in order to produce the categories. We can see that the paper is about biodiversity analysis using machine learning approaches. It specifically focuses on DNA barcodes and their role in species-level identification. The paper also mentions the comparison of different models and the proposal of a new self-supervised method called BarcodeBERT. Based on this information, the most relevant categories are 'cs.AI' (Artificial Intelligence), 'cs.LG' (Machine Learning), and 'cs.DB' (Databases).
Categories: cs.AI, cs.LG, cs.DB
Ground-truth categories: {'cs.LG'}
Score: 0.0
scores_pipeline101_optimized = []
for example in testset:
example_x = example.inputs()
example_y = example.labels()
prediction = pipeline101_optimized(**example_x)
score = metric_fn(example_y, prediction)
scores_pipeline101_optimized.append(score)
lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)
print("\n" * 5)
Given the abstract of a scientific paper, identify most relevant categories. Valid categories are dict_keys(['cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY'])
---
Follow the following format.
Title: ${title}
Abstract: ${abstract}
Reasoning: Let's think step by step in order to ${produce the categories}. We ...
Categories: list of comma-separated categories
---
Title: Explicit shading strategies for repeated truthful auctions
Abstract: With the increasing use of auctions in online advertising, there has been a large effort to study seller revenue maximization, following Myerson's seminal work, both theoretically and practically. We take the point of view of the buyer in classical auctions and ask the question of whether she has an incentive to shade her bid even in auctions that are reputed to be truthful, when aware of the revenue optimization mechanism. We show that in auctions such as the Myerson auction or a VCG with reserve price set as the monopoly price, the buyer who is aware of this information has indeed an incentive to shade. Intuitively, by selecting the revenue maximizing auction, the seller introduces a dependency on the buyers' distributions in the choice of the auction. We study in depth the case of the Myerson auction and show that a symmetric equilibrium exists in which buyers shade non-linearly what would be their first price bid. They then end up with an expected payoff that is equal to what they would get in a first price auction with no reserve price. We conclude that a return to simple first price auctions with no reserve price or at least non-dynamic anonymous ones is desirable from the point of view of both buyers, sellers and increasing transparency.
Reasoning: Let's think step by step in order to produce the categories. We can start by identifying the main topic of the paper, which appears to be auctions and revenue optimization. From there, we can consider the specific aspects of auctions that are discussed, such as truthful auctions, shading strategies, and the impact of buyer awareness on bidding behavior. Additionally, the paper mentions the Myerson auction and VCG with reserve price as specific auction mechanisms. Finally, the conclusion suggests that a return to simple first price auctions or non-dynamic anonymous auctions may be desirable.
Categories: cs.GT, cs.IR
---
Title: Comments on "Gang EDF Schedulability Analysis"
Abstract: This short report raises a correctness issue in the schedulability test presented in Kato et al., "Gang EDF Scheduling of Parallel Task Systems", 30th IEEE Real-Time Systems Symposium, 2009, pp. 459-468.
Reasoning: Let's think step by step in order to produce the categories. We can see that this abstract is discussing a specific paper and raising a correctness issue in the schedulability test presented in that paper. Therefore, the most relevant category for this abstract is 'cs.OS' (Operating Systems).
Categories: cs.OS
---
Title: BarcodeBERT: Transformers for Biodiversity Analysis
Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT
Reasoning: Let's think step by step in order to produce the categories. The abstract discusses the use of machine learning approaches for biodiversity analysis, specifically focusing on DNA barcodes and their role in species-level identification. The paper compares different approaches, such as supervised CNNs and fine-tuned transformers, and proposes a self-supervised method called BarcodeBERT. The work emphasizes the impact of dataset specifics and coverage on model selection and highlights the importance of self-supervised pretraining for accurate DNA barcode-based identification.
Categories: cs.LG, cs.DB
Ground-truth categories: {'cs.LG'}
Score: 1.0
While developing this notebook, we:
- Processed 7,537,982 input tokens (0.0005/1K)
- Generated 315,868 output tokens (0.0015/1K)
- With an estimated cost of < $5
Pipeline | Avg. metric_fn |
---|---|
pipeline101 | 56% |
pipeline101_optimized | 65% |
Conclusions¶
Future Work¶
Add RAG.
Utilize the category descriptions.
Use the full body of the paper.
- Generate summaries.
- Use a sliding window, process chunks, and aggregate.
- Use a more capable language model with greater context length.
Validate data with the
Assert
module.Use a smarter teacher (e.g., GPT-4).
Experiment with more creative pipelines.
Why DSPy?¶
- It has promising core concepts.
- It is actively being developed.
- It is versatile.
Alternatives¶
Many frameworks exist that programmatically generate prompts and parse responses.
- Instructor: Provides structured outputs for Large Language Models (LLMs).
- Guidance: A guidance language for controlling large language models.
- LMQL: A language for constraint-guided and efficient LLM programming.
- Outlines: Supports structured text generation.
- ...
Guidance¶
"...constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditional, loops) and generation seamlessly."
from guidance import models, select
# load a model
llama2 = models.LlamaCpp(path)
# a simple select between two options
llama2 + f'Do you want a joke or a poem? A ' + select(['joke', 'poem'])
Do you want a joke or a poem? A poem
Instructor¶
Validate LLMs outputs to streamline data extraction.
import instructor
from openai import OpenAI
from pydantic import BaseModel
# Enables `response_model`
client = instructor.patch(OpenAI())
class UserDetail(BaseModel):
name: str
age: int
user = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=UserDetail,
messages=[
{"role": "user", "content": "Extract Jason is 25 years old"},
],
)
assert isinstance(user, UserDetail)
assert user.name == "Jason"
assert user.age == 25