print(re.sub(r"[\s\n]+", " ", valset[0].title), "\n")
print(re.sub(r"[\s\n]+", " ", valset[0].abstract)[:400], "...\n")
print(valset[0].labels().categories, "\n")

Numerical Investigation of Graph Spectra and Information Interpretability of Eigenvalues 

 We undertake an extensive numerical investigation of the graph spectra of thousands regular graphs, a set of random Erd\"os-R\'enyi graphs, the two most popular types of complex networks and an evolving genetic network by using novel conceptual and experimental tools. Our objective in so doing is to contribute to an understanding of the meaning of the Eigenvalues of a graph relative to its topolo ...

{'cs.IT'}

def metric_fn(labels, preds, trace=None):
    preds: list[str] | str = preds.categories
    labels: list[str] = labels.categories

    # We assume that predicted categories are sorted by relevance
    # We selected top k predicted categories
    k = min(len(labels), len(preds))
    top_k_preds = preds[:k] if isinstance(preds, list) else [preds]

    # ground-truth labels are alphabetically sorted
    # so it make sense to look at the intesection with top_k_preds
    top_k_pred_set: set[str] = set(top_k_preds)
    lables_set: set[str] = set(labels)

    score: float = len(top_k_pred_set & lables_set) / len(labels)
    return score

class PredictCategories(dspy.Signature):
    __doc__ = (
        f"Given the abstract of a scientific paper, "
        f"identify most relevant categories. "
        f"Valid categories are {CATEGORIES.keys()}"
    )
    title = dspy.InputField()
    abstract = dspy.InputField()
    categories = dspy.OutputField(
        desc="list of comma-separated categories",
        format=lambda x: ", ".join(x) if isinstance(x, list) else x,
    )

class Pipeline101(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.ChainOfThought(PredictCategories)

    def forward(self, title, abstract, text=None, labels=None):
        categories = self.predict(title=title, abstract=abstract).completions.categories
        categories = [cat.strip() for cat in categories[0].split(",")]
        return dspy.Prediction(categories=categories)

# You can host local model with ollama.
# Just change `model` and `api_base` accordingly.
# For example: `model="gemma"` & `api_base="http://localhost:11434/v1/"`
lm = dspy.OpenAI(
    model="gpt3.5-turbo",
    api_base=f"http://localhost:{PORT_LM}/v1/",
    api_key="you-api-key",
    model_type="chat",
)

# configure dspy to use `lm` as Language Model
dspy.settings.configure(lm=lm)

# Just testing that LM works
lm("What's red + yellow?")

['Red + yellow equals orange.']

# This is not optimized
pipeline101 = Pipeline101()

optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric_fn,
    max_bootstrapped_demos=2,
    max_labeled_demos=0,
    max_rounds=1,
    num_candidate_programs=20,
    num_threads=8,
    teacher_settings=dict(lm=lm),
)

pipeline101_optimized = optimizer.compile(
    pipeline101,
    teacher=pipeline101,
    trainset=trainset,
    valset=valset,
)

scores_pipeline101 = []
for example in testset:
    example_x = example.inputs()
    example_y = example.labels()
    prediction = pipeline101(**example_x)
    score = metric_fn(example_y, prediction)
    scores_pipeline101.append(score)

# Inspcet the last prompt given to LLM
lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)

print("\n" * 5)



Given the abstract of a scientific paper, identify most relevant categories. Valid categories are dict_keys(['cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY'])

---

Follow the following format.

Title: ${title}

Abstract: ${abstract}

Reasoning: Let's think step by step in order to ${produce the categories}. We ...

Categories: list of comma-separated categories

---

Title: BarcodeBERT: Transformers for Biodiversity Analysis

Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT

Reasoning: Let's think step by step in order to produce the categories. We can see that the paper is about biodiversity analysis using machine learning approaches. It specifically focuses on DNA barcodes and their role in species-level identification. The paper also mentions the comparison of different models and the proposal of a new self-supervised method called BarcodeBERT. Based on this information, the most relevant categories are 'cs.AI' (Artificial Intelligence), 'cs.LG' (Machine Learning), and 'cs.DB' (Databases). 

Categories: cs.AI, cs.LG, cs.DB


Ground-truth categories: {'cs.LG'}
Score: 0.0

scores_pipeline101_optimized = []
for example in testset:
    example_x = example.inputs()
    example_y = example.labels()
    prediction = pipeline101_optimized(**example_x)
    score = metric_fn(example_y, prediction)
    scores_pipeline101_optimized.append(score)

lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)
print("\n" * 5)



Given the abstract of a scientific paper, identify most relevant categories. Valid categories are dict_keys(['cs.AI', 'cs.AR', 'cs.CC', 'cs.CE', 'cs.CG', 'cs.CL', 'cs.CR', 'cs.CV', 'cs.CY', 'cs.DB', 'cs.DC', 'cs.DL', 'cs.DM', 'cs.DS', 'cs.ET', 'cs.FL', 'cs.GL', 'cs.GR', 'cs.GT', 'cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS', 'cs.PF', 'cs.PL', 'cs.RO', 'cs.SC', 'cs.SD', 'cs.SE', 'cs.SI', 'cs.SY'])

---

Follow the following format.

Title: ${title}

Abstract: ${abstract}

Reasoning: Let's think step by step in order to ${produce the categories}. We ...

Categories: list of comma-separated categories

---

Title: Explicit shading strategies for repeated truthful auctions

Abstract: With the increasing use of auctions in online advertising, there has been a large effort to study seller revenue maximization, following Myerson's seminal work, both theoretically and practically. We take the point of view of the buyer in classical auctions and ask the question of whether she has an incentive to shade her bid even in auctions that are reputed to be truthful, when aware of the revenue optimization mechanism. We show that in auctions such as the Myerson auction or a VCG with reserve price set as the monopoly price, the buyer who is aware of this information has indeed an incentive to shade. Intuitively, by selecting the revenue maximizing auction, the seller introduces a dependency on the buyers' distributions in the choice of the auction. We study in depth the case of the Myerson auction and show that a symmetric equilibrium exists in which buyers shade non-linearly what would be their first price bid. They then end up with an expected payoff that is equal to what they would get in a first price auction with no reserve price. We conclude that a return to simple first price auctions with no reserve price or at least non-dynamic anonymous ones is desirable from the point of view of both buyers, sellers and increasing transparency.

Reasoning: Let's think step by step in order to produce the categories. We can start by identifying the main topic of the paper, which appears to be auctions and revenue optimization. From there, we can consider the specific aspects of auctions that are discussed, such as truthful auctions, shading strategies, and the impact of buyer awareness on bidding behavior. Additionally, the paper mentions the Myerson auction and VCG with reserve price as specific auction mechanisms. Finally, the conclusion suggests that a return to simple first price auctions or non-dynamic anonymous auctions may be desirable.

Categories: cs.GT, cs.IR

---

Title: Comments on "Gang EDF Schedulability Analysis"

Abstract: This short report raises a correctness issue in the schedulability test presented in Kato et al., "Gang EDF Scheduling of Parallel Task Systems", 30th IEEE Real-Time Systems Symposium, 2009, pp. 459-468.

Reasoning: Let's think step by step in order to produce the categories. We can see that this abstract is discussing a specific paper and raising a correctness issue in the schedulability test presented in that paper. Therefore, the most relevant category for this abstract is 'cs.OS' (Operating Systems).

Categories: cs.OS

---

Title: BarcodeBERT: Transformers for Biodiversity Analysis

Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT

Reasoning: Let's think step by step in order to produce the categories. The abstract discusses the use of machine learning approaches for biodiversity analysis, specifically focusing on DNA barcodes and their role in species-level identification. The paper compares different approaches, such as supervised CNNs and fine-tuned transformers, and proposes a self-supervised method called BarcodeBERT. The work emphasizes the impact of dataset specifics and coverage on model selection and highlights the importance of self-supervised pretraining for accurate DNA barcode-based identification. 

Categories: cs.LG, cs.DB


Ground-truth categories: {'cs.LG'}
Score: 1.0

from guidance import models, select

# load a model
llama2 = models.LlamaCpp(path)

# a simple select between two options
llama2 + f'Do you want a joke or a poem? A ' + select(['joke', 'poem'])

import instructor
from openai import OpenAI
from pydantic import BaseModel

# Enables `response_model`
client = instructor.patch(OpenAI())


class UserDetail(BaseModel):
    name: str
    age: int


user = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserDetail,
    messages=[
        {"role": "user", "content": "Extract Jason is 25 years old"},
    ],
)

assert isinstance(user, UserDetail)
assert user.name == "Jason"
assert user.age == 25

DSPy-arXiv¶

Dataset¶

Metrics¶

DSPy¶

I/O interface:¶

Modular structure:¶

"Optimization"¶

Pipeline 101¶

Signature¶

Pipeline / Module¶

Language Model¶

Optimization¶

Results¶

Conclusions¶

Future Work¶

Why DSPy?¶

Why Not DSPy?¶

Alternatives¶

Guidance¶

Instructor¶

Pipeline	Avg. metric_fn
pipeline101	56%
pipeline101_optimized	65%