KB Storage
lean_automator.kb.storage
¶
Defines data structures and SQLite storage for a mathematical knowledge base.
This module provides the core data structures (KBItem
, LatexLink
, etc.)
representing mathematical concepts and their metadata within a knowledge base.
It also includes functions for creating, initializing, saving, and retrieving
these items from an SQLite database. The default database path is determined
by the KB_DB_PATH
environment variable or defaults to 'knowledge_base.sqlite'.
Functionality includes handling text embeddings (stored as BLOBs) associated
with natural language descriptions and LaTeX statements.
Classes¶
ItemType
¶
Bases: Enum
Enumerates the different types of items stored in the Knowledge Base.
Source code in lean_automator/kb/storage.py
Functions¶
requires_proof() -> bool
¶
Checks if this item type typically requires a proof.
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the item type is one that generally needs a proof |
bool
|
(Theorem, Lemma, Proposition, Corollary, Example), False otherwise. |
Source code in lean_automator/kb/storage.py
ItemStatus
¶
Bases: Enum
Enumerates the possible states of an item in the Knowledge Base lifecycle.
Source code in lean_automator/kb/storage.py
LatexLink
dataclass
¶
Represents a link to a specific component in an external LaTeX source.
Attributes:
Name | Type | Description |
---|---|---|
citation_text |
str
|
The text used for citing the external source (e.g., "[Knuth73]"). |
link_type |
str
|
The type of link, defaulting to 'statement'. Other potential values could be 'proof', 'definition', etc. |
source_identifier |
Optional[str]
|
A unique identifier for the external source document (e.g., DOI, arXiv ID, internal book code). |
latex_snippet |
Optional[str]
|
A short snippet of the LaTeX code from the external source related to this link. |
match_confidence |
Optional[float]
|
A confidence score (e.g., 0.0 to 1.0) indicating the likelihood that this link correctly points to the intended component. |
verified_by_human |
bool
|
Flag indicating if a human has confirmed the correctness of this link. Defaults to False. |
Source code in lean_automator/kb/storage.py
Functions¶
from_dict(data: Dict[str, Any]) -> LatexLink
classmethod
¶
Creates a LatexLink instance from a dictionary.
Sets a default 'link_type' of 'statement' if not present in the input dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict[str, Any]
|
A dictionary containing keys corresponding to the attributes of LatexLink. |
required |
Returns:
Name | Type | Description |
---|---|---|
LatexLink |
LatexLink
|
An instance of the LatexLink class. |
Raises:
Type | Description |
---|---|
TypeError
|
If the dictionary keys do not match the LatexLink attributes or if the data types are incorrect. |
Source code in lean_automator/kb/storage.py
KBItem
dataclass
¶
Represents a single node (mathematical statement, definition, etc.) in the KB.
This dataclass holds all information related to a mathematical item, including its textual descriptions (NL and LaTeX), formal Lean code, embeddings for semantic search, relationships to other items (dependencies), links to external sources, and metadata tracking its processing status and history.
Attributes:
Name | Type | Description |
---|---|---|
id |
Optional[int]
|
The primary key identifier from the database. None if not saved yet. |
unique_name |
str
|
A unique human-readable or generated identifier (e.g., "lemma_xyz"). Defaults to a UUID-based name if not provided. |
item_type |
ItemType
|
The category of the mathematical item (e.g., Theorem, Definition). |
description_nl |
str
|
A natural language description of the item. |
latex_statement |
Optional[str]
|
The LaTeX representation of the item's statement. |
latex_proof |
Optional[str]
|
An informal proof written in LaTeX (only relevant for item types that require proof). Set to None otherwise. |
lean_code |
str
|
The formal Lean code representing the item, potentially incomplete (e.g., containing 'sorry'). |
embedding_nl |
Optional[bytes]
|
The embedding vector generated from
|
embedding_latex |
Optional[bytes]
|
The embedding vector generated only
from |
topic |
str
|
A general topic or subject area (e.g., "Group Theory", "Measure Theory"). |
plan_dependencies |
List[str]
|
A list of unique names of other KBItems identified as necessary prerequisites during proof planning or generation. |
dependencies |
List[str]
|
A list of unique names of other KBItems discovered as dependencies during Lean code compilation or validation. |
latex_links |
List[LatexLink]
|
A list of links to external LaTeX sources related to this item. |
status |
ItemStatus
|
The current processing status of the item in its lifecycle. |
failure_count |
int
|
A counter for the number of times a processing step (like validation or generation) has failed for this item. Defaults to 0. |
latex_review_feedback |
Optional[str]
|
Feedback provided by a reviewer if the generated LaTeX (statement or proof) was rejected. |
generation_prompt |
Optional[str]
|
The last prompt text sent to an LLM for generating content (LaTeX, Lean code, etc.) for this item. |
raw_ai_response |
Optional[str]
|
The last raw text response received from an LLM for this item. |
lean_error_log |
Optional[str]
|
Error messages captured from the Lean compiler
( |
created_at |
datetime
|
Timestamp when the item was first created (UTC). |
last_modified_at |
datetime
|
Timestamp when the item was last modified (UTC). |
Source code in lean_automator/kb/storage.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
|
Functions¶
__post_init__()
¶
Performs basic validation and normalization after dataclass initialization.
Ensures unique_name
is not empty, converts lean_code
to string if necessary,
ensures timestamps are timezone-aware (UTC), and sets latex_proof
to None
if the item_type
does not require a proof.
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in lean_automator/kb/storage.py
update_status(new_status: ItemStatus, error_log: Optional[str] = _sentinel, review_feedback: Optional[str] = _sentinel)
¶
Updates the item's status and optionally clears/sets related logs.
Also updates the last_modified_at
timestamp. Error logs and review
feedback are conditionally cleared when moving to certain "successful"
states (PROVEN, LATEX_ACCEPTED).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_status
|
ItemStatus
|
The new status to set for the item. |
required |
error_log
|
Optional[str]
|
If provided (and not the sentinel), sets
the |
_sentinel
|
review_feedback
|
Optional[str]
|
If provided (and not the
sentinel), sets the |
_sentinel
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
Source code in lean_automator/kb/storage.py
add_plan_dependency(dep_unique_name: str)
¶
Adds a unique name to the list of planning dependencies.
Updates last_modified_at
if the dependency was added.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dep_unique_name
|
str
|
The unique name of the KBItem dependency. |
required |
Source code in lean_automator/kb/storage.py
add_dependency(dep_unique_name: str)
¶
Adds a unique name to the list of discovered Lean dependencies.
Updates last_modified_at
if the dependency was added.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dep_unique_name
|
str
|
The unique name of the KBItem dependency. |
required |
Source code in lean_automator/kb/storage.py
add_latex_link(link: LatexLink)
¶
Adds a LatexLink object to the item's list of links.
Updates last_modified_at
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
link
|
LatexLink
|
The LatexLink object to add. |
required |
Source code in lean_automator/kb/storage.py
increment_failure_count()
¶
Increments the general failure counter and updates modification time.
to_dict_for_db() -> Dict[str, Any]
¶
Serializes the KBItem instance into a dictionary for database storage.
Converts complex types (enums, lists, datetimes, LatexLink list) into
database-compatible formats (strings, JSON strings). Excludes embedding
fields, as they are typically handled separately (e.g., direct BLOB updates).
Sets latex_proof
to None if the item type does not require one.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary representation of the item ready for DB |
Dict[str, Any]
|
insertion/update. |
Source code in lean_automator/kb/storage.py
from_db_dict(data: Dict[str, Any]) -> KBItem
classmethod
¶
Creates a KBItem instance from a dictionary retrieved from the database.
Handles deserialization of complex types (enums from names, lists from
JSON strings, datetimes from ISO strings, LatexLink list from JSON).
Includes error handling for invalid enum values or JSON decoding errors.
Retrieves embedding fields directly if present in the input dictionary.
Loads latex_proof
only if the determined item_type
requires it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict[str, Any]
|
A dictionary representing a row fetched from the
|
required |
Returns:
Name | Type | Description |
---|---|---|
KBItem |
KBItem
|
An instance of the KBItem class populated with data from the |
KBItem
|
dictionary. |
Raises:
Type | Description |
---|---|
ValueError
|
If deserialization fails due to missing keys, invalid enum names, JSON decoding errors, or other type mismatches. |
Source code in lean_automator/kb/storage.py
440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
|
Functions¶
get_db_connection(db_path: Optional[str] = None) -> sqlite3.Connection
¶
Establishes and configures an SQLite database connection.
Uses the provided db_path
or falls back to the DEFAULT_DB_PATH
.
Configures the connection to use sqlite3.Row
factory for dictionary-like
row access and enables foreign key constraint enforcement.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
db_path
|
Optional[str]
|
The path to the SQLite database file. If None,
uses the path defined by |
None
|
Returns:
Type | Description |
---|---|
Connection
|
sqlite3.Connection: An configured SQLite database connection object. |
Source code in lean_automator/kb/storage.py
initialize_database(db_path: Optional[str] = None)
¶
Initializes the database schema, creating or updating the kb_items
table.
Ensures the kb_items
table exists with the required columns and data types.
Uses _add_column_if_not_exists
for robustness, allowing the function to
add missing columns to an existing table (useful for schema evolution).
Creates necessary indexes on key columns for efficient querying.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Source code in lean_automator/kb/storage.py
save_kb_item(item: KBItem, client: Optional[GeminiClient] = None, db_path: Optional[str] = None) -> KBItem
async
¶
Saves a KBItem to the database (INSERT/UPDATE) and handles embedding generation.
This function performs an UPSERT operation based on the item's unique_name
.
If the item is new or if latex_statement
or description_nl
has changed
compared to the existing database record, it attempts to generate new embeddings
using the provided GeminiClient
(if available). Embeddings are generated only
from latex_statement
(not latex_proof
) and description_nl
. Embeddings
are updated in the DB only if they have changed or are newly generated.
The item's last_modified_at
timestamp is always updated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item
|
KBItem
|
The KBItem object to save. Its |
required |
client
|
Optional[GeminiClient]
|
An initialized GeminiClient instance. Required if embedding generation is needed based on content changes. If None, embeddings will not be generated or updated. |
None
|
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Returns:
Name | Type | Description |
---|---|---|
KBItem |
KBItem
|
The saved KBItem object, updated with its database |
KBItem
|
newly generated/updated |
Raises:
Type | Description |
---|---|
Error
|
If a database error occurs during the UPSERT or embedding update. |
TypeError
|
If the |
ValueError
|
Can be raised by |
Source code in lean_automator/kb/storage.py
684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 |
|
get_kb_item_by_id(item_id: int, db_path: Optional[str] = None) -> Optional[KBItem]
¶
Retrieves a single KBItem from the database by its primary key ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item_id
|
int
|
The integer primary key ID of the item to retrieve. |
required |
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Returns:
Type | Description |
---|---|
Optional[KBItem]
|
Optional[KBItem]: The retrieved KBItem object if found and deserialization |
Optional[KBItem]
|
succeeds, otherwise None. Logs an error if deserialization fails. |
Source code in lean_automator/kb/storage.py
get_kb_item_by_name(unique_name: str, db_path: Optional[str] = None) -> Optional[KBItem]
¶
Retrieves a single KBItem from the database by its unique name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
unique_name
|
str
|
The unique string name of the item to retrieve. |
required |
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Returns:
Type | Description |
---|---|
Optional[KBItem]
|
Optional[KBItem]: The retrieved KBItem object if found and deserialization |
Optional[KBItem]
|
succeeds, otherwise None. Logs an error if deserialization fails. |
Source code in lean_automator/kb/storage.py
get_items_by_status(status: ItemStatus, db_path: Optional[str] = None) -> Generator[KBItem, None, None]
¶
Retrieves all KBItems matching a specific status, yielding them one by one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
status
|
ItemStatus
|
The status enum member to filter by. |
required |
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Yields:
Name | Type | Description |
---|---|---|
KBItem |
KBItem
|
KBItem objects matching the specified status. Skips items that |
KBItem
|
fail deserialization and logs an error. |
Raises:
Type | Description |
---|---|
TypeError
|
If the provided |
Source code in lean_automator/kb/storage.py
get_items_by_topic(topic_prefix: str, db_path: Optional[str] = None) -> Generator[KBItem, None, None]
¶
Retrieves KBItems whose topic starts with the given prefix, yielding them.
Performs a case-sensitive prefix search using the SQL LIKE operator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
topic_prefix
|
str
|
The prefix string to match against the beginning of the item's topic. |
required |
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Yields:
Name | Type | Description |
---|---|---|
KBItem |
KBItem
|
KBItem objects whose topic matches the prefix. Skips items that |
KBItem
|
fail deserialization and logs an error. |
Source code in lean_automator/kb/storage.py
get_all_items_with_embedding(embedding_field: str, db_path: Optional[str] = None) -> List[Tuple[int, str, bytes]]
¶
Retrieves minimal info (ID, name, embedding) for items with an embedding.
Fetches the primary key ID, unique name, and the raw embedding bytes for all
items where the specified embedding field (embedding_nl
or embedding_latex
)
is not NULL. This is useful for bulk operations like vector similarity searches.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_field
|
str
|
The name of the embedding column to retrieve (must be either 'embedding_nl' or 'embedding_latex'). |
required |
db_path
|
Optional[str]
|
Path to the database file. If None, uses
|
None
|
Returns:
Type | Description |
---|---|
List[Tuple[int, str, bytes]]
|
List[Tuple[int, str, bytes]]: A list of tuples, where each tuple contains |
List[Tuple[int, str, bytes]]
|
the item's ID (int), unique name (str), and the embedding data (bytes). |
List[Tuple[int, str, bytes]]
|
Returns an empty list if no items have the specified embedding or if a |
List[Tuple[int, str, bytes]]
|
database error occurs. |
Raises:
Type | Description |
---|---|
ValueError
|
If |