va/knowledge-search-mcp

Fork 1

Files

Anibal Angulo d3cd8d5291 Refactor duplicated code

2026-03-03 18:00:31 +00:00

5.1 KiB

Raw Blame History

Refactoring Summary

High-ROI Refactorings Completed

1. Eliminated Code Duplication - Session Management ✅

Problem: The _get_aio_session() method was duplicated identically in both GoogleCloudFileStorage and GoogleCloudVectorSearch classes.

Solution:

Created a new BaseGoogleCloudClient base class that encapsulates shared session management logic
Both GoogleCloudFileStorage and GoogleCloudVectorSearch now inherit from this base class
Added a close() method to properly clean up resources

Files Changed:

src/knowledge_search_mcp/main.py:25-80 - Added base class
src/knowledge_search_mcp/main.py:83 - GoogleCloudFileStorage inherits from base
src/knowledge_search_mcp/main.py:219 - GoogleCloudVectorSearch inherits from base

Impact: Reduced ~24 lines of duplicated code, improved maintainability

2. Fixed Resource Cleanup ✅

Problem: aiohttp sessions were never explicitly closed, leading to potential resource leaks and warnings.

Solution:

Added close() method to BaseGoogleCloudClient to properly close aiohttp sessions
Extended close() in GoogleCloudVectorSearch to also close the storage client's session
Modified lifespan() function's finally block to call vs.close() on shutdown

Files Changed:

src/knowledge_search_mcp/main.py:74-78 - Base close method
src/knowledge_search_mcp/main.py:228-231 - VectorSearch close override
src/knowledge_search_mcp/main.py:699-707 - Cleanup in lifespan finally block

Impact: Prevents resource leaks, eliminates aiohttp warnings on shutdown

3. Implemented LRU Cache with Size Limits ✅

Problem: The _cache dictionary in GoogleCloudFileStorage grew indefinitely, potentially causing memory issues with large document sets.

Solution:

Created a new LRUCache class with configurable max size (default: 100 items)
Automatically evicts least recently used items when cache is full
Maintains insertion order and tracks access patterns

Files Changed:

src/knowledge_search_mcp/main.py:28-58 - New LRUCache class
src/knowledge_search_mcp/main.py:85-87 - Updated GoogleCloudFileStorage to use LRUCache
src/knowledge_search_mcp/main.py:115-122 - Updated cache access patterns
src/knowledge_search_mcp/main.py:147-148 - Updated cache write patterns
tests/test_search.py - Updated tests to work with LRUCache interface

Impact: Bounded memory usage, prevents cache from growing indefinitely

4. Broke Down Large Functions ✅

a. Extracted Validation Functions from `lifespan()`

Problem: The lifespan() function was 225 lines with repetitive validation logic.

Solution: Extracted three helper functions:

_validate_genai_access() - Validates GenAI embedding API access
_validate_gcs_access() - Validates GCS bucket access
_validate_vector_search_access() - Validates vector search endpoint access

Files Changed:

src/knowledge_search_mcp/main.py:424-587 - New validation functions
src/knowledge_search_mcp/main.py:644-693 - Simplified lifespan function

Impact: Reduced lifespan() from 225 to ~65 lines, improved readability and testability

b. Extracted Helper Functions from `knowledge_search()`

Problem: The knowledge_search() function was 149 lines mixing multiple concerns.

Solution: Extracted three helper functions:

_generate_query_embedding() - Handles embedding generation with error handling
_filter_search_results() - Applies similarity thresholds and filtering
_format_search_results() - Formats results as XML-like documents

Files Changed:

src/knowledge_search_mcp/main.py:717-766 - _generate_query_embedding
src/knowledge_search_mcp/main.py:769-793 - _filter_search_results
src/knowledge_search_mcp/main.py:796-810 - _format_search_results
src/knowledge_search_mcp/main.py:814-876 - Simplified knowledge_search function

Impact: Reduced knowledge_search() from 149 to ~63 lines, improved testability, added input validation for empty queries

Additional Improvements

Input Validation

Added validation for empty/whitespace-only queries in _generate_query_embedding()

Code Organization

Moved import time from inline to module-level imports

Test Updates

Updated all tests to work with the new LRUCache interface
All 11 tests passing

Metrics

Metric	Before	After	Change
Total lines (main.py)	809	876	+67 (more modular code)
Longest function	225 lines	65 lines	-71%
Code duplication instances	2 major	0	-100%
Resource leaks	Yes	No	Fixed
Cache memory bound	No	Yes (100 items)	Fixed
Test coverage	11 tests	11 tests	Maintained

5.1 KiB

Raw Blame History

Refactoring Summary

High-ROI Refactorings Completed

1. Eliminated Code Duplication - Session Management ✅

2. Fixed Resource Cleanup ✅

3. Implemented LRU Cache with Size Limits ✅

4. Broke Down Large Functions ✅

a. Extracted Validation Functions from `lifespan()`

b. Extracted Helper Functions from `knowledge_search()`

Additional Improvements

Input Validation

Code Organization

Test Updates

Metrics

What's Left for Future Work

Medium Priority (Not Done)

Lower Priority (Not Done)

5.1 KiB Raw Blame History

Refactoring Summary

High-ROI Refactorings Completed

1. Eliminated Code Duplication - Session Management ✅

2. Fixed Resource Cleanup ✅

3. Implemented LRU Cache with Size Limits ✅

4. Broke Down Large Functions ✅

a. Extracted Validation Functions from lifespan()

b. Extracted Helper Functions from knowledge_search()

Additional Improvements

Input Validation

Code Organization

Test Updates

Metrics

What's Left for Future Work

Medium Priority (Not Done)

Lower Priority (Not Done)

5.1 KiB

Raw Blame History

a. Extracted Validation Functions from `lifespan()`

b. Extracted Helper Functions from `knowledge_search()`