Refactor duplicated code

2026-03-03 17:07:15 +00:00
parent 8dfd2048a5
commit d3cd8d5291
3 changed files with 515 additions and 235 deletions
--- a/REFACTORING_SUMMARY.md
+++ b/REFACTORING_SUMMARY.md
@@ -0,0 +1,136 @@
+# Refactoring Summary
+
+## High-ROI Refactorings Completed
+
+### 1. Eliminated Code Duplication - Session Management ✅
+
+**Problem**: The `_get_aio_session()` method was duplicated identically in both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` classes.
+
+**Solution**:
+- Created a new `BaseGoogleCloudClient` base class that encapsulates shared session management logic
+- Both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` now inherit from this base class
+- Added a `close()` method to properly clean up resources
+
+**Files Changed**:
+- `src/knowledge_search_mcp/main.py:25-80` - Added base class
+- `src/knowledge_search_mcp/main.py:83` - GoogleCloudFileStorage inherits from base
+- `src/knowledge_search_mcp/main.py:219` - GoogleCloudVectorSearch inherits from base
+
+**Impact**: Reduced ~24 lines of duplicated code, improved maintainability
+
+---
+
+### 2. Fixed Resource Cleanup ✅
+
+**Problem**: aiohttp sessions were never explicitly closed, leading to potential resource leaks and warnings.
+
+**Solution**:
+- Added `close()` method to `BaseGoogleCloudClient` to properly close aiohttp sessions
+- Extended `close()` in `GoogleCloudVectorSearch` to also close the storage client's session
+- Modified `lifespan()` function's finally block to call `vs.close()` on shutdown
+
+**Files Changed**:
+- `src/knowledge_search_mcp/main.py:74-78` - Base close method
+- `src/knowledge_search_mcp/main.py:228-231` - VectorSearch close override
+- `src/knowledge_search_mcp/main.py:699-707` - Cleanup in lifespan finally block
+
+**Impact**: Prevents resource leaks, eliminates aiohttp warnings on shutdown
+
+---
+
+### 3. Implemented LRU Cache with Size Limits ✅
+
+**Problem**: The `_cache` dictionary in `GoogleCloudFileStorage` grew indefinitely, potentially causing memory issues with large document sets.
+
+**Solution**:
+- Created a new `LRUCache` class with configurable max size (default: 100 items)
+- Automatically evicts least recently used items when cache is full
+- Maintains insertion order and tracks access patterns
+
+**Files Changed**:
+- `src/knowledge_search_mcp/main.py:28-58` - New LRUCache class
+- `src/knowledge_search_mcp/main.py:85-87` - Updated GoogleCloudFileStorage to use LRUCache
+- `src/knowledge_search_mcp/main.py:115-122` - Updated cache access patterns
+- `src/knowledge_search_mcp/main.py:147-148` - Updated cache write patterns
+- `tests/test_search.py` - Updated tests to work with LRUCache interface
+
+**Impact**: Bounded memory usage, prevents cache from growing indefinitely
+
+---
+
+### 4. Broke Down Large Functions ✅
+
+#### a. Extracted Validation Functions from `lifespan()`
+
+**Problem**: The `lifespan()` function was 225 lines with repetitive validation logic.
+
+**Solution**: Extracted three helper functions:
+- `_validate_genai_access()` - Validates GenAI embedding API access
+- `_validate_gcs_access()` - Validates GCS bucket access
+- `_validate_vector_search_access()` - Validates vector search endpoint access
+
+**Files Changed**:
+- `src/knowledge_search_mcp/main.py:424-587` - New validation functions
+- `src/knowledge_search_mcp/main.py:644-693` - Simplified lifespan function
+
+**Impact**: Reduced lifespan() from 225 to ~65 lines, improved readability and testability
+
+#### b. Extracted Helper Functions from `knowledge_search()`
+
+**Problem**: The `knowledge_search()` function was 149 lines mixing multiple concerns.
+
+**Solution**: Extracted three helper functions:
+- `_generate_query_embedding()` - Handles embedding generation with error handling
+- `_filter_search_results()` - Applies similarity thresholds and filtering
+- `_format_search_results()` - Formats results as XML-like documents
+
+**Files Changed**:
+- `src/knowledge_search_mcp/main.py:717-766` - _generate_query_embedding
+- `src/knowledge_search_mcp/main.py:769-793` - _filter_search_results
+- `src/knowledge_search_mcp/main.py:796-810` - _format_search_results
+- `src/knowledge_search_mcp/main.py:814-876` - Simplified knowledge_search function
+
+**Impact**: Reduced knowledge_search() from 149 to ~63 lines, improved testability, added input validation for empty queries
+
+---
+
+## Additional Improvements
+
+### Input Validation
+- Added validation for empty/whitespace-only queries in `_generate_query_embedding()`
+
+### Code Organization
+- Moved `import time` from inline to module-level imports
+
+### Test Updates
+- Updated all tests to work with the new LRUCache interface
+- All 11 tests passing
+
+---
+
+## Metrics
+
+| Metric | Before | After | Change |
+|--------|--------|-------|--------|
+| Total lines (main.py) | 809 | 876 | +67 (more modular code) |
+| Longest function | 225 lines | 65 lines | -71% |
+| Code duplication instances | 2 major | 0 | -100% |
+| Resource leaks | Yes | No | Fixed |
+| Cache memory bound | No | Yes (100 items) | Fixed |
+| Test coverage | 11 tests | 11 tests | Maintained |
+
+---
+
+## What's Left for Future Work
+
+### Medium Priority (Not Done)
+- Move magic numbers to Settings configuration
+- Update outdated DockerfileConnector
+- Review and adjust logging levels
+- Add dependency injection for tighter coupling issues
+
+### Lower Priority (Not Done)
+- Add integration tests for end-to-end flows
+- Add performance tests
+- Introduce abstraction layers for cloud services
+- Standardize on f-strings (one %-format remaining)