Refactor duplicated code

This commit is contained in:
2026-03-03 17:07:15 +00:00
parent 8dfd2048a5
commit d3cd8d5291
3 changed files with 515 additions and 235 deletions

136
REFACTORING_SUMMARY.md Normal file
View File

@@ -0,0 +1,136 @@
# Refactoring Summary
## High-ROI Refactorings Completed
### 1. Eliminated Code Duplication - Session Management ✅
**Problem**: The `_get_aio_session()` method was duplicated identically in both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` classes.
**Solution**:
- Created a new `BaseGoogleCloudClient` base class that encapsulates shared session management logic
- Both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` now inherit from this base class
- Added a `close()` method to properly clean up resources
**Files Changed**:
- `src/knowledge_search_mcp/main.py:25-80` - Added base class
- `src/knowledge_search_mcp/main.py:83` - GoogleCloudFileStorage inherits from base
- `src/knowledge_search_mcp/main.py:219` - GoogleCloudVectorSearch inherits from base
**Impact**: Reduced ~24 lines of duplicated code, improved maintainability
---
### 2. Fixed Resource Cleanup ✅
**Problem**: aiohttp sessions were never explicitly closed, leading to potential resource leaks and warnings.
**Solution**:
- Added `close()` method to `BaseGoogleCloudClient` to properly close aiohttp sessions
- Extended `close()` in `GoogleCloudVectorSearch` to also close the storage client's session
- Modified `lifespan()` function's finally block to call `vs.close()` on shutdown
**Files Changed**:
- `src/knowledge_search_mcp/main.py:74-78` - Base close method
- `src/knowledge_search_mcp/main.py:228-231` - VectorSearch close override
- `src/knowledge_search_mcp/main.py:699-707` - Cleanup in lifespan finally block
**Impact**: Prevents resource leaks, eliminates aiohttp warnings on shutdown
---
### 3. Implemented LRU Cache with Size Limits ✅
**Problem**: The `_cache` dictionary in `GoogleCloudFileStorage` grew indefinitely, potentially causing memory issues with large document sets.
**Solution**:
- Created a new `LRUCache` class with configurable max size (default: 100 items)
- Automatically evicts least recently used items when cache is full
- Maintains insertion order and tracks access patterns
**Files Changed**:
- `src/knowledge_search_mcp/main.py:28-58` - New LRUCache class
- `src/knowledge_search_mcp/main.py:85-87` - Updated GoogleCloudFileStorage to use LRUCache
- `src/knowledge_search_mcp/main.py:115-122` - Updated cache access patterns
- `src/knowledge_search_mcp/main.py:147-148` - Updated cache write patterns
- `tests/test_search.py` - Updated tests to work with LRUCache interface
**Impact**: Bounded memory usage, prevents cache from growing indefinitely
---
### 4. Broke Down Large Functions ✅
#### a. Extracted Validation Functions from `lifespan()`
**Problem**: The `lifespan()` function was 225 lines with repetitive validation logic.
**Solution**: Extracted three helper functions:
- `_validate_genai_access()` - Validates GenAI embedding API access
- `_validate_gcs_access()` - Validates GCS bucket access
- `_validate_vector_search_access()` - Validates vector search endpoint access
**Files Changed**:
- `src/knowledge_search_mcp/main.py:424-587` - New validation functions
- `src/knowledge_search_mcp/main.py:644-693` - Simplified lifespan function
**Impact**: Reduced lifespan() from 225 to ~65 lines, improved readability and testability
#### b. Extracted Helper Functions from `knowledge_search()`
**Problem**: The `knowledge_search()` function was 149 lines mixing multiple concerns.
**Solution**: Extracted three helper functions:
- `_generate_query_embedding()` - Handles embedding generation with error handling
- `_filter_search_results()` - Applies similarity thresholds and filtering
- `_format_search_results()` - Formats results as XML-like documents
**Files Changed**:
- `src/knowledge_search_mcp/main.py:717-766` - _generate_query_embedding
- `src/knowledge_search_mcp/main.py:769-793` - _filter_search_results
- `src/knowledge_search_mcp/main.py:796-810` - _format_search_results
- `src/knowledge_search_mcp/main.py:814-876` - Simplified knowledge_search function
**Impact**: Reduced knowledge_search() from 149 to ~63 lines, improved testability, added input validation for empty queries
---
## Additional Improvements
### Input Validation
- Added validation for empty/whitespace-only queries in `_generate_query_embedding()`
### Code Organization
- Moved `import time` from inline to module-level imports
### Test Updates
- Updated all tests to work with the new LRUCache interface
- All 11 tests passing
---
## Metrics
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Total lines (main.py) | 809 | 876 | +67 (more modular code) |
| Longest function | 225 lines | 65 lines | -71% |
| Code duplication instances | 2 major | 0 | -100% |
| Resource leaks | Yes | No | Fixed |
| Cache memory bound | No | Yes (100 items) | Fixed |
| Test coverage | 11 tests | 11 tests | Maintained |
---
## What's Left for Future Work
### Medium Priority (Not Done)
- Move magic numbers to Settings configuration
- Update outdated DockerfileConnector
- Review and adjust logging levels
- Add dependency injection for tighter coupling issues
### Lower Priority (Not Done)
- Add integration tests for end-to-end flows
- Add performance tests
- Introduce abstraction layers for cloud services
- Standardize on f-strings (one %-format remaining)