Refactor duplicated code
This commit is contained in:
136
REFACTORING_SUMMARY.md
Normal file
136
REFACTORING_SUMMARY.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# Refactoring Summary
|
||||
|
||||
## High-ROI Refactorings Completed
|
||||
|
||||
### 1. Eliminated Code Duplication - Session Management ✅
|
||||
|
||||
**Problem**: The `_get_aio_session()` method was duplicated identically in both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` classes.
|
||||
|
||||
**Solution**:
|
||||
- Created a new `BaseGoogleCloudClient` base class that encapsulates shared session management logic
|
||||
- Both `GoogleCloudFileStorage` and `GoogleCloudVectorSearch` now inherit from this base class
|
||||
- Added a `close()` method to properly clean up resources
|
||||
|
||||
**Files Changed**:
|
||||
- `src/knowledge_search_mcp/main.py:25-80` - Added base class
|
||||
- `src/knowledge_search_mcp/main.py:83` - GoogleCloudFileStorage inherits from base
|
||||
- `src/knowledge_search_mcp/main.py:219` - GoogleCloudVectorSearch inherits from base
|
||||
|
||||
**Impact**: Reduced ~24 lines of duplicated code, improved maintainability
|
||||
|
||||
---
|
||||
|
||||
### 2. Fixed Resource Cleanup ✅
|
||||
|
||||
**Problem**: aiohttp sessions were never explicitly closed, leading to potential resource leaks and warnings.
|
||||
|
||||
**Solution**:
|
||||
- Added `close()` method to `BaseGoogleCloudClient` to properly close aiohttp sessions
|
||||
- Extended `close()` in `GoogleCloudVectorSearch` to also close the storage client's session
|
||||
- Modified `lifespan()` function's finally block to call `vs.close()` on shutdown
|
||||
|
||||
**Files Changed**:
|
||||
- `src/knowledge_search_mcp/main.py:74-78` - Base close method
|
||||
- `src/knowledge_search_mcp/main.py:228-231` - VectorSearch close override
|
||||
- `src/knowledge_search_mcp/main.py:699-707` - Cleanup in lifespan finally block
|
||||
|
||||
**Impact**: Prevents resource leaks, eliminates aiohttp warnings on shutdown
|
||||
|
||||
---
|
||||
|
||||
### 3. Implemented LRU Cache with Size Limits ✅
|
||||
|
||||
**Problem**: The `_cache` dictionary in `GoogleCloudFileStorage` grew indefinitely, potentially causing memory issues with large document sets.
|
||||
|
||||
**Solution**:
|
||||
- Created a new `LRUCache` class with configurable max size (default: 100 items)
|
||||
- Automatically evicts least recently used items when cache is full
|
||||
- Maintains insertion order and tracks access patterns
|
||||
|
||||
**Files Changed**:
|
||||
- `src/knowledge_search_mcp/main.py:28-58` - New LRUCache class
|
||||
- `src/knowledge_search_mcp/main.py:85-87` - Updated GoogleCloudFileStorage to use LRUCache
|
||||
- `src/knowledge_search_mcp/main.py:115-122` - Updated cache access patterns
|
||||
- `src/knowledge_search_mcp/main.py:147-148` - Updated cache write patterns
|
||||
- `tests/test_search.py` - Updated tests to work with LRUCache interface
|
||||
|
||||
**Impact**: Bounded memory usage, prevents cache from growing indefinitely
|
||||
|
||||
---
|
||||
|
||||
### 4. Broke Down Large Functions ✅
|
||||
|
||||
#### a. Extracted Validation Functions from `lifespan()`
|
||||
|
||||
**Problem**: The `lifespan()` function was 225 lines with repetitive validation logic.
|
||||
|
||||
**Solution**: Extracted three helper functions:
|
||||
- `_validate_genai_access()` - Validates GenAI embedding API access
|
||||
- `_validate_gcs_access()` - Validates GCS bucket access
|
||||
- `_validate_vector_search_access()` - Validates vector search endpoint access
|
||||
|
||||
**Files Changed**:
|
||||
- `src/knowledge_search_mcp/main.py:424-587` - New validation functions
|
||||
- `src/knowledge_search_mcp/main.py:644-693` - Simplified lifespan function
|
||||
|
||||
**Impact**: Reduced lifespan() from 225 to ~65 lines, improved readability and testability
|
||||
|
||||
#### b. Extracted Helper Functions from `knowledge_search()`
|
||||
|
||||
**Problem**: The `knowledge_search()` function was 149 lines mixing multiple concerns.
|
||||
|
||||
**Solution**: Extracted three helper functions:
|
||||
- `_generate_query_embedding()` - Handles embedding generation with error handling
|
||||
- `_filter_search_results()` - Applies similarity thresholds and filtering
|
||||
- `_format_search_results()` - Formats results as XML-like documents
|
||||
|
||||
**Files Changed**:
|
||||
- `src/knowledge_search_mcp/main.py:717-766` - _generate_query_embedding
|
||||
- `src/knowledge_search_mcp/main.py:769-793` - _filter_search_results
|
||||
- `src/knowledge_search_mcp/main.py:796-810` - _format_search_results
|
||||
- `src/knowledge_search_mcp/main.py:814-876` - Simplified knowledge_search function
|
||||
|
||||
**Impact**: Reduced knowledge_search() from 149 to ~63 lines, improved testability, added input validation for empty queries
|
||||
|
||||
---
|
||||
|
||||
## Additional Improvements
|
||||
|
||||
### Input Validation
|
||||
- Added validation for empty/whitespace-only queries in `_generate_query_embedding()`
|
||||
|
||||
### Code Organization
|
||||
- Moved `import time` from inline to module-level imports
|
||||
|
||||
### Test Updates
|
||||
- Updated all tests to work with the new LRUCache interface
|
||||
- All 11 tests passing
|
||||
|
||||
---
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Before | After | Change |
|
||||
|--------|--------|-------|--------|
|
||||
| Total lines (main.py) | 809 | 876 | +67 (more modular code) |
|
||||
| Longest function | 225 lines | 65 lines | -71% |
|
||||
| Code duplication instances | 2 major | 0 | -100% |
|
||||
| Resource leaks | Yes | No | Fixed |
|
||||
| Cache memory bound | No | Yes (100 items) | Fixed |
|
||||
| Test coverage | 11 tests | 11 tests | Maintained |
|
||||
|
||||
---
|
||||
|
||||
## What's Left for Future Work
|
||||
|
||||
### Medium Priority (Not Done)
|
||||
- Move magic numbers to Settings configuration
|
||||
- Update outdated DockerfileConnector
|
||||
- Review and adjust logging levels
|
||||
- Add dependency injection for tighter coupling issues
|
||||
|
||||
### Lower Priority (Not Done)
|
||||
- Add integration tests for end-to-end flows
|
||||
- Add performance tests
|
||||
- Introduce abstraction layers for cloud services
|
||||
- Standardize on f-strings (one %-format remaining)
|
||||
Reference in New Issue
Block a user