Building GitNarrative: How I Parse Git History with Python to Extract Development Patterns
When I started building GitNarrative, I thought the hardest part would be the AI integration. Turns out, the real challenge was analyzing git repositories in a way that actually captures meaningful development patterns.
Here’s how I built the git analysis engine that powers GitNarrative’s story generation.
The Challenge: Making Sense of Messy Git History
Every git repository tells a story, but extracting that story programmatically is complex. Consider these real commit messages from a typical project:
"fix bug"
"refactor"
"update dependencies"
"THIS FINALLY WORKS"
"revert last commit"
"actually fix the bug this time"
The challenge is identifying patterns that reveal the actual development journey – the struggles, breakthroughs, and decision points that make compelling narratives.
Library Choice: pygit2 vs GitPython
I evaluated both major Python git libraries:
GitPython: More Pythonic, easier to use
import git
repo = git.Repo('/path/to/repo')
commits = list(repo.iter_commits())
pygit2: Lower-level, better performance, more control
import pygit2
repo = pygit2.Repository('/path/to/repo')
walker = repo.walk(repo.head.target)
I chose pygit2 because GitNarrative needs to process repositories with thousands of commits efficiently. The performance difference is significant for large repositories.
Core Analysis Architecture
Here’s the foundation of my git analysis engine:
from dataclasses import dataclass
from datetime import datetime
from typing import List, Dict, Set
import pygit2
@dataclass
class CommitAnalysis:
sha: str
message: str
timestamp: datetime
files_changed: List[str]
additions: int
deletions: int
author: str
is_merge: bool
complexity_score: float
commit_type: str # 'feature', 'bugfix', 'refactor', 'docs', etc.
class GitAnalyzer:
def __init__(self, repo_path: str):
self.repo = pygit2.Repository(repo_path)
def analyze_repository(self) -> Dict:
commits = self._extract_commits()
patterns = self._identify_patterns(commits)
timeline = self._build_timeline(commits)
milestones = self._detect_milestones(commits)
return {
"commits": commits,
"patterns": patterns,
"timeline": timeline,
"milestones": milestones,
"summary": self._generate_summary(commits, patterns)
}
Pattern Recognition: The Heart of Story Extraction
The key insight is that commit patterns reveal development phases. Here’s how I identify them:
1. Commit Type Classification
def _classify_commit(self, commit_message: str, files_changed: List[str]) -> str:
message_lower = commit_message.lower()
# Bug fix patterns
if any(keyword in message_lower for keyword in ['fix', 'bug', 'issue', 'error']):
return 'bugfix'
# Feature patterns
if any(keyword in message_lower for keyword in ['add', 'implement', 'create', 'feature']):
return 'feature'
# Refactor patterns
if any(keyword in message_lower for keyword in ['refactor', 'restructure', 'reorganize']):
return 'refactor'
# Documentation
if any(keyword in message_lower for keyword in ['doc', 'readme', 'comment']):
return 'docs'
# Dependency/config changes
if any(file.endswith(('.json', '.yml', '.yaml', '.toml')) for file in files_changed):
return 'config'
return 'other'
2. Development Phase Detection
def _identify_development_phases(self, commits: List[CommitAnalysis]) -> List[Dict]:
phases = []
current_phase = None
for i, commit in enumerate(commits):
# Look for phase transition indicators
if self._is_architecture_change(commit):
if current_phase:
phases.append(current_phase)
current_phase = {
'type': 'architecture_change',
'start_commit': commit.sha,
'description': 'Major architectural refactoring',
'commits': [commit]
}
elif self._is_feature_burst(commits[max(0, i-5):i+1]):
# Multiple feature commits in short timeframe
if not current_phase or current_phase['type'] != 'feature_development':
if current_phase:
phases.append(current_phase)
current_phase = {
'type': 'feature_development',
'start_commit': commit.sha,
'description': 'Rapid feature development phase',
'commits': [commit]
}
if current_phase:
current_phase['commits'].append(commit)
return phases
def _is_architecture_change(self, commit: CommitAnalysis) -> bool:
# High file change count + specific patterns
return (len(commit.files_changed) > 10 and
commit.complexity_score > 0.8 and
any(keyword in commit.message.lower()
for keyword in ['refactor', 'restructure', 'migrate']))
3. Struggle and Breakthrough Detection
This is where the storytelling magic happens:
def _detect_struggle_patterns(self, commits: List[CommitAnalysis]) -> List[Dict]:
struggles = []
for i in range(len(commits) - 3):
window = commits[i:i+4]
# Look for multiple attempts at same issue
if self._is_struggle_sequence(window):
struggles.append({
'type': 'debugging_struggle',
'commits': window,
'description': self._describe_struggle(window),
'resolution_commit': self._find_resolution(commits[i+4:i+10])
})
return struggles
def _is_struggle_sequence(self, commits: List[CommitAnalysis]) -> bool:
# Multiple bug fix attempts in short timeframe
bugfix_count = sum(1 for c in commits if c.commit_type == 'bugfix')
# Time clustering (all within days of each other)
time_span = (commits[-1].timestamp - commits[0].timestamp).days
return bugfix_count >= 2 and time_span <= 3
def _find_resolution(self, following_commits: List[CommitAnalysis]) -> CommitAnalysis:
# Look for commit that likely resolved the issue
for commit in following_commits:
if ('work' in commit.message.lower() or
'fix' in commit.message.lower() or
commit.complexity_score > 0.6):
return commit
return None
Timeline Correlation: When Things Happened
Understanding timing is crucial for narrative flow:
def _build_timeline(self, commits: List[CommitAnalysis]) -> Dict:
# Group commits by time periods
monthly_activity = defaultdict(list)
for commit in commits:
month_key = commit.timestamp.strftime('%Y-%m')
monthly_activity[month_key].append(commit)
timeline = {}
for month, month_commits in monthly_activity.items():
timeline[month] = {
'total_commits': len(month_commits),
'commit_types': self._analyze_commit_distribution(month_commits),
'major_changes': self._identify_major_changes(month_commits),
'development_velocity': self._calculate_velocity(month_commits)
}
return timeline
def _calculate_velocity(self, commits: List[CommitAnalysis]) -> float:
if not commits:
return 0.0
# Factor in commit frequency, complexity, and file changes
total_complexity = sum(c.complexity_score for c in commits)
total_files = sum(len(c.files_changed) for c in commits)
return (total_complexity * total_files) / len(commits)
Performance Optimizations
Processing large repositories efficiently required several optimizations:
1. Lazy Loading
def _extract_commits(self, max_commits: int = 1000) -> List[CommitAnalysis]:
# Process commits in batches to avoid memory issues
walker = self.repo.walk(self.repo.head.target)
commits = []
for i, commit in enumerate(walker):
if i >= max_commits:
break
commits.append(self._analyze_single_commit(commit))
return commits
2. Caching Results
from functools import lru_cache
@lru_cache(maxsize=128)
def _calculate_complexity_score(self, sha: str) -> float:
# Expensive calculation cached per commit
commit = self.repo[sha]
# ... complexity calculation
return score
3. Parallel Processing for Multiple Repositories
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def analyze_multiple_repos(repo_paths: List[str]) -> List[Dict]:
with ProcessPoolExecutor() as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, analyze_single_repo, path)
for path in repo_paths
]
return await asyncio.gather(*tasks)
Integration with AI Story Generation
The analysis output feeds directly into AI prompts:
def format_for_ai_prompt(self, analysis: Dict) -> str:
prompt_data = {
'repository_summary': analysis['summary'],
'development_phases': analysis['patterns']['phases'],
'key_struggles': analysis['patterns']['struggles'],
'breakthrough_moments': analysis['milestones'],
'timeline': analysis['timeline']
}
return self._build_narrative_prompt(prompt_data)
Challenges and Solutions
Challenge 1: Repositories with inconsistent commit message styles
Solution: Pattern matching with multiple fallback strategies and file-based analysis
Challenge 2: Merge commits creating noise in analysis
Solution: Filtering strategy that focuses on meaningful commits while preserving merge context
Challenge 3: Very large repositories (10k+ commits)
Solution: Sampling strategy that captures representative commits from different time periods
Results and Validation
The analysis engine successfully processes repositories ranging from small personal projects to large open source codebases. When tested on React’s repository, it correctly identified:
- The initial experimental phase (2013)
- Major architecture rewrites (Fiber, Hooks)
- Performance optimization periods
- API stabilization phases
What’s Next
Current improvements in development:
- Better natural language processing of commit messages
- Machine learning models for commit classification
- Integration with issue tracker data for richer context
- Support for monorepo analysis
The git analysis engine is the foundation that makes GitNarrative’s storytelling possible. By extracting meaningful patterns from commit history, we can transform boring git logs into compelling narratives about software development.
GitNarrative is available at https://gitnarrative.io – try it with your own repositories to see these patterns in action.
What patterns have you noticed in your own git history? I’d love to hear about interesting commit patterns you’ve discovered in your projects.