Building GitNarrative: How I Parse Git History with Python to Extract Development Patterns

When I started building GitNarrative, I thought the hardest part would be the AI integration. Turns out, the real challenge was analyzing git repositories in a way that actually captures meaningful development patterns.

Here’s how I built the git analysis engine that powers GitNarrative’s story generation.

The Challenge: Making Sense of Messy Git History

Every git repository tells a story, but extracting that story programmatically is complex. Consider these real commit messages from a typical project:

"fix bug"
"refactor"
"update dependencies" 
"THIS FINALLY WORKS"
"revert last commit"
"actually fix the bug this time"

The challenge is identifying patterns that reveal the actual development journey – the struggles, breakthroughs, and decision points that make compelling narratives.

Library Choice: pygit2 vs GitPython

I evaluated both major Python git libraries:

GitPython: More Pythonic, easier to use

import git
repo = git.Repo('/path/to/repo')
commits = list(repo.iter_commits())

pygit2: Lower-level, better performance, more control

import pygit2
repo = pygit2.Repository('/path/to/repo')
walker = repo.walk(repo.head.target)

I chose pygit2 because GitNarrative needs to process repositories with thousands of commits efficiently. The performance difference is significant for large repositories.

Core Analysis Architecture

Here’s the foundation of my git analysis engine:

from dataclasses import dataclass
from datetime import datetime
from typing import List, Dict, Set
import pygit2

@dataclass
class CommitAnalysis:
    sha: str
    message: str
    timestamp: datetime
    files_changed: List[str]
    additions: int
    deletions: int
    author: str
    is_merge: bool
    complexity_score: float
    commit_type: str  # 'feature', 'bugfix', 'refactor', 'docs', etc.

class GitAnalyzer:
    def __init__(self, repo_path: str):
        self.repo = pygit2.Repository(repo_path)

    def analyze_repository(self) -> Dict:
        commits = self._extract_commits()
        patterns = self._identify_patterns(commits)
        timeline = self._build_timeline(commits)
        milestones = self._detect_milestones(commits)

        return {
            "commits": commits,
            "patterns": patterns,
            "timeline": timeline,
            "milestones": milestones,
            "summary": self._generate_summary(commits, patterns)
        }

Pattern Recognition: The Heart of Story Extraction

The key insight is that commit patterns reveal development phases. Here’s how I identify them:

1. Commit Type Classification

def _classify_commit(self, commit_message: str, files_changed: List[str]) -> str:
    message_lower = commit_message.lower()

    # Bug fix patterns
    if any(keyword in message_lower for keyword in ['fix', 'bug', 'issue', 'error']):
        return 'bugfix'

    # Feature patterns
    if any(keyword in message_lower for keyword in ['add', 'implement', 'create', 'feature']):
        return 'feature'

    # Refactor patterns
    if any(keyword in message_lower for keyword in ['refactor', 'restructure', 'reorganize']):
        return 'refactor'

    # Documentation
    if any(keyword in message_lower for keyword in ['doc', 'readme', 'comment']):
        return 'docs'

    # Dependency/config changes
    if any(file.endswith(('.json', '.yml', '.yaml', '.toml')) for file in files_changed):
        return 'config'

    return 'other'

2. Development Phase Detection

def _identify_development_phases(self, commits: List[CommitAnalysis]) -> List[Dict]:
    phases = []
    current_phase = None

    for i, commit in enumerate(commits):
        # Look for phase transition indicators
        if self._is_architecture_change(commit):
            if current_phase:
                phases.append(current_phase)
            current_phase = {
                'type': 'architecture_change',
                'start_commit': commit.sha,
                'description': 'Major architectural refactoring',
                'commits': [commit]
            }
        elif self._is_feature_burst(commits[max(0, i-5):i+1]):
            # Multiple feature commits in short timeframe
            if not current_phase or current_phase['type'] != 'feature_development':
                if current_phase:
                    phases.append(current_phase)
                current_phase = {
                    'type': 'feature_development',
                    'start_commit': commit.sha,
                    'description': 'Rapid feature development phase',
                    'commits': [commit]
                }

        if current_phase:
            current_phase['commits'].append(commit)

    return phases

def _is_architecture_change(self, commit: CommitAnalysis) -> bool:
    # High file change count + specific patterns
    return (len(commit.files_changed) > 10 and 
            commit.complexity_score > 0.8 and
            any(keyword in commit.message.lower() 
                for keyword in ['refactor', 'restructure', 'migrate']))

3. Struggle and Breakthrough Detection

This is where the storytelling magic happens:

def _detect_struggle_patterns(self, commits: List[CommitAnalysis]) -> List[Dict]:
    struggles = []

    for i in range(len(commits) - 3):
        window = commits[i:i+4]

        # Look for multiple attempts at same issue
        if self._is_struggle_sequence(window):
            struggles.append({
                'type': 'debugging_struggle',
                'commits': window,
                'description': self._describe_struggle(window),
                'resolution_commit': self._find_resolution(commits[i+4:i+10])
            })

    return struggles

def _is_struggle_sequence(self, commits: List[CommitAnalysis]) -> bool:
    # Multiple bug fix attempts in short timeframe
    bugfix_count = sum(1 for c in commits if c.commit_type == 'bugfix')

    # Time clustering (all within days of each other)
    time_span = (commits[-1].timestamp - commits[0].timestamp).days

    return bugfix_count >= 2 and time_span <= 3

def _find_resolution(self, following_commits: List[CommitAnalysis]) -> CommitAnalysis:
    # Look for commit that likely resolved the issue
    for commit in following_commits:
        if ('work' in commit.message.lower() or 
            'fix' in commit.message.lower() or
            commit.complexity_score > 0.6):
            return commit
    return None

Timeline Correlation: When Things Happened

Understanding timing is crucial for narrative flow:

def _build_timeline(self, commits: List[CommitAnalysis]) -> Dict:
    # Group commits by time periods
    monthly_activity = defaultdict(list)

    for commit in commits:
        month_key = commit.timestamp.strftime('%Y-%m')
        monthly_activity[month_key].append(commit)

    timeline = {}
    for month, month_commits in monthly_activity.items():
        timeline[month] = {
            'total_commits': len(month_commits),
            'commit_types': self._analyze_commit_distribution(month_commits),
            'major_changes': self._identify_major_changes(month_commits),
            'development_velocity': self._calculate_velocity(month_commits)
        }

    return timeline

def _calculate_velocity(self, commits: List[CommitAnalysis]) -> float:
    if not commits:
        return 0.0

    # Factor in commit frequency, complexity, and file changes
    total_complexity = sum(c.complexity_score for c in commits)
    total_files = sum(len(c.files_changed) for c in commits)

    return (total_complexity * total_files) / len(commits)

Performance Optimizations

Processing large repositories efficiently required several optimizations:

1. Lazy Loading

def _extract_commits(self, max_commits: int = 1000) -> List[CommitAnalysis]:
    # Process commits in batches to avoid memory issues
    walker = self.repo.walk(self.repo.head.target)
    commits = []

    for i, commit in enumerate(walker):
        if i >= max_commits:
            break

        commits.append(self._analyze_single_commit(commit))

    return commits

2. Caching Results

from functools import lru_cache

@lru_cache(maxsize=128)
def _calculate_complexity_score(self, sha: str) -> float:
    # Expensive calculation cached per commit
    commit = self.repo[sha]
    # ... complexity calculation
    return score

3. Parallel Processing for Multiple Repositories

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def analyze_multiple_repos(repo_paths: List[str]) -> List[Dict]:
    with ProcessPoolExecutor() as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(executor, analyze_single_repo, path)
            for path in repo_paths
        ]
        return await asyncio.gather(*tasks)

Integration with AI Story Generation

The analysis output feeds directly into AI prompts:

def format_for_ai_prompt(self, analysis: Dict) -> str:
    prompt_data = {
        'repository_summary': analysis['summary'],
        'development_phases': analysis['patterns']['phases'],
        'key_struggles': analysis['patterns']['struggles'],
        'breakthrough_moments': analysis['milestones'],
        'timeline': analysis['timeline']
    }

    return self._build_narrative_prompt(prompt_data)

Challenges and Solutions

Challenge 1: Repositories with inconsistent commit message styles
Solution: Pattern matching with multiple fallback strategies and file-based analysis

Challenge 2: Merge commits creating noise in analysis
Solution: Filtering strategy that focuses on meaningful commits while preserving merge context

Challenge 3: Very large repositories (10k+ commits)
Solution: Sampling strategy that captures representative commits from different time periods

Results and Validation

The analysis engine successfully processes repositories ranging from small personal projects to large open source codebases. When tested on React’s repository, it correctly identified:

  • The initial experimental phase (2013)
  • Major architecture rewrites (Fiber, Hooks)
  • Performance optimization periods
  • API stabilization phases

What’s Next

Current improvements in development:

  • Better natural language processing of commit messages
  • Machine learning models for commit classification
  • Integration with issue tracker data for richer context
  • Support for monorepo analysis

The git analysis engine is the foundation that makes GitNarrative’s storytelling possible. By extracting meaningful patterns from commit history, we can transform boring git logs into compelling narratives about software development.

GitNarrative is available at https://gitnarrative.io – try it with your own repositories to see these patterns in action.

What patterns have you noticed in your own git history? I’d love to hear about interesting commit patterns you’ve discovered in your projects.

Similar Posts