Zurich Insurance Group: Building an Effective Log Management Solution on AWS
Speaker: Samantha Gignac @ AWS FSI Meetup Q4/2024
Agenda
Start with the basics
-
What is log management and why is it critical for financial sector organizations
-
Its role in compliance, security, and operational efficiency
Explore unique challenges faced by financial institutions in log management
-
Handling large data volumes
-
Meeting regulatory requirements
-
Managing costs
Discuss Zurich’s specific goals for a log management solution
- Understand the approach chosen
Dive into the technical details of the solution
-
Outline key AWS services used
-
Explain design principles for scalability, cost-effectiveness, and efficiency
Review the outcomes and benefits achieved
- Apply lessons to other financial sector organizations
Wrap up with key takeaways
- Provide actionable insights for organizations using or exploring AWS for log management
Practical and relatable, focusing on real-world strategies and inspiration for the audience’s journey
Begin with the fundamentals:
- Why is log management important in the finance sector?
The basics of log management
-
Log sources: Logs come from servers, appliances, network devices, cloud services, etc.
-
Log collection: Centralizing log collection is essential for visibility and compliance
-
Processing and analysis: Filtering noise, adding context, and identifying patterns or red flags
-
Storage and retention: Secure, organized, and long-term storage (e.g., S3 tiers) is crucial for compliance
-
Insights and actions: Leveraging data for actionable insights and responses
Log management is crucial for financial services due to three main reasons:
-
Compliance: Regulators require proof of due diligence, and logs serve as evidence.
-
Security: Logs are essential for early threat detection and mitigation.
-
Operational efficiency: Smoothly running systems lead to happy customers, employees, and regulators.
-
Log management is not just about data collection but about transforming data into useful insights for system protection, regulation compliance, and operational efficiency.
Key pain points:
Specific challenges Zurich faced in log management:
-
Centralized logging can become expensive quickly.
-
Approach treated all logs equally, storing everything in high-cost storage and processing all logs as critical, leading to rapid cost escalation.
Data volume:
-
Financial services generate vast amounts of log data from various sources (firewall rules, access attempts, vulnerability scans, etc.).
-
Security systems, due to sensitive data handling, generate terabytes of logs daily.
-
Not all security logs are equally important; some can be analyzed later, while critical alerts require immediate attention.
Compliance regulations:
-
Regulations like PCI DSS and GDPR mandate log retention (e.g., seven years for audits) and detailed access records.
-
Ensuring logs are secure, tamper-proof, and accessible without excessive cost is challenging.
Cost efficiency:
- Balancing the need for comprehensive log management with cost-effective solutions is crucial.
Treating all logs equally leads to resource waste:
-
Not all logs require high-cost, real-time processing.
-
Example: One-time debug logs vs. API access logs containing security-critical information.
-
Need to store and retain logs based on their importance and potential use.
Complexity and integration challenges:
-
Logs come from various sources (legacy systems, cloud services, apps, devices) with different formats.
-
Translating between systems can lead to delays and blind spots.
-
Example: Vulnerability management system and CASB tool (Cloud Access Security Broke) using different formats.
Security and real-time analysis:
-
SIME (Security information and event management) logs are critical for threat detection and response.
-
Need to analyze logs quickly and efficiently to detect patterns (e.g., failed login attempts indicating a brute force attack).
-
Trust is crucial in the financial sector, so timely threat detection is essential.
Summary of challenges:
-
Overwhelming data volumes
-
Strict compliance requirements
-
High cost of logging
-
Messy integrations
-
Need for real-time security
Solution goals:
-
Modernize log management approach:
-
Move away from treating every log equally.
-
Use scalable tools to prioritize data collection based on importance.
Reduce cost:
-
Lower cost per gigabyte ingested without compromising performance and compliance.
-
Manage data smarter by filtering out unnecessary logs before ingestion.
Decommission old SIME (Security information and event management) infrastructure:
-
Retire legacy systems to reduce technical debt, high maintenance costs, slow performance, and limited scalability.
-
Free up resources for more modern and efficient solutions.
Improve analytics performance:
-
Reduce latency and log search time for faster insights and real-time decision-making.
-
Focus on delivering better performance without unnecessary delays.
Objectives summary:
-
Achieve smarter log management that reduces costs, simplifies the environment, and delivers better performance.
-
Focus on what matters most.
Solution architecture:
Prioritize logs based on importance:
-
High-priority logs (e.g., security events) are routed for immediate analysis.
-
Lower-priority logs (e.g., compliance-related) are stored in cost-effective archives like S3.
-
Ensures resources are focused on critical data while controlling costs.
ETL pipeline:
-
Acts as a traffic controller for logs, filtering out unnecessary data to reduce ingestion costs.
-
Enriches data with additional metadata to make logs more actionable.
-
Routes data to the appropriate destination (real-time analytics or long-term storage).
-
Critical for reducing cost per gigabyte ingested and ensuring valuable data is available where needed.
Route logs into SIME and AWS infrastructure:
- After prioritization and processing, logs are directed into the SIME (Security information and event management) and AWS infrastructure for further analysis and storage.
AWS infrastructure components and their roles:
Amazon OpenSearch Service:
-
Enables real-time analytics with fast log searches.
-
Provides dashboards and monitoring tools for DevOps and security teams.
-
Helps identify root causes quickly (e.g., sudden spike in failed login attempts).
Amazon S3:
-
Stores all logs for compliance purposes.
-
Scalable, secure, and cost-effective.
-
Perfect for compliance needs.
AWS Glue and Data Catalog:
-
Organizes and makes logs searchable.
-
Reduces time spent searching through raw data.
Amazon Athena:
-
Allows querying data directly in S3 without data movement.
-
Cost-effective for deep dive investigations and ad hoc analysis.
-
Useful for investigating potential compliance issues.
Combining data prioritization, ETL capabilities, and AWS services:
-
Modernized log management approach.
-
Reduced costs by storing and processing only the most important logs.
-
Simplified infrastructure by retiring legacy SIME (Security information and event management) systems.
-
Improved performance with faster log searches and better analytics.
Benefits and outcomes of the solution:
Cost savings:
-
Reduced SIME (Security information and event management) ingestion from 5 TB per day to 500 GB per day.
-
Achieved by deploying Cribl with version control and automatic backups.
-
Unnecessary logs were filtered out before ingestion.
-
Scales rapidly using Terraform.
Performance improvements:
-
Sample query for counting firewall events per firewall reduced from 93 seconds to 2 seconds.
-
OpenSearch proved to be fast and easy to manage.
-
One-click deployment and upgrades.
-
Consistent and reliable rollout using version-controlled Terraform.
Clean house and technical debt reduction:
-
Retired old SIME (Security information and event management) infrastructure, eliminating technical debt.
-
Transitioned to a sleek, modern architecture.
Scalability:
-
Utilized AWS services like S3 for tiered data storage (frequent access, infrequent access, and Glacier).
-
Achieved rapid scaling using Terraform for Cribl and other components.
-
The system grows with organizational needs.
Compliance and security:
-
Logs securely stored in S3 and easily searchable in OpenSearch.
-
Made regulatory reporting and security monitoring faster, simpler, and more reliable.
Key lessons learned from Journey:
Prioritize the right data:
-
Not all logs are equal; treat them based on importance.
-
Zurich achieved over 50% reduction in ingestion costs by prioritizing critical logs.
Invest in modern tools:
-
Use tools designed for current challenges (e.g., Cribl, AWS OpenSearch, S3).
-
These tools scale seamlessly with the organization.
Automate for consistency:
-
Utilize automation (e.g., Terraform) for deployments, upgrades, and scaling.
-
Reduces human error and frees teams for strategic tasks.
Think beyond compliance:
-
Modern log management offers faster searches, real-time insights, and better security monitoring.
-
Focus on making data work for the organization, not just meeting regulatory requirements.
Summary:
-
A smarter approach to log management is transformative.
-
Prioritize what matters, invest in modern tools, automate processes, and think beyond compliance.
Future plans for log management:
Continue focusing on cost savings:
-
Refine S3 tiering strategies to optimize storage efficiency.
-
Shift logs between frequent, infrequent, and Glacier tiers based on usage.
Expand automation:
-
Further streamline scaling and configuration updates using Terraform.
-
Aim for infrastructure that practically runs itself.
Integrate advanced analytics:
-
Utilize OpenSearch’s built-in machine learning capabilities like anomaly detection.
-
Catch unusual patterns in log data before they escalate.
Import and optimize SIME correlation searches and alerts:
-
Continue transitioning from old legacy SIME (Security information and event management) by importing and optimizing these into OpenSearch.
-
Maintain high monitoring standards while embracing the modern platform.
Strengthen data governance:
-
Enhance metadata management with tools like AWS Glue.
-
Keep logs organized and stay ahead of compliance requirements.
Explore generative AI with OpenSearch:
-
Use OpenSearch as a vector database to power retrieval augmented generation (RAG) for AI use cases.
-
Example: Train AI models using logs to predict system issues before they happen.