The Data Discovery Problem
Most organizations don't know exactly what sensitive data they have in S3. A data pipeline from three years ago might have written PII to a bucket that was originally intended for configuration files. A developer might have uploaded a database dump to debug a production issue and forgotten to delete it. A third-party integration might be writing more data than expected.
Amazon Macie exists to solve this problem. It uses ML and pattern matching to automatically discover and classify sensitive data in your S3 buckets — identifying PII, financial records, credentials, and other sensitive content — and flags the security issues that put that data at risk.
What Macie Detects
Macie's sensitive data detection covers several categories:
- Personally Identifiable Information (PII): Names, addresses, email addresses, phone numbers, social security numbers, passport numbers
- Financial data: Credit card numbers, bank account numbers, routing numbers, financial statements
- Healthcare data: Medical record numbers, insurance information, medication names (HIPAA-relevant)
- Credentials: AWS access keys, private encryption keys, usernames and passwords in files
- Network information: IP addresses, MAC addresses, GPS coordinates
Macie uses a combination of managed data identifiers (maintained by AWS) and custom data identifiers (regular expressions and keywords you define) for detection.
Enabling Macie
aws macie2 enable-macie
# Enable for organization (from management account)
aws macie2 enable-organization-admin-account --admin-account-id 123456789012
For multi-account setups with AWS Organizations, delegate Macie administration to your security account and enable Macie across all member accounts automatically. See our Organizations guide for the security account architecture.
Running Sensitive Data Discovery Jobs
Macie can run one-time discovery jobs or scheduled jobs across your S3 buckets:
aws macie2 create-classification-job --job-type SCHEDULED --name "WeeklyPIIScan" --schedule-frequency '{"weeklySchedule": {"dayOfWeek": "SUNDAY"}}' --s3-job-definition '{
"bucketDefinitions": [{
"accountId": "123456789012",
"buckets": ["my-data-bucket", "my-logs-bucket"]
}]
}' --managed-data-identifier-selector RECOMMENDED
For initial discovery, run a one-time job across all buckets to understand your baseline data exposure. Then run weekly scheduled jobs on buckets that contain or might receive sensitive data.
Macie Policy Findings
Beyond data discovery, Macie continuously monitors for policy-based findings — security issues affecting your S3 buckets that could expose data:
Policy:IAMUser/S3BlockPublicAccessDisabled— Block Public Access disabled on a bucketPolicy:IAMUser/S3BucketEncryptionDisabled— Server-side encryption not requiredPolicy:IAMUser/S3BucketPublic— Bucket is publicly accessiblePolicy:IAMUser/S3BucketSharedExternally— Bucket shared with external AWS accountsPolicy:IAMUser/S3BucketSharedWithCloudFront— Bucket shared with CloudFront (legitimate, but flagged for awareness)
Policy findings require no jobs — Macie generates them continuously as S3 configurations change. Integrate policy findings with Security Hub for unified visibility. For S3 security configuration details, see our S3 security guide and bucket policy audit guide.
Prioritizing Findings: Risk-Based Approach
Macie generates a lot of findings. Prioritize based on the combination of data sensitivity and security risk:
Highest Priority
- High-severity sensitive data (SSN, credit cards) in a publicly accessible bucket
- Credentials (AWS keys, passwords) found in any bucket
- Sensitive data in a bucket shared with an unknown external account
Medium Priority
- PII in a bucket that's shared with a known partner account
- Any sensitive data in a bucket with Block Public Access disabled
- High volumes of sensitive data without encryption enabled
Lower Priority
- Low-confidence sensitive data detection (email addresses in log files)
- Sensitive data in highly restricted internal buckets with no external access
Custom Data Identifiers
Macie's managed identifiers cover common patterns. For organization-specific sensitive data — internal employee IDs, customer identifiers, proprietary codes — create custom identifiers:
aws macie2 create-custom-data-identifier --name "InternalEmployeeID" --regex "EMP-[0-9]{6}" --keywords "employee,staff,personnel" --maximum-match-distance 50 --description "Internal employee ID format"
Custom identifiers are useful for HIPAA covered entities that have organization-specific data formats, or for companies with proprietary data that has identifiable patterns.
Integrating Macie with Security Hub
Macie publishes findings to Security Hub automatically when both services are enabled. This gives you a unified view of S3 data risks alongside infrastructure security findings. Configure EventBridge rules to route high-priority Macie findings to immediate alerts:
aws events put-rule --name "macie-high-severity" --event-pattern '{
"source": ["aws.macie"],
"detail-type": ["Macie Finding"],
"detail": {
"severity": {"description": ["High", "Critical"]}
}
}'
For the full Security Hub automation approach, see our Security Hub automations guide.
Macie for GDPR and HIPAA Evidence
Macie is particularly valuable for compliance frameworks that require data classification:
- GDPR: Requires knowing where personal data is stored and demonstrating appropriate protection. Macie's discovery reports provide evidence of your data inventory and security controls. See our GDPR on AWS guide.
- HIPAA: Requires identifying and protecting PHI. Macie's healthcare data detection identifies potential PHI locations. See our HIPAA compliance guide.
- PCI DSS: Requires identifying and protecting cardholder data. Macie's financial data detection helps locate credit card data that shouldn't be stored. See our PCI DSS guide.
For all three frameworks, Macie finding reports serve as evidence of continuous data classification monitoring — a control requirement in all three standards.
FAQ
How much does Macie cost?
Macie charges per GB of data evaluated in classification jobs ($1/GB, with a 1 GB/month free tier per account). Policy findings are free. For large S3 environments, use sampling and focus jobs on buckets most likely to contain sensitive data rather than scanning everything.
What's the false positive rate for Macie findings?
Macie's managed identifiers have tuned precision and recall. False positives exist — a field containing the string "1234-5678-9012-3456" will be flagged as a potential credit card number even if it's an internal ID. Custom identifiers generally have higher false positive rates if the regex isn't specific enough. Tune custom identifiers based on your observed false positive rate.
Can Macie scan data in non-S3 storage?
Macie is currently S3-specific. For data classification in DynamoDB, RDS, or other storage services, you'd need third-party tools or custom implementations. AWS continues to expand Macie's coverage over time.
Protect your AWS accounts before it's too late
Vigilare monitors your AWS accounts for suspension risks — billing anomalies, IAM issues, GuardDuty findings, and more — and alerts you before AWS takes action.
Written by Vigilare Engineering
Platform Team