AWS Macie: Automating Data Classification and Sensitive Data Discovery

The Data Discovery Problem

Most organizations don't know exactly what sensitive data they have in S3. A data pipeline from three years ago might have written PII to a bucket that was originally intended for configuration files. A developer might have uploaded a database dump to debug a production issue and forgotten to delete it. A third-party integration might be writing more data than expected.

Amazon Macie exists to solve this problem. It uses ML and pattern matching to automatically discover and classify sensitive data in your S3 buckets — identifying PII, financial records, credentials, and other sensitive content — and flags the security issues that put that data at risk.

What Macie Detects

Macie's sensitive data detection covers several categories:

Personally Identifiable Information (PII): Names, addresses, email addresses, phone numbers, social security numbers, passport numbers
Financial data: Credit card numbers, bank account numbers, routing numbers, financial statements
Healthcare data: Medical record numbers, insurance information, medication names (HIPAA-relevant)
Credentials: AWS access keys, private encryption keys, usernames and passwords in files
Network information: IP addresses, MAC addresses, GPS coordinates

Macie uses a combination of managed data identifiers (maintained by AWS) and custom data identifiers (regular expressions and keywords you define) for detection.

Enabling Macie

aws macie2 enable-macie

# Enable for organization (from management account)
aws macie2 enable-organization-admin-account   --admin-account-id 123456789012

For multi-account setups with AWS Organizations, delegate Macie administration to your security account and enable Macie across all member accounts automatically. See our Organizations guide for the security account architecture.

Running Sensitive Data Discovery Jobs

Macie can run one-time discovery jobs or scheduled jobs across your S3 buckets:

aws macie2 create-classification-job   --job-type SCHEDULED   --name "WeeklyPIIScan"   --schedule-frequency '{"weeklySchedule": {"dayOfWeek": "SUNDAY"}}'   --s3-job-definition '{
    "bucketDefinitions": [{
      "accountId": "123456789012",
      "buckets": ["my-data-bucket", "my-logs-bucket"]
    }]
  }'   --managed-data-identifier-selector RECOMMENDED

For initial discovery, run a one-time job across all buckets to understand your baseline data exposure. Then run weekly scheduled jobs on buckets that contain or might receive sensitive data.

Macie Policy Findings

Beyond data discovery, Macie continuously monitors for policy-based findings — security issues affecting your S3 buckets that could expose data:

Policy:IAMUser/S3BlockPublicAccessDisabled — Block Public Access disabled on a bucket
Policy:IAMUser/S3BucketEncryptionDisabled — Server-side encryption not required
Policy:IAMUser/S3BucketPublic — Bucket is publicly accessible
Policy:IAMUser/S3BucketSharedExternally — Bucket shared with external AWS accounts
Policy:IAMUser/S3BucketSharedWithCloudFront — Bucket shared with CloudFront (legitimate, but flagged for awareness)

Policy findings require no jobs — Macie generates them continuously as S3 configurations change. Integrate policy findings with Security Hub for unified visibility. For S3 security configuration details, see our S3 security guide and bucket policy audit guide.

Prioritizing Findings: Risk-Based Approach

Macie generates a lot of findings. Prioritize based on the combination of data sensitivity and security risk:

Highest Priority

High-severity sensitive data (SSN, credit cards) in a publicly accessible bucket
Credentials (AWS keys, passwords) found in any bucket
Sensitive data in a bucket shared with an unknown external account

Medium Priority

PII in a bucket that's shared with a known partner account
Any sensitive data in a bucket with Block Public Access disabled
High volumes of sensitive data without encryption enabled

Lower Priority

Low-confidence sensitive data detection (email addresses in log files)
Sensitive data in highly restricted internal buckets with no external access

Custom Data Identifiers

Macie's managed identifiers cover common patterns. For organization-specific sensitive data — internal employee IDs, customer identifiers, proprietary codes — create custom identifiers:

aws macie2 create-custom-data-identifier   --name "InternalEmployeeID"   --regex "EMP-[0-9]{6}"   --keywords "employee,staff,personnel"   --maximum-match-distance 50   --description "Internal employee ID format"

Custom identifiers are useful for HIPAA covered entities that have organization-specific data formats, or for companies with proprietary data that has identifiable patterns.

Integrating Macie with Security Hub

Macie publishes findings to Security Hub automatically when both services are enabled. This gives you a unified view of S3 data risks alongside infrastructure security findings. Configure EventBridge rules to route high-priority Macie findings to immediate alerts:

aws events put-rule   --name "macie-high-severity"   --event-pattern '{
    "source": ["aws.macie"],
    "detail-type": ["Macie Finding"],
    "detail": {
      "severity": {"description": ["High", "Critical"]}
    }
  }'

For the full Security Hub automation approach, see our Security Hub automations guide.

Macie for GDPR and HIPAA Evidence

Macie is particularly valuable for compliance frameworks that require data classification:

GDPR: Requires knowing where personal data is stored and demonstrating appropriate protection. Macie's discovery reports provide evidence of your data inventory and security controls. See our GDPR on AWS guide.
HIPAA: Requires identifying and protecting PHI. Macie's healthcare data detection identifies potential PHI locations. See our HIPAA compliance guide.
PCI DSS: Requires identifying and protecting cardholder data. Macie's financial data detection helps locate credit card data that shouldn't be stored. See our PCI DSS guide.

For all three frameworks, Macie finding reports serve as evidence of continuous data classification monitoring — a control requirement in all three standards.

FAQ

How much does Macie cost?

Macie charges per GB of data evaluated in classification jobs ($1/GB, with a 1 GB/month free tier per account). Policy findings are free. For large S3 environments, use sampling and focus jobs on buckets most likely to contain sensitive data rather than scanning everything.

What's the false positive rate for Macie findings?

Macie's managed identifiers have tuned precision and recall. False positives exist — a field containing the string "1234-5678-9012-3456" will be flagged as a potential credit card number even if it's an internal ID. Custom identifiers generally have higher false positive rates if the regex isn't specific enough. Tune custom identifiers based on your observed false positive rate.

Can Macie scan data in non-S3 storage?

Macie is currently S3-specific. For data classification in DynamoDB, RDS, or other storage services, you'd need third-party tools or custom implementations. AWS continues to expand Macie's coverage over time.

Protect your AWS accounts before it's too late

Vigilare monitors your AWS accounts for suspension risks — billing anomalies, IAM issues, GuardDuty findings, and more — and alerts you before AWS takes action.

See Vigilare pricing Talk to us about securing your AWS Browse documentation →

Written by Vigilare Engineering

Platform Team

The Vigilare platform team. We write about the AWS security, compliance, and cost signals behind account suspensions, and the practical steps to stay ahead of them.