Storing Website Analytics in DynamoDB

--

Anyone working on a Google Chrome Extension knows that migrating to Manifest v3 is inevitable BUT as of April 2022 Google Analytics doesn’t work with Manifest v3 or at least I couldn’t find how to put the two together. There’s this on Chrome Developers: Tutorial: Google analytics and the instructions there on what to put in the manifest.json for content_security_policy does not work for v3.

I found a recommendation on Indie Hackers for Mixpanel. It was so super easy to implement. They’ve go a nice pricing scheme for young startups that is nothing for the first year and then it goes up to a monthly/yearly plan rate depending on your usage. For capturing data and sending it to S3 you’re going to need the Data Pipeline Add-On. It should still be within the first year credits you get.

There are several sync data options. I went with S3.

Mixpanel Data Pipelines
Mixpanel Data Pipelines

Mixpanel has a nice article on Exporting data to AWS S3 but it relies on AWS Glue and since I needed to do some additional manipulation on the data I preferred setting it up the following way:

  1. Have Mixpanel export the data daily to S3
  2. Set up a S3 trigger to invoke a Lambda function
  3. Do all the additional processing in the Lambda function including saving the data to DynamoDB

The key to all this is really the IAM Roles definitions. I will focus on that shortly.

Now the steps in detail,

Step 1: Export the data daily to S3
Follow the steps in the Mixpanel article up to the ‘Glue Configurations’ title.
At the end, you should have one S3 Bucket and one IAM Role with one policy that allows Mixpanel to PutObject, GetObject, ListBucket and DeleteObject in your S3 bucket and a second policy (optional) for the KMS key.
In addition, the role should have a ‘Trusted Relationship’ with Mixpanel.
All is explained very clearly in the article.
The structure of the bucket will be by year-month-day as Mixpanel starts uploading data.

Step 1b: You then need to actually create the pipeline in Mixpanel. In the Create Pipeline page, open the tab called ‘Raw Amazon S3 Pipeline’.
Put in the all the details such as your project id (you can find that in your Project Settings) and the S3 information — for the s3 bucket you just need the bucket name, whereas for the role you need the full arn.
You need to create a Service Account by going to your Organization Setting and picking service Accounts on the left popup menu.
You put the Service Account Username and Secret on the top right where it says AUTHENTICATION and then click ‘Try It!’ at the bottom.
You should get a 200 response under the ‘Try It!’ button or an error with what you need to fix.
This will set up the pipeline. I picked the daily option that is more nightly in my opinion as it runs at midnight :)
Note that if you select a ‘trial pipeline’, it will expire.
You can read more about the Mixpanel Data Pipelines API in general in the Overview.

Step 2: Set up a S3 trigger to invoke a Lambda function
There’s a very detailed tutorial on this in the AWS documentation: Tutorial: Using an Amazon S3 trigger to invoke a Lambda function. Of course you can skip the first part about creating a bucket.

A few more tweaks:

  • In creating the function, its enough to select ‘PUT’ for the Event type.
  • I’m using the s3-get-object-python blueprint and the code below will be in python.
  • After you create the function, if you go back to the Mixpanel S3 bucket. In the properties tab, if you scroll to Event notifications, the Lambda function blueprint should have added the event there to call the Lambda function you just created.
  • For testing put a file in the S3 bucket and run the test code with the name of that file and the right bucket name and bucket arn.

Step 3: Additional Processing in Lambda
To process the export.json.gz file coming from Mixpanel, you need to add 2 more imports:

import io
import gzip

And a few more lines of code:

try:
if “export.json.gz” in key:
response = s3.get_object(Bucket=bucket, Key=key)
print(“CONTENT TYPE: “ + response[‘ContentType’])
content = response[‘Body’].read()
with gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb') as gzipfile:
data = gzipfile.read()
lines = data.splitlines()
for line in lines:
eventItem = json.loads(line)
#additional processing
else:
print("key is not 'export.json.gz', its:", key)

If you want to put the events into DynamoDB, you are going to need to go to the Role that was created by the Lambda function blueprint and add a policy enabling the function to PutItem in the table you have or whatever other permissions you need.

Lastly, put enough prints in the Lambda code and set up CloudWatch to track the function execution. You should have 2 to 3 log streams each night. You will see that Mixpanel always sends a test object first that it later deletes and a file called ‘complete’ that stays in the folder. You’re really interested in the ‘export.json.gz’ file. If you download it it will unzip and you can use any test editor to look at it.

If there was some issue in the nightly run with your code, you can run the test event with the full path to the latest object in the s3.object.key value including the ‘export.json.gz’ file name.

I hope I wrote everything down correctly. If you see any mistakes, please let me know!

Hit the clap button if you found this useful.

--

--