Get more updates and further details about your project right in your mailbox.
The best time to establish protocols with your clients is when you onboard them.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
At CodeStax, we are always rapidly developing and deploying apps in the cloud. Sometimes the entire cycle from concept to production happens within a span of two weeks. To enable this kind of fast-paced prototyping we usually go with Serverless apps backed by a NoSql database. Amazon’s DynamoDb and MongoDb are the ones we usually default to depending on our needs. While NoSql neatly fits into our requirements of building highly scalable apps with a flexible data model and low latency, it falls woefully short when it comes to analytics.
It allows you to quickly import data and build interactive reports and dashboards. Yesteryear data intensive applications were designed as separate OLTP(transaction) and OLAP(analytic systems) where the transaction system would routinely dump data into the analytical system which would later be used by management to slice and dice data. So in a similar vein, we began using Google Data Studio as the de-facto reporting solution. Add to this the fact that google analytics(for web and apps) and other google products(like Search Console, Ads etc) can directly push data into Data Studio and you really have a 360 degree view of your web apps. Did I also mention that it’s free?
But, there are no default connectors to Google Data Studio from MongoDb. We were left to our own devices to do it. Our original solution was to just take a dump of the relevant collections from MongoDb, flatten it and push it into a bunch of google sheets. These sheets were directly linked with Data Studio. However, we quickly realized that for large datasets, the solution would be very slow.
A Google data warehouse solution. We created a simple python module that pulls data from MongoDb and pushes it into bigquery, which then automatically pushes data into the DataStudio dashboard. Deploy this python script on AWS Lambda(or any other cloud serverless solution) with and event bridge triggering it according to your needs and you have yourself a cost effective and easy to deploy solution. Most of this is just boilerplate code glued together and is a testimonial to the power of cloud services available to us. With minimum fuss and a few search queries on google, we are able create a world class reporting system.
We use python for this project, although is should be just as simple for any of the other supported platforms. PyMongo (library) is for connecting with MongoDb and other libraries from google cloud platform for BigQuery and Google Cloud Storage. The python-dotenv package is for loading our credentials from an env file.
Now that our environment is set up, let’s get down to the nuts and bolts of the solution. To access MongoDb you will need credentials and since this is part of the ETL pipeline it’s advised to create a separate user with read only access. Also you can store the credentials (creds) directly in the env file or you can have a separate json file which houses the creds and the env just points to it. Either way its a good practice to make sure the .env file and any other credentials are not part of your repo. How to manage that is out of the scope of this article. In this project we will be storing the credentials in a json file.
MongoDb - Code
This is fairly straightforward code. The export function basically exports your query results to csv or json. The csv code is slightly elaborate as MongoDb data is notorious for not having a fixed schema, so when you flatten it out you will probably find a lot of collections having different keys which need to be managed(different rows having different columns).
BigQuery - Code
Google allows you to stream data into BigQuery although this is discouraged. What we will illustrate is creating a csv file in google cloud storage bucket and passing the link to the BiqQuery helper function that pushes the data into the table. BigQuery tables conform to SQL standards and you should make sure that the data in your file(which is basically string) strictly conforms to the schema. A self explanatory code for your reference.
A few notes on the GCP environment setup.
We need to create a project under BigQuery and inside that create a dataset under which we will have our tables. Then we need to create a service account which will be given access to the project(with write permission). We will have to download the credentials(json file) and map the location in our env file as “GOOGLE_APPLICATION_CREDENTIALS”. Now we create a cloud storage bucket and give access to this service account to write and read to it. And we are all set.
Our Env file — we store the dataset details under BIG_QUERY_DETAILS
We can now create an python code which imports mongoConnection and bigQuery (classes from the code) and apply our business logic in that code.
Typically, ETL pipelines extract data from the transactional db at a particular frequency and pushes it into the reporting solution at particular intervals. This frequency depends on the reporting requirements of the application(how quickly do you respond to market dynamics). At Trinom we usually set this to once a day(usually 1 am in the morning). To do this we need to create an AWS lamda (or Serverless service from a cloud service provider of your choice) which is triggered by an event bridge instance. Also for Big Query upload, we can consider using Parquet format(which is much smaller) or AVRO(which is very strict). We also write functions to verify the schema of the file before the upload starts. What we have shown above is a quick way to cobble something up which can evolve based on our needs.