Driftwood

A father dies. A brother is disinherited, but his sister stands by him. In more ways than one.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




My First Data Warehouse

Sample of the CSV

As shown in the above image the column labels’ stop after the tenth column. But if we look closely at the data, its pattern will repeat every sixth column.

One row represents a single vehicle. And the duration for which the vehicles were tracked was not equal for all of them. This gives the CSV an odd shape where some rows would have smaller column counts. Which indicates that they were tracked for a smaller time span. Try to remember this since I will come back to it in the system design section.

System Diagram

The system has five main components: The data extraction and loading module, a PostgreSQL database, dbt for creating transformation, redash for dashboards, and finally Airflow for orchestrating different workflows. All of them will be running in docker containers for easy deployment.

Let’s start with the data extraction and loading module. This module is just a few Python scripts for reading the CSV files and dumping the data into the database. As I mentioned earlier, the data doesn’t have a rectangular shape. Some rows have shorter columns. And the columns label’s stop after some point. Pandas wasn’t reading it for me, so I wrote a simple parser that would read the CSV files and create pandas dataframes.

Yes, dataframes.

I decided to treat the data as two separate but related tables. One describing the vehicles and the other showing their trajectories. This two tables would be connected with a foreign key that is unique to the vehicles table and is also present in the trajectories table.

At the end, I will get two pandas.DataFrames that have the following shapes.

Finally, this module saves these two dataframes into their respective database tables. I used SQLAlchemy for the connection to the database. To make it easy to work with Airflow and not deal with dependencies, I containerized the whole module.

Next up, we have the dag transformation scripts. Once these are prepared, they will be run using Airflow. They will work on the data in the database and create new transformed views of the data. This new views can be saved to the database or be fed to some other process directly. For now, I will be saving it to the database.

The final component is the Redash tools. This will help in creating dashboards that visualize different kinds of insights that we will draw from the data and the transformations created by dbt. This will allow anyone to better understand the data.

Add a comment

Related posts:

Handling Imbalanced Dataset

It is most commonly found in medical sector related dataset,fraudulent dataset etc. Suppose Apollo hospital has made a dataset of people came for diabetes checkup ,the dataset consists binary ouput…

Build a Chrome Extension using React JS

Build a Chrome Extension using React JS. Google Chrome is the world’s most used web browser and React is the most popular JavaScript framework. So it’s quite obvious you want to….