OLIST E-Commerce Reporting Dashboard

Note

The Power BI eXchange file and data used in this blog can be found in this GitHub link

Overview Link to heading

In today’s data-driven world, leveraging data for informed decision-making is crucial. A user-friendly dashboard plays a vital role in achieving this goal. If you’re new to Power BI or looking to refine your skills, I recommend checking out my previous blog post, Power BI Tutorial: Tips for Clear Data Visualizations. In that post, I covered essential principles of designing effective Power BI dashboards, common issues to avoid, and practical tips to enhance your visualizations. It’s a great resource to get a comprehensive understanding of Power BI’s capabilities and best practices.

In this blog we will explore how to build a Power BI dashboard using OLIST e-commerce database. We collected the datasets from Kaggle

Before we get started, here is what the final dashboard looks like:

Data Source: A Peek into the Dataset Link to heading

OLIST is the largest department store in the Brazilian e-commerce marketplace. Their publicly available database contains retail datasets of 100k orders placed on Olist, spanning from October 2016 to September 2018 across several states.

To summarize, the data covers order details, customer details, product details, seller details, payment details, geolocation details, and review details. It includes information on prices, orders, order statuses, payments, freight, and user reviews, among many other parameters. It consists of almost 100,000 customer IDs and order IDs.

The database schema model is shown below to help understand how the data interrelate with each other.

Entitiy-relationship model diagram of Olist Dataset

Data Quality: Exploratory Data Analysis (EDA) in Python Link to heading

Before diving into Power BI, it’s crucial to familiarize ourselves with the data. I typically start by reviewing the data dictionary, and a quick glance at the CSV or Excel file. Then, I get to my favorite part: Exploratory Data Analysis (EDA) in Python. EDA helps uncover insights about the distribution of incidents, identify patterns, and categorize the data. In this case, we’re working with nine different CSV files.

First, we’ll assess data quality to gain a comprehensive understanding of our dataset. This involves examining various attributes and planning our next steps. While large datasets can be daunting, Python makes this process more manageable.

To profile the data, we can either use a profiling library or write custom Python code. Here’s a quick example using the Pandas, and Pandas Profiling libraries:

Custom Pandas Profiling

# Summary of the DataFrame
df.info()

# Descriptive statistics
df.describe()

# Check for missing values
df.isnull()

# Data types of each column
df.dtypes

# Correlation matrix
df.corr()

Pandas Profiling

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv("olist_customers_dataset.csv")
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("olist_customers.html")

The Pandas Profiling makes life super easy. The HTML file you’ll get contains all the crucial details you need. Just read through it to figure out the next steps, paying attention to any missing, inaccurate, or invalid data.

In the Olist datasets, the profiling report showed:

1. Some missing or null values. 2. Incorrect data formats.

Luckily, this dataset seems pretty tidy and ready for visualization. But if it was messy (with outliers or missing values), I would clean it up in Python first to make sure it’s reliable before jumping into visualization.

With the data in good shape, we import it into Power BI and get started on the visualizations.

Data modeling Link to heading

Once we loaded the datasets into PowerBI, we can see that the raw data just has tables and keys referring to each file. So, we need to preprocess it to get some meaningful insights. Here’s what the initial data model looks like:

At this stage, the tables aren’t connected properly, which limits our ability to leverage the data. For example, the olist_geolocation_dataset and product_category_names tables aren’t linked at all. Also, Power BI requires explicit time data to manage time-related aspects across tables effectively. It’s crucial to establish the correct relationships between tables, and of course, the variables need to be defined accurately as numerical, geographical, or textual data.

The table below shows the entity relationships used in our analysis:

Entity	Relationship Type	Related Entity	Key
Sellers	One-to-Many	Geolocation	geolocation_zip_code_prefix
Sellers	One-to-Many	Orders	orderID
Sellers	Many-to-One	Brazil State	StateID
Brazil State	One-to-Many	Customers & sellers	StateID
Geolocation	Many-to-One	Sellers	zip code prefix
Geolocation	Many-to-Many	Customers	zip code prefix
Customers	One-to-One	Order Dates	customerID
Products	Many-to-One	Product Category	ProductID
Products	One-to-Many	Orders	ProductID
Orders	Many-to-One	Order Dates	OrderID
Order Dates	One-to-Many	Payments	OrderID
Order Dates	One-to-Many	Reviews	ReviewID
Order Dates	Many-to-One	Weekday	Date

After linking these tables properly, we get a well-structured data model that looks something like this:

DAX Formulas Link to heading

DAX, or Data Analysis Expressions, is a formula language used in Microsoft Power BI, Excel, and SQL Server Analysis Services (SSAS). It’s designed for data modeling and analysis, allowing users to create custom calculations on data models

In our report, we have developed multiple DAX formulas to enhance the functionality of our dashboard. This section will explore key tables in our dataset, which have been renamed for easier reference. This will help us build a comprehensive data model optimized for analytics.

Orders Table Link to heading

At the heart of our data model is the Orders table. It’s the most crucial component, detailing customer purchases of specific products. This table contains primary keys that are essential for extracting insights. I’ve created two DAX formulas: % of Sales to show the profit ratio, and Qt Ordered to count unique orders.

% of Sales = DIVIDE([price], SUM([price]), 0)

qt_order_id_ordered = DISTINCTCOUNT(orders[order_id])

We also introduced several other formulas to provide deeper insights:

% Delivered = 
DIVIDE(
    (CALCULATE(COUNT(orders_dates[delivery_indicator]),
     orders_dates[delivery_indicator] ="on time" || orders_dates[delivery_indicator] = "in advance")),
    COUNT(orders_dates[delivery_indicator]))

AVG Price = AVERAGEA(orders[price])

count orders = COUNT(orders[order_id])

count orders(unique) = DISTINCTCOUNT('orders_dates'[order_id])

Geolocation Table Link to heading

The Geolocation table, now integrated into our data model, links geolocation information with customers and sellers. This connection allows us to derive postal codes, cities, and states for deeper insights. Additionally, we’ve created a hierarchy to enable drilling down from country to state to city.

Order dates Link to heading

This table enables chronological data analysis. In the orders_date table, we can track the time difference between orders and group them by order date to obtain indicators such as deliveries or purchases. We can drill down from year to quarter, month, and day. I’ve created a delivery_days DAX expression to calculate the difference in days between estimated and actual delivery dates.

delivery_days = DATEDIFF('orders_dates'[order_estimated_DATE],'orders_dates'[order_delivered_DATE],DAY)

Additionally, there’s a delivery indicator DAX formula to categorize whether the delivery was made before, on time, or after the estimated date. The formula is:

delivery_indicator = IF('orders_dates'[delivery_days]>0,"In advance", 
    IF('orders_dates'[delivery_days]=0,
    "On time",
    "Late"
    ))

Together, these expressions allow us to assess the total number of days each delivery took and categorize the delivery performance effectively.

I’ve also added time_day and time_hour expressions to differentiate approval, delivery, and orders in days and hours, respectively.

time_day (approved vs delivered) = DATEDIFF(orders_dates[order_approved_at],orders_dates[order_delivered_customer_date],DAY)

Similarly, the time_hour expression will allow us to track the time difference in hours for more granular analysis of the approval and delivery process.

time_hour (approved vs delivered) = DATEDIFF('orders_dates'[order_approved_at], 'orders_dates'[order_delivered_customer_date], HOUR)

Review Table Link to heading

To generate meaningful product insights, I’ve added the review_indicator to determine if a comment was made after a purchase. This helps us understand customer sentiments and whether they share their experiences. The review_indicator categorizes reviews based on the presence of keywords, enabling us to identify common product descriptors.

review_indicator = IF(reviews[review_comment_message]=="" || reviews[review_comment_message]=="-","No Comment","With Comment")

We also introduced the average rating calculation:

AVG Rating = AVERAGEA(reviews[review_score])

Brazil State Table Link to heading

This new table includes state IDs and city names, and is modeled in a one-to-many relationship with both the Sellers and Customers tables. It enables us to focus our analyses within Brazil, leveraging its rich dataset while excluding data from other regions for a more detailed and descriptive analysis.

Additional DAX Formulas Link to heading

We further enhanced our analysis with:

Unique Costumer Counts

count customer(unique) = DISTINCTCOUNT(customers[customer_unique_id])

Costumer Counts

customers count = DISTINCTCOUNT(customers[customer_id])

Early Deliveries

delivery less 19 days = 
COUNTROWS(FILTER('orders_dates', 'orders_dates'[delivery_days] < -19))

Late Deliveries

delivery more than 100 days =
COUNTROWS(FILTER('orders_dates', 'orders_dates'[delivery_days] >= 100))

Single Item Order Ratio

orders with an item = 
VAR um = CALCULATE(DISTINCTCOUNT(orders[order_id]),FILTER(orders, [qt order id ordered]=1))
RETURN (
    DIVIDE(um, [count orders(unique)])
)

Avg Payment per Cat

payment_value average per product_category_name_english = 
AVERAGEX(
    KEEPFILTERS(VALUES('product category name translation'[product_category_name_english])),
    CALCULATE(SUM('payments'[payment_value]))
)

Total Price by Cat

price total for product_category_name_english = 
CALCULATE(
    SUM('orders'[price]),
    ALLSELECTED('product category name translation'[product_category_name_english])
)

Unique Customers Ordered

qt of customers ordered =
COUNTROWS(SELECTCOLUMNS('orders_dates', "customer_id", DISTINCT('orders_dates'[customer_id])))

Recurrent Customer Count

recurrent count = CALCULATE(
    DISTINCTCOUNT('orders_dates'[customer_unique_id]), 
    FILTER('orders_dates', 'orders_dates'[recurrent] = "YES"), 
    'orders_dates'[order_status] IN {"delivered", "shipped"}
)

Recurrent Customer Ratio

recurrent customers ratio = CALCULATE(DIVIDE([recurrent count],[count customer(unique)]))

Total Grand Orders

Total Grand Orders = CALCULATE(COUNT(orders[order_id]), ALL(Orders))

Total Grand Sales

Total Grand Sales = CALCULATE(SUM(payments[payment_value]), ALL(payments))

Total Sales

Total sales = 
SUM('orders'[price]) + SUM('orders'[freight_value])

Executive Insights Dashboard Link to heading

When it comes to answering “How are we performing?” it often leads to a series of follow-up questions, especially for global companies. To tackle this, I designed the first report to anticipates these inquiries, promoting data-driven decisions with a flexible and visually engaging dashboard.

The Executive Insights page focuses on sales and customer data, aiming to meet goals, enhance customer satisfaction, and boost sales by revealing insights. We used three unique formulas to build this dashboard:

Total sales = SUM('orders'[price]) + SUM('orders'[freight_value])

count customer(unique ) = DISTINCTCOUNT(customers[customer_unique_id])

count orders = COUNT(orders[order_id])

Revisiting the figures, we see that São Paulo consistently leads in total sales. Despite a rapid increase in sales over the past three years, the waterfall graph shows no decline in values.

Interestingly, both late and early deliveries are significant. There are five orders that took more than 100 days to deliver, clearly impacting customer satisfaction.

Descriptive Analytics Dashboard Link to heading

Now, let’s dive into descriptive analytics. You can select a specific day, month, or year using the slicer at the top.

The table graph categorizes products by average price, total revenue, profit ratio, and quantities ordered. The “health and beauty” category tops the list, generating $1.2M in revenue with a 9% profit ratio. This is no surprise, as fashion spending is high among females.

Also, on this page, we can see approximately $16 million in sales, 94,000 customers served, and over 111,000 orders processed.

Customer Investigation Dashboard Link to heading

Every business needs to know about new customers. Customers drive growth, and without them, sustaining progress is tough. This page highlights nearly 100,000 customers. Looking at the visuals, most new customers join between May and August over three years. Preferred payment methods are credit card and boleto payment.

Customer Satisfaction Dashboard Link to heading

In this dashboard, the average rating is positive, at 4 stars from about 99.5k customers. However, December shows a drop in orders after a slight increase three months prior.

The top five categories have about 40k orders out of 99k total, while the bottom five haven’t exceeded 60 orders. This shows a clear difference in product popularity.

The recurrent customer ratio dropped by 1% compared to the previous two years, likely due to a surge in orders last year. There’s no solid evidence of declining loyalty. Here’s the reference to the formula used for calculating the recurrent customer ratio:

recurrent customers ratio = CALCULATE(DIVIDE([recurrent count], [count customer(unique )]))

Delivery Days Dashboard Link to heading

The delivery page provides insights into the delivery process from seller to customer. The freight value seems reasonable, with 96.4 orders delivered out of the total, even with cancellations.

The average review score is lower in the first quarter, possibly due to longer delivery times. In contrast, the third quarter has higher review scores despite shorter delivery times.

Forecasting Dashboard Link to heading

The forecast looks promising, with the higher order quantities in all three years and optimistic predictions for new customer acquisitions.

Insights Link to heading

Olist has a delivery success rate of approximately 85%. This indicates some challenges within the delivery process, necessitating an investigation into the causes of undelivered orders and efforts to improve this rate. Potential areas for investigation include:

Efficiency of logistics and fulfillment processes
Reliability of transportation partners
Issues with the quality or accuracy of orders being placed

Olist enjoys high overall customer satisfaction, evidenced by numerous positive reviews and high scores. However, the “Security and Services” product category receives the lowest ratings, suggesting a need for improvement in this area.

Recommendations Link to heading

Monitor and analyze customer reviews regularly to identify trends and areas for improvement. This could involve using data analysis tools to identify common themes in customer feedback and using this information to make changes and improve the customer experience.
Investigate Undelivered Orders: Analyze the undelivered orders to identify patterns or factors contributing to delivery failures. Consider whether specific regions, customer demographics, or types of products/orders are more prone to delivery issues.
Communicate with customers about the delivery process. Olist should be transparent with customers about the delivery process and provide them with regular updates on the status of their orders. This will help to build trust and create a positive customer experience. It will also give customers the opportunity to provide feedback on their experiences with the delivery process, which can be used to identify areas for improvement.
Maintain Quality and Responsiveness: Continue delivering high-quality products and services while being responsive to customer feedback. Regularly monitor the delivery success rate and communicate any changes or improvements in the process to customers.

Suggestions: Link to heading

Special Offers During Low Sales Periods: Implement promotions or special offers to boost sales during traditionally low periods.
Improve Low-Selling Categories: Invest in advertisements or promotions for bottom-performing product categories to increase visibility and sales.
Outsource Drivers During Peak Periods: Hire additional drivers during sales or festival periods to ensure timely deliveries and manage increased order volumes
Review Partnerships: Investigate and potentially revise partnerships with companies that receive low review scores to ensure consistent service quality.
Analyze Customer Feedback with NLP: Utilize Natural Language Processing (NLP) models to analyze customer comments and reviews in the dataset. This can provide deeper insights into customer sentiments and highlight specific areas needing attention.

Note

The Power BI eXchange file and data used in this blog can be found in this GitHub link