• Home
  • Action Plan
  • Data
    • Data Collection
      • Tweets
      • Dataset
    • Data Exploration
      • Google Colab Code
      • Data Preprocessing
      • Handling Missing Values
      • Ensuring Formatting Consistency
      • Categorical Data Encoding
      • Handling Outliers
      • Normalization/Standardization/Scaling
      • Natural Language Processing
      • Time Series Analysis
      • Interpolation
      • Binning
    • Data Visualization
      • Types of Plots
      • Scatterplots/Histograms
      • Heat Maps
      • Bar/Swarm/Violin Plots
      • Line Graphs
    • Data Modelling
      • Data Binning
      • Topic Clustering Using LDA and t-SNE
    • Data Communication
      • Results
      • Conclusion
      • Acknowledgments
      • References
      • The Vaxplorers Team

Bar/Swarm/Violin Plots

Another attempt to visualize and interpret categorical features in our dataset was made possible using Seaborn barplot() method. This was first applied to individual nominal data columns to generate imagery that describe their values and frequency.

Dictionaries that were instantiated earlier in this step were implemented for decoding the categorical values during the plotting process and transforming the numeric points to labels that readers can comprehend.

Bar Plot

Note that during earlier process, the column of ‘Peddled Medical Adverse Side Effects’ was encoded differently than other features such that the numeric representation of some of its rows had more than one value in the form of number sequences.

To address the unique encoding of the ‘Side Effects’ feature, we created a separate bar plot for it that has more rows than the original dataframe to accommodate entries that fall under more than one category 

Unlike other features that may take more than label, the ‘Side Effects’ column must be specific for each instance of mentioned side effect since it is indicative of underlying context that we need to discover for our research question.

Again, decoding dictionaries were put to use upon plotting the total counts for the labels in this column, which were obtained using value_counts() and were recorded in a new column in the dataframe copy passed to the function.

Referring again to the encoding step done beforehand, recall that some nominal features have inherent sub-labels pertinent to specific local cities and international locations for ‘Location’, and particular side effects for ‘Peddled Medical Adverse Side Effect’.

To get a more comprehensive visualization of our dataset, we also made bar plots for these sub-labels: 

Extended Bar Plot

Lastly, for bar plots, time-dependent features were also visualized using their fixed bins values that were obtained in former segments of exploration. Using the same Seaborn method, we were able to get a representation of value intervals for these features:

Extended Bar Plot

Line Graphs: Time Series

Seeing the significance of time series analysis for this research endeavor, we also applied the interpolated values of our data points to arrive at an imagery of trends pertinent to posting time of tweets.

We wanted to identify the trends for hours of the day and days of the week during which the tweets were posted over the course of the data gathering window.

From the DateTime dtype features, we extracted the values that correspond to hours and days. Later, they were added to new columns and used as pivots for counting tweets that match with the same posting DateTime values. 

With the aid of Matplotlib, the labels of the axes were also adjusted accordingly. 

Line Graph via Time Series

@ VaccineVerity 2023. All Rights Reserved.