Data Exploration

Data Preprocessing

To efficiently execute this stage of the project, reviewing the obtained data types and identifying the extant data levels are conducive measures to infer the appropriate exploration methods.

We began the preprocessing stage by resizing the collected organized data and dropping the empty rows and negligible columns.

It was deduced that rows with empty values in the ‘Timestamp (DD/MM/YY H: M:S)’ column are consequently blank rows.

Additionally, the aforesaid columns included in the dropped list are:

[‘ID’, ‘Timestamp (DD/MM/YY H: M:S)’, ‘Tweet URL’, ‘Group’, ‘Collector’, ‘Category’, ‘Topic’, ‘Screenshot’, ‘Reviewer’, ‘Review’].

These columns were distinguished as negligible since their values were used to review or identify—not classify—row entries. Thus, they do not have any bearing on the research question.

General Steps for Preprocessing

After this step, we arrived at a more refined structured data that has 32 columns and 155 rows.

Of the remaining 32 columns, their respective types are as follows:

Qualitative Data – ['Keywords', 'Account Handle', 'Account Name', 'Account Bio', 'Account Type', 'Location', 'Tweet', 'Tweet Translated', 'Tweet Type', 'Content Type', 'Rating', 'Reasoning', 'Remarks', 'Mentioned or Referenced COVID-19 Vaccine Brand', 'Mentioned or Referenced Other Vaccine/Drugs', 'Peddled Medical Adverse Side Effect', 'Distrust in Vaccine Development', 'Racial, Religious, Cultural, Economic, or Socio-Political Keywords', 'Oppressive Keywords']

Quantitative Data (Discrete) – [ 'Joined (MM/YYYY)', 'Date Posted (DD/MM/YY H:M:S)’, 'Following', ‘Followers’, 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views', 'No. of Days since Philippines Joined the COVAX Facility (Jul 24, 2020)', 'No. of Days since FDA Approved the First COVID-19 Vaccine (Dec 11, 2020)', 'No. of Days since Arrival of First Batch of COVID-19 Vaccine Doses Committed by the COVAX Facility (Mar 04, 2021)', 'No. of Days since First Detected Cases of Omicron Variant in the Philippines (Aug 02, 2022)']

Quantitative Data (Continuous) – [ ]

Additionally, the existing columns and their levels are classified into:

Nominal Level – ['Account Type', 'Location', 'Tweet Type', 'Content Type', 'Rating', 'Mentioned or Referenced COVID-19 Vaccine Brand', 'Mentioned or Referenced Other Vaccine/Drugs', 'Peddled Medical Adverse Side Effect', 'Distrust in Vaccine Development']

Ordinal Level –[]

Interval Level – ['Joined (MM/YYYY)', 'Date Posted (DD/MM/YY H:M:S)', 'No. of Days since Philippines Joined the COVAX Facility (Jul 24, 2020)', 'No. of Days since FDA Approved the First COVID-19 Vaccine (Dec 11, 2020)', 'No. of Days since Arrival of First Batch of COVID-19 Vaccine Doses Committed by the COVAX Facility (Mar 04, 2021)', 'No. of Days since First Detected Cases of Omicron Variant in the Philippines (Aug 02, 2022)']

Notes: Pandas Timestamp and Python Datetime are interchangeable data objects that represent dates in the form of integers.

The “No. of Days” indicated in the columns listed above may take negative values, which means they occurred before a specified point in time. Since this could be the case, such measurements have no starting zero and are intervals.

Ratio Level – ['Following', ‘Followers’, 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views']

Textual Data – ['Keywords', 'Account Handle', 'Account Name', 'Account Bio', 'Tweet', 'Tweet Translated', 'Reasoning', 'Remarks', 'Racial, Religious, Cultural, Economic, or Socio-Political Keywords', 'Oppressive Keywords']

Using the Pandas DataFrame.info() method, we were able to ensure the expected number of cells that have missing values.

As for detecting and handling the missing values, the function missing_values_handler() was utilized:

Line 40 initializes a list of columns that have missing values using DataFrame.isna().any() method:

['Account Bio', 'Tweet Translated', 'Quote Tweets', 'Views', 'Remarks', 'Mentioned or Referenced COVID-19 Vaccine Brand', 'Mentioned or Referenced Other Vaccine/Drugs', 'Racial, Religious, Cultural, Economic, or Socio-Political Keywords', 'Oppressive Keywords'].

The values of columns 'Tweet Translated', ‘Quote Tweets’, and ‘Views’ are dependent on the entry tweet. Expectedly, they may have empty values when there is no pertinent observation of a tweet’s characteristics.
Meanwhile, the rest of the columns in the list above are either optional or have inherent ‘None’ choices and were classified as such during data collection.

Line 49 deals with the missing values of ‘Account Bio’, ‘Tweet Translated’, ‘Remarks’, 'Racial, Religious, Cultural, Economic, or Socio-Political Keywords', and 'Oppressive Keywords' using replacement with an arbitrary zero represented by an empty string.

This method was used because these columns contain unstructured textual data and will be further refined later through natural language processing.

Line 50 handles the missing values of 'Mentioned or Referenced COVID-19 Vaccine Brand' and 'Mentioned or Referenced Other Vaccine/Drugs' using replacement with an arbitrary zero represented by an integer.
On the other hand, this approach was applied for these columns given that they contain nominal data and will be encoded later into numerical values.

Line 51, for thoroughness, applies the same technique as Line 50 to the columns ‘Quote Tweets’ and ‘Views’, which are members of the ratio data level.
While these tweet characteristics are represented as numerical counts, the replacement with the mean method was not applied to avoid introducing bias since these are optional fields.

Handling Missing Values/Ensuring No Missing Values

Finally, the total number of missing values in each column can be summarized and confirmed using the output of the method df.isnull().sum().

Following the imputation steps, it is noticeable using info() method that there are discrepancies with the data types of the columns.

Many columns have object dtype, meaning these columns might contain strings or mixed types of data. Furthermore, some columns are expected to contain numerical values and have object dtype instead.

This needs to be resolved first before treating possible outliers because the values must be uniform for statistical operations to be applicable.

Ensuring Formatting Consistency (date, labels, etc.)

The function formatting_handler() was used to address formatting inconsistencies in the data:

Lines 59-63 initialize lists that group all columns into their appropriate data level classification.

Lines 66-79 then transmute the non-conforming values accordingly. The following conditions were prioritized for this method:

As recommended by Pandas documentation, StringDtype is preferred over object dtype because the latter can accidentally store non-strings. Thus, instances of object dtype columns that strictly contain string literals were transformed.

Records of date and timestamps were initially relegated also as object dtype and were thus changed to DatetimeTZ dtype.

Columns that have object dtype and must be comprised of numerical integer values, though already compliant, were still converted to int64 data type as a precaution.

Columns that must contain numerical float values, meanwhile, were stripped of comma separators before converting to float64 data type.

Line 82 handles the remaining columns of unstructured textual data by transforming them into StringDtype for later language processing.

Finally, info() method was used again to check that the appropriate data types for all columns and date formatting requirements were satisfied.

As fulfilled in the previous preprocessing steps, we were able to identify which columns contain nominal data and would therefore require categorical encoding:

['Account Type', 'Location', 'Tweet Type', 'Content Type', 'Rating', 'Mentioned or Referenced COVID-19 Vaccine Brand', 'Mentioned or Referenced Other Vaccine/Drugs', 'Peddled Medical Adverse Side Effect', 'Distrust in Vaccine Development']

Global Python dictionaries were first initialized for data encoding and later decoding during the analysis stage.

Apart from dictionaries, empty lists were also created as ‘catch basins’ for optional subcategories included in the categorical data. These will be transformed into Series dtype for analysis afterward.

Categorical Data Encoding

The function categ_data_encoder() assigns integer values to the categorical data using the appropriate dictionary:

A for-loop that iterates over the columns with nominal data and accesses their values was the main approach used for this step of numerical encoding.

Since dictionaries were already made available for mapping the categories, get() method allowed us to obtain the corresponding integer values.

Some columns required additional string manipulation to extract the data, like the ‘Location’ characteristic, because it contains an optional substring field for ‘city name’ that cannot be explicitly integrated into category labels.

As for columns that may have more than one category, such as the ‘Content Type’, we used string concatenation to represent a sequence of one-digit integer values, each corresponding to a label.

Lastly, for nominal columns that might contain more than two subcategories e.g., 'Peddled Medical Adverse Side Effect', we handled the encoding by passing conditional statements and manipulating the obtained sublists.

Since we only stored the primary category labels, we used the empty lists that were instantiated earlier. After converting to Series dtype, we can use the textual data as variables of natural language processing and other steps later.

Upon completing this step, we also ensured that the data types of the columns are numerical in the form of int64. It can also be confirmed using info () method that the data types were no longer strings literals.

To detect outliers, we adhered to the computation of z-scores and identifying data points that have greater than +3 or less than -3 z-scores.

With the aid of the function outliers_handler(), separate dataframes for columns pertinent to temporal data, interval data, and rational data were created to store characteristics about outlier data itself

Moving forward, for-loops that iterate through the data features serve as the primary tool for obtaining the necessary statistical values.

The function calculates the mean and standard deviation, which are used to calculate individual z-scores to be compared.

It then adds the verified outliers and their z-scores to a list for the current column and is repeated for each characteristic that contain temporal data.

These steps were also applied to the other groups of data level—interval and rational—and the results were stored in separate dictionaries.

Once the function is done processing, it creates a DataFrame dtype from the dictionaries using the built-in pandas utility. We used print() as a straightforward method to check the obtained outliers.

From the output, no outliers were detected in any of the columns under the groups of temporal and interval data. As for rational data, we found 1 to 2 outliers with significantly higher z-scores in most of the columns. Additionally, there were 7 outliers in the ‘Following’ column.

We deduced that they are heavily skewing the mean and standard deviation. However, we also concluded that they are not representative of the whe whole data.
Most of the detected outliers are unique individual cases dependent on the Twitter account owner i.e., the tweet was posted by an account that has a lot of influence (followers/following counts).

As a solution, the method of winsorization to a fixed percentile was applied. We opted not to drop the rows because our research question puts more emphasis on the context of the tweets, not on the account who posted them.

The for-loops that iterate through the columns were subsequently modified to execute the adjustments to the outlier values:

The auxiliary function winsorization_technique() defines a lower and upper threshold for the dataset based on the 25th and 75th percentiles.

A multiplier is chosen—2 represents the smallest 2% and largest 2% of values—and is multiplied to the Interquartile Range that represents the spread of the dataset.

Outliers that are below the lower threshold are replaced with the 25th percentile minus the multiplier times the IQR. As for outliers above the upper threshold, they are replaced with the 75th percentile plus the multiplier times the IQR.

Upon executing the modified program, we observed that there are significant differences between the mean and standard deviation values of the old output with the new output.

The z-scores of the recorded outliers were also recomputed and can be confirmed that they are now within the acceptable range.

Handling Outliers

In preparation for the machine learning steps, we applied four scaling methods to our dataset using iterative programming techniques and some of Python’s computational libraries. To emphasize, these methods were only applied to data features that have inherent numeric values.

We started off with data_min_max_scaler() function, or more commonly-known as normalization:

It takes the minimum and maximum values in a specific data feature and utilizes these to scale the data point to a range between 0 and 1.

To get the normalized value of x, the minimum values is subtracted from it and the result is divided by the range, or the difference between the maximum and minimum values.

Following normalization, we also implement z-score normalization or standardization using the function, data_standardizer():

The z-score computation was already explained in previous sections. But another key detail to standardization is that it assume a normally distributed dataset.

To ascertain this prerequisite, we used the Shapiro-Wilk test using the shapiro function from SciPy, which then returns a p-value corresponding to a significance level from normality.

Upon running the program, we found that our dataset is not normally distributed for the features in the interval_data_columns and rational_data_columns.

For comprehensiveness, however, we still opted to standardize our dataset as a future reference.

Normalization/Standardization/Scaling

Given that we our dataset is not leaning towards normal distribution, we decided to apply power transformation to address possible skewness or th heavy tails using data_power_transformer():

The Yeo-Johnson power transformation, as pulled also from Python SciPy library, uses a lambda parameter to scale the data points.

We also chose this method because it can handle both positive and negative values that are present in our dataset, unlike its curtailed version—Box-Cox method.

Lastly, data_unit_vector_scaler() is a function that scales the characteristics of our dataset to have a consistent magnitude of 1. We concluded that this additional step will be useful later for feature analysis.

Unlike min-max scaling, this approach normalizes the magnitude of vectors using the Euclidian norm, computed through the aid of NumPy’s built-in method.

The four functions that were introduced above were implemented to transform four copies of the original dataframe.

Subsequently, print() can be used to immediately check that the data points in the aforesaid copies are scaled within the expected ranges and magnitudes.

Our proposed research question and hypotheses require us to derive conclusions from the contextual perspective of our gathered tweets. As such, we employ the stage of NLP to meet this requirement.

Preparation of Data for NLP

The preliminary method was executed by onverting Dataframe to a list of texts:

In order to accomplish Natural Language Processing (NLP), the first step is to convert the dataframe into a list of texts. This is simply done by isolating the “Tweet” column and dropping the rows with NaN values through .loc() and dropna() methods respectively. And then, we convert the dataframe to a list using the values() and tolist() methods.

This step is crucial when it comes to preparing the dataset for NLP. Properly handling the emojis will allow for easier processing of the data and simply removing them from the text won’t do since they might contain important details about the text.

Another vital task before proceeding to NLP is to translate the text to English as it helps us standardize the format and remove the stopwords from the text which is one of the steps in NLP.

These steps also allow us to standardize the texts and remove irrelevant characters from them. By doing these, we won’t count the same words twice (i.e. “Data Science” is the same as “data science”)

Tokenization transforms the raw texts into smaller chunks of data that a computer can easily process while removing stop words, which are words that are highly used in the English language and very insignificant (i.e. pronouns, articles, etc.)

Stemming refers to the process of removing the last few characters of a word in order to extract the root word. In this way, we can further reduce the number of words to process given that the word “exploration” will be the same as “explore”.

However, the problem with Stemming is that it will yield inaccurate and unclear results.

Using the same example as earlier, performing stemming results in “explor”. This is where Lemmatization comes in. Lemmatization is quite similar to Stemming since its goal is to reduce a given word to its root.