What is Power Query and where is it available?
Power Query is a data cleaning and transformation tool built into Excel and Power BI that records each transformation as editable steps.
Video Summary
Power Query automates data import and transformation in Excel and Power BI; every action is recorded as editable steps.
Use Get Data → From Web to scrape tables (e.g., Olympic medal table) and transform headers, fill down shared ranks, and remove unwanted rows.
Common cleaning ops: trim spaces, change data types, replace nulls, split name columns, and use conditional columns to bucket values.
Calculate durations/tenure by converting start dates to dates, adding duration columns, and transforming days into years.
Write simple M code in custom columns for bespoke logic (e.g., full-time vs part-time based on FTE).
Power Query is a data cleaning and transformation tool built into Excel and Power BI that records each transformation as editable steps.
Use Data → Get Data → From Web, paste the page URL, select the desired HTML table in the Navigator, then choose Transform Data to clean it in the Power Query Editor.
Use Fill Down to propagate shared ranks and Replace Values (enter 'null' and replace with 'missing') or other replacement rules to flag or correct nulls.
Change the start-date column to Date, use Add Column → Date → Age/Duration to get days, then Transform → Duration to convert days into years.
Queries maintain a live connection to the source; right-clicking the loaded table and selecting Refresh reruns recorded steps to fetch updated data.
"Power Query is a data cleaning and transformation software that comes prepackaged with both Excel and Power BI."
Power Query is regarded as an essential feature in Excel and Power BI because it allows users to save time and quickly access their data.
The video covers two practical examples to illustrate the use of Power Query: one will involve web scraping data from a website, and the other will focus on connecting to a local network file.
In these examples, users will uncover the powerful features of Power Query that can enhance productivity and streamline data handling processes.
"We are going to look at the 2020 Summer Olympics medal table data, which is publicly available on Wikipedia."
The first example involves scraping the 2020 Summer Olympics medal table data from a Wikipedia page.
Copying and pasting the data into Excel is not the most efficient method; instead, using Power Query to retrieve the data dynamically allows for easy updates whenever the source changes.
Users will navigate to Excel to retrieve this data, starting by pasting the URL from their web browser into Power Query's 'Get Data From Web' feature.
"The Power Query Editor is a powerful software that we use to clean up, transform, and create new kinds of data based on what we already have."
Upon accessing the Power Query Editor, users will find the interface divided into four main areas: a ribbon at the top with various buttons, an area that shows the currently selected query, and a section displaying the steps taken to manipulate the data.
Queries in Power Query can be thought of as tables. Each action performed, such as connecting to a source or transforming data, is recorded as a step which can be revisited later.
"We would like to take this row and move it into the header area."
Users will start cleaning the data by moving the first row of data to the header, effectively replacing improper column names that appear as numbers.
While cleaning, it's also necessary to replace any unwanted symbols, such as asterisks next to the country names indicating the host nation, by performing 'Find and Replace' operations.
Power Query's automation capability is advantageous; once transformation steps are recorded, they can be reapplied to new datasets efficiently by simply refreshing the query.
"This is where we can fill down the values to handle shared ranks between countries."
The ranks in the medal table often have blank values due to shared rankings, which will be addressed by using the 'Fill Down' feature to propagate the rank value from the preceding cell.
This process ensures that shared ranks, such as those between Greece and Uganda, are accurately reflected across the dataset.
The video also suggests that users can carry out further analyses, such as determining the percentage of total medals represented by gold medals, by selecting multiple columns in the Power Query interface.
"To calculate the gold percentage, we select the gold column and the total medals column."
First, convert the specified columns into number values to enable calculations. This is done by right-clicking the column and selecting "Change Type" to the appropriate format.
After the conversion, select the gold medal column and then use the Control key to select the total medals column, enabling you to perform calculations using both.
To add a new column that displays the percentage of gold medals, navigate to the "Add Column" ribbon in Power Query and choose to perform a standard arithmetic operation. Specifically, select "Division" to divide the number of gold medals by the total number of medals.
"Filtering is akin to adding a WHERE clause in SQL, allowing us to exclude non-essential data."
When analyzing data, it is often useful to filter out any rows that aren't needed. In this case, the total entry row is unnecessary for analysis.
To filter out this row, check either the rank column or the country column and uncheck the "Total" entry. This action removes the unwanted row from the dataset and helps streamline the analysis process.
"Power Query records every step you take, but it is essential to create adaptable filters for varying datasets."
Power Query operates like a recording tool that logs your actions. This functionality can be both powerful and limiting.
If the dataset changes—for example, if connecting it to another Olympics table with a different total—adjustments to filters will be necessary. The previously recorded step might not yield the correct results, highlighting the importance of mindful filter application.
"Give your data a proper name before loading it into Excel for better organization."
Once the data is filtered, it is crucial to name the resulting column intentionally. For instance, changing the column name from "2020 Summer Olympics medal table [36]" to "Medals" improves clarity.
Use the "Close & Load" button in the home ribbon to transfer the cleaned dataset back into Excel, presenting it as a table that retains all analyses performed in Power Query.
"The connection to the data source remains live, allowing for easy refreshes whenever updates occur."
The data loaded into Excel maintains a dynamic connection to the Wikipedia page, facilitating refreshes to obtain updated values. This is a significant advantage as any changes in the source can be easily reflected in your Excel analytics.
By simply right-clicking the table and selecting "Refresh," all previous steps recorded in Power Query are executed again, re-fetching new information from the source.
"Navigating through Power Query allows for effective and efficient data cleaning and analysis."
Transitioning to a different dataset, like staff information, requires a similar approach to loading data into Power Query for analysis, including checking for null values and discrepancies within the data categories.
Once the desired staff data is selected, use the "Transform Data" option to enter Power Query and begin cleaning the dataset.
"Identifying and addressing data issues is vital for accurate analysis and reporting."
Within the staff dataset, observe any null values or irregular formatting—such as question marks in the department column and varying date alignments—that indicate potential problems.
Essential steps include promoting the first row to headers and collectively addressing any discrepancies such as null values, incorrect data types, or extra spaces in names to ensure clean and usable data moves forward.
"Now that extra spaces are gone, let's look at the gender column."
The first step in data cleaning involves removing extra spaces from the beginning and end of data fields to ensure uniformity.
After addressing spaces, attention turns to the gender column, which contains null values for some employees.
Instead of leaving these null values unaddressed, it's beneficial to replace them with a flag indicating 'missing'.
Power Query provides a user-friendly indicator called the column quality indicator, which shows how many values are present and how many are empty in each column.
Users should routinely check this indicator to ensure data quality throughout their dataset.
"We are going to write 'null' in small letters and then replace this with 'missing' as the value."
To replace null values in the gender column, one can simply right-click on the column and select 'replace values'.
By entering 'null' as the value to be replaced and 'missing' as the new value, all instances of null gender values will be flagged appropriately.
This allows for easier filtering and further checks of data in Excel.
"If you have done the replacement with 'missing' but you had a change of mind, you can look at the step that you want to edit."
Every transformation step taken within Power Query is recorded and can be modified if needed.
Users can find a gear icon next to each step, allowing them to edit the transformation parameters or revert changes as necessary.
Additionally, if a certain step is no longer needed, there’s an option to delete it using the 'X' button next to the step.
"The question marks should have actually been the engineering department."
Data entry inconsistencies, such as question marks representing the engineering department, can easily be corrected through replacement rules.
Implementing transformation functions allows users to specify what to replace the question marks with, ensuring data accuracy.
"If someone's salary is either zero or null, we don’t want that information anymore."
When reviewing the salary column, employees with a salary of zero or null may need to be filtered out as they may no longer be part of the organization.
Power Query allows users to filter out these undesired values through a straightforward operation, thereby cleaning up the dataset.
"Power Query makes this problem almost trivial. All you have to do is just right-click on this column, change type, and then select the date option."
A common issue in data management involves inconsistent date formatting, which Power Query simplifies significantly.
Users can convert discrepancies in date alignment by right-clicking on the date column and selecting the appropriate data type, ensuring all values are formatted correctly.
"We want to split the name column into first name and last name."
To enhance data organization, the long name column can be split into two: first name and last name.
Utilizing the 'transform' ribbon in Power Query, one can select to split the column based on delimiters, such as space, and adjust the column names accordingly.
"These kinds of columns are called conditional columns."
To categorize employees based on their salaries, users can create conditional columns that categorize salaries into specified ranges: under 50K, 50K to 100K, and above 100K.
By leveraging conditional logic within Power Query, users can efficiently manage their data and better analyze salary distributions among the workforce.
"You can calculate the age of the employee, representing how long they have been in the organization, using the start date column."
To determine how long an employee has been with the organization, select the start date column and use the "Add Column" feature in Power Query. Ensuring the column is formatted as a date is essential for this process.
Once correctly formatted, the date-related functionalities will become available in the "Add Column" section of Power Query.
The result displays the duration in days, and as the video notes, the exact figure may vary depending on the current date when executing the operation.
"You can use the duration buttons to convert the days into a different format, such as years."
If the initial output is in days, users can convert it to years using the "Transform" feature and selecting "Duration" options. This allows for a clearer representation of how long each employee has been with the organization.
The video exemplifies this by transforming the data to display the first employee’s tenure as approximately 5.61 years. It emphasizes that this number will also change depending on the current query execution date.
"You can write a condition in the custom column to tag employees as either full-time or part-time based on their FTE status."
To differentiate employees as full-time or part-time, utilize the "Add Column" feature, selecting the custom column option. This allows for writing a conditional statement using Power Query M language.
An example condition provided in the video states: if the FTE value equals 1, the output should be "full-time"; otherwise, it should be "part-time." This showcases how users can create tailored logic for their data needs.
After creating the column, adjustments to naming and the functionality are highlighted, alongside the conclusion that all data should now be organized effectively.
"When the source file changes, refreshing the query updates the displayed data automatically."
The instructor notes that the Power Query setup is dynamic. For instance, adjusting values in the source employee data file, like changing a salary from 120,000 to 98,000, prompts the need to refresh the query in Excel to reflect this change.
Users can refresh the query simply by right-clicking and selecting the refresh option, which will update all corresponding values and conditions accordingly.
This dynamic capability also extends to any new data added to the source, allowing for seamless integration and ensuring that the most current information is always displayed in the final output.