This blog post is a part of my own personal development within data science and machine learning. With these blog posts I will share my learnings as a way to encapsulate the knowledge and bring more people on board. If you find any mistakes or alternatives approaches, feel free to comment and provide your feedback.
Visualizing your data is one of the best ways to understand it. There are many different approaches to “seeing” the data – different chart types, dimensions, etc. In this blog post, we will look at the Breast Cancer Wisconsin (Diagnostic) Data Set (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). The data set is easy to get using scikit-learn (http://scikit-learn.org/stable/index.html), as its built in with the load_breast_cancer (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) function.
The data set contains 569 samples of features from digitized images of a fine needle aspirate of breast mass. It has 30 dimensions, which we will scale down into a manageable dimension so that we can visualize it – in 3D. We will also create a heat map of the features to gain an understanding of how the scaling/decomposition affects the data set.
We will be using Python to work with the data set. Additionally, we will use Jupyter Notebook to host the visualizations. You can use any environment that you like, if you want to follow along. If you don’t have an environment configured, you can use tools like Azure Machine Learning Workbench (https://docs.microsoft.com/en-gb/azure/machine-learning/preview/quickstart-installation) or Anaconda (https://anaconda.org/).
We will first be importing the libraries we need. These will help us to load the data, perform data processing and visualizations. We will be using the following libraries:
- NumPy – powerful array object and more.
- pandas – data structures and data analysis tools and more.
- scikit-learn – machine learning tools.
- Plotly – graphic library for interactive graphs and more.
Notice that we will only be importing some parts of some libraries.
import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import plotly import plotly.graph_objs as go %matplotlib inline
Next up we can load our data set using the load_breast_cancer function. This will give as an object that contains the samples, features, feature names and labels. We will then create a DataFrame object of the data, this will make it much easier to transform and work with the data.
If we want to view our data, we can use the head function on the DataFrame object.
# Get data. bunch = load_breast_cancer() df = pd.DataFrame(bunch.data, columns=bunch.feature_names) df.head()
Pre-process and decompose data
Next up we will start working with the data to fit our scenario. A common practice for many machine learning estimators is to standardize the features. For instance, if an individual feature is very different in scale, compared to the other features (normally distributed) – the algorithm may behave badly.
We will use a StandardScaler and fit it to our data. Finally, we will use it to transform the data.
# Pre-process data. scaler = StandardScaler() scaler.fit(df) preprocessed_data = scaler.transform(df)
We’re now ready to decompose (factorize) our data. To goal is to break down the data from 30 dimensions to 3 dimensions – so that we can visualize it. We will use a statistical procedure called principal component analysis (PCA for short, https://en.wikipedia.org/wiki/Principal_component_analysis). This technique is great for finding patterns while trying to retain the variation in the data set. The artifacts of such an analysis is called principal components, which we use to try to explain as much of the variance (in the data set) as possible. This means that some of the variance may be lost.
Much like the step before, we will use a PCA and fit it to our data. This is then used to transform the data.
# Decompose data (PCA). pca = PCA(n_components=3) pca.fit(preprocessed_data) decomposed_data = pca.transform(preprocessed_data)
We know that we will most likely not retain all of the variance. We can view it using the explained_variance_ratio_ field.
# Get explained variance. print(pca.explained_variance_ratio_, sum(pca.explained_variance_ratio_))
Each element in the array shows you how much variance each principal component can explain. Notice that the last principal component/dimension only retains 9.39% of the variance. In some cases, adding additional dimensions will not do wonders.
In this case, we are able to retain 72.64% of the variance.
Visualize principal components
If we want to understand how each feature correlates with each principal component, we can visualize it using a heat map. The first thing we need to do is to configure the Plotly library. This is easily done as such:
# Connect notebook and download JS files. plotly.offline.init_notebook_mode(connected=True)
We will then create the heat map data using the Heatmap function (within go). The heat map data is created by supplying the principal components as the z-axis. The x-axis is the feature and the y-axis is the principal component. Finally, we can plot the heat map using the iplot function.
# Create heat map data. data = go.Heatmap(z=pca.components_, x=bunch.feature_names, y=['PC 1', 'PC 2', 'PC 3'], colorscale='Viridis') # Plot heatmap. plotly.offline.iplot([data], filename='heatmap')
Since we created the heat map using Plotly, it is also interactive. Notice how we can hover over individual parts to discover more about it.
Finally, let’s visualize the transformed data. The first thing we will do is to split benign and malignant samples into individual data sets. We will do this by first adding this piece of data to the existing transformed data. This label (benign/malignant) for each sample is loaded with the load_breast_cancer function from before.
# Add malignant column. decomposed_df = pd.DataFrame(decomposed_data, columns=['x', 'y', 'z']) decomposed_df['malignant'] = 1 - bunch.target # Create individual data sets. malignant = decomposed_df[decomposed_df.malignant == 1] benign = decomposed_df[decomposed_df.malignant == 0]
We can now create the visualizations. We will use the Scatter3d function (within go) for each data set to create scatter data. We will also create a bit of styling and separate the two data sets by color and name. This way we can distinguish between data points in the visualization.
Once the scatter data and layout has been created, we can plot using the iplot function.
# Create line style. line_style = dict(color='rgba(0, 0, 0, 0.14)',width=0.5) # Create scatters. malignant_scatter = go.Scatter3d( x=malignant['x'], y=malignant['y'], z=malignant['z'], mode='markers', marker=dict( color='rgb(181, 20, 37)', size=12, opacity=0.8, line=line_style ), name='Malignant' ) benign_scatter = go.Scatter3d( x=benign['x'], y=benign['y'], z=benign['z'], mode='markers', marker=dict( color='rgb(5, 99, 226)', size=12, opacity=0.8, line=line_style ), name='Benign' ) # Create data array. Ensure malignant scatter is rendered above (can we merge layers somehow?). data = [benign_scatter, malignant_scatter] # Create layout. layout = go.Layout(showlegend=True, margin=dict(l=0,r=0,b=0,t=0)) # Render (offline). fig = go.Figure(data=data, layout=layout) plotly.offline.iplot(fig, filename='3d-scatter')
As with the heat map, this plot is also interactive. This allows you to view the data from different angles to understand how your data fits within 3 dimensions. Pretty neat I must say!
This data set works well to visualize using this approach. We can clearly see that the two different labels cluster together well – with some outliers. What’s fascinating is that we have been able to reduce the data set from a dimension impossible for us to imagine – into an interactive 3D scatter plot. By doing this we can gain a better understanding of our data.
Be aware that our principal component analysis does have a significant cost, as we are not able to retain all the variance in the data set. This always depends on your data set, and you will see different results from different data sets. But I would like to say, that an exciting thought is if we could derive (or support) any conclusions from future data points, by using this. In that case, we could also reduce one more dimension (to 2D) so that we can leverage any common libraries for supervised clustering.
You can read more about the data set here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)