{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 lesser-known pandas tricks - part 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is part 2 of the 5 lesser-known pandas tricks series, where I show 5 pandas tricks that will help you boost your productivity.\n", "This part is more focused on the Exploratory Data Analysis, which is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.\n", "\n", "In case you've missed [5 lesser-known pandas tricks part 1](https://towardsdatascience.com/5-lesser-known-pandas-tricks-e8ab1dd21431)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from platform import python_version\n", "\n", "import matplotlib as mpl\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "python version==3.7.3\n", "pandas==0.25.0\n", "matplotlib==3.0.3\n" ] } ], "source": [ "print(\"python version==%s\" % python_version())\n", "print(\"pandas==%s\" % pd.__version__)\n", "print(\"matplotlib==%s\" % mpl.__version__)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pd.np.random.seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a pandas DataFrame with 5 columns and 1000 rows:\n", "- a1 and a2 have random samples drawn from a normal (Gaussian) distribution,\n", "- a3 has randomly distributed integers from a set of (0, 1, 2, 3, 4),\n", "- y1 has numbers spaced evenly on a log scale from 0 to 1,\n", "- y2 has randomly distributed integers from a set of (0, 1)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "mu1, sigma1 = 0, 0.1\n", "mu2, sigma2 = 0.2, 0.2\n", "n = 1000\n", "\n", "df = pd.DataFrame(\n", " {\n", " \"a1\": pd.np.random.normal(mu1, sigma1, n),\n", " \"a2\": pd.np.random.normal(mu2, sigma2, n),\n", " \"a3\": pd.np.random.randint(0, 5, n),\n", " \"y1\": pd.np.logspace(0, 1, num=n),\n", " \"y2\": pd.np.random.randint(0, 2, n),\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Readers with Machine Learning background will recognize the notation where a1, a2 and a3 represent attributes and y1 and y2 represent target variables.\n", "In short, Machine Learning algorithms try to find patterns in the attributes and use them to predict the unseen target variable - but this is not the main focus of this blog post.\n", "The reason that we have two target variables (y1 and y2) in the DataFrame (one binary and one continuous) is to make examples easier to follow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We reset the index, which adds the index column to the DataFrame to enumerates the rows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexa1a2a3y1y2
000.0496710.47987121.0000001
11-0.0138260.38492721.0023080
220.0647690.21192621.0046200
330.1523030.07061331.0069390
44-0.0234150.33964541.0092620
\n", "
" ], "text/plain": [ " index a1 a2 a3 y1 y2\n", "0 0 0.049671 0.479871 2 1.000000 1\n", "1 1 -0.013826 0.384927 2 1.002308 0\n", "2 2 0.064769 0.211926 2 1.004620 0\n", "3 3 0.152303 0.070613 3 1.006939 0\n", "4 4 -0.023415 0.339645 4 1.009262 0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Plot Customizations\n", "\n", "When I first started working with pandas, the plotting functionality seemed clunky.\n", "I was so wrong on this one because pandas exposes full matplotlib functionality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Customize axes on the output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas plot function returns matplotlib.axes.Axes or numpy.ndarray of them so we can additionally customize our plots.\n", "In the example below, we add a horizontal and a vertical red line to pandas line plot.\n", "This is useful if we need to: \n", "- add the average line to a histogram,\n", "- mark an important point on the plot, etc." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax = df.y1.plot()\n", "ax.axhline(6, color=\"red\", linestyle=\"--\")\n", "ax.axvline(775, color=\"red\", linestyle=\"--\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Customize axes on the input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas plot function also takes Axes argument on the input. \n", "This enables us to customize plots to our liking.\n", "In the example below, we create a two-by-two grid with different types of plots." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = mpl.pyplot.subplots(2, 2, figsize=(14,7))\n", "df.plot(x=\"index\", y=\"y1\", ax=ax[0, 0])\n", "df.plot.scatter(x=\"index\", y=\"y2\", ax=ax[0, 1])\n", "df.plot.scatter(x=\"index\", y=\"a3\", ax=ax[1, 0])\n", "df.plot(x=\"index\", y=\"a1\", ax=ax[1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Histograms\n", "\n", "A histogram is an accurate representation of the distribution of numerical data. \n", "It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson[[1]](https://en.wikipedia.org/wiki/Histogram)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Stacked Histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas enables us to compare distributions of multiple variables on a single histogram with a single function call." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[[\"a1\", \"a2\"]].plot(bins=30, kind=\"hist\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create two separate plots, we set `subplots=True`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ],\n", " dtype=object)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[[\"a1\", \"a2\"]].plot(bins=30, kind=\"hist\", subplots=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Probability Density Function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Probability density function (PDF) is a function whose value at any given sample in the set of possible values can be interpreted as a relative likelihood that the value of the random variable would equal that sample [[2]](https://en.wikipedia.org/wiki/Probability_density_function).\n", "In other words, the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would equal one sample compared to the other sample." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that in pandas, there is a `density=1` argument that we can pass to `hist` function, but with it, we don't get a PDF, because the y-axis is not on the scale from 0 to 1 as can be seen on the plot below.\n", "The reason for this is explained in [numpy documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html): \"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.\"." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADhFJREFUeJzt3X2MXNddxvHnqfNmvInb4miaOFG2lUKltFs1eCh/8NLdvggTl4BEJNIqUSIVrUopVMIIGQUhAUIkiFTwRwRUpUrKH2yh6kvk9EVJ6qVEalLWIfXGCW0SZKlxg0NBmG4aWlb98cfeTaf27M6Z9T0z88t+P9LKM7vH9z5ezz575s69ZxwRAgDk8YpxBwAADIfiBoBkKG4ASIbiBoBkKG4ASIbiBoBkKG4ASIbiBoBkKG4ASOa8Ghvds2dPTE9P19j0S1544QXt2rWr6j7aQtY6yFoHWdtXkvPo0aPfiohLizYYEa1/7Nu3L2o7cuRI9X20hax1kLUOsravJKekpSjsWA6VAEAyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyVS55B0Zt+tB9ReNO3H6gchKgPmbcAJAMxQ0AyVDcAJAMxQ0AyVDcAJAMxQ0AyVDcAJAM53GjNaXnUkucTw2cC2bcAJAMxQ0AyRQXt+0dtv/F9uGagQAAmxtmxv1BSU/WCgIAKFNU3LavkHRA0kfqxgEADFI64/5zSb8j6fsVswAACjgiNh9gv0vSdRHxftuzkn47It7VZ9y8pHlJ6nQ6+xYWFirE/YGVlRVNTU1V3UdbtkvW5ZOni8fO7N29pX306s1auu829rsV2+UxMGpZspbknJubOxoR3ZLtlRT3n0i6WdKqpIskXSLpkxFx00Z/p9vtxtLSUsn+t2xxcVGzs7NV99GW7ZK1xnncm23z4Myq7lwe7lKEcZ0/vl0eA6OWJWtJTtvFxT3wUElE/G5EXBER05JulPTFzUobAFAX53EDQDJDPc+MiEVJi1WSAACKMOMGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIZrjFjIHkStcMH9e63UAJZtwAkAzFDQDJUNwAkAzHuIE+OBaOScaMGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBmKGwCSobgBIBneSAFjUfpGBQDOxowbAJJhxo2BmB0Dk4UZNwAkQ3EDQDIUNwAkM7C4bV9k+yu2v2r7uO0/GEUwAEB/JS9OflfS2yJixfb5kh6y/bmIeLhyNgBAHwOLOyJC0kpz9/zmI2qGAgBsrOgYt+0dth+T9Lyk+yPikbqxAAAb8dqEunCw/UpJn5L0GxHx+Blfm5c0L0mdTmffwsJCmznPsrKyoqmpqar7aEv2rMsnT48pzeY6O6VTL443w8ze3UXjsj8GJlWWrCU55+bmjkZEt2R7QxW3JNn+fUnfiYg/22hMt9uNpaWlobY7rMXFRc3OzlbdR1uyZ53UC3AOzqzqzuXxXkN24vYDReOyPwYmVZasJTltFxd3yVkllzYzbdneKemdkv61ZOMAgPaVTFcuk3SP7R1aK/q/j4jDdWMBADZSclbJMUnXjiALAKAAV04CQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDLnjTsAkNn0ofuKxt29f1flJNhOmHEDQDLMuLexfrPFgzOrurVwFglgPJhxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyA4vb9pW2j9h+wvZx2x8cRTAAQH8li0ytSjoYEY/avljSUdv3R8QTlbMBAPoYOOOOiOci4tHm9rclPSlpb+1gAID+hjrGbXta0rWSHqkRBgAwmCOibKA9JekfJf1xRHyyz9fnJc1LUqfT2bewsNBmzrOsrKxoamqq6j7aMqlZl0+ePutznZ3SqRfHEGYLMmV97e4dE/kY6GdSH6/9ZMlaknNubu5oRHRLtldU3LbPl3RY0hci4kODxne73VhaWirZ/5YtLi5qdna26j7aMqlZN3ojhTuXc7y/Rqasd+/fNZGPgX4m9fHaT5asJTltFxd3yVkllvQ3kp4sKW0AQF0lx7h/StLNkt5m+7Hm47rKuQAAGxj4PDMiHpLkEWQBXraWT54ufi/PE7cfqJwG2XHlJAAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkQ3EDQDI5llbDUPqt+gfg5YMZNwAkQ3EDQDIUNwAkQ3EDQDIUNwAkw1kliXC2CACJGTcApENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJMNaJcCEKV2T5sTtByonwaRixg0AyVDcAJAMxQ0AyVDcAJAMxQ0AyXBWCZAUZ59sX8y4ASAZihsAkqG4ASAZihsAkhlY3LY/avt524+PIhAAYHMlM+67Je2vnAMAUGhgcUfElyT91wiyAAAKcIwbAJJxRAweZE9LOhwRb9xkzLykeUnqdDr7FhYWWorY38rKiqampqruoy1tZV0+ebqFNJvr7JROvVh9N60ga5mZvbuHGr8df7ZqK8k5Nzd3NCK6Jdtrrbh7dbvdWFpaKhm6ZYuLi5qdna26j7a0lbX0SrlzcXBmVXcu57iglqztWr/Ccjv+bNVWktN2cXFzqAQAkik5HfDvJH1Z0uttP2v7vfVjAQA2MvC5W0S8exRBAABlOFQCAMlQ3ACQDMUNAMlQ3ACQzGSfWLpNjOL8bAAvH8y4ASAZihsAkqG4ASAZihsAkqG4ASAZihsAkqG4ASAZihsAkqG4ASAZihsAkqG4ASAZ1iqpaH0NkoMzq7qV9UgAtIQZNwAkw4wbgKTyZ4jr7waP8WHGDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAyXvAOoYrpwYTUuoR8eM24ASIYZ95BKZxEAUAszbgBIhuIGgGQobgBIhuIGgGQobgBI5mV/VglngQDt4mdq/JhxA0AyRcVte7/tr9l+2vah2qEAABsbWNy2d0i6S9LPS7pG0rttX1M7GACgv5Jj3G+R9HRE/Jsk2V6Q9IuSnqgRqPT42d37d9XYPYARO5dj5gdnVnVrn7/f9vonk7buSsmhkr2SvtFz/9nmcwCAMXBEbD7AvkHS/oj41eb+zZJ+MiI+cMa4eUnzzd3XS/pa+3F/yB5J36q8j7aQtQ6y1kHW9pXkvCoiLi3ZWMmhkpOSruy5f0XzuR8SER+W9OGSnbbB9lJEdEe1v3NB1jrIWgdZ29d2zpJDJf8s6Wrbr7V9gaQbJd3bVgAAwHAGzrgjYtX2ByR9QdIOSR+NiOPVkwEA+iq6cjIiPivps5WzDGtkh2VaQNY6yFoHWdvXas6BL04CACYLl7wDQDJpitv2q23fb/up5s9X9Rlzle1HbT9m+7jt901w1jfb/nKT85jtX5nUrM24z9v+b9uHx5Bx0yUXbF9o++PN1x+xPT3qjE2OQTl/tnl8rjan2Y5NQdbfsv1E89h80PZV48jZZBmU9X22l5uf+4fGeWV36fIgtn/Zdtje2pkmEZHiQ9KfSjrU3D4k6Y4+Yy6QdGFze0rSCUmXT2jWH5N0dXP7cknPSXrlJGZtvvZ2Sb8g6fCI8+2Q9Iyk1zX/v1+VdM0ZY94v6a+a2zdK+vgYvo8lOaclvUnSxyTdMOqMQ2adk/Qjze1fG8f3dIisl/Tcvl7S5yc1azPuYklfkvSwpO5W9pVmxq21y+zvaW7fI+mXzhwQEd+LiO82dy/U+J5RlGT9ekQ81dz+pqTnJRWdfN+ygVklKSIelPTtUYXq8dKSCxHxPUnrSy706v03fELS2217hBmlgpwRcSIijkn6/oiznakk65GI+E5z92GtXb8xDiVZ/6fn7i5J43rhruSxKkl/JOkOSf+71R1lKu5ORDzX3P53SZ1+g2xfafuY1i7Tv6MpxVEryrrO9lu09hv6mdrB+hgq6xiULLnw0piIWJV0WtKPjiRdnwyNSV4aYtis75X0uaqJNlaU1fav235Ga88gf3NE2c40MKvtH5d0ZUSc06LmE/VGCrYfkPSaPl+6rfdORITtvr9VI+Ibkt5k+3JJn7b9iYg4NYlZm+1cJulvJd0SEVVmYm1lxfZj+yZJXUlvHXeWzUTEXZLusv0eSb8n6ZYxRzqL7VdI+pCkW891WxNV3BHxjo2+ZvuU7csi4rmm7J4fsK1v2n5c0s9o7elzq9rIavsSSfdJui0iHm4747o2v69jULLkwvqYZ22fJ2m3pP8cTbyzMqzruzTEhCjKavsdWvvl/taeQ5CjNuz3dUHSX1ZNtLFBWS+W9EZJi82RvNdIutf29RGxNMyOMh0quVc/+C16i6TPnDnA9hW2dza3XyXpp1V/sat+SrJeIOlTkj4WEa3/YhnCwKxjVrLkQu+/4QZJX4zmVaARyrQ0xMCstq+V9NeSro+Icf4yL8l6dc/dA5KeGmG+XptmjYjTEbEnIqYjYlprrx0MXdrrG0vxobVjlg9q7T/lAUmvbj7flfSR5vY7JR3T2qu5xyTNT3DWmyT9n6THej7ePIlZm/v/JOk/JL2otWN3PzfCjNdJ+rrWXgO4rfncHzYPekm6SNI/SHpa0lckvW5M/++Dcv5E8717QWvPCI6PI2dh1gcknep5bN47wVn/QtLxJucRSW+Y1KxnjF3UFs8q4cpJAEgm06ESAIAobgBIh+IGgGQobgBIhuIGgGQobgBIhuIGgGQobgBI5v8BM3BtncUk7tcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.a1.hist(bins=30, density=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To calculate a PDF for a variable, we use the `weights` argument of a `hist` function.\n", "We can observe on the plot below, that the maximum value of the y-axis is less than 1." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAE1JJREFUeJzt3W2MXGd5xvH/jdOYYFMTErqQ2GWDYj443YiSxemHAmvCi9OIGKmO6hCoLQW5abGQiqvWCBRSA1KMFFIk3BaLpJigapNaorWIISIvKygKqRMKdkwa2ASL2NDQvNR0Q1KzcPfDHMN02HjO7p6Z3fXz/0lWzsszc67dzF5z9szMs5GZSJLK8IK5DiBJ6h9LX5IKYulLUkEsfUkqiKUvSQWx9CWpIJa+JBXE0pekglj6klSQ0+Y6QKezzz47BwcHe36cZ555hiVLlvT8OE0wa2+YtXkLJSecelkfeOCBJzLzZV3vLDPn1b+LLroo++Gee+7py3GaYNbeMGvzFkrOzFMvK3B/1uhYL+9IUkEsfUkqiKUvSQWx9CWpIJa+JBXE0pekglj6klQQS1+SCmLpS1JBak3DEBFrgU8Ci4DPZOb1HfvfAPwNcCGwITP3tO3bCHyoWv1oZu5uIrjUpMFtt9cad/j6y3qcROqtrmf6EbEI2AlcCqwCroyIVR3DfgBsAv6x47YvBT4MXAysBj4cEWfOPrYkaSbqXN5ZDYxn5qOZeRwYBda1D8jMw5l5APhFx23fBnwlM5/KzKeBrwBrG8gtSZqBOqV/LvBY2/qRalsds7mtJKlh82Jq5YjYDGwGGBgYYGxsrOfHnJiY6MtxmmDW3mjPunVostZt5uprWyjf14WSE8rNWqf0jwIr2taXV9vqOAqMdNx2rHNQZu4CdgEMDw/nyMhI55DGjY2N0Y/jNMGsvdGedVPdF3KvGuldoJNYKN/XhZITys1a5/LOfmBlRJwXEacDG4C9Ne//DuCtEXFm9QLuW6ttkqQ50LX0M3MS2EKrrB8CbsvMQxGxPSIuB4iI10XEEeAK4NMRcai67VPAR2g9cewHtlfbJElzoNY1/czcB+zr2HZt2/J+WpduprrtzcDNs8ioQvheean3/ESuJBXE0pekglj6klQQS1+SCmLpS1JBLH1JKoilL0kFmRdz70i90O19/1uHJmtPvyCdKjzTl6SCWPqSVBBLX5IKYulLUkEsfUkqiKUvSQWx9CWpIJa+JBXE0pekglj6klQQS1+SCmLpS1JBLH1JKoilL0kFsfQlqSDOpy9NQ7c5+k84fP1lPU4izYxn+pJUEEtfkgpi6UtSQbymL/WA1/41X3mmL0kFsfQlqSCWviQVxNKXpILUKv2IWBsRD0fEeERsm2L/4oi4tdp/X0QMVtt/IyJ2R8TBiHgoIj7QbHxJ0nR0Lf2IWATsBC4FVgFXRsSqjmFXA09n5vnAjcCOavsVwOLMHAIuAv7kxBOCJKn/6pzprwbGM/PRzDwOjALrOsasA3ZXy3uASyIigASWRMRpwBnAceAnjSSXJE1bZObJB0SsB9Zm5nuq9XcDF2fmlrYxD1ZjjlTrjwAXA8eAW4BLgBcBf56Zu6Y4xmZgM8DAwMBFo6OjDXxpJzcxMcHSpUt7fpwmlJL14NFjtcYNnbuskfsbOAMef7bWXfVM3a9loTwGFkpOOPWyrlmz5oHMHO52X73+cNZq4OfAOcCZwNci4s7MfLR9UPVEsAtgeHg4R0ZGehwLxsbG6MdxmlBK1k11P9B0Vb3773Z/W4cmueHg3H4+se7XslAeAwslJ5Sbtc7lnaPAirb15dW2KcdUl3KWAU8C7wS+nJk/y8wfA18Huj4TSZJ6o07p7wdWRsR5EXE6sAHY2zFmL7CxWl4P3J2t60Y/AN4EEBFLgN8D/qOJ4JKk6eta+pk5CWwB7gAeAm7LzEMRsT0iLq+G3QScFRHjwPuBE2/r3AksjYhDtJ48/iEzDzT9RUiS6ql1QTMz9wH7OrZd27b8HK23Z3bebmKq7ZKkueEnciWpIJa+JBXE0pekgvhHVLTg1P0DJZJ+nWf6klQQz/TVc56ZS/OHZ/qSVBBLX5IKYulLUkEsfUkqiKUvSQWx9CWpIJa+JBXE0pekglj6klQQS1+SCmLpS1JBLH1JKoilL0kFsfQlqSCWviQVxNKXpIJY+pJUEEtfkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFcTSl6SCWPqSVJBapR8RayPi4YgYj4htU+xfHBG3Vvvvi4jBtn0XRsS9EXEoIg5GxAubiy9Jmo6upR8Ri4CdwKXAKuDKiFjVMexq4OnMPB+4EdhR3fY04PPANZl5ATAC/Kyx9JKkaalzpr8aGM/MRzPzODAKrOsYsw7YXS3vAS6JiADeChzIzG8DZOaTmfnzZqJLkqarTumfCzzWtn6k2jblmMycBI4BZwGvBjIi7oiIb0bEX84+siRppiIzTz4gYj2wNjPfU62/G7g4M7e0jXmwGnOkWn8EuBjYBLwXeB3wU+Au4EOZeVfHMTYDmwEGBgYuGh0dbeSLO5mJiQmWLl3a8+M0YaFnPXj02BylObmBM+DxZ+c2w9C5y2qNWyiPgYWSE069rGvWrHkgM4e73ddpNY53FFjRtr682jbVmCPVdfxlwJO0fiv4amY+ARAR+4DX0ir/X8rMXcAugOHh4RwZGakRa3bGxsbox3GasNCzbtp2+9yE6WLr0CQ3HKzzI9A7h68aqTVuoTwGFkpOKDdrncs7+4GVEXFeRJwObAD2dozZC2ysltcDd2frV4g7gKGIeFH1ZPBG4DuNJJckTVvX05zMnIyILbQKfBFwc2YeiojtwP2ZuRe4CbglIsaBp2g9MZCZT0fEJ2g9cSSwLzPn52mfJBWg1u+2mbkP2Nex7dq25eeAK57ntp+n9bZNSdIc8xO5klQQS1+SCmLpS1JB5vb9alLhBmu+nfWza5f0OIlK4Zm+JBXEM33N2FRnqVuHJufth7EkeaYvSUWx9CWpIJa+JBXE0pekglj6klQQS1+SCmLpS1JBLH1JKoilL0kFsfQlqSCWviQVxNKXpIJY+pJUEEtfkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFcS/kSstAAePHqv1t4cPX39ZH9JoIfNMX5IKYulLUkEsfUkqiKUvSQWx9CWpIJa+JBWkVulHxNqIeDgixiNi2xT7F0fErdX++yJisGP/b0fERET8RTOxJUkz0bX0I2IRsBO4FFgFXBkRqzqGXQ08nZnnAzcCOzr2fwL40uzjSpJmo86Z/mpgPDMfzczjwCiwrmPMOmB3tbwHuCQiAiAi3gF8HzjUTGRJ0kzVKf1zgcfa1o9U26Yck5mTwDHgrIhYCvwV8NezjypJmq1eT8NwHXBjZk5UJ/5TiojNwGaAgYEBxsbGehwLJiYm+nKcJvQ768Gjx2qN2zr069sGzoCtQ5MNJ+qNUzHrXD+m/bnqjSaz1in9o8CKtvXl1bapxhyJiNOAZcCTwMXA+oj4OPAS4BcR8Vxmfqr9xpm5C9gFMDw8nCMjIzP4UqZnbGyMfhynCf3OWmeOl+ezdWiSGw4ujCmdTsWsh68a6X2Yk/DnqjeazFrnEb8fWBkR59Eq9w3AOzvG7AU2AvcC64G7MzOB158YEBHXAROdhS9J6p+upZ+ZkxGxBbgDWATcnJmHImI7cH9m7gVuAm6JiHHgKVpPDJKkeabW77aZuQ/Y17Ht2rbl54ArutzHdTPIJ0lqkJ/IlaSCLIxXsdSIwVm8QCvp1OCZviQVxNKXpIJY+pJUEEtfkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFcTSl6SCWPqSVBDn3pFOIXXnVzp8/WU9TqL5yjN9SSqIpS9JBbH0Jakglr4kFcTSl6SC+O4dqUDT+StqvtPn1OKZviQVxNKXpIJY+pJUEEtfkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFcRP5Eo6KefoP7V4pi9JBbH0JakgtUo/ItZGxMMRMR4R26bYvzgibq323xcRg9X2t0TEAxFxsPrvm5qNL0majq6lHxGLgJ3ApcAq4MqIWNUx7Grg6cw8H7gR2FFtfwJ4e2YOARuBW5oKLkmavjpn+quB8cx8NDOPA6PAuo4x64Dd1fIe4JKIiMz898z8YbX9EHBGRCxuIrgkafoiM08+IGI9sDYz31Otvxu4ODO3tI15sBpzpFp/pBrzRMf9XJOZb57iGJuBzQADAwMXjY6OzvoL62ZiYoKlS5f2/DhNaCrrwaPHGkhzcgNnwOPP9vwwjTBrs4bOXVbkz1U/1Mm6Zs2aBzJzuNt99eUtmxFxAa1LPm+dan9m7gJ2AQwPD+fIyEjPM42NjdGP4zShqaybpvGHM2Zq69AkNxxcGO8ENmuzDl81UuTPVT80mbXOo+gosKJtfXm1baoxRyLiNGAZ8CRARCwHvgD8cWY+MuvE+n+m8xeQJKnONf39wMqIOC8iTgc2AHs7xuyl9UItwHrg7szMiHgJcDuwLTO/3lRoSdLMdC39zJwEtgB3AA8Bt2XmoYjYHhGXV8NuAs6KiHHg/cCJt3VuAc4Hro2Ib1X/fqvxr0KSVEuti4SZuQ/Y17Ht2rbl54ArprjdR4GPzjKjJKkhfiJXkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFcTSl6SCWPqSVJD5PYNTwdrn1Nk6NNmXydIknfo805ekgnimL6kRg9tur/Vb6eHrL+tTIk3FM31JKoilL0kFsfQlqSCWviQVxNKXpIJY+pJUEEtfkgpi6UtSQSx9SSqIpS9JBXEaBknz0mDNSQad1mF6PNOXpIJ4pt9ndc9eJKkXPNOXpIJY+pJUEEtfkgpi6UtSQSx9SSqI797pwnfbSM3yZ2pueaYvSQWpVfoRsTYiHo6I8YjYNsX+xRFxa7X/vogYbNv3gWr7wxHxtuaiS5Kmq2vpR8QiYCdwKbAKuDIiVnUMuxp4OjPPB24EdlS3XQVsAC4A1gJ/W92fJGkO1LmmvxoYz8xHASJiFFgHfKdtzDrgump5D/CpiIhq+2hm/i/w/YgYr+7v3mbi/7q61ws/u3ZJryJI6qOZvkawdWiSTVPcthdz+cyneYTqXN45F3isbf1ItW3KMZk5CRwDzqp5W0lSn8yLd+9ExGZgc7U6EREP9/qYa3ZwNvBEr4/ThPdh1l4wa/MWSk54/qyxYw7CdD92ne/rK+sco07pHwVWtK0vr7ZNNeZIRJwGLAOerHlbMnMXsKtO4KZExP2ZOdzPY86UWXvDrM1bKDmh3Kx1Lu/sB1ZGxHkRcTqtF2b3dozZC2ysltcDd2dmVts3VO/uOQ9YCfxbE8ElSdPX9Uw/MycjYgtwB7AIuDkzD0XEduD+zNwL3ATcUr1Q+xStJwaqcbfRetF3EnhvZv68R1+LJKmLWtf0M3MfsK9j27Vty88BVzzPbT8GfGwWGXulr5eTZsmsvWHW5i2UnFBo1mhdhZEklcBpGCSpIMWUfkS8NCK+EhHfq/575hRjXhkR34yIb0XEoYi4Zh5nfU1E3FvlPBARfzRfs1bjvhwR/x0RX+xzvhlPIdJvNbK+oXp8TkbE+rnI2JalW9b3R8R3qsfmXRFR6+2EvVAj6zURcbD6uf/XKWYc6JtuWdvG/WFEZERM/x09mVnEP+DjwLZqeRuwY4oxpwOLq+WlwGHgnHma9dXAymr5HOBHwEvmY9Zq3yXA24Ev9jHbIuAR4FXV/9tvA6s6xvwZ8PfV8gbg1n5/D6eRdRC4EPgcsH4uck4j6xrgRdXyn87z7+tvti1fDnx5vmatxr0Y+CrwDWB4uscp5kyf1pQQu6vl3cA7Ogdk5vFsTRkBsJi5+02oTtbvZub3quUfAj8GXta3hL/SNStAZt4F/E+/QlV+OYVIZh4HTkwh0q49/x7gkmoKkX7rmjUzD2fmAeAXc5CvXZ2s92TmT6vVb9D6jM5cqJP1J22rS4C5eqGzzuMV4CO05jd7biYHKan0BzLzR9XyfwIDUw2KiBURcYDW9BE7qkLtt1pZT4iI1bTODB7pdbApTCtrn81mCpF+W0hTlkw369XAl3qa6PnVyhoR742IR2j95vq+PmXr1DVrRLwWWJGZM/6jBPNiGoamRMSdwMun2PXB9pXMzIiY8tk8Mx8DLoyIc4B/jog9mfn4fMxa3c8rgFuAjZnZkzPAprKqPBHxLmAYeONcZzmZzNwJ7IyIdwIf4lcfNp03IuIFwCeATbO5n1Oq9DPzzc+3LyIej4hXZOaPqqL8cZf7+mFEPAi8ntav/Y1qImtE/CZwO/DBzPxG0xlPaPL72mezmUKk32pNWTJP1MoaEW+mdWLwxrbLpv023e/rKPB3PU30/LplfTHwO8BYdQXy5cDeiLg8M++ve5CSLu+0TxWxEfiXzgERsTwizqiWzwR+H+j55G9TqJP1dOALwOcys/EnpWnomnUOzWYKkX6rk3W+6Jo1In4X+DRweWbO5YlAnawr21YvA77Xx3ztTpo1M49l5tmZOZiZg7ReK5lW4Z+4oyL+0bpOexet/6F3Ai+ttg8Dn6mW3wIcoPWq+QFg8zzO+i7gZ8C32v69Zj5mrda/BvwX8Cyta5Vv61O+PwC+S+v1jg9W27ZXPywALwT+CRinNS/Uq+bwMdot6+uq790ztH4bOTSPs94JPN722Nw7j7N+EjhU5bwHuGC+Zu0YO8YM3r3jJ3IlqSAlXd6RpOJZ+pJUEEtfkgpi6UtSQSx9SSqIpS9JBbH0Jakglr4kFeT/AGZIODutd99+AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "weights = pd.np.ones_like(df.a1.values) / len(df.a1.values)\n", "df.a1.hist(bins=30, weights=weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Cumulative Distribution Function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A cumulative histogram is a mapping that counts the cumulative number of observations in all of the bins up to the specified bin.\n", "\n", "Let's make a cumulative histogram for a1 column.\n", "We can observe on the plot below that there are approximately 500 data points where the x is smaller or equal to 0.0." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEsxJREFUeJzt3X+QXWV9x/H3VyiIRBN+OCtuMgbH1A5lrYUt0KHVjdGK2BJmikqLGpx0Mlb80ZJOibUzdHSmA52ixY5DmxHa0HFcFG3JIOpgyI51xlCJ0oQfVRaKkjUEUYguYO2O3/5xn9jrumF37+/N837N7OSc5z73nM9udu9nz7l3z43MRJJUn+f0O4AkqT8sAEmqlAUgSZWyACSpUhaAJFXKApCkSlkAklQpC0CSKmUBSFKlju53gGdz8skn5+rVq7u+n6eeeorjjz++6/vpBLN2h1k7b6nkhCMv6+7dux/PzBfOu7HMHNiPM888M3th586dPdlPJ5i1O8zaeUslZ+aRlxW4KxfwGOspIEmqlAUgSZWyACSpUhaAJFXKApCkSs1bABFxQ0Q8FhH3NI2dGBG3R8QD5d8TynhExEcjYjIi9kTEGU332VDmPxARG7rz6UiSFmohRwD/DJw3a2wLsCMz1wA7yjrAG4A15WMTcB00CgO4EjgbOAu48lBpSJL6Y94CyMwvAz+YNbwe2FaWtwEXNo3fWF6KugtYERGnAK8Hbs/MH2TmE8Dt/GKpSJJ6qNXnAIYyc39ZfhQYKsvDwCNN8/aVscONS5L6pO1LQWRmRkTH3lk+IjbROH3E0NAQExMTndr0YU1PT/dkP51g1u4wa+d1MufeqYMd2c7hDB0Hf/+JW7q6j8UaGV4+53gnv66tFsCBiDglM/eXUzyPlfEpYFXTvJVlbAoYmzU+MdeGM3MrsBVgdHQ0x8bG5prWURMTE/RiP51g1u4wa+d1MuelWz7Xke0czuaRGa7ZO1iXRnv4krE5xzv5dW31M94ObACuKv/e0jT+7ogYp/GE78FSEl8E/rrpid/fAd7femxJg2z1ls+xeWSm6w/cas+8BRARn6Tx2/vJEbGPxqt5rgI+FREbgW8Dby7TbwPOByaBp4F3AGTmDyLiQ8DXyrwPZubsJ5YlST00bwFk5h8c5qZ1c8xN4LLDbOcG4IZFpZMkdc1gnfSSNNBWe0rniOKlICSpUhaAJFXKApCkSlkAklQpnwSWKucTu/XyCECSKmUBSFKlLABJqpQFIEmV8klg6Qjlk7uaj0cAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVK+DFRaYvZOHfS9dtURHgFIUqUsAEmqlAUgSZWyACSpUhaAJFXKApCkSlkAklQpC0CSKuUfgkkDYqHX79880uUgqoZHAJJUKQtAkiplAUhSpSwASaqUBSBJlWqrACLiTyPi3oi4JyI+GRHPjYhTI+LOiJiMiJsi4pgy99iyPlluX92JT0CS1JqWCyAihoH3AqOZeTpwFHAxcDXwkcx8GfAEsLHcZSPwRBn/SJknSeqTdk8BHQ0cFxFHA88D9gOvAW4ut28DLizL68s65fZ1ERFt7l+S1KKWCyAzp4C/Bb5D44H/ILAbeDIzZ8q0fcBwWR4GHin3nSnzT2p1/5Kk9kRmtnbHiBOAzwBvAZ4EPk3jN/u/Kqd5iIhVwOcz8/SIuAc4LzP3ldseBM7OzMdnbXcTsAlgaGjozPHx8ZbyLcb09DTLli3r+n46wazdMQhZ904dXNC8oePgwDNdDtMBSyUnDGbWkeHlc44v5Ht17dq1uzNzdL59tHMpiNcC/52Z3wOIiM8C5wIrIuLo8lv+SmCqzJ8CVgH7yimj5cD3Z280M7cCWwFGR0dzbGysjYgLMzExQS/20wlm7Y5ByLrQ9/ndPDLDNXsH/youSyUnDGbWhy8Zm3O8k9+r7XzG3wHOiYjnAc8A64C7gJ3ARcA4sAG4pczfXta/Wm6/I1s9/JCWkIVe40fqtXaeA7iTximfrwN7y7a2AlcAl0fEJI1z/NeXu1wPnFTGLwe2tJFbktSmto55MvNK4MpZww8BZ80x98fAm9rZnySpc/xLYEmqlAUgSZWyACSpUhaAJFXKApCkSlkAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVIWgCRVygKQpEpZAJJUqcF6CxxpCfGNXrTUeQQgSZWyACSpUhaAJFXKApCkSlkAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVIWgCRVygKQpEpZAJJUKQtAkiplAUhSpSwASaqUBSBJlfIdwaRZfKcv1aKtI4CIWBERN0fEf0XE/RHxmxFxYkTcHhEPlH9PKHMjIj4aEZMRsScizujMpyBJakW7p4CuBb6Qmb8C/BpwP7AF2JGZa4AdZR3gDcCa8rEJuK7NfUuS2tByAUTEcuBVwPUAmfmTzHwSWA9sK9O2AReW5fXAjdmwC1gREae0nFyS1JZ2jgBOBb4H/FNEfCMiPh4RxwNDmbm/zHkUGCrLw8AjTfffV8YkSX0QmdnaHSNGgV3AuZl5Z0RcC/wQeE9mrmia90RmnhARtwJXZeZXyvgO4IrMvGvWdjfROEXE0NDQmePj4y3lW4zp6WmWLVvW9f10glm7oznr3qmDfU7z7IaOgwPP9DvF/JZKThjMrCPDy+ccX8jP1dq1a3dn5uh8+2jnVUD7gH2ZeWdZv5nG+f4DEXFKZu4vp3geK7dPAaua7r+yjP2czNwKbAUYHR3NsbGxNiIuzMTEBL3YTyeYtTuas1464K8C2jwywzV7B/8FfEslJwxm1ocvGZtzvJM/Vy2fAsrMR4FHIuLlZWgdcB+wHdhQxjYAt5Tl7cDby6uBzgEONp0qkiT1WLuV9x7gExFxDPAQ8A4apfKpiNgIfBt4c5l7G3A+MAk8XeZKkvqkrQLIzLuBuc4zrZtjbgKXtbM/SVLneCkISaqUBSBJlbIAJKlSFoAkVcoCkKRKWQCSVCkLQJIqZQFIUqUsAEmqlAUgSZWyACSpUoN1/VOpS+Z7o/fNIzMDfxloqdM8ApCkSlkAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVIWgCRVygKQpEpZAJJUKQtAkiplAUhSpSwASaqUBSBJlbIAJKlSFoAkVcoCkKRKWQCSVCkLQJIq5XsCa0mb771+JR1e20cAEXFURHwjIm4t66dGxJ0RMRkRN0XEMWX82LI+WW5f3e6+JUmt68QpoPcB9zetXw18JDNfBjwBbCzjG4EnyvhHyjxJUp+0VQARsRJ4I/Dxsh7Aa4Cby5RtwIVleX1Zp9y+rsyXJPVBu0cAfwf8OfDTsn4S8GRmzpT1fcBwWR4GHgEotx8s8yVJfRCZ2dodI34XOD8z3xURY8CfAZcCu8ppHiJiFfD5zDw9Iu4BzsvMfeW2B4GzM/PxWdvdBGwCGBoaOnN8fLylfIsxPT3NsmXLur6fTjDrz9s7dbAj2xk6Dg4805FNdd1SybpUcsJgZh0ZXj7n+EJ+rtauXbs7M0fn20c7rwI6F7ggIs4Hngu8ALgWWBERR5ff8lcCU2X+FLAK2BcRRwPLge/P3mhmbgW2AoyOjubY2FgbERdmYmKCXuynE8z68y7t0KuANo/McM3epfGiuKWSdankhMHM+vAlY3OOd/LnquVTQJn5/sxcmZmrgYuBOzLzEmAncFGZtgG4pSxvL+uU2+/IVg8/JElt68Yfgl0BXB4RkzTO8V9fxq8HTirjlwNburBvSdICdeSYJzMngImy/BBw1hxzfgy8qRP7kyS1z0tBSFKlLABJqpQFIEmVsgAkqVIWgCRVarD+8kEqvMyz1H0eAUhSpSwASaqUBSBJlbIAJKlSFoAkVcoCkKRKWQCSVCkLQJIqZQFIUqUsAEmqlAUgSZWyACSpUhaAJFXKApCkSlkAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVK+Kbx6yjd7lwaHRwCSVCkLQJIqZQFIUqUsAEmqlAUgSZWyACSpUi0XQESsioidEXFfRNwbEe8r4ydGxO0R8UD594QyHhHx0YiYjIg9EXFGpz4JSdLitXMEMANszszTgHOAyyLiNGALsCMz1wA7yjrAG4A15WMTcF0b+5YktanlAsjM/Zn59bL8I+B+YBhYD2wr07YBF5bl9cCN2bALWBERp7ScXJLUlsjM9jcSsRr4MnA68J3MXFHGA3giM1dExK3AVZn5lXLbDuCKzLxr1rY20ThCYGho6Mzx8fG2881nenqaZcuWdX0/nbDUs+6dOtinNM9u6Dg48Ey/UyzMUsm6VHLCYGYdGV4+5/hCHgPWrl27OzNH59tH25eCiIhlwGeAP8nMHzYe8xsyMyNiUQ2TmVuBrQCjo6M5NjbWbsR5TUxM0Iv9dMJSz3rpgF4KYvPIDNfsXRpXRlkqWZdKThjMrA9fMjbneCcfA9p6FVBE/BKNB/9PZOZny/CBQ6d2yr+PlfEpYFXT3VeWMUlSH7TzKqAArgfuz8wPN920HdhQljcAtzSNv728Gugc4GBm7m91/5Kk9rRzzHMu8DZgb0TcXcb+ArgK+FREbAS+Dby53HYbcD4wCTwNvKONfUuS2tRyAZQnc+MwN6+bY34Cl7W6Pw22uS7zvHlkZmDP+UvyL4ElqVoWgCRVygKQpEpZAJJUKQtAkiplAUhSpSwASaqUBSBJlbIAJKlSFoAkVcoCkKRKDdYFsDVw5rrGj6Qjg0cAklQpC0CSKmUBSFKlLABJqpQFIEmVsgAkqVIWgCRVyr8DqJSv75fkEYAkVcoCkKRKWQCSVCkLQJIqZQFIUqUsAEmqlC8DPYL40k5Ji+ERgCRVygKQpEpZAJJUKZ8DWAKaz+1vHpnhUs/1S+oAjwAkqVI9L4CIOC8ivhkRkxGxpdf7lyQ19PQUUEQcBXwMeB2wD/haRGzPzPt6mWNQ+LJNSf3U6+cAzgImM/MhgIgYB9YDR1QB+MAuaSnodQEMA480re8Dzu5xhpb5wC7pSBKZ2budRVwEnJeZf1TW3wacnZnvbpqzCdhUVl8OfLMH0U4GHu/BfjrBrN1h1s5bKjnhyMv6ksx84Xwb6vURwBSwqml9ZRn7mczcCmztZaiIuCszR3u5z1aZtTvM2nlLJSfUm7XXrwL6GrAmIk6NiGOAi4HtPc4gSaLHRwCZORMR7wa+CBwF3JCZ9/YygySpoed/CZyZtwG39Xq/8+jpKac2mbU7zNp5SyUnVJq1p08CS5IGh5eCkKRKVVkAEXFiRNweEQ+Uf0+YY85LIuLrEXF3RNwbEe8c4KyvjIivlpx7IuItg5q1zPtCRDwZEbf2ON+zXoYkIo6NiJvK7XdGxOpe5puVZb6sryrfnzPl5dV9s4Csl0fEfeV7c0dEvKQfOUuW+bK+MyL2lp/7r0TEaf3IWbIs6LI5EfH7EZERsfhXBmVmdR/A3wBbyvIW4Oo55hwDHFuWlwEPAy8e0Ky/DKwpyy8G9gMrBjFruW0d8HvArT3MdhTwIPDS8n/7n8Bps+a8C/iHsnwxcFOvv4aLyLoaeAVwI3BRP3IuIuta4Hll+Y8H/Ov6gqblC4AvDGrWMu/5wJeBXcDoYvdT5REAjctPbCvL24ALZ0/IzJ9k5v+U1WPp39HSQrJ+KzMfKMvfBR4D5v0jkC6YNytAZu4AftSrUMXPLkOSmT8BDl2GpFlz/puBdRERPcx4yLxZM/PhzNwD/LQP+ZotJOvOzHy6rO6i8fc//bCQrD9sWj0e6NeTpAv5fgX4EHA18ONWdlJrAQxl5v6y/CgwNNekiFgVEXtoXL7i6vLg2msLynpIRJxF4zeGB7sdbA6Lytpjc12GZPhwczJzBjgInNSTdIfJUcyVdVAsNutG4PNdTXR4C8oaEZdFxIM0jmjf26Nss82bNSLOAFZlZsvXqDli3xAmIr4EvGiOmz7QvJKZGRFztnxmPgK8IiJeDPxbRNycmQcGMWvZzinAvwAbMrMrvxl2KqvqExFvBUaBV/c7y7PJzI8BH4uIPwT+EtjQ50i/ICKeA3wYuLSd7RyxBZCZrz3cbRFxICJOycz95UHzsXm29d2IuAf4bRqnBjqqE1kj4gXA54APZOauTmc8pJNf1x6b9zIkTXP2RcTRwHLg+72JN2eOQ+bKOigWlDUiXkvjl4RXN51a7bXFfl3Hgeu6mujw5sv6fOB0YKKcpXwRsD0iLsjMuxa6k1pPAW3n/1t9A3DL7AkRsTIijivLJwC/RW8uTDfbQrIeA/wrcGNmdrygFmHerH20kMuQNOe/CLgjyzNtPbaULpkyb9aI+HXgH4ELMrOfvxQsJOuaptU3Ag/0MF+zZ82amQcz8+TMXJ2Zq2k8t7KoB/9DG6rug8Z53R00/nO/BJxYxkeBj5fl1wF7aDz7vgfYNMBZ3wr8L3B308crBzFrWf934HvAMzTObb6+R/nOB75F4/mRD5SxD5YfHIDnAp8GJoH/AF7ax+/R+bL+RvnaPUXjKOXeAc76JeBA0/fm9gHOei1wb8m5E/jVQc06a+4ELbwKyL8ElqRK1XoKSJKqZwFIUqUsAEmqlAUgSZWyACSpUhaAJFXKApCkSlkAklSp/wORSdDqEBHPbAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.a1.hist(bins=30, cumulative=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A normalized cumulative histogram is what we call Cumulative distribution function (CDF) in statistics.\n", "The CDF is the probability that the variable takes a value less than or equal to x.\n", "In the example below, the probability that x <= 0.0 is 0.5 and x <= 0.2 is cca. 0.98.\n", "Note that `densitiy=1` argument works as expected with cumulative histograms." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEXdJREFUeJzt3X9sXWd9x/H3l7BCV5cACzOQZE3RUmlZjYB4jSa21RZFS5mWTCJjKaUiEyViI2NSs2mZijpU9sfK1EmbyIBoIH5Iw5RKY1HJ6EaJxZgISzMgIakKpovWhJLyM5tLoVh894dP4NZc+x7H5/56+n5JVs4599F5Pnbsj4/PPffcyEwkSWV5Wr8DSJKaZ7lLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCvT0fk28Zs2a3LBhQ9fneeyxx7jsssu6Pk8ThiXrsOQEs3aLWbujTtZjx459MzOf13FnmdmXj82bN2cvHD58uCfzNGFYsg5LzkyzdotZu6NOVuD+rNGxnpaRpAJZ7pJUIMtdkgpkuUtSgSx3SSpQx3KPiPdFxKMR8aVFHo+I+LuImImI4xHxsuZjSpKWo86R+/uBrUs8fj2wsfrYDbxr5bEkSSvRsdwz89PAt5cYsh34YHUJ5hHg2RHxgqYCSpKWr4lz7muBh1vWz1TbJEl9ElnjDbIjYgNwT2Ze3eaxe4C/yszPVOv3AX+Wmfe3Gbub+VM3jI6Obp6amlpR+DpmZ2cZGRnp+jxNGJasw5ITzNotTWY9cfZ8I/tZzOilcO7xrk6xbGNrV7fdXufrOjk5eSwzxzvN0cS9Zc4C61vW11XbfkpmHgAOAIyPj+fExEQD0y9tenqaXszThGHJOiw5wazd0mTWXfs+3sh+FrN3bI47T/TtNlptnb5xou32Jr+uTXzGB4E9ETEFbAHOZ+YjDexX0oDZUBXx3rG5rpeyVqZjuUfEh4EJYE1EnAH+AvgZgMx8N3AIeBUwA3wP+P1uhZUk1dOx3DPzhg6PJ/DmxhJJklZssE5ESeqLDZ5iKY63H5CkAlnuklQgy12SCmS5S1KBfEJVKphPlD51eeQuSQWy3CWpQJa7JBXIcpekAvmEqjSEfKJUnXjkLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgrkpZDSAPESRzXFI3dJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgXwRk9QDdV6ctHdsDn8k1RSP3CWpQJa7JBXIcpekAlnuklQgy12SClSr3CNia0Q8GBEzEbGvzeO/EBGHI+LzEXE8Il7VfFRJUl0dyz0iVgH7geuBTcANEbFpwbC3Andl5kuBncDfNx1UklRfnSP3a4CZzHwoM58ApoDtC8Yk8KxqeTXwteYiSpKWq84rJtYCD7esnwG2LBjzNuBfI+KPgMuA6xpJJ0m6KJGZSw+I2AFszcybq/WbgC2ZuadlzC3Vvu6MiF8F3gtcnZk/WrCv3cBugNHR0c1TU1ONfjLtzM7OMjIy0vV5mjAsWYclJwxO1hNnz3ccM3opnHu8B2EaYNaVGVu7uu32Ot+vk5OTxzJzvNMcdY7czwLrW9bXVdtavQHYCpCZn42IZwJrgEdbB2XmAeAAwPj4eE5MTNSYfmWmp6fpxTxNGJasw5ITBifrrpq3H7jzxHDcfsCsK3P6xom225v8fq3zGR8FNkbElcyX+k7gtQvG/A/wCuD9EfFLwDOBbzSSUBpgvqG1BlXHJ1Qzcw7YA9wLPMD8VTEnI+L2iNhWDdsLvDEivgh8GNiVnc73SJK6ptbfKpl5CDi0YNttLcungJc3G02SdLF8haokFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKtBgvT2JNAB8Aw6VwCN3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDvxKSnDN9hSU8ltY7cI2JrRDwYETMRsW+RMa+JiFMRcTIi/rHZmJKk5eh45B4Rq4D9wCuBM8DRiDiYmadaxmwE/hx4eWZ+JyJ+vluBJUmd1TlyvwaYycyHMvMJYArYvmDMG4H9mfkdgMx8tNmYkqTlqFPua4GHW9bPVNtaXQVcFRH/ERFHImJrUwElScsXmbn0gIgdwNbMvLlavwnYkpl7WsbcA/wQeA2wDvg0MJaZ312wr93AboDR0dHNU1NTDX4q7c3OzjIyMtL1eZowLFmHJSc8OeuJs+f7nGZpo5fCucf7naIes67M2NrVbbfX+dmanJw8lpnjneaoc7XMWWB9y/q6alurM8DnMvOHwH9HxJeBjcDR1kGZeQA4ADA+Pp4TExM1pl+Z6elpejFPE4Yl67DkhCdn3TXgV8vsHZvjzhPDcQGbWVfm9I0Tbbc3+bNV57TMUWBjRFwZEZcAO4GDC8Z8DJgAiIg1zJ+meaiRhJKkZetY7pk5B+wB7gUeAO7KzJMRcXtEbKuG3Qt8KyJOAYeBP83Mb3UrtCRpabX+VsnMQ8ChBdtua1lO4JbqQ5LUZ95+QJIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKtBg3QdTughLvfH13rG5gb/Vr9QNHrlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBfI9VDWwlnpvVElLq3XkHhFbI+LBiJiJiH1LjHt1RGREjDcXUZK0XB3LPSJWAfuB64FNwA0RsanNuMuBPwY+13RISdLy1DlyvwaYycyHMvMJYArY3mbc24E7gO83mE+SdBHqlPta4OGW9TPVth+LiJcB6zPTk6SSNAAiM5ceELED2JqZN1frNwFbMnNPtf404FPArsw8HRHTwJ9k5v1t9rUb2A0wOjq6eWpqqsnPpa3Z2VlGRka6Pk8ThiVrr3KeOHt+xfsYvRTOPd5AmB4wa3cMYtaxtavbbq/zszU5OXksMzs+r1nnapmzwPqW9XXVtgsuB64GpiMC4PnAwYjYtrDgM/MAcABgfHw8JyYmaky/MtPT0/RiniYMS9Ze5dzVwNUye8fmuPPEcFwUZtbuGMSsp2+caLu9yZ+tOqdljgIbI+LKiLgE2AkcvPBgZp7PzDWZuSEzNwBHgJ8qdklS73Qs98ycA/YA9wIPAHdl5smIuD0itnU7oCRp+Wr9rZKZh4BDC7bdtsjYiZXHkiSthLcfkKQCWe6SVCDLXZIKZLlLUoEsd0kq0GBd2a+nBG/lK3WfR+6SVCDLXZIKZLlLUoEsd0kqkOUuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBLHdJKpDlLkkFstwlqUCWuyQVyHKXpAJZ7pJUIMtdkgpkuUtSgSx3SSqQb5CtxvjG19Lg8MhdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFahWuUfE1oh4MCJmImJfm8dviYhTEXE8Iu6LiCuajypJqqtjuUfEKmA/cD2wCbghIjYtGPZ5YDwzXwzcDbyj6aCSpPrqHLlfA8xk5kOZ+QQwBWxvHZCZhzPze9XqEWBdszElScsRmbn0gIgdwNbMvLlavwnYkpl7Fhn/TuDrmfmXbR7bDewGGB0d3Tw1NbXC+J3Nzs4yMjLS9XmaMCxZF8t54uz5PqRZ2uilcO7xfqeox6zdMYhZx9aubru9TgdMTk4ey8zxTnM0evuBiHgdMA5c2+7xzDwAHAAYHx/PiYmJJqdva3p6ml7M04RhybpYzl0DePuBvWNz3HliOO6yYdbuGMSsp2+caLu9yQ6o8xmfBda3rK+rtj1JRFwH3Apcm5k/aCSdJOmi1DnnfhTYGBFXRsQlwE7gYOuAiHgp8B5gW2Y+2nxMSdJydCz3zJwD9gD3Ag8Ad2XmyYi4PSK2VcP+GhgBPhoRX4iIg4vsTpLUA7VORGXmIeDQgm23tSxf13AuDZCFt/LdOzY3kOfXJf2Er1CVpAJZ7pJUIMtdkgpkuUtSgSx3SSqQ5S5JBbLcJalAlrskFchyl6QCWe6SVCDLXZIKNFg3OVbPLLxfjKSyeOQuSQWy3CWpQJa7JBXIcpekAlnuklQgy12SCmS5S1KBvM69MF6/Lgk8cpekIlnuklQgy12SCmS5S1KBLHdJKpDlLkkF8lLIIeEljpKWwyN3SSqQ5S5JBbLcJalAnnPvs9Zz6XvH5tjluXVJDfDIXZIKVKvcI2JrRDwYETMRsa/N48+IiI9Uj38uIjY0HVSSVF/H0zIRsQrYD7wSOAMcjYiDmXmqZdgbgO9k5i9GxE7gDuD3uhF4WHjpoqR+qnPO/RpgJjMfAoiIKWA70Fru24G3Vct3A++MiMjMbDDrQLC0JQ2DOuW+Fni4Zf0MsGWxMZk5FxHngZ8DvtlEyF6wtCWVpKdXy0TEbmB3tTobEQ/2YNo1DMkvmbcMSdZhyQlm7RazrkzcsehDdbJeUWeOOuV+Fljfsr6u2tZuzJmIeDqwGvjWwh1l5gHgQJ1gTYmI+zNzvJdzXqxhyTosOcGs3WLW7mgya52rZY4CGyPiyoi4BNgJHFww5iDw+mp5B/CpEs+3S9Kw6HjkXp1D3wPcC6wC3peZJyPiduD+zDwIvBf4UETMAN9m/heAJKlPap1zz8xDwKEF225rWf4+8LvNRmtMT08DrdCwZB2WnGDWbjFrdzSWNTx7Iknl8fYDklSg4so9Ip4bEf8WEV+p/n1OmzFXRMR/RcQXIuJkRLxpgLO+JCI+W+U8HhE9f+VvnZzVuE9ExHcj4p4+ZByaW2TUyPob1ffnXETs6EfGliydst4SEaeq7837IqLWZXrdUCPrmyLiRPVz/5mI2NSPnFWWJbO2jHt1RGRELP8Kmsws6gN4B7CvWt4H3NFmzCXAM6rlEeA08MIBzXoVsLFafiHwCPDsQctZPfYK4LeBe3qcbxXwVeBF1f/tF4FNC8b8IfDuankn8JFe/38vI+sG4MXAB4Ed/ci5jKyTwM9Wy38w4F/XZ7UsbwM+MahZq3GXA58GjgDjy52nuCN35m+F8IFq+QPA7ywckJlPZOYPqtVn0L+/YOpk/XJmfqVa/hrwKPC8niWc1zEnQGbeB/xfr0K1+PEtMjLzCeDCLTJatX4OdwOviIjoYcYLOmbNzNOZeRz4UR/ytaqT9XBmfq9aPcL862D6oU7W/21ZvQzo1xOOdb5fAd7O/H26vn8xk5RY7qOZ+Ui1/HVgtN2giFgfEceZv23CHVVx9lqtrBdExDXM/6b/areDLbCsnH3Q7hYZaxcbk5lzwIVbZPRanayDYrlZ3wD8S1cTLa5W1oh4c0R8lfm/Rt/So2wLdcwaES8D1mfmRd8XZSjfrCMiPgk8v81Dt7auZGZGRNvfzpn5MPDiiHgh8LGIuDszzw1i1mo/LwA+BLw+Mxs/omsqp56aIuJ1wDhwbb+zLCUz9wP7I+K1wFv5yYsvB0ZEPA34G2DXSvYzlOWemdct9lhEnIuIF2TmI1UhPtphX1+LiC8Bv878n+uNaiJrRDwL+Dhwa2YeaTpjUzn7qLFbZPRAnayDolbWiLiO+YOAa1tOd/bacr+uU8C7uppocZ2yXg5cDUxXZw6fDxyMiG2ZeX/dSUo8LdN6K4TXA/+8cEBErIuIS6vl5wC/BvTiJmYL1cl6CfBPwAczs/FfPjV1zNlnw3SLjDpZB0XHrBHxUuA9wLbM7Ocv/TpZN7as/hbwlR7ma7Vk1sw8n5lrMnNDZm5g/rmMZRX7hR0V9cH8edT7mP+P+yTw3Gr7OPAP1fIrgePMP0t9HNg9wFlfB/wQ+ELLx0sGLWe1/u/AN4DHmT+P+Js9zPgq4MvMPx9xa7Xt9uqHAuCZwEeBGeA/gRf18Xu0U9Zfqb5+jzH/18XJAc76SeBcy/fmwQHO+rfAySrnYeCXBzXrgrHTXMTVMr5CVZIKVOJpGUl6yrPcJalAlrskFchyl6QCWe6SVCDLXZIKZLlLUoEsd0kq0P8DF/oPVytT1eAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.a1.hist(bins=30, cumulative=True, density=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Plots for separate groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas enables us to visualize data separated by the value of the specified column.\n", "Separating data by certain columns and observing differences in distributions is a common step in Exploratory Data Analysis.\n", "Let's separate distributions of a1 and a2 columns by the y2 column and plot histograms." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ],\n", " dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAENCAYAAADgwHn9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADp9JREFUeJzt3X+oZGd9x/H3x0Sb4u8lt+uaTXqlri1aMMptFOwfttYak8JaKEFLdWtTttQEKpTSbRFMi8JCqaWCBtYqWaVGA5q6ZUNbWQpSipobjdZoNYvdkN3G7FpttFq0id/+cc8+jsk19+78uGfmzPsFwz3zzJmZ74F5+DzP+XVTVUiSBPCEvguQJM0PQ0GS1BgKkqTGUJAkNYaCJKkxFCRJjaEgSWoMhQWRZFeS25N8J8l9SX6z75qkPiS5Mcl6ku8luaXveobm4r4L0La9C/g+sBu4Ejie5HNVdU+/ZUk77j+BtwGvAn6y51oGJ17RPP+SPBn4JvDzVfWVru0DwJmqOtRrcVJPkrwN2FtVv913LUPi7qPF8Dzg4fOB0Pkc8IKe6pE0UIbCYngK8K1HtT0EPLWHWiQNmKGwGP4HeNqj2p4GfLuHWiQNmKGwGL4CXJxk30jbCwEPMkuaKkNhAVTVd4CPAn+e5MlJXgbsBz7Qb2XSzktycZJLgIuAi5JcksQzKafEUFgcb2Lj9LuzwK3A73s6qpbUW4D/BQ4Bv9Utv6XXigbEU1IlSY0zBUlSYyhIkhpDQZLUGAqSpMZQkCQ1c3Fu76WXXlqrq6t9l6GBueuuu75eVSt913Gh7A+ahe32h7kIhdXVVdbX1/suQwOT5L6+axiH/UGzsN3+4O4jSVJjKEiSGkNBktQYCpKkxlCQJDWGgiSpMRQkSY2hIElq5uLitSFYPXT8MW2nDl/bQyVSvzbrC2B/WBTOFCRJjaEgSWoMBUlSYyhIkhpDQZLUGAqSpMZQkCQ1hoIkqTEUJEmNoSBJagwFSVKzZSgkuTzJPyf5YpJ7kvxB174ryceT3Nv9fWbXniTvTHIyyeeTvHjWGyFJmo7tzBQeBv6wqp4PvBS4IcnzgUPAiaraB5zongO8GtjXPQ4CN0+9aqkHDpC0DLYMhap6oKo+0y1/G/gScBmwHzjarXYUeE23vB94f234JPCMJHumXrm08xwgafAu6JhCklXgRcCngN1V9UD30teA3d3yZcD9I2873bU9+rMOJllPsn7u3LkLLFvaeQ6QtAy2HQpJngJ8BHhzVX1r9LWqKqAu5Iur6khVrVXV2srKyoW8VerdNAdI3ec5SNJc2FYoJHkiG4Hwt1X10a75wfOjnu7v2a79DHD5yNv3dm3SIEx7gNS9z0GS5sJ2zj4K8F7gS1X1jpGXjgEHuuUDwMdG2t/QHWR7KfDQyChKWmgOkDR025kpvAx4PfDLSe7uHtcAh4FXJrkX+JXuOcAdwFeBk8B7gDdNv2xp5zlA0jLY8n80V9W/APkxL79ik/ULuGHCuqR5dH6A9G9J7u7a/pSNAdFtSa4H7gOu6167A7iGjQHSd4E37my50oXbMhQkbXCApGXgbS4kSY2hIElqDAVJUmMoSJIaQ0GS1BgKkqTGUJAkNYaCJKkxFCRJjVc0S9rU6qHjm7afOnztDleineRMQZLUGAqSpMZQkCQ1hoIkqTEUJEmNoSBJagwFSVJjKEiSGkNBktQYCpKkxlCQJDWGgiSpMRQkSY2hIElqDAVJUmMoSJIaQ0GS1BgKkqTGUJAkNYaCJKkxFCRJzcV9FzBvVg8df0zbqcPX9lCJJO08ZwqSpMZQkCQ1hoIkqTEUJEmNoSBJarYMhSTvS3I2yRdG2m5KcibJ3d3jmpHX/iTJySRfTvKqWRUuSZq+7cwUbgGu3qT9r6rqyu5xB0CS5wOvBV7QvefdSS6aVrFS3xwkaei2DIWq+gTwjW1+3n7gQ1X1var6D+AkcNUE9Unz5hYcJGnAJjmmcGOSz3cjp2d2bZcB94+sc7prkwbBQZKGbtxQuBn4GeBK4AHgLy/0A5IcTLKeZP3cuXNjliHNDQdJGoSxQqGqHqyqR6rqB8B7+OHo5wxw+ciqe7u2zT7jSFWtVdXaysrKOGVI88JBkgZjrFBIsmfk6a8D5w+6HQNem+QnkjwH2Ad8erISpfnmIElDsuUN8ZLcCrwcuDTJaeCtwMuTXAkUcAr4PYCquifJbcAXgYeBG6rqkdmULs2HJHuq6oHu6aMHSR9M8g7g2ThI0gLYMhSq6nWbNL/3cdZ/O/D2SYqS5pWDJA2dt86WLoCDJA2dt7mQJDWGgiSpMRQkSY2hIElqDAVJUmMoSJIaQ0GS1BgKkqTGUJAkNYaCJKkxFCRJjaEgSWoMBUlSYyhIkhpDQZLUGAqSpMZQkCQ1hoIkqTEUJEmNoSBJagwFSVJzcd8FSNLjuunpm7Q9tPN1LAlnCpKkxlCQJDWGgiSp8ZjCPNpsHypsvR913PdJUseZgiSpMRQkSY2hIElqDAVJUuOBZknD5EVvY3GmIElqDAVJUmMoSJIaQ0GS1BgKkqTGUJAkNYaCJKnZMhSSvC/J2SRfGGnbleTjSe7t/j6za0+SdyY5meTzSV48y+KlnWZ/0NBtZ6ZwC3D1o9oOASeqah9wonsO8GpgX/c4CNw8nTKluXEL9gcN2JahUFWfAL7xqOb9wNFu+SjwmpH299eGTwLPSLJnWsVKfbM/aOjGPaawu6oe6Ja/Buzuli8D7h9Z73TXJg2Z/UGDMfGB5qoqoC70fUkOJllPsn7u3LlJy5Dmgv1Bi27cUHjw/DS4+3u2az8DXD6y3t6u7TGq6khVrVXV2srKyphlSHPB/qDBGDcUjgEHuuUDwMdG2t/QnXXxUuChkWm1NFT2Bw3GlrfOTnIr8HLg0iSngbcCh4HbklwP3Adc161+B3ANcBL4LvDGGdQs9cb+oKHbMhSq6nU/5qVXbLJuATdMWpQ0r+wPGjqvaJYkNYaCJKkxFCRJjaEgSWoMBUlSYyhIkhpDQZLUGAqSpMZQkCQ1W17RLElTcdPTN2l7aOfr0ONypiBJagwFSVJjKEiSGkNBktQYCpKkxrOPZmmzsy3AMy4kzS1nCpKkxplCz1YPHX9M26lLeihEkjAUJOlHLflFdu4+kiQ1hoIkqTEUJEmNxxTkqbOSGmcKkqTGUJAkNYaCJKkxFCRJjaEgSWoMBUlSYyhIkhpDQZLUePGapLmw2R2DwbsG7zRnCpKkxlCQJDWGgiSpMRQkSY2hIElqDAVJUmMoSJKaia5TSHIK+DbwCPBwVa0l2QV8GFgFTgHXVdU3JytTmn/2Bw3BNGYKv1RVV1bVWvf8EHCiqvYBJ7rn0rKwP2ihzWL30X7gaLd8FHjNDL5DWhT2By2USUOhgH9KcleSg13b7qp6oFv+GrB7szcmOZhkPcn6uXPnJixDmgtj9wdpXkx676NfrKozSX4K+HiSfx99saoqSW32xqo6AhwBWFtb23QdacGM3R+6EDkIcMUVV8y+UunHmGimUFVnur9ngduBq4AHk+wB6P6enbRIaRFM0h+q6khVrVXV2srKyk6VLD3G2KGQ5MlJnnp+GfhV4AvAMeBAt9oB4GOTFinNO/uDhmKS3Ue7gduTnP+cD1bVPyS5E7gtyfXAfcB1k5cpzT37gwZh7FCoqq8CL9yk/b+AV0xSlLRolqo/3PT0Tdoe2vk6NBNe0SxJagwFSVJjKEiSGkNBktQYCpKkZtIrmrXMNjsLBTwTZc6sHjq+afupw9fucCVaBM4UJEmNoSBJagwFSVLjMYXtcN+5pCXhTEGS1DhTkKRpGMg9oZwpSJIaQ0GS1Ax295EX7EjShXOmIElqDAVJUmMoSJIaQ0GS1BgKkqTGUJAkNYaCJKkxFCRJjaEgSWoMBUlSYyhIkhpDQZLUGAqSpGawd0mVpLk3h//q15mCJKlZvpnCQP5lnjQx+4I24UxBktQYCpKkZvl2H6l/c3hwTdIGZwqSpMaZgqSFtnro+Kbtpy7Z4UIGYnFDwTMnpA3ujtMUuftIktTMbKaQ5Grgr4GLgL+pqsOz+q5ltNmU2enyfLIvaJHMJBSSXAS8C3glcBq4M8mxqvriLL5PS2IBd5PYFzQTM+wLs9p9dBVwsqq+WlXfBz4E7J/Rd0nzzL6ghTKr3UeXAfePPD8NvGScD/LMgulyt9OOsy9ooaSqpv+hyW8AV1fV73bPXw+8pKpuHFnnIHCwe/qzwJenXsjOuxT4et9FzNgibeNPV9VKnwVspy907UPrD4v0OxnXom3jtvrDrGYKZ4DLR57v7dqaqjoCHJnR9/ciyXpVrfVdxywtwzZO2ZZ9AYbXH5bhdzLUbZzVMYU7gX1JnpPkScBrgWMz+i5pntkXtFBmMlOoqoeT3Aj8Ixun4b2vqu6ZxXdJ88y+oEUzs+sUquoO4I5Zff6cGsz0/3EswzZOlX1hsAa5jTM50CxJWkze5kKS1BgKkqTGUJAkNYbCFCTZlWRX33VIfbIfDIOhMKYkVyT5UJJzwKeATyc527Wt9lvddCT5nZHlvUlOJPnvJP+a5Hl91qb5YD8YXj8wFMb3YeB24FlVta+qngvsAf6OjZueDcHorRjewcY27wL+Ari5l4o0b+wHA+MpqWNKcm9V7bvQ1xZJks9U1Yu75bur6sqR1z5bVS/qrzrNA/vB8PrB4v47zv7dleTdwFF+eBfMy4EDwGd7q2q69iZ5JxBgJckTq+r/utee2GNdmh/2g4ExFMb3BuB64M/YuD0ybNwW+e+B9/ZV1JT90cjyOvAU4JtJnoX379EG+8HAuPtIktR4oHkGkvxa3zXM2jJsoyazDL+RIW6joTAbv9B3ATtgGbZRk1mG38jgttHdRxNI8nNs/L/d8/tSzwDHqupL/VU1XcuwjZrMMvxGlmEbz3OmMKYkf8zGedgBPt09Atya5FCftU3LMmyjJrMMv5Fl2MZRzhTGlOQrwAtGTk073/4k4J6BnJ89+G3UZJbhN7IM2zjKmcL4fgA8e5P2Pd1rQ7AM26jJLMNvZBm2sfE6hfG9GTiR5F5+eNHOFcBz+dHL4hfZMmyjJrMMv5Fl2MbG3UcTSPIE4Cp+9ODTnVX1SH9VTdcybKMmswy/kWXYxvMMBUlS4zEFSVJjKEiSGkNBktQYCpKkxlCQJDX/D3+8Ru08viu4AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[['a1', 'a2']].hist(by=df.y2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is not much difference between separated distributions as the data was randomly generated. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same for the line plot." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ],\n", " dtype=object)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[['a1', 'a2']].plot(by=df.y2, subplots=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Dummy variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some Machine Learning algorithms don't work with multivariate attributes, like a3 column in our example.\n", "a3 column has 5 distinct values (0, 1, 2, 3, 4 and 5). \n", "To transform a multivariate attribute to multiple binary attributes, we can binarize the column, so that we get 5 attributes with 0 and 1 values.\n", "\n", "Let's look at the example below. \n", "The first three rows of a3 column have value 2. \n", "So a3_2 attribute has the first three rows marked with 1 and all other attributes are 0.\n", "The fourth row in a3 has a value 3, so a3_3 is 1 and all others are 0, etc." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 2\n", "2 2\n", "3 3\n", "4 4\n", "Name: a3, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.a3.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a3__0a3__1a3__2a3__3a3__4
000100
100100
200100
300010
400001
\n", "
" ], "text/plain": [ " a3__0 a3__1 a3__2 a3__3 a3__4\n", "0 0 0 1 0 0\n", "1 0 0 1 0 0\n", "2 0 0 1 0 0\n", "3 0 0 0 1 0\n", "4 0 0 0 0 1" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_a4_dummy = pd.get_dummies(df.a3, prefix='a3_')\n", "df_a4_dummy.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get_dummies` function also enables us to drop the first column, so that we don't store redundant information.\n", "Eg. when a3_1, a3_2, a3_3, a3_4 are all 0 we can assume that a3_0 should be 1 and we don't need to store it." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
a3__1a3__2a3__3a3__4
00100
10100
20100
30010
40001
\n", "
" ], "text/plain": [ " a3__1 a3__2 a3__3 a3__4\n", "0 0 1 0 0\n", "1 0 1 0 0\n", "2 0 1 0 0\n", "3 0 0 1 0\n", "4 0 0 0 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(df.a3, prefix='a3_', drop_first=True).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have binarized the a3 column, let's remove it from the DataFrame and add binarized attributes to it." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "df = df.drop('a3', axis=1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1000, 10)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.concat([df, df_a4_dummy], axis=1)\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexa1a2y1y2a3__0a3__1a3__2a3__3a3__4
000.0496710.4798711.000000100100
11-0.0138260.3849271.002308000100
220.0647690.2119261.004620000100
330.1523030.0706131.006939000010
44-0.0234150.3396451.009262000001
\n", "
" ], "text/plain": [ " index a1 a2 y1 y2 a3__0 a3__1 a3__2 a3__3 a3__4\n", "0 0 0.049671 0.479871 1.000000 1 0 0 1 0 0\n", "1 1 -0.013826 0.384927 1.002308 0 0 0 1 0 0\n", "2 2 0.064769 0.211926 1.004620 0 0 0 1 0 0\n", "3 3 0.152303 0.070613 1.006939 0 0 0 0 1 0\n", "4 4 -0.023415 0.339645 1.009262 0 0 0 0 0 1" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Fitting lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we would like to compare a certain distribution with a linear line.\n", "Eg. To determine if monthly sales growth is higher than linear.\n", "When we observe that our data is linear, we can predict future values.\n", "\n", "Pandas (with the help of numpy) enables us to fit a linear line to our data.\n", "This is a Linear Regression algorithm in Machine Learning, which tries to make the vertical distance between the line and the data points as small as possible. \n", "This is called “fitting the line to the data.” " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot below shows the y1 column. \n", "Let's draw a linear line that closely matches data points of the y1 column." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.plot.scatter(x='index', y='y1', s=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below calculates the least-squares solution to a linear equation. \n", "The output of the function that we are interested in is the least-squares solution." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "df['ones'] = pd.np.ones(len(df))\n", "m, c = pd.np.linalg.lstsq(df[['index', 'ones']], df['y1'], rcond=None)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Equation for a line is `y = m * x + c`. \n", "Let's use the equation and calculate the values for the line `y` that closely fits the `y1` line." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "df['y'] = df['index'].apply(lambda x: x * m + c)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[['y', 'y1']].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this part, I've shown a few tricks that help me to be more productive when working on Exploratory Data Analysis.\n", "\n", "Have you learned any way to make visualizations with pandas? Let me know in the comments below." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }