|
| 1 | +.. _compare_with_spreadsheets: |
| 2 | + |
| 3 | +{{ header }} |
| 4 | + |
| 5 | +Comparison with spreadsheets |
| 6 | +**************************** |
| 7 | + |
| 8 | +Since many potential pandas users have some familiarity with spreadsheet programs like |
| 9 | +`Excel <https://support.microsoft.com/en-us/excel>`_, this page is meant to provide some examples |
| 10 | +of how various spreadsheet operations would be performed using pandas. This page will use |
| 11 | +terminology and link to documentation for Excel, but much will be the same/similar in |
| 12 | +`Google Sheets <https://support.google.com/a/users/answer/9282959>`_, |
| 13 | +`LibreOffice Calc <https://help.libreoffice.org/latest/en-US/text/scalc/main0000.html?DbPAR=CALC>`_, |
| 14 | +`Apple Numbers <https://www.apple.com/mac/numbers/compatibility/functions.html>`_, and other |
| 15 | +Excel-compatible spreadsheet software. |
| 16 | + |
| 17 | +.. include:: comparison_boilerplate.rst |
| 18 | + |
| 19 | +Data structures |
| 20 | +--------------- |
| 21 | + |
| 22 | +General terminology translation |
| 23 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 24 | + |
| 25 | +.. csv-table:: |
| 26 | + :header: "pandas", "Excel" |
| 27 | + :widths: 20, 20 |
| 28 | + |
| 29 | + ``DataFrame``, worksheet |
| 30 | + ``Series``, column |
| 31 | + ``Index``, row headings |
| 32 | + row, row |
| 33 | + ``NaN``, empty cell |
| 34 | + |
| 35 | +``DataFrame`` |
| 36 | +~~~~~~~~~~~~~ |
| 37 | + |
| 38 | +A ``DataFrame`` in pandas is analogous to an Excel worksheet. While an Excel worksheet can contain |
| 39 | +multiple worksheets, pandas ``DataFrame``\s exist independently. |
| 40 | + |
| 41 | +``Series`` |
| 42 | +~~~~~~~~~~ |
| 43 | + |
| 44 | +A ``Series`` is the data structure that represents one column of a ``DataFrame``. Working with a |
| 45 | +``Series`` is analogous to referencing a column of a spreadsheet. |
| 46 | + |
| 47 | +``Index`` |
| 48 | +~~~~~~~~~ |
| 49 | + |
| 50 | +Every ``DataFrame`` and ``Series`` has an ``Index``, which are labels on the *rows* of the data. In |
| 51 | +pandas, if no index is specified, a :class:`~pandas.RangeIndex` is used by default (first row = 0, |
| 52 | +second row = 1, and so on), analogous to row headings/numbers in spreadsheets. |
| 53 | + |
| 54 | +In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that |
| 55 | +use use as the row identifier in a worksheet. Unlike spreadsheets, these ``Index`` values can actually be |
| 56 | +used to reference the rows. For example, in spreadsheets, you would reference the first row as ``A1:Z1``, |
| 57 | +while in pandas you could use ``populations.loc['Chicago']``. |
| 58 | + |
| 59 | +Index values are also persistent, so if you re-order the rows in a ``DataFrame``, the label for a |
| 60 | +particular row don't change. |
| 61 | + |
| 62 | +See the :ref:`indexing documentation<indexing>` for much more on how to use an ``Index`` |
| 63 | +effectively. |
| 64 | + |
| 65 | +Commonly used spreadsheet functionalities |
| 66 | +----------------------------------------- |
| 67 | + |
| 68 | +Importing data |
| 69 | +~~~~~~~~~~~~~~ |
| 70 | + |
| 71 | +Both `Excel <https://support.microsoft.com/en-us/office/import-data-from-external-data-sources-power-query-be4330b3-5356-486c-a168-b68e9e616f5a>`__ |
| 72 | +and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various |
| 73 | +formats. |
| 74 | + |
| 75 | +Excel files |
| 76 | +''''''''''' |
| 77 | + |
| 78 | +Excel opens `various Excel file formats <https://support.microsoft.com/en-us/office/file-formats-that-are-supported-in-excel-0943ff2c-6014-4e8d-aaea-b83d51d46247>`_ |
| 79 | +by double-clicking them, or using `the Open menu <https://support.microsoft.com/en-us/office/open-files-from-the-file-menu-97f087d8-3136-4485-8e86-c5b12a8c4176>`_. |
| 80 | +In pandas, you use :ref:`special methods for reading and writing from/to Excel files <io.excel>`. |
| 81 | + |
| 82 | +CSV |
| 83 | +''' |
| 84 | + |
| 85 | +Let's load and display the `tips <https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/csv/tips.csv>`_ |
| 86 | +dataset from the pandas tests, which is a CSV file. In Excel, you would download and then |
| 87 | +`open the CSV <https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba>`_. |
| 88 | +In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read_csv`: |
| 89 | + |
| 90 | +.. ipython:: python |
| 91 | +
|
| 92 | + url = ( |
| 93 | + "https://raw.github.com/pandas-dev" |
| 94 | + "/pandas/master/pandas/tests/io/data/csv/tips.csv" |
| 95 | + ) |
| 96 | + tips = pd.read_csv(url) |
| 97 | + tips |
| 98 | +
|
| 99 | +Fill Handle |
| 100 | +~~~~~~~~~~~ |
| 101 | + |
| 102 | +Create a series of numbers following a set pattern in a certain set of cells. In |
| 103 | +a spreadsheet, this would be done by shift+drag after entering the first number or by |
| 104 | +entering the first two or three values and then dragging. |
| 105 | + |
| 106 | +This can be achieved by creating a series and assigning it to the desired cells. |
| 107 | + |
| 108 | +.. ipython:: python |
| 109 | +
|
| 110 | + df = pd.DataFrame({"AAA": [1] * 8, "BBB": list(range(0, 8))}) |
| 111 | + df |
| 112 | +
|
| 113 | + series = list(range(1, 5)) |
| 114 | + series |
| 115 | +
|
| 116 | + df.loc[2:5, "AAA"] = series |
| 117 | +
|
| 118 | + df |
| 119 | +
|
| 120 | +Filters |
| 121 | +~~~~~~~ |
| 122 | + |
| 123 | +Filters can be achieved by using slicing. |
| 124 | + |
| 125 | +The examples filter by 0 on column AAA, and also show how to filter by multiple |
| 126 | +values. |
| 127 | + |
| 128 | +.. ipython:: python |
| 129 | +
|
| 130 | + df[df.AAA == 0] |
| 131 | +
|
| 132 | + df[(df.AAA == 0) | (df.AAA == 2)] |
| 133 | +
|
| 134 | +
|
| 135 | +Drop Duplicates |
| 136 | +~~~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +Excel has built-in functionality for `removing duplicate values <https://support.microsoft.com/en-us/office/find-and-remove-duplicates-00e35bea-b46a-4d5d-b28e-66a552dc138d>`_. |
| 139 | +This is supported in pandas via :meth:`~DataFrame.drop_duplicates`. |
| 140 | + |
| 141 | +.. ipython:: python |
| 142 | +
|
| 143 | + df = pd.DataFrame( |
| 144 | + { |
| 145 | + "class": ["A", "A", "A", "B", "C", "D"], |
| 146 | + "student_count": [42, 35, 42, 50, 47, 45], |
| 147 | + "all_pass": ["Yes", "Yes", "Yes", "No", "No", "Yes"], |
| 148 | + } |
| 149 | + ) |
| 150 | +
|
| 151 | + df.drop_duplicates() |
| 152 | +
|
| 153 | + df.drop_duplicates(["class", "student_count"]) |
| 154 | +
|
| 155 | +
|
| 156 | +Pivot Tables |
| 157 | +~~~~~~~~~~~~ |
| 158 | + |
| 159 | +`PivotTables <https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576>`_ |
| 160 | +from spreadsheets can be replicated in pandas through :ref:`reshaping`. Using the ``tips`` dataset again, |
| 161 | +let's find the average gratuity by size of the party and sex of the server. |
| 162 | + |
| 163 | +In Excel, we use the following configuration for the PivotTable: |
| 164 | + |
| 165 | +.. image:: ../../_static/excel_pivot.png |
| 166 | + :align: center |
| 167 | + |
| 168 | +The equivalent in pandas: |
| 169 | + |
| 170 | +.. ipython:: python |
| 171 | +
|
| 172 | + pd.pivot_table( |
| 173 | + tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average |
| 174 | + ) |
| 175 | +
|
| 176 | +Formulas |
| 177 | +~~~~~~~~ |
| 178 | + |
| 179 | +In spreadsheets, `formulas <https://support.microsoft.com/en-us/office/overview-of-formulas-in-excel-ecfdc708-9162-49e8-b993-c311f47ca173>`_ |
| 180 | +are often created in individual cells and then `dragged <https://support.microsoft.com/en-us/office/copy-a-formula-by-dragging-the-fill-handle-in-excel-for-mac-dd928259-622b-473f-9a33-83aa1a63e218>`_ |
| 181 | +into other cells to compute them for other columns. In pandas, you'll be doing more operations on |
| 182 | +full columns. |
| 183 | + |
| 184 | +As an example, let's create a new column "girls_count" and try to compute the number of boys in |
| 185 | +each class. |
| 186 | + |
| 187 | +.. ipython:: python |
| 188 | +
|
| 189 | + df["girls_count"] = [21, 12, 21, 31, 23, 17] |
| 190 | + df |
| 191 | + df["boys_count"] = df["student_count"] - df["girls_count"] |
| 192 | + df |
| 193 | +
|
| 194 | +Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for |
| 195 | +us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`. |
| 196 | + |
| 197 | +VLOOKUP |
| 198 | +~~~~~~~ |
| 199 | + |
| 200 | +.. ipython:: python |
| 201 | +
|
| 202 | + import random |
| 203 | +
|
| 204 | + first_names = [ |
| 205 | + "harry", |
| 206 | + "ron", |
| 207 | + "hermione", |
| 208 | + "rubius", |
| 209 | + "albus", |
| 210 | + "severus", |
| 211 | + "luna", |
| 212 | + ] |
| 213 | + keys = [1, 2, 3, 4, 5, 6, 7] |
| 214 | + df1 = pd.DataFrame({"keys": keys, "first_names": first_names}) |
| 215 | + df1 |
| 216 | +
|
| 217 | + surnames = [ |
| 218 | + "hadrid", |
| 219 | + "malfoy", |
| 220 | + "lovegood", |
| 221 | + "dumbledore", |
| 222 | + "grindelwald", |
| 223 | + "granger", |
| 224 | + "weasly", |
| 225 | + "riddle", |
| 226 | + "longbottom", |
| 227 | + "snape", |
| 228 | + ] |
| 229 | + keys = [random.randint(1, 7) for x in range(0, 10)] |
| 230 | + random_names = pd.DataFrame({"surnames": surnames, "keys": keys}) |
| 231 | +
|
| 232 | + random_names |
| 233 | +
|
| 234 | + random_names.merge(df1, on="keys", how="left") |
| 235 | +
|
| 236 | +Adding a row |
| 237 | +~~~~~~~~~~~~ |
| 238 | + |
| 239 | +To appended a row, we can just assign values to an index using :meth:`~DataFrame.loc`. |
| 240 | + |
| 241 | +NOTE: If the index already exists, the values in that index will be over written. |
| 242 | + |
| 243 | +.. ipython:: python |
| 244 | +
|
| 245 | + df1.loc[7] = [8, "tonks"] |
| 246 | + df1 |
| 247 | +
|
| 248 | +
|
| 249 | +Search and Replace |
| 250 | +~~~~~~~~~~~~~~~~~~ |
| 251 | + |
| 252 | +The ``replace`` method that comes associated with the ``DataFrame`` object can perform |
| 253 | +this function. Please see `pandas.DataFrame.replace <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html>`__ for examples. |
0 commit comments