Skip to content

Commit 26a679a

Browse files
authored
DOC: add Comparison with Excel (#38554)
1 parent fc2cc7c commit 26a679a

10 files changed

+323
-36
lines changed

doc/source/_static/excel_pivot.png

156 KB
Loading

doc/source/_static/logo_excel.svg

Lines changed: 27 additions & 0 deletions
Loading
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>`
2+
to familiarize yourself with the library.
3+
4+
As is customary, we import pandas and NumPy as follows:
5+
6+
.. ipython:: python
7+
8+
import pandas as pd
9+
import numpy as np

doc/source/getting_started/comparison/comparison_with_sas.rst

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,7 @@ For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(softwar
88
this page is meant to demonstrate how different SAS operations would be
99
performed in pandas.
1010

11-
If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>`
12-
to familiarize yourself with the library.
13-
14-
As is customary, we import pandas and NumPy as follows:
15-
16-
.. ipython:: python
17-
18-
import pandas as pd
19-
import numpy as np
20-
11+
.. include:: comparison_boilerplate.rst
2112

2213
.. note::
2314

@@ -48,14 +39,17 @@ General terminology translation
4839
``NaN``, ``.``
4940

5041

51-
``DataFrame`` / ``Series``
52-
~~~~~~~~~~~~~~~~~~~~~~~~~~
42+
``DataFrame``
43+
~~~~~~~~~~~~~
5344

5445
A ``DataFrame`` in pandas is analogous to a SAS data set - a two-dimensional
5546
data source with labeled columns that can be of different types. As will be
5647
shown in this document, almost any operation that can be applied to a data set
5748
using SAS's ``DATA`` step, can also be accomplished in pandas.
5849

50+
``Series``
51+
~~~~~~~~~~
52+
5953
A ``Series`` is the data structure that represents one column of a
6054
``DataFrame``. SAS doesn't have a separate data structure for a single column,
6155
but in general, working with a ``Series`` is analogous to referencing a column
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
.. _compare_with_spreadsheets:
2+
3+
{{ header }}
4+
5+
Comparison with spreadsheets
6+
****************************
7+
8+
Since many potential pandas users have some familiarity with spreadsheet programs like
9+
`Excel <https://support.microsoft.com/en-us/excel>`_, this page is meant to provide some examples
10+
of how various spreadsheet operations would be performed using pandas. This page will use
11+
terminology and link to documentation for Excel, but much will be the same/similar in
12+
`Google Sheets <https://support.google.com/a/users/answer/9282959>`_,
13+
`LibreOffice Calc <https://help.libreoffice.org/latest/en-US/text/scalc/main0000.html?DbPAR=CALC>`_,
14+
`Apple Numbers <https://www.apple.com/mac/numbers/compatibility/functions.html>`_, and other
15+
Excel-compatible spreadsheet software.
16+
17+
.. include:: comparison_boilerplate.rst
18+
19+
Data structures
20+
---------------
21+
22+
General terminology translation
23+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24+
25+
.. csv-table::
26+
:header: "pandas", "Excel"
27+
:widths: 20, 20
28+
29+
``DataFrame``, worksheet
30+
``Series``, column
31+
``Index``, row headings
32+
row, row
33+
``NaN``, empty cell
34+
35+
``DataFrame``
36+
~~~~~~~~~~~~~
37+
38+
A ``DataFrame`` in pandas is analogous to an Excel worksheet. While an Excel worksheet can contain
39+
multiple worksheets, pandas ``DataFrame``\s exist independently.
40+
41+
``Series``
42+
~~~~~~~~~~
43+
44+
A ``Series`` is the data structure that represents one column of a ``DataFrame``. Working with a
45+
``Series`` is analogous to referencing a column of a spreadsheet.
46+
47+
``Index``
48+
~~~~~~~~~
49+
50+
Every ``DataFrame`` and ``Series`` has an ``Index``, which are labels on the *rows* of the data. In
51+
pandas, if no index is specified, a :class:`~pandas.RangeIndex` is used by default (first row = 0,
52+
second row = 1, and so on), analogous to row headings/numbers in spreadsheets.
53+
54+
In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that
55+
use use as the row identifier in a worksheet. Unlike spreadsheets, these ``Index`` values can actually be
56+
used to reference the rows. For example, in spreadsheets, you would reference the first row as ``A1:Z1``,
57+
while in pandas you could use ``populations.loc['Chicago']``.
58+
59+
Index values are also persistent, so if you re-order the rows in a ``DataFrame``, the label for a
60+
particular row don't change.
61+
62+
See the :ref:`indexing documentation<indexing>` for much more on how to use an ``Index``
63+
effectively.
64+
65+
Commonly used spreadsheet functionalities
66+
-----------------------------------------
67+
68+
Importing data
69+
~~~~~~~~~~~~~~
70+
71+
Both `Excel <https://support.microsoft.com/en-us/office/import-data-from-external-data-sources-power-query-be4330b3-5356-486c-a168-b68e9e616f5a>`__
72+
and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various
73+
formats.
74+
75+
Excel files
76+
'''''''''''
77+
78+
Excel opens `various Excel file formats <https://support.microsoft.com/en-us/office/file-formats-that-are-supported-in-excel-0943ff2c-6014-4e8d-aaea-b83d51d46247>`_
79+
by double-clicking them, or using `the Open menu <https://support.microsoft.com/en-us/office/open-files-from-the-file-menu-97f087d8-3136-4485-8e86-c5b12a8c4176>`_.
80+
In pandas, you use :ref:`special methods for reading and writing from/to Excel files <io.excel>`.
81+
82+
CSV
83+
'''
84+
85+
Let's load and display the `tips <https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/data/csv/tips.csv>`_
86+
dataset from the pandas tests, which is a CSV file. In Excel, you would download and then
87+
`open the CSV <https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba>`_.
88+
In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read_csv`:
89+
90+
.. ipython:: python
91+
92+
url = (
93+
"https://raw.github.com/pandas-dev"
94+
"/pandas/master/pandas/tests/io/data/csv/tips.csv"
95+
)
96+
tips = pd.read_csv(url)
97+
tips
98+
99+
Fill Handle
100+
~~~~~~~~~~~
101+
102+
Create a series of numbers following a set pattern in a certain set of cells. In
103+
a spreadsheet, this would be done by shift+drag after entering the first number or by
104+
entering the first two or three values and then dragging.
105+
106+
This can be achieved by creating a series and assigning it to the desired cells.
107+
108+
.. ipython:: python
109+
110+
df = pd.DataFrame({"AAA": [1] * 8, "BBB": list(range(0, 8))})
111+
df
112+
113+
series = list(range(1, 5))
114+
series
115+
116+
df.loc[2:5, "AAA"] = series
117+
118+
df
119+
120+
Filters
121+
~~~~~~~
122+
123+
Filters can be achieved by using slicing.
124+
125+
The examples filter by 0 on column AAA, and also show how to filter by multiple
126+
values.
127+
128+
.. ipython:: python
129+
130+
df[df.AAA == 0]
131+
132+
df[(df.AAA == 0) | (df.AAA == 2)]
133+
134+
135+
Drop Duplicates
136+
~~~~~~~~~~~~~~~
137+
138+
Excel has built-in functionality for `removing duplicate values <https://support.microsoft.com/en-us/office/find-and-remove-duplicates-00e35bea-b46a-4d5d-b28e-66a552dc138d>`_.
139+
This is supported in pandas via :meth:`~DataFrame.drop_duplicates`.
140+
141+
.. ipython:: python
142+
143+
df = pd.DataFrame(
144+
{
145+
"class": ["A", "A", "A", "B", "C", "D"],
146+
"student_count": [42, 35, 42, 50, 47, 45],
147+
"all_pass": ["Yes", "Yes", "Yes", "No", "No", "Yes"],
148+
}
149+
)
150+
151+
df.drop_duplicates()
152+
153+
df.drop_duplicates(["class", "student_count"])
154+
155+
156+
Pivot Tables
157+
~~~~~~~~~~~~
158+
159+
`PivotTables <https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576>`_
160+
from spreadsheets can be replicated in pandas through :ref:`reshaping`. Using the ``tips`` dataset again,
161+
let's find the average gratuity by size of the party and sex of the server.
162+
163+
In Excel, we use the following configuration for the PivotTable:
164+
165+
.. image:: ../../_static/excel_pivot.png
166+
:align: center
167+
168+
The equivalent in pandas:
169+
170+
.. ipython:: python
171+
172+
pd.pivot_table(
173+
tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average
174+
)
175+
176+
Formulas
177+
~~~~~~~~
178+
179+
In spreadsheets, `formulas <https://support.microsoft.com/en-us/office/overview-of-formulas-in-excel-ecfdc708-9162-49e8-b993-c311f47ca173>`_
180+
are often created in individual cells and then `dragged <https://support.microsoft.com/en-us/office/copy-a-formula-by-dragging-the-fill-handle-in-excel-for-mac-dd928259-622b-473f-9a33-83aa1a63e218>`_
181+
into other cells to compute them for other columns. In pandas, you'll be doing more operations on
182+
full columns.
183+
184+
As an example, let's create a new column "girls_count" and try to compute the number of boys in
185+
each class.
186+
187+
.. ipython:: python
188+
189+
df["girls_count"] = [21, 12, 21, 31, 23, 17]
190+
df
191+
df["boys_count"] = df["student_count"] - df["girls_count"]
192+
df
193+
194+
Note that we aren't having to tell it to do that subtraction cell-by-cell — pandas handles that for
195+
us. See :ref:`how to create new columns derived from existing columns <10min_tut_05_columns>`.
196+
197+
VLOOKUP
198+
~~~~~~~
199+
200+
.. ipython:: python
201+
202+
import random
203+
204+
first_names = [
205+
"harry",
206+
"ron",
207+
"hermione",
208+
"rubius",
209+
"albus",
210+
"severus",
211+
"luna",
212+
]
213+
keys = [1, 2, 3, 4, 5, 6, 7]
214+
df1 = pd.DataFrame({"keys": keys, "first_names": first_names})
215+
df1
216+
217+
surnames = [
218+
"hadrid",
219+
"malfoy",
220+
"lovegood",
221+
"dumbledore",
222+
"grindelwald",
223+
"granger",
224+
"weasly",
225+
"riddle",
226+
"longbottom",
227+
"snape",
228+
]
229+
keys = [random.randint(1, 7) for x in range(0, 10)]
230+
random_names = pd.DataFrame({"surnames": surnames, "keys": keys})
231+
232+
random_names
233+
234+
random_names.merge(df1, on="keys", how="left")
235+
236+
Adding a row
237+
~~~~~~~~~~~~
238+
239+
To appended a row, we can just assign values to an index using :meth:`~DataFrame.loc`.
240+
241+
NOTE: If the index already exists, the values in that index will be over written.
242+
243+
.. ipython:: python
244+
245+
df1.loc[7] = [8, "tonks"]
246+
df1
247+
248+
249+
Search and Replace
250+
~~~~~~~~~~~~~~~~~~
251+
252+
The ``replace`` method that comes associated with the ``DataFrame`` object can perform
253+
this function. Please see `pandas.DataFrame.replace <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html>`__ for examples.

doc/source/getting_started/comparison/comparison_with_sql.rst

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,7 @@ Since many potential pandas users have some familiarity with
88
`SQL <https://en.wikipedia.org/wiki/SQL>`_, this page is meant to provide some examples of how
99
various SQL operations would be performed using pandas.
1010

11-
If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>`
12-
to familiarize yourself with the library.
13-
14-
As is customary, we import pandas and NumPy as follows:
15-
16-
.. ipython:: python
17-
18-
import pandas as pd
19-
import numpy as np
11+
.. include:: comparison_boilerplate.rst
2012

2113
Most of the examples will utilize the ``tips`` dataset found within pandas tests. We'll read
2214
the data into a DataFrame called ``tips`` and assume we have a database table of the same name and

doc/source/getting_started/comparison/comparison_with_stata.rst

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,7 @@ For potential users coming from `Stata <https://en.wikipedia.org/wiki/Stata>`__
88
this page is meant to demonstrate how different Stata operations would be
99
performed in pandas.
1010

11-
If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>`
12-
to familiarize yourself with the library.
13-
14-
As is customary, we import pandas and NumPy as follows. This means that we can refer to the
15-
libraries as ``pd`` and ``np``, respectively, for the rest of the document.
16-
17-
.. ipython:: python
18-
19-
import pandas as pd
20-
import numpy as np
21-
11+
.. include:: comparison_boilerplate.rst
2212

2313
.. note::
2414

@@ -48,14 +38,17 @@ General terminology translation
4838
``NaN``, ``.``
4939

5040

51-
``DataFrame`` / ``Series``
52-
~~~~~~~~~~~~~~~~~~~~~~~~~~
41+
``DataFrame``
42+
~~~~~~~~~~~~~
5343

5444
A ``DataFrame`` in pandas is analogous to a Stata data set -- a two-dimensional
5545
data source with labeled columns that can be of different types. As will be
5646
shown in this document, almost any operation that can be applied to a data set
5747
in Stata can also be accomplished in pandas.
5848

49+
``Series``
50+
~~~~~~~~~~
51+
5952
A ``Series`` is the data structure that represents one column of a
6053
``DataFrame``. Stata doesn't have a separate data structure for a single column,
6154
but in general, working with a ``Series`` is analogous to referencing a column

doc/source/getting_started/comparison/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@ Comparison with other tools
1111

1212
comparison_with_r
1313
comparison_with_sql
14+
comparison_with_spreadsheets
1415
comparison_with_sas
1516
comparison_with_stata

0 commit comments

Comments
 (0)