当前位置：网站首页>df.describe() 详解+用法+示例

df.describe() 详解+用法+示例

2022-07-22 01:56:00 【懒笑翻】

Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.22.0
Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] on win32

describe(self: 'FrameOrSeries', percentiles=None, include=None, exclude=None, datetime_is_numeric=False) -> 'FrameOrSeries'

describe() 方法可以计算每一列的若干常用统计值。

percentile:列出像0-1之间的数字的数据类型以返回各自的百分位数
include:描述 DataFrame 时要包括的数据类型列表。默认为无
exclude:描述 DataFrame 时要排除的数据类型列表。默认为无

# 读取Excel文件
df = pd.read_excel('./data/all.xlsx')

# 数据的基本描述
view = df.describe(percentiles=[], include='all').T
view.to_excel('./data/result111.xlsx')

all.xlsx 数据展示：

result111.xlsx 数据展示：

    Generate descriptive statistics.

    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.

    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.

参数详解：

----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.

    include : 'all', list-like of dtypes or None (default), optional
        A white list of data types to include in the result. Ignored
        for ``Series``. Here are the options:

        - 'all' : All columns of the input will be included in the output.                                                                      include= ”all“则是对所有属性的描述。
        - A list-like of dtypes : Limits the results to the
          provided data types.
          To limit the result to numeric types submit
          ``numpy.number``. To limit it instead to object columns submit the ``numpy.object`` data type. Strings can also be used in the style of
          ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
          select pandas categorical columns, use ``'category'``
        - None (default) : The result will include all numeric columns.

    exclude : list-like of dtypes or None (default), optional,
        A black list of data types to omit from the result. Ignored
        for ``Series``. Here are the options:

        - A list-like of dtypes : Excludes the provided data types
          from the result. To exclude numeric types submit
          ``numpy.number``. To exclude object columns submit the data
          type ``numpy.object``. Strings can also be used in the style of
          ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
          exclude pandas categorical columns, use ``'category'``
        - None (default) : The result will exclude nothing.

    datetime_is_numeric : bool, default False
        Whether to treat datetime dtypes as numeric. This affects statistics
        calculated for the column. For DataFrame input, this also
        controls whether datetime columns are included by default.

        .. versionadded:: 1.1.0

Series or DataFrame
Summary statistics of the Series or Dataframe provided.

Notes：

    -----
    For numeric data, the result's index will include ``count``,
    ``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
    upper percentiles. By default the lower percentile is ``25`` and the
    upper percentile is ``75``. The ``50`` percentile is the
    same as the median.

    For object data (e.g. strings or timestamps), the result's index
    will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
    is the most common value. The ``freq`` is the most common value's
    frequency. Timestamps also include the ``first`` and ``last`` items.

    If multiple object values have the highest count, then the
    ``count`` and ``top`` results will be arbitrarily chosen from
    among those with the highest count.

    For mixed data types provided via a ``DataFrame``, the default is to
    return only an analysis of numeric columns. If the dataframe consists
    only of object and categorical data without any numeric columns, the
    default is to return an analysis of both the object and categorical
    columns. If ``include='all'`` is provided as an option, the result
    will include a union of attributes of each type.

    The `include` and `exclude` parameters can be used to limit
    which columns in a ``DataFrame`` are analyzed for the output.
    The parameters are ignored when analyzing a ``Series``.

示例：

Describing a numeric ``Series``.

    >>> s = pd.Series([1, 2, 3])
    >>> s.describe()
    count    3.0
    mean     2.0
    std      1.0
    min      1.0
    25%      1.5
    50%      2.0
    75%      2.5
    max      3.0
    dtype: float64

    Describing a categorical ``Series``.

    >>> s = pd.Series(['a', 'a', 'b', 'c'])
    >>> s.describe()
    count     4
    unique    3
    top       a
    freq      2
    dtype: object

    Describing a timestamp ``Series``.

    >>> s = pd.Series([
    ...   np.datetime64("2000-01-01"),
    ...   np.datetime64("2010-01-01"),
    ...   np.datetime64("2010-01-01")
    ... ])
    >>> s.describe(datetime_is_numeric=True)
    count                      3
    mean     2006-09-01 08:00:00
    min      2000-01-01 00:00:00
    25%      2004-12-31 12:00:00
    50%      2010-01-01 00:00:00
    75%      2010-01-01 00:00:00
    max      2010-01-01 00:00:00
    dtype: object

    Describing a ``DataFrame``. By default only numeric fields
    are returned.

    >>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
    ...                    'numeric': [1, 2, 3],
    ...                    'object': ['a', 'b', 'c']
    ...                   })
    >>> df.describe()
           numeric
    count      3.0
    mean       2.0
    std        1.0
    min        1.0
    25%        1.5
    50%        2.0
    75%        2.5
    max        3.0

    Describing all columns of a ``DataFrame`` regardless of data type.

    >>> df.describe(include='all') # doctest: +SKIP
           categorical numeric object
    count            3      3.0      3
    unique           3      NaN      3
    top              f      NaN      a
    freq             1      NaN      1
    mean           NaN      2.0    NaN
    std            NaN      1.0    NaN
    min            NaN      1.0    NaN
    25%            NaN      1.5    NaN
    50%            NaN      2.0    NaN
    75%            NaN      2.5    NaN
    max            NaN      3.0    NaN

    Describing a column from a ``DataFrame`` by accessing it as
    an attribute.

    >>> df.numeric.describe()
    count    3.0
    mean     2.0
    std      1.0
    min      1.0
    25%      1.5
    50%      2.0
    75%      2.5
    max      3.0
    Name: numeric, dtype: float64

    Including only numeric columns in a ``DataFrame`` description.

    >>> df.describe(include=[np.number])
           numeric
    count      3.0
    mean       2.0
    std        1.0
    min        1.0
    25%        1.5
    50%        2.0
    75%        2.5
    max        3.0

    Including only string columns in a ``DataFrame`` description.

    >>> df.describe(include=[object]) # doctest: +SKIP
           object
    count       3
    unique      3
    top         a
    freq        1

    Including only categorical columns from a ``DataFrame`` description.

    >>> df.describe(include=['category'])
           categorical
    count            3
    unique           3
    top              d
    freq             1

    Excluding numeric columns from a ``DataFrame`` description.

    >>> df.describe(exclude=[np.number]) # doctest: +SKIP
           categorical object
    count            3      3
    unique           3      3
    top              f      a
    freq             1      1

    Excluding object columns from a ``DataFrame`` description.

    >>> df.describe(exclude=[object]) # doctest: +SKIP
           categorical numeric
    count            3      3.0
    unique           3      NaN
    top              f      NaN
    freq             1      NaN
    mean           NaN      2.0
    std            NaN      1.0
    min            NaN      1.0
    25%            NaN      1.5
    50%            NaN      2.0
    75%            NaN      2.5
    max            NaN      3.0