首页>Program>source

给出这样的Pandas DataFrame:

|------|-------|
|col1  |col2   |
|------|-------|
|a1    |abc    |
|a2    |bcd    |
|a3    |kfs    |
|------|-------|

是否存在一个使用自定义函数确定数字字段值的有效矩阵(实际的DataFrame具有>10.000行),如下所示创建矩阵?

|------|-------|------|-------|
|      |a1     |a2    |a3     |
|------|-------|------|-------|
|a1    |1.000  |0.362 |0.643  |
|a2    |0.362  |1.000 |0.364  |
|a3    |0.643  |0.364 |1.000  |
|------|-------|------|-------|

到目前为止我所做的尝试:

  • Converting the DataFrame to a list and using a nested list comprehension. That, however, is taking too long performance wise.
  • Using sklearn pairwise_distance with my custom function defined as a metric. Same performance issue here.

最终,将生成以下表示形式:

|------|--------------------------------------|
|a1    |{a1: 1.000}, {a2: 0.362}, {a3: 0.643} |
|a2    |{a1: 0.362}, {a2: 1.000}, {a3: 0.364} |
|a3    |{a1: 0.643}, {a2: 0.364}, {a3: 1.000} |
|------|--------------------------------------|
最新回答
  • 10天前
    1 #

    做到这一点的一种方法是在col1的所有可能值之间创建叉积,为每对运行计算,然后旋转:

    # dummy data
    df = pd.DataFrame({
        "col1": [f"a_{i}" for i in range(5)], 
        "col2": range(5)})
    # the following dataframe is produced. We're now looking for a way to 
    # run some calculation for each combination of col1 x col1
      col1  col2
    0  a_0     0
    1  a_1     1
    2  a_2     2
    3  a_3     3
    4  a_4     4
    df = pd.merge(df.assign(dummy=1), df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
    df["res"] = df.col2_x * df.col2_y
    pd.pivot_table(df, index="col1_x", columns="col1_y", values = "res")
    

    结果是:

    col1_y  a_0  a_1  a_2  a_3  a_4
    col1_x                         
    a_0       0    0    0    0    0
    a_1       0    1    2    3    4
    a_2       0    2    4    6    8
    a_3       0    3    6    9   12
    a_4       0    4    8   12   16
    

  • python:编码分类变量,例如"状态名称"
  • python:看来numpy数组会自动忽略小数