首页>Program>source

我有两个共享同一列字符串ID的数据帧ddf_1和ddf_2.我的目标是在ddf_1中创建一个新的布尔is_fine列,如果该ID包含在ddf_1和ddf_2中,则该列包含True;如果该ID不包含在ddf_1和ddf_2中,则包含False.

请考虑以下示例数据:

#### test
#example data
data_1 = { 
    'fruits': ["apples", "banana", "cherry"],
    'myid': ['1-12', '2-12', '3-13'],
    'meat': ["pig", "cow", "chicken"]}
data_2 = { 
    'furniture': ["table", "chair", "lamp"],
    'myid': ['1-12', '0-11', '2-12'],
    'clothing': ["pants", "shoes", "socks"]}
df_1 = pd.DataFrame(data_1)
ddf_1 = spark.createDataFrame(df_1)
df_2 = pd.DataFrame(data_2)
ddf_2 = spark.createDataFrame(df_2)

我想象一个函数是这样的:

def func(df_1, df_2, column_1, column_2):
    if df_1.column_1 != df_2.column_2:
       return df_1.withColumn('is_fact', False)
    else:
        return df_1.withColumn('is_fact', True)
    return df_1

所需的输出应如下所示:

最新回答
  • 1月前
    1 #

    您可以执行左外部联接 在 my_id的两个数据框之间   列并使用简单的case语句派生 is_fine   列,如下所示,

    import pyspark.sql.functions as F
    ddf_1.join(ddf_2, ddf_1.myid == ddf_2.myid, 'left')\
    .withColumn('is_fine', F.when(ddf_2.myid.isNull(), False).otherwise(True))\
    .select(ddf_1['fruits'], ddf_1['myid'], ddf_1['meat'], 'is_fine').show()
    

    输出:

    +------+----+-------+-------+
    |fruits|myid|   meat|is_fine|
    +------+----+-------+-------+
    |cherry|3-13|chicken|  false|
    |apples|1-12|    pig|   true|
    |banana|2-12|    cow|   true|
    +------+----+-------+-------+
    

  • 1月前
    2 #

    利用Spark SQL解决此类问题:

    query = """
    select ddf_1.*,
    case 
        when ddf_1.myid = ddf_2.myid  then True
        else False 
    end as is_fine
    from ddf_1 left outer join ddf_2 
    on ddf_1.myid = ddf_2.myid
    """
    display(spark.sql(query))
    

    这是输出

  • 1月前
    3 #

    #left join ddf2 on ddf1
    result = (ddf_1.join(ddf_2, ddf_1.myid == ddf_2.myid, how='left')\
              #create is_fine column
              .withColumn('is_fine', F.when(ddf_2.myid.isNull(), False).otherwise(True)))\
              #select all columns from ddf_1, the new column is_fine and show
              .select(ddf_1["*"], "is_fine").show()
    

  • c#:XUnit测试以返回是否正在返回列表
  • 如何在Django管理员的changelist_view中过滤查询集?