Merging DataFrames Columns for Apache Spark in Python without a Key Column

At the Databricks Q&A site, looked at how to take DataFrames of identical length in rows andperform merge on columns into one DataFrame without an existing key.

If we include the function monotonically_increasing_id() to give an increasing IDs to rows, then merging with join works. Not shown here we could pad the smaller dataframe (fewer rows) to be the same length as the longest. This is needed to make monotonically_increasing_id() give the same IDs for both of the DataFrames.

# For two Dataframes that have the same number of rows, merge all columns, row by row.

# Get the function monotonically_increasing_id so we can assign ids to each row, when the
# Dataframes have the same number of rows.
from pyspark.sql.functions import monotonically_increasing_id

#Create some test data with 3 and 4 columns.
df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo"), ("aaa", "bbb","ccc","ddd")], ("k", "K" ,"v" ,"V"))
df2 = sqlContext.createDataFrame([("aaa", "bbb","ddd"), ("www", "eee","rrr"), ("jjj", "rrr","www")], ("m", "M" ,"n"))

# Add increasing Ids, and they should be the same.
df1 = df1.withColumn("id", monotonically_increasing_id())
df2 = df2.withColumn("id", monotonically_increasing_id())

# Perform a join on the ids.
df3 = df2.join(df1, "id", "outer").drop("id")

And at Github have created a GitHub repository to store Spark code snippets as I work on them.

Darrell Ulm’s BookList Pages

The links below have listings of tomes read on the bookshelf and future volumes to consume from the GoodReads profile pages for Darrell Raymond Ulm.

GoodReads profile for Darrell Ulm

The book list is mostly covering comp. sci., parallel computing, and algorithms.

Blogger, Tumblr

SlideShare Bookmarks for Darrell Ulm – Comp. Science Presentations

The link following is referring to the SlideShare for Darrell Ulm saved presentations containing slide shows for Computer Science, Apache Spark, Python, Software Dev., parallel algorithms, and Drupal. Also put links to saved slides for Apache Spark, Hadoop, and Drupal.

Blogger, Tumblr