Merging DataFrames Columns for Apache Spark in Python without a Key Column

At the Databricks Q&A site, looked at how to take DataFrames of identical length in rows andperform merge on columns into one DataFrame without an existing key.

If we include the function monotonically_increasing_id() to give an increasing IDs to rows, then merging with join works. Not shown here we could pad the smaller dataframe (fewer rows) to be the same length as the longest. This is needed to make monotonically_increasing_id() give the same IDs for both of the DataFrames.

# For two Dataframes that have the same number of rows, merge all columns, row by row.

# Get the function monotonically_increasing_id so we can assign ids to each row, when the
# Dataframes have the same number of rows.
from pyspark.sql.functions import monotonically_increasing_id

#Create some test data with 3 and 4 columns.
df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo"), ("aaa", "bbb","ccc","ddd")], ("k", "K" ,"v" ,"V"))
df2 = sqlContext.createDataFrame([("aaa", "bbb","ddd"), ("www", "eee","rrr"), ("jjj", "rrr","www")], ("m", "M" ,"n"))

# Add increasing Ids, and they should be the same.
df1 = df1.withColumn("id", monotonically_increasing_id())
df2 = df2.withColumn("id", monotonically_increasing_id())

# Perform a join on the ids.
df3 = df2.join(df1, "id", "outer").drop("id")
df3.show()

And at Github have created a GitHub repository to store Spark code snippets as I work on them.

Academic Research for Darrell Ulm: Microsoft Research: Stream PRAM

The link here for the for Stream PRAM at Research-Microsoft Site for Darrell Raymond Ulm contains Computer Science research papers that are listed in Microsoft Research on the topics of parallel processing and algorithms concerning the computation model for Stream PRAM.

The results of the research are primarily theoretical instead of dealing with parallel programming.

GitHub Darrell Ulm Profile Link

Github is taking over the online software development repository sphere, and certainly have a huge market share of all online managed software projects, Open Source and proprietary.

Here is the Github account profile page with source code in C#, C/C++, PHP, ASM and others for Darrell Ulm and other links to this information can be found at the following: Profile, Tumblr and Blogger.

GitHub is a great tool for software development making most tasks more streamlined and helping multi-developer projects come together more cleanly with a smooth workflow.