The chief goal of merging the processing and memory components in this way is to reduce memory latency and increase bandwidth. Alternatively reducing the distance that data needs to be moved reduces the power requirements of a system. Much of the complexity (and hence power consumption) in current processors stems from strategies to deal with avoiding memory stalls.
The Transputer also had large on chip memory given that it was made in the early 1980's making it essentially a Processor-in-memory. see http://www.classiccmp.org/transputer/