MapReduce

map reducemap-reducemap/reduceOsprey
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.wikipedia
178 Related Articles

Big data

big data analyticsbig-databig data analysis
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. By 2014, Google was no longer using MapReduce as their primary big data processing model, and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.
In 2004, Google published a paper on a process called MapReduce that uses a similar architecture.

Reduce (parallel pattern)

reduce
A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a function is applied (mapped) to all elements before they are reduced.

Programming model

general programming modelModels
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
Examples include the POSIX Threads library and Hadoop's MapReduce.

Apache Hadoop

HadoopHDFSApache
A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop.
It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

Map (parallel pattern)

mapmapsparallel map
A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
For example, map combined with category reduction gives the MapReduce pattern.

Apache Mahout

Mahout
By 2014, Google was no longer using MapReduce as their primary big data processing model, and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.
While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to Hadoop-based implementations.

Computer cluster

clusterclusteringclusters
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
This is an area of ongoing research; algorithms that combine and extend MapReduce and Hadoop have been proposed and studied.

Shard (database architecture)

shardingshardedshard
In many situations, the input data might already be distributed ("sharded") among many different servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers that would process the locally present input data.
eXtreme Scale: a cross-process in-memory key/value datastore (a variety of NoSQL datastore). It uses sharding to achieve scalability across processes for both data and MapReduce-style parallel processing.

Apache Pig

PigPIG LatinPig (or PigLatin)
MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase and Bigtable are addressing some of these problems.
Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.

Apache Hive

HiveHive QL
MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase and Bigtable are addressing some of these problems.
Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Apache HBase

HBase
MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase and Bigtable are addressing some of these problems.
Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.

Bigtable

Google Cloud BigtableGoogle Bigtable
MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase and Bigtable are addressing some of these problems.
Bigtable development began in 2004 and is now used by a number of Google applications, such as web indexing, MapReduce, which is often used for generating and modifying data stored in Bigtable, Google Maps, Google Book Search, "My Search History", Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail.

Apache CouchDB

C'''ouchDBCouchDB
Apache CouchDB
It has a document-oriented NoSQL database architecture and is implemented in the concurrency-oriented language Erlang; it uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.

Sawzall (programming language)

SawzallSawzall programming language
MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase and Bigtable are addressing some of these problems.
However, since the MapReduce table aggregators have not been released, the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf.

MongoDB

For example, map and reduce functionality can be very easily implemented in Oracle's PL/SQL database oriented language or is supported for developers transparently in distributed database architectures such as Clusterpoint XML database or MongoDB NoSQL database.
Map-reduce can be used for batch processing of data and aggregation operations.

Google File System

GFSColossusColossus/GFS
If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead and sends out the node's assigned work to other nodes.
MapReduce

Riak

Riak
More complex queries are also possible, including secondary indexes, search (via Apache Solr), and MapReduce.

Teradata

Teradata version 14Teradata AsterTeradata Corporation
They called its interface too low-level and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenged the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades.
For Teradata, big data prompted the acquisition of Aster Data Systems in 2011 for the company’s MapReduce capabilities and ability to store and analyze semi-structured data.

Bird–Meertens formalism

computational paradigm
* Homomorphism lemma
This is the basis of the map-reduce approach.

Parallel computing

parallelparallel processingparallelism
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Distributed computing

distributeddistributed systemsdistributed system
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Subroutine

functionfunctionssubroutines
A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Marshalling (computer science)

marshallingmarshalmarshalled
The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

Redundancy (engineering)

redundancyredundantredundancies
The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.