Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women

BIG DATA ANALYTICS
Name : K.Kiruthika
Class : II M.Sc(CS)
Batch : 2017-2019
Incharge staff : M.FLORANCE DAYANA

Map Reduce application
Introduction :
Map Reduce is a programming modal as an
associated implementation for processing .
generating big data sets with
a parallel, distributed algorithm on a cluster.

 Map: The map function to the local data, and writes the
output to a temporary storage.
 Shuffle: worker nodes redistribute data based on the output
keys such that all data belong to one key is located on worker
node.
 Reduce: worker nodes process each group of output data, per
key, in parallel.

 MapReduce is as a 5-step parallel
and distributed computation:
Prepare the Map() input : The
"MapReduce system" designates Map
processors the input key value K1.
Run the user-provided Map()
code : Map() is run exactly once for
each K1 key value, generating output organized
by key values K2.

"Shuffle" the Map output to the
Reduce processors :The MapReduce
system designates assigns the K2 key value each
processer.
Run the user-provided Reduce()
code : Reduce() is run the program exactly once
for each K2 key value produced by the Map step.
Produce the final output : The
MapReduce system collects all the Reduce output,
and sorts it final outcome.

 DATA SERIALIZATION AND WORKING COMMON
SERIALIZATION FORMAT :
 A container is nothing but a data structure, or an object to
store data in.
 When we transfer data over a network the container is
converted into a byte stream.
 This refer to process.

Uses :
 A method of transferring data through the wires
(messaging).
 A method of storing data (in databases, on hard disk drives).
 A method of remote procedure calls, e.g., as in SOAP.
 A method for distributing objects, especially in component-
based software engineering such as COM, CORBA, etc.
 A method for detecting changes in time-varying data.

Drawback :
 Serialization breaks the opacity of an abstract data type.
 private implementation details.
 Serialize all data Private members may violate encapsulation.
 products, publishers of proprietary software their programs.
 serialization formats a trade secret.
 obfuscate or even encrypt the serialization of data.
 Therefore remote method call architectures such
as CORBA define their serialization formats in detail.

 Serialization formats :
 JSON is a lighter plain-text alternative .
 XMLcommonly used for client-server communication in
web applications.
 JSON is based on JavaScript syntax, but is supported in
other programming languages as well.

 YAML :
 powerful for serialization.
 "human friendly," and potentially more compact.
 These features include a notion of tagging data types.
 support for non-hierarchical data structures.

 Another human-readable serialization.
 Eg: format is property list format used
in NeXTSTEP, GNUstep, and macOS Cocoa.
 For large volume scientific datasets,
 such as satellite data and output
 numerical climate, weather, or ocean models, specific
binary serialization standards have been developed.
 e.g. HDF, netCDF and the older GRIB.

 BIG DATA SERIALIZATION FORMATS:
 Big Data serialization also refers to converting byte streams.
 But it has another goal which is schema control.
 data structure data can be validated on writing phase.
 It avoids to have some surprises data is read.
 Example: a field is missing or has bad type eg: (instead of
array).

 To resume, we can list following points to characterize
serialization in Big Data systems:
 splittability - easier to achieve splits on byte streams rather
than JSON or XML files.
 portability - schema can be consumed by different languages.
 versioning - flexibity to define fields with default values or
continue to use old schema version.
 data integrity - serialization schemas data corecteness. errors
can be detected earlier, when data is written.

 NoSQL solutions are very often related to the word
schemaless.
 Sometimes to maintenance or backward compatibility
problems.
 One of solutions to these issues in Big Data systems are
serialization frameworks.
 This purely theoretical, article to explain the role and the
importance of serialization in Big Data systems.

 Conclusion:
 Converting for bit stream.
 Json or xml formate in serialization data is splittable easier.

Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women

Similar to Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women (20)

Recently uploaded

Recently uploaded (20)

Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women