Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college for women
1. BIG DATA ANALYTICS
Name : K.Kiruthika
Class : II M.Sc(CS)
Batch : 2017-2019
Incharge staff : M.FLORANCE DAYANA
2. Map Reduce application
Introduction :
Map Reduce is a programming modal as an
associated implementation for processing .
generating big data sets with
a parallel, distributed algorithm on a cluster.
3.
4. Map: The map function to the local data, and writes the
output to a temporary storage.
Shuffle: worker nodes redistribute data based on the output
keys such that all data belong to one key is located on worker
node.
Reduce: worker nodes process each group of output data, per
key, in parallel.
5.
6. MapReduce is as a 5-step parallel
and distributed computation:
Prepare the Map() input : The
"MapReduce system" designates Map
processors the input key value K1.
Run the user-provided Map()
code : Map() is run exactly once for
each K1 key value, generating output organized
by key values K2.
7. "Shuffle" the Map output to the
Reduce processors :The MapReduce
system designates assigns the K2 key value each
processer.
Run the user-provided Reduce()
code : Reduce() is run the program exactly once
for each K2 key value produced by the Map step.
Produce the final output : The
MapReduce system collects all the Reduce output,
and sorts it final outcome.
8.
9. DATA SERIALIZATION AND WORKING COMMON
SERIALIZATION FORMAT :
A container is nothing but a data structure, or an object to
store data in.
When we transfer data over a network the container is
converted into a byte stream.
This refer to process.
10.
11. Uses :
A method of transferring data through the wires
(messaging).
A method of storing data (in databases, on hard disk drives).
A method of remote procedure calls, e.g., as in SOAP.
A method for distributing objects, especially in component-
based software engineering such as COM, CORBA, etc.
A method for detecting changes in time-varying data.
12.
13. Drawback :
Serialization breaks the opacity of an abstract data type.
private implementation details.
Serialize all data Private members may violate encapsulation.
products, publishers of proprietary software their programs.
serialization formats a trade secret.
obfuscate or even encrypt the serialization of data.
Therefore remote method call architectures such
as CORBA define their serialization formats in detail.
14.
15. Serialization formats :
JSON is a lighter plain-text alternative .
XMLcommonly used for client-server communication in
web applications.
JSON is based on JavaScript syntax, but is supported in
other programming languages as well.
16.
17.
18.
19. YAML :
powerful for serialization.
"human friendly," and potentially more compact.
These features include a notion of tagging data types.
support for non-hierarchical data structures.
20.
21. Another human-readable serialization.
Eg: format is property list format used
in NeXTSTEP, GNUstep, and macOS Cocoa.
For large volume scientific datasets,
such as satellite data and output
numerical climate, weather, or ocean models, specific
binary serialization standards have been developed.
e.g. HDF, netCDF and the older GRIB.
22. BIG DATA SERIALIZATION FORMATS:
Big Data serialization also refers to converting byte streams.
But it has another goal which is schema control.
data structure data can be validated on writing phase.
It avoids to have some surprises data is read.
Example: a field is missing or has bad type eg: (instead of
array).
23.
24. To resume, we can list following points to characterize
serialization in Big Data systems:
splittability - easier to achieve splits on byte streams rather
than JSON or XML files.
portability - schema can be consumed by different languages.
versioning - flexibity to define fields with default values or
continue to use old schema version.
data integrity - serialization schemas data corecteness. errors
can be detected earlier, when data is written.
25. NoSQL solutions are very often related to the word
schemaless.
Sometimes to maintenance or backward compatibility
problems.
One of solutions to these issues in Big Data systems are
serialization frameworks.
This purely theoretical, article to explain the role and the
importance of serialization in Big Data systems.
26. Conclusion:
Converting for bit stream.
Json or xml formate in serialization data is splittable easier.