BIG DATA ANALYTICS
K.Kiruthika
II M.Sc(CS)
Map Reduce application
Introduction :
Map Reduce is a programming modal as an
associated implementation for processing .
generating big data sets with
a parallel, distributed algorithm on a cluster.
 Map: The map function to the local data, and writes the
output to a temporary storage.
 Shuffle: worker nodes redistribute data based on the output
keys such that all data belong to one key is located on worker
node.
 Reduce: worker nodes process each group of output data, per
key, in parallel.
 MapReduce is as a 5-step parallel
and distributed computation:
Prepare the Map() input : The
"MapReduce system" designates Map
processors the input key value K1.
Run the user-provided Map()
code : Map() is run exactly once for
each K1 key value, generating output organized
by key values K2.
"Shuffle" the Map output to the
Reduce processors :The MapReduce
system designates assigns the K2 key value each
processer.
Run the user-provided Reduce()
code : Reduce() is run the program exactly once
for each K2 key value produced by the Map step.
Produce the final output : The
MapReduce system collects all the Reduce output,
and sorts it final outcome.
 DATA SERIALIZATION AND WORKING COMMON
SERIALIZATION FORMAT :
 A container is nothing but a data structure, or an object to
store data in.
 When we transfer data over a network the container is
converted into a byte stream.
 This refer to process.
Uses :
 A method of transferring data through the wires
(messaging).
 A method of storing data (in databases, on hard disk drives).
 A method of remote procedure calls, e.g., as in SOAP.
 A method for distributing objects, especially in component-
based software engineering such as COM, CORBA, etc.
 A method for detecting changes in time-varying data.
Drawback :
 Serialization breaks the opacity of an abstract data type.
 private implementation details.
 Serialize all data Private members may violate encapsulation.
 products, publishers of proprietary software their programs.
 serialization formats a trade secret.
 obfuscate or even encrypt the serialization of data.
 Therefore remote method call architectures such
as CORBA define their serialization formats in detail.
 Serialization formats :
 JSON is a lighter plain-text alternative .
 XMLcommonly used for client-server communication in
web applications.
 JSON is based on JavaScript syntax, but is supported in
other programming languages as well.
 YAML :
 powerful for serialization.
 "human friendly," and potentially more compact.
 These features include a notion of tagging data types.
 support for non-hierarchical data structures.
 Another human-readable serialization.
 Eg: format is property list format used
in NeXTSTEP, GNUstep, and macOS Cocoa.
 For large volume scientific datasets,
 such as satellite data and output
 numerical climate, weather, or ocean models, specific
binary serialization standards have been developed.
 e.g. HDF, netCDF and the older GRIB.
 BIG DATA SERIALIZATION FORMATS:
 Big Data serialization also refers to converting byte streams.
 But it has another goal which is schema control.
 data structure data can be validated on writing phase.
 It avoids to have some surprises data is read.
 Example: a field is missing or has bad type eg: (instead of
array).
 To resume, we can list following points to characterize
serialization in Big Data systems:
 splittability - easier to achieve splits on byte streams rather
than JSON or XML files.
 portability - schema can be consumed by different languages.
 versioning - flexibity to define fields with default values or
continue to use old schema version.
 data integrity - serialization schemas data corecteness. errors
can be detected earlier, when data is written.
 NoSQL solutions are very often related to the word
schemaless.
 Sometimes to maintenance or backward compatibility
problems.
 One of solutions to these issues in Big Data systems are
serialization frameworks.
 This purely theoretical, article to explain the role and the
importance of serialization in Big Data systems.
 Conclusion:
 Converting for bit stream.
 Json or xml formate in serialization data is splittable easier.
THANK YOU

Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college for women

  • 1.
  • 2.
    Map Reduce application Introduction: Map Reduce is a programming modal as an associated implementation for processing . generating big data sets with a parallel, distributed algorithm on a cluster.
  • 4.
     Map: Themap function to the local data, and writes the output to a temporary storage.  Shuffle: worker nodes redistribute data based on the output keys such that all data belong to one key is located on worker node.  Reduce: worker nodes process each group of output data, per key, in parallel.
  • 6.
     MapReduce isas a 5-step parallel and distributed computation: Prepare the Map() input : The "MapReduce system" designates Map processors the input key value K1. Run the user-provided Map() code : Map() is run exactly once for each K1 key value, generating output organized by key values K2.
  • 7.
    "Shuffle" the Mapoutput to the Reduce processors :The MapReduce system designates assigns the K2 key value each processer. Run the user-provided Reduce() code : Reduce() is run the program exactly once for each K2 key value produced by the Map step. Produce the final output : The MapReduce system collects all the Reduce output, and sorts it final outcome.
  • 9.
     DATA SERIALIZATIONAND WORKING COMMON SERIALIZATION FORMAT :  A container is nothing but a data structure, or an object to store data in.  When we transfer data over a network the container is converted into a byte stream.  This refer to process.
  • 11.
    Uses :  Amethod of transferring data through the wires (messaging).  A method of storing data (in databases, on hard disk drives).  A method of remote procedure calls, e.g., as in SOAP.  A method for distributing objects, especially in component- based software engineering such as COM, CORBA, etc.  A method for detecting changes in time-varying data.
  • 13.
    Drawback :  Serializationbreaks the opacity of an abstract data type.  private implementation details.  Serialize all data Private members may violate encapsulation.  products, publishers of proprietary software their programs.  serialization formats a trade secret.  obfuscate or even encrypt the serialization of data.  Therefore remote method call architectures such as CORBA define their serialization formats in detail.
  • 15.
     Serialization formats:  JSON is a lighter plain-text alternative .  XMLcommonly used for client-server communication in web applications.  JSON is based on JavaScript syntax, but is supported in other programming languages as well.
  • 19.
     YAML : powerful for serialization.  "human friendly," and potentially more compact.  These features include a notion of tagging data types.  support for non-hierarchical data structures.
  • 21.
     Another human-readableserialization.  Eg: format is property list format used in NeXTSTEP, GNUstep, and macOS Cocoa.  For large volume scientific datasets,  such as satellite data and output  numerical climate, weather, or ocean models, specific binary serialization standards have been developed.  e.g. HDF, netCDF and the older GRIB.
  • 22.
     BIG DATASERIALIZATION FORMATS:  Big Data serialization also refers to converting byte streams.  But it has another goal which is schema control.  data structure data can be validated on writing phase.  It avoids to have some surprises data is read.  Example: a field is missing or has bad type eg: (instead of array).
  • 24.
     To resume,we can list following points to characterize serialization in Big Data systems:  splittability - easier to achieve splits on byte streams rather than JSON or XML files.  portability - schema can be consumed by different languages.  versioning - flexibity to define fields with default values or continue to use old schema version.  data integrity - serialization schemas data corecteness. errors can be detected earlier, when data is written.
  • 25.
     NoSQL solutionsare very often related to the word schemaless.  Sometimes to maintenance or backward compatibility problems.  One of solutions to these issues in Big Data systems are serialization frameworks.  This purely theoretical, article to explain the role and the importance of serialization in Big Data systems.
  • 26.
     Conclusion:  Convertingfor bit stream.  Json or xml formate in serialization data is splittable easier.
  • 27.