SPAMDETECTION
By ,Finny, Omkar, Sreenivas
DATA
Big data refers to the large, diverse sets of
information that grow at ever-increasing
rates.
Variety
Volume Velocity
BIG
• Apache Spark
⚬ open-source unified analytics engine
⚬ large-scale data processing.
• Used as a replacement for MapReduce
• Processes data
⚬ using the Master-Slave principle.
Spark SQL is Apache
Spark's module for
working with
structured data.
Spark Streaming
streaming data can be
analyzed in real time
MLib
is a machine learning
library of Apache Spark
GraphX
Graph analysis can be
performed with the
GraphX library
The biggest harm of spam emails is that contrary to
popular belief, the
recipient faces more costs than the sender.
FLATMAP
collapse the elements of a collection to create a
single collection with elements of the same type
JAVA RDD
Resilient Distributed Datasets - serves as the
building blocks for distributed data processing in
Spark
SPARK CONTEXT
To perform data analysis with Spark, a Spark
Context is required. It serves as a bridge to access
data within the Spark environment.
Creating Spark
Context
Java RDD Structure
Separate RDDs for Spam
and Non-Spam Emails
FlatMap Process
Text to Vector
Transformation
Modeling with Naive
Bayes
10.5 M
Records
5 GB
Emails
Average Working Time
Conclusio
nSpark is faster
By Sreenivas, Finny & Omkar.pptx

By Sreenivas, Finny & Omkar.pptx

  • 1.
  • 2.
    DATA Big data refersto the large, diverse sets of information that grow at ever-increasing rates. Variety Volume Velocity BIG
  • 3.
    • Apache Spark ⚬open-source unified analytics engine ⚬ large-scale data processing. • Used as a replacement for MapReduce • Processes data ⚬ using the Master-Slave principle.
  • 4.
    Spark SQL isApache Spark's module for working with structured data. Spark Streaming streaming data can be analyzed in real time MLib is a machine learning library of Apache Spark GraphX Graph analysis can be performed with the GraphX library
  • 5.
    The biggest harmof spam emails is that contrary to popular belief, the recipient faces more costs than the sender.
  • 6.
    FLATMAP collapse the elementsof a collection to create a single collection with elements of the same type JAVA RDD Resilient Distributed Datasets - serves as the building blocks for distributed data processing in Spark SPARK CONTEXT To perform data analysis with Spark, a Spark Context is required. It serves as a bridge to access data within the Spark environment.
  • 7.
    Creating Spark Context Java RDDStructure Separate RDDs for Spam and Non-Spam Emails FlatMap Process Text to Vector Transformation Modeling with Naive Bayes
  • 8.
  • 12.
  • 13.