$> spark-start
[ Kafka and Spark streaming ]
cc : https://www.ku.ac.th/scc2009/SCC2009_advance.pdf
Functional parallelization Data parallelization
E = (A + B) * (C + D)
CPU CPU
(C + D)(A + B)
CPU CPU
Input = [1,123,512,46]
1 * 2 123 * 2
E = ( A * 2)
MapReduce
Key Feature
● General propose cluster for large scale data processing.
● Speed − 100x faster than Hadoop for Logistic Regression Algorithm.
● Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
● Advanced Analytics − Spark not only supports ‘Map’ and ‘Reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Framework
Spark SQL
cc :
https://subscription.packtpub.com/book/big_data_and_business_intellige
nce/9781785884696/4/ch04lvl1sec26/architecture-of-spark-sql
Spark Streaming
cc : https://databricks.com/glossary/what-is-spark-streaming
Spark MLlib
cc : https://www.youtube.com/watch?v=DBxcua0Vmvk
Spark GraphX
cc : https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_gonzalez.pdf
Spark Cluster
cc : https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-settings
cc : https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Hadoop
cc : https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
Spark
cc : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd.html#lineage
Resilient
Distributed
Dataset
Fault-tolerant with the help of RDD lineage graph
Data across multiple node
A collection of partitioned data
cc : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd.html#lineage
● In-Memory stored in memory as much (size) and long (time) as possible.
● Immutable or Read-Only, i.e. it does not change once created and can only be transformed using
transformations to new RDDs.
● Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that
triggers the execution.
● Cacheable, i.e. you can hold all the data in a persistent "storage" like memory (default and the most
preferred) or disk (the least preferred due to access speed).
● Parallel, i.e. process data in parallel.
● Typed — RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int, String)].
● Partitioned — records are partitioned (split into logical partitions) and distributed across nodes in a cluster.
● Location-Stickiness — RDD can define placement preferences to compute partitions (as close to the
records as possible).
cc : https://www.youtube.com/watch?v=x8xXXqvhZq8
cc : https://www.youtube.com/watch?v=x8xXXqvhZq8
cc : https://www.youtube.com/watch?v=x8xXXqvhZq8
cc : https://www.youtube.com/watch?v=x8xXXqvhZq8
Zeppelin
● Simple producer https://github.com/Aorjoa/kafka-simple-producer
● Simple Spark consumer https://github.com/Aorjoa/kafka-simple-consumer
● Spark Streaming twitter consumer https://github.com/Aorjoa/kafka-twitter-comsumer
● Spark Streaming twitter producer https://github.com/Aorjoa/kafka-simple-producer

$ Spark start