Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science bootcamp day 3

115 views

Published on

Data Science - CCCS936 - Department of Computer Science, University of Kachchh.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data science bootcamp day 3

  1. 1. Data Science Bootcamp Day-3 Presented by: Chetan Khatri, Volunteer Teaching Assistant, Data Science lab, University of Kachchh Guidance by: Prof. Devji D. Chhanga, University of Kachchh.
  2. 2. Agenda An Introduction to Apache Spark Apache Spark single node configuration MapReduce Program on Spark Cluster An Introduction to Apache Kafka Apache Kafka single on Configuration. Create Topic, Push Messages to Topic
  3. 3. Spark Terminology » Spark and SQL Contexts : A Spark program first creates a SparkContext object » SparkContext tells Spark how and where to access a cluster » The program next creates a sqlContext object » Use sqlContext to create DataFrames
  4. 4. Review : DataFrames The primary abstraction in Spark » Immutable once constructed. » Track lineage information to efficiently recompute lost data. » Enable operations on collection of elements in parallel. You construct DataFrames » by parallelizing existing Scala collections (lists) » by transforming an existing Spark DFs » from files in HDFS or any other storage system
  5. 5. Review: DataFrames Two types of operations: transformations and actions. Transformations are lazy (not computed immediately). Transformed DF is executed when action runs on it. Persist (cache) DFs in memory or disk.
  6. 6. Resilient Distributed Datasets Untyped Spark abstraction underneath DataFrames: » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel You construct RDDs » by parallelizing existing Scala collections (lists) » by transforming an existing RDDs or DataFrame » from files in HDFS or any other storage system
  7. 7. When to use DataFrames ? Need high-level transformations and actions, and want high-level control over your dataset. Have typed (structured or semi-structured) data. You want DataFrame optimization and performance benefits » Catalyst Optimization Engine • 75% reduction in execution time » Project Tungsten off-heap memory management • 75+% reduction in memory usage (less GC)
  8. 8. Apache Spark MapReduce 1) Start Apache Spark Shell ./bin/spark-shell 2) Let's Read the text file scala> val textFile = sc.textFile("file:///home/chetan306/inputfile.txt") 3) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: scala> textFile.count() scala> textFile.first() 4) Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file. val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Get transformation output. linesWithSpark.collect()
  9. 9. Apache Spark MapReduce 5) We can chain together transformations and actions: textFile.filter(line => line.contains("Spark")).count() 6) One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily: val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts.collect()

×