1 
An Introduction to Spark 
Jai Ranganathan, Senior Director Product Management, Cloudera 
Denny Lee, Senior Director Data Sciences Engineering, Concur
Agenda 
• Cloudera’s Enterprise Data Hub 
• Why Spark? 
• Spark Use Cases 
• Concur Case Study 
• Cloudera and Spark 
• Future of Spark 
2 ©2014 Cloudera, Inc. All rights reserved.
Cloudera’s Enterprise Data Hub 
3 ©2014 Cloudera, Inc. All rights reserved. 
3RD PARTY 
APPS 
STORAGE FOR ANY TYPE OF DATA 
UNIFIED, ELASTIC, RESILIENT URE 
BATCH 
PROCESSING 
MAPREDUCE, 
SPARK 
ANALYTIC 
SQL 
IMPALA 
SEARCH 
ENGINE 
SOLR 
MACHINE 
LEARNING 
SPARK, PARTNERS, 
MAHOUT, MLLIB 
STREAM 
PROCESSING 
SPARK 
WORKLOAD MANAGEMENT YARN 
FILESYSTEM 
HDFS 
ONLINE NOSQL 
HBASE 
MANAGEMENT 
CLOUDERA NAVIGATOR 
DATA 
MANAGEMENT 
CLOUDERA MANAGER 
SYSTEM 
, SECURE SENTRY
Spark: Easy and Fast Big Data 
Easy to Develop 
• Rich APIs in Java, Scala, 
Python 
• Interactive shell 
Fast to Run 
• General execution 
graphs 
• In-memory storage 
2-5× less code Up to 10× faster on disk, 
4 ©2014 Cloudera, Inc. All rights reserved. 
100× in memory
Easy: Expressive API 
• map 
• filter 
• groupBy 
• sort 
• union 
• join 
• leftOuterJoin 
• rightOuterJoin 
• reduce 
• count 
• fold 
• reduceByKey 
• groupByKey 
• cogroup 
• cross 
• zip 
5 ©2014 Cloudera, Inc. All rights reserved. 
• sample 
• take 
• first 
• partitionBy 
• mapWith 
• pipe 
• save ...
Example: Logistic Regression 
data = spark.textFile(...).map(readPoint).cache() 
w = numpy.random.rand(D) 
for i in range(iterations): 
gradient = data 
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
* p.y * p.x) 
.reduce(lambda x, y: x + y) 
w -= gradient 
print “Final w: %s” % w 
6 ©2014 Cloudera, Inc. All rights reserved.
Spark Introduces Concept of RDD to Take 
Advantage of Memory 
RDD = Resilient Distributed Datasets 
• Memory caching layer that stores data in a distributed, fault-tolerant 
cache 
• Created by parallel transformations on data in stable storage 
Two observations: 
a. Can fall back to disk when data-set does not fit in memory 
b. Provides fault-tolerance through concept of lineage 
7 ©2014 Cloudera, Inc. All rights reserved.
Fast: Using RAM, Operator Graphs 
In-Memory Caching 
• Data Partitions read from RAM 
instead of disk 
Operator Graphs 
• Scheduling Optimizations 
• Fault Tolerance 
C: D: E: 
8 ©2014 Cloudera, Inc. All rights reserved. 
join 
B: B: 
groupBy 
filter 
F: 
Ç 
√ 
Ω 
map 
A: 
map 
take 
= RDD = cached partition
Easy: Out of the Box Functionality 
Hadoop Integration 
• Standard Hadoop data formats 
• Runs under YARN in mixed clusters 
Libraries 
• Mllib – Machine Learning toolkit 
• GraphX (alpha) – Graph analytics based on 
PowerGraph abstractions 
• Spark Streaming – Near real-time analytics 
• Spark SQL – direct SQL interface in a Spark 
application 
Language support: 
• SparkR (upcoming) 
• Java 8 
• Schema support in Spark’s APIs 
• SQL support in Spark Streaming (upcoming) 
9 ©2014 Cloudera, Inc. All rights reserved.
Logistic Regression Performance 
(Data Fits in Memory) 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
1 5 10 20 30 
Running Time (s) 
Number of Iterations 
10 ©2014 Cloudera, Inc. All rights reserved. 
110 s / iteration 
Hadoop 
Spark 
first iteration 80 s 
further iterations 1 s
Spark Streaming 
What is it? 
• Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, 
transformable streams 
• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. 
Why do you care? 
• Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts 
• High level API with automatic DAG generation – simplicity of development 
• Excellent throughput – can scale easily to really large volumes of data ingest 
• Combine elements like MLLib & Oryx into a streaming application 
Example use cases: 
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS 
• Detecting anomalous behavior and triggering alerts. 
• Continuous reporting of summary metrics for incoming data. 
11 ©2014 Cloudera, Inc. All rights reserved.
Streaming Architectures with Spark 
Data sources 
Integration 
layer 
Ingest 
HDFS 
Spark Stream processing 
• Flume 
• Kafka 
12 ©2014 Cloudera, Inc. All rights reserved. 
Data prep 
Aggregation / 
Scoring 
HBase 
Spark long-term analytics / model building 
Real-time result 
serving
Cloudera Customer Use Cases – Core Spark 
Sector Use case Replaces 
Financial 
• Multiple use cases to calculate VaR for portfolio risk analysis – 
Services 
Monte Carlo simulations as well as Var-Covar methods 
• ETL pipeline speed-up 
• Analyzing stock data for 20 years 
13 ©2014 Cloudera, Inc. All rights reserved. 
• Home grown 
applications 
Genomics • Two use cases to identify disease causing genes in full human 
genome 
• MySQL engine 
Data services • Trend analysis using statistical methods on large data sets 
• Document classification (LDA) 
• Fraud analytics 
• Netezza 
replacement 
• Net new 
Healthcare • Calculating Jaccard scores on health care data sets • Net new
Cloudera Customer Use Cases – Streaming 
Sector Use case Replaces 
Financial 
Services 
• On-line fraud detection • Net new 
Many • Continuous ETL 
Retail • On-line recommender systems 
• Inventory management 
14 ©2014 Cloudera, Inc. All rights reserved. 
• Custom apps
15 
Spark at Concur
16 
About Concur 
What do we do? 
• Leading provider of spend management solutions and (Travel, 
Invoice, TripIt, etc.) services in the world 
• Global customer base of 20,000 clients and 25 million users 
• Processing more than $50 Billion in Travel & Expense (T&E) 
spend each year
17 
About the Speaker 
Who Am I? 
• Long time SQL Server BI guy 
(24TB Yahoo! Cube) 
• Project Isotope (Hadoop on 
Windows and Azure) 
• At Concur, helping with Big 
Data and Data Sciences
18 
A long time ago… 
• We started using Hadoop because 
• It was free 
• i.e. Didn’t want to pay for a big data warehouse 
• Could slowly extract from hundreds of relational data sources, consolidate it, and query it 
• We were not thinking about advanced analytics 
• We were thinking …. “cheaper reporting” 
• We have some hardware lying around … let’s cobble it together and now we have reports
19 
Themes 
Consolidate Visualize Insight Recommend
20 
BTS 
Travel Weather 
Invoice Web Analytics 
Expense
Can quickly switch to map mode and determine where most itineraries are from in 2013 
21
22 
Or even quickly map out the airport locations on a map to see that Sun Moon 
Lake Airport is in the center of Taiwan
23 
Starbucks Store #3313 
601 108th Ave NE 
Bellevue, WA (425) 646-9602 
------------------------------- 
Chk 713452 
05/14/2014 11:04 AM 
1961558 Drawer: 1 Reg: 1 
------------------------------- 
Bacon Art Brkfst 3.45 
Warmed 
T1 Latte 2.70 
Triple 1.50 
Soy 0.60 
Gr Vanilla Mac 4.15 
Reload Card 50.00 
AMEX $50.00 
XXXXXXXXXXXXXXXXXX1004 
SBUX Card $13.56 
SUBTOTAL $62.40 
New Caffe Espresso 
Frappuccino(R) Blended beverage 
Our Signature 
Frappuccino(R) roast coffee and 
fresh milk, blended with ice. 
Topped with our new espresso 
whipped cream and new 
Italian roast drizzle 
Expense Categorization 
One of my receipts that I had OCRed 
One of the issues we’re trying to solve 
is to auto-categorize this, so how 
can we do this? 
Below is a simplistic solution using 
WordCount 
Note, a real solution should involve 
machine learning algorithms
24 
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Welcome to 
____ __ 
/ __/__ ___ _____/ /__ 
_ / _ / _ `/ __/ '_/ 
/___/ .__/_,_/_/ /_/_ version 1.1.0 
/_/ 
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) 
Type in expressions to have them evaluated. 
Type :help for more information. 
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable 
Spark context available as sc. 
scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") 
receipt: org.apache.spark.rdd.RDD[String] = 
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at 
<console>:12 
scala> receipt.count 
res0: Long = 30
25 
scala> val words = receipt.flatMap(_.split(" ")) 
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 
scala> words.count 
res1: Long = 161 
scala> words.distinct.count 
res2: Long = 72 
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => 
(y,x)}.sortByKey(false).map{case(i,j) => (j, i)} 
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 
scala> wordCounts.take(12) 
res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- 
---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), 
(Starbucks,1))
26
27 
What’s next… 
• With Spark 1.1 
• Sort-based shuffling 
• MLLib: correlations, sampling, feature extraction, decision 
trees 
• GraphX: label propagation
28 
Using AtScale to build up a dimensional model based on the data that is 
stored within Impala / Hive
29 
Slice and filter the Impala model using Tableau
30 
Spark and Cloudera
Why Cloudera? 
Expertise 
• Deep engineering investment – only distribution vendor with engineering contributions to Spark and 
actual technical know-how 
• Field team, support, training and services with experience in many Spark use cases 
• Driving roadmap for Spark 
Experience 
•Most customers running Spark across all distributions put together 
• Range from few nodes to 800+ nodes 
• Longest field presence – first vendor to support and still only two vendors with official support 
Partnerships 
• Intel partnership brings 15 Spark developers focused on Cloudera customer use cases 
• Business relationship with Databricks to do joint development on Spark 
31 ©2014 Cloudera, Inc. All rights reserved.
Spark Takes Over From MapReduce 
Stage 1 
• Crunch on Spark 
• Search on Spark 
Stage 2 
• Hive on Spark 
• Pig on Spark 
Stage 3 
• MR equivalence 
• Sqoop on Spark 
Cloudera led multi-organization effort: 
MapR, Intel, Databricks, IBM 
32 ©2014 Cloudera, Inc. All rights reserved.
Spark is Great but… 
• Opaque API limitations 
• Debugging and troubleshooting 
• Complex configuration 
CLOUDERA 
UNIVERSITY 
Spark Training 
33 ©2014 Cloudera, Inc. All rights reserved.
Questions & Next Steps 
Download Now – www.cloudera.com/download 
Spark Training - 
www.cloudera.com/content/cloudera/en/training/cour 
ses/spark-training.html 
34 ©2014 Cloudera, Inc. All rights reserved.
35 ©2014 Cloudera, Inc. All rights reserved. 
Thank You

The Future of Hadoop: A deeper look at Apache Spark

  • 1.
    1 An Introductionto Spark Jai Ranganathan, Senior Director Product Management, Cloudera Denny Lee, Senior Director Data Sciences Engineering, Concur
  • 2.
    Agenda • Cloudera’sEnterprise Data Hub • Why Spark? • Spark Use Cases • Concur Case Study • Cloudera and Spark • Future of Spark 2 ©2014 Cloudera, Inc. All rights reserved.
  • 3.
    Cloudera’s Enterprise DataHub 3 ©2014 Cloudera, Inc. All rights reserved. 3RD PARTY APPS STORAGE FOR ANY TYPE OF DATA UNIFIED, ELASTIC, RESILIENT URE BATCH PROCESSING MAPREDUCE, SPARK ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK, PARTNERS, MAHOUT, MLLIB STREAM PROCESSING SPARK WORKLOAD MANAGEMENT YARN FILESYSTEM HDFS ONLINE NOSQL HBASE MANAGEMENT CLOUDERA NAVIGATOR DATA MANAGEMENT CLOUDERA MANAGER SYSTEM , SECURE SENTRY
  • 4.
    Spark: Easy andFast Big Data Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 4 ©2014 Cloudera, Inc. All rights reserved. 100× in memory
  • 5.
    Easy: Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip 5 ©2014 Cloudera, Inc. All rights reserved. • sample • take • first • partitionBy • mapWith • pipe • save ...
  • 6.
    Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w 6 ©2014 Cloudera, Inc. All rights reserved.
  • 7.
    Spark Introduces Conceptof RDD to Take Advantage of Memory RDD = Resilient Distributed Datasets • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage 7 ©2014 Cloudera, Inc. All rights reserved.
  • 8.
    Fast: Using RAM,Operator Graphs In-Memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance C: D: E: 8 ©2014 Cloudera, Inc. All rights reserved. join B: B: groupBy filter F: Ç √ Ω map A: map take = RDD = cached partition
  • 9.
    Easy: Out ofthe Box Functionality Hadoop Integration • Standard Hadoop data formats • Runs under YARN in mixed clusters Libraries • Mllib – Machine Learning toolkit • GraphX (alpha) – Graph analytics based on PowerGraph abstractions • Spark Streaming – Near real-time analytics • Spark SQL – direct SQL interface in a Spark application Language support: • SparkR (upcoming) • Java 8 • Schema support in Spark’s APIs • SQL support in Spark Streaming (upcoming) 9 ©2014 Cloudera, Inc. All rights reserved.
  • 10.
    Logistic Regression Performance (Data Fits in Memory) 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 Running Time (s) Number of Iterations 10 ©2014 Cloudera, Inc. All rights reserved. 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s
  • 11.
    Spark Streaming Whatis it? • Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant, transformable streams • Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes. Why do you care? • Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts • High level API with automatic DAG generation – simplicity of development • Excellent throughput – can scale easily to really large volumes of data ingest • Combine elements like MLLib & Oryx into a streaming application Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 11 ©2014 Cloudera, Inc. All rights reserved.
  • 12.
    Streaming Architectures withSpark Data sources Integration layer Ingest HDFS Spark Stream processing • Flume • Kafka 12 ©2014 Cloudera, Inc. All rights reserved. Data prep Aggregation / Scoring HBase Spark long-term analytics / model building Real-time result serving
  • 13.
    Cloudera Customer UseCases – Core Spark Sector Use case Replaces Financial • Multiple use cases to calculate VaR for portfolio risk analysis – Services Monte Carlo simulations as well as Var-Covar methods • ETL pipeline speed-up • Analyzing stock data for 20 years 13 ©2014 Cloudera, Inc. All rights reserved. • Home grown applications Genomics • Two use cases to identify disease causing genes in full human genome • MySQL engine Data services • Trend analysis using statistical methods on large data sets • Document classification (LDA) • Fraud analytics • Netezza replacement • Net new Healthcare • Calculating Jaccard scores on health care data sets • Net new
  • 14.
    Cloudera Customer UseCases – Streaming Sector Use case Replaces Financial Services • On-line fraud detection • Net new Many • Continuous ETL Retail • On-line recommender systems • Inventory management 14 ©2014 Cloudera, Inc. All rights reserved. • Custom apps
  • 15.
  • 16.
    16 About Concur What do we do? • Leading provider of spend management solutions and (Travel, Invoice, TripIt, etc.) services in the world • Global customer base of 20,000 clients and 25 million users • Processing more than $50 Billion in Travel & Expense (T&E) spend each year
  • 17.
    17 About theSpeaker Who Am I? • Long time SQL Server BI guy (24TB Yahoo! Cube) • Project Isotope (Hadoop on Windows and Azure) • At Concur, helping with Big Data and Data Sciences
  • 18.
    18 A longtime ago… • We started using Hadoop because • It was free • i.e. Didn’t want to pay for a big data warehouse • Could slowly extract from hundreds of relational data sources, consolidate it, and query it • We were not thinking about advanced analytics • We were thinking …. “cheaper reporting” • We have some hardware lying around … let’s cobble it together and now we have reports
  • 19.
    19 Themes ConsolidateVisualize Insight Recommend
  • 20.
    20 BTS TravelWeather Invoice Web Analytics Expense
  • 21.
    Can quickly switchto map mode and determine where most itineraries are from in 2013 21
  • 22.
    22 Or evenquickly map out the airport locations on a map to see that Sun Moon Lake Airport is in the center of Taiwan
  • 23.
    23 Starbucks Store#3313 601 108th Ave NE Bellevue, WA (425) 646-9602 ------------------------------- Chk 713452 05/14/2014 11:04 AM 1961558 Drawer: 1 Reg: 1 ------------------------------- Bacon Art Brkfst 3.45 Warmed T1 Latte 2.70 Triple 1.50 Soy 0.60 Gr Vanilla Mac 4.15 Reload Card 50.00 AMEX $50.00 XXXXXXXXXXXXXXXXXX1004 SBUX Card $13.56 SUBTOTAL $62.40 New Caffe Espresso Frappuccino(R) Blended beverage Our Signature Frappuccino(R) roast coffee and fresh milk, blended with ice. Topped with our new espresso whipped cream and new Italian roast drizzle Expense Categorization One of my receipts that I had OCRed One of the issues we’re trying to solve is to auto-categorize this, so how can we do this? Below is a simplistic solution using WordCount Note, a real solution should involve machine learning algorithms
  • 24.
    24 Spark assemblyhas been built with Hive, including Datanucleus jars on classpath Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) Type in expressions to have them evaluated. Type :help for more information. 2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore 14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc. scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt") receipt: org.apache.spark.rdd.RDD[String] = /usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at <console>:12 scala> receipt.count res0: Long = 30
  • 25.
    25 scala> valwords = receipt.flatMap(_.split(" ")) words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14 scala> words.count res1: Long = 161 scala> words.distinct.count res2: Long = 72 scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)} wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16 scala> wordCounts.take(12) res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (---------------- ---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1), (Starbucks,1))
  • 26.
  • 27.
    27 What’s next… • With Spark 1.1 • Sort-based shuffling • MLLib: correlations, sampling, feature extraction, decision trees • GraphX: label propagation
  • 28.
    28 Using AtScaleto build up a dimensional model based on the data that is stored within Impala / Hive
  • 29.
    29 Slice andfilter the Impala model using Tableau
  • 30.
    30 Spark andCloudera
  • 31.
    Why Cloudera? Expertise • Deep engineering investment – only distribution vendor with engineering contributions to Spark and actual technical know-how • Field team, support, training and services with experience in many Spark use cases • Driving roadmap for Spark Experience •Most customers running Spark across all distributions put together • Range from few nodes to 800+ nodes • Longest field presence – first vendor to support and still only two vendors with official support Partnerships • Intel partnership brings 15 Spark developers focused on Cloudera customer use cases • Business relationship with Databricks to do joint development on Spark 31 ©2014 Cloudera, Inc. All rights reserved.
  • 32.
    Spark Takes OverFrom MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark • Pig on Spark Stage 3 • MR equivalence • Sqoop on Spark Cloudera led multi-organization effort: MapR, Intel, Databricks, IBM 32 ©2014 Cloudera, Inc. All rights reserved.
  • 33.
    Spark is Greatbut… • Opaque API limitations • Debugging and troubleshooting • Complex configuration CLOUDERA UNIVERSITY Spark Training 33 ©2014 Cloudera, Inc. All rights reserved.
  • 34.
    Questions & NextSteps Download Now – www.cloudera.com/download Spark Training - www.cloudera.com/content/cloudera/en/training/cour ses/spark-training.html 34 ©2014 Cloudera, Inc. All rights reserved.
  • 35.
    35 ©2014 Cloudera,Inc. All rights reserved. Thank You

Editor's Notes

  • #4 Cloudera’s enterprise data hub (powered by Hadoop) is a data management platform that provides a unique offering that’s unified, compliance-ready, accessible, and open. This enterprise data hub bring everything together in one unified layer. No copying of data. Simply one single transparent view that allows you to easily meet auditing and compliance goals. It offers a single, unified solution for: Storage & serialization Data ingest & egress Security & governance Metadata Resource management It’s compliance-ready for security and governance and includes: Authentication, authorization, encryption, audit, RBAC, lineage Single interface with integrated controls It’s accessible through: Multiple frameworks Familiar tools and skills And it’s completely open: 100% open source Apache licensed platform Extensible to 3rd party frameworks Zero lock-in platform As mentioned, Cloudera’s enterprise data hub has multiple different frameworks integrated into the platform for robust querying. One of the newest and most exciting querying frameworks is Spark, an open source, flexible data processing framework for machine learning and stream processing. Before we dive into Spark, we need to understand why Spark is necessary. And that requires an understanding of MapReduce
  • #7 Key idea: add “variables” to the “functions” in functional programming
  • #11 This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • #27 Quick view of Android vs. iOS mobile sessions