SlideShare a Scribd company logo
Remove Duplicates
Basic Spark Functionality
Spark
Spark Core
• Spark Core is the base engine for large-scale
parallel and distributed data processing. It is
responsible for:
• memory management and fault recovery
• scheduling, distributing and monitoring jobs
on a cluster
• interacting with storage systems
Spark Core
• Spark introduces the concept of an RDD (Resilient
Distributed Dataset)
• an immutable fault-tolerant, distributed collection of objects
that can be operated on in parallel.
• contains any type of object and is created by loading an
external dataset or distributing a collection from the driver
program.
• RDDs support two types of operations:
• Transformations are operations (such as map, filter, join, union,
and so on) that are performed on an RDD and which yield a
new RDD containing the result.
• Actions are operations (such as reduce, count, first, and so
on) that return a value after running a computation on an RDD.
Spark DataFrames
• DataFrames API is inspired by data frames in R and Python
(Pandas), but designed from the ground-up to support
modern big data and data science applications:
• Ability to scale from kilobytes of data on a single laptop to
petabytes on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the
Spark SQL Catalyst optimizer
• Seamless integration with all big data tooling and
infrastructure via Spark
• APIs for Python, Java, Scala, and R (in development via
SparkR)
Remove Duplicates
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String]
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cNoDups: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]]
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]
college: RDD
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“as,df,asf”
“q3,e,qw”
“mb,kg,o”
“as,df,asf”
“qw,e,qw”
“mb,k2,o”
cRows: RDD
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(q3,e,qw)
Array(mb,kg,o)
Array(as,df,asf)
Array(qw,e,qw)
Array(mb,k2,o)
cKeyRows: RDD
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,k2,o)
val cGrouped = cKeyRows
.groupBy(x => x._1)
.map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]
val cDups = cGrouped.filter(x => x._2.length > 1)
cKeyRows: RDD
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(q3,e,qw)
key->Array(mb,kg,o)
key->Array(as,df,asf)
key->Array(qw,e,qw)
key->Array(mb,k2,o)
cGrouped: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)
val cDups = cGrouped.filter(x => x._2.length > 1)
cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])]
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]]
cGrouped: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(mb,k2,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
key->Array(q3,e,qw)
“as,df,asf”
“qw,e,qw”
“mb,kg,o”
“q3,e,qw
“mb,k2,o”
cNoDups: RDD cDups: RDD
key->Array(as,df,asf)
Array(as,df,asf)
Array(as,df,asf)
key->Array(mb,kg,o)
Array(mb,kg,o)
key->Array(qw,e,qw)
Array(qw,e,qw)
Previously it was RDD but currently the Spark DataFrames API is
considered to be the primary interaction point of Spark. but RDD is
available if needed
What is partitioning in Apache Spark?
Partitioning is actually the main concept of access your entire Hardware resources while
executing any Job.
More Partition = More Parallelism
So conceptually you must check the number of slots in your hardware, how many tasks can
each of executors can handle.Each partition will leave in different Executor.
DataFrames
• So Dataframe is more like column structure and each
record is actually a line.
• Can Run statistics naturally as its somewhat works like
SQL or Python/R Dataframe.
• In RDD, to process any data for last 7 days, spark
needed to go through entire dataset to get the details, but
in Dataframe you already get Time column to handle the
situation, so Spark won’t even see the data which is
greater than 7 days.
• Easier to program.
• Better performance and storage in the heap of executor.
How Dataframe ensures to
read less data?
• You can skip partition while reading the data
using Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
What is Parquet
• You can skip partition while reading the data using
Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
• Parquet should be the source for any operation or ETL. So if the
data is different format, preferred approach is to convert the source
to Parquet and then process.
• If any dataset in JSON or comma separated file, first ETL it to
convert it to Parquet.
• It limits I/O , so scans/reads only the columns that are needed.
• Parquet is columnar layout based, so it compresses better, so
save spaces.
• So parquet takes first column and store that as a file, and so on. So
if we have 3 different files and sql query is on 2 files, then parquet
won’t even consider to read the 3rd file.
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv")
college.count
res2: Long = 7805
val collegeNoDups = college.distinct
collegeNoDups.count
res3: Long = 7805
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29
cNoDups.count
res7: Long = 7805
college.count
res8: Long = 9000
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31
cKeyRows.take(2)
res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(
val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33
val cDups = cGrouped.filter(x => x._2.length > 1)
val cDups = cGrouped.filter(x => x._2.length > 1)
cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35
cDups.count
res12: Long = 1195
val cNoDups = cGrouped.map(x => x._2(0))
cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35
cNoDups.count
res13: Long = 7805
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35
cNoDups.take(5)
16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28
res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0,
NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245
Demo RDD Code
import org.apache.spark.sql.SQLContext
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv")
df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI:
string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string,
SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string,
ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...
val dfd = df.distinct
dfd.count
res0: Long = 7804
df.count
res1: Long = 8998
val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM"))
dfdd.count
res2: Long = 7804
val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt"))
res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
df.registerTempTable("colleges")
val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM")
dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt2.show
+--------+-------+------+--------------------+---+
| UNITID| OPEID|opeid6| INSTNM|cnt|
+--------+-------+------+--------------------+---+
|10236801| 104703| 1047|Troy University-P...| 2|
|11339705|3467309| 34673|Marinello Schools...| 2|
| 135276| 558500| 5585|Lively Technical ...| 2|
| 145682| 675300| 6753|Illinois Central ...| 2|
| 151111| 181300| 1813|Indiana Universit...| 1|
| 156921| 696100| 6961|Jefferson Communi...| 1|
Demo DataFrame Code

More Related Content

What's hot

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
oggers
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
Ran Silberman
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Databricks
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
R Introduction
R IntroductionR Introduction
R Introduction
Sangeetha S
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
Databricks
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
Sri Ambati
 

What's hot (20)

Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 

Viewers also liked

AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)
Ryan Hoegg
 
IEEE big data 2015
IEEE big data 2015IEEE big data 2015
IEEE big data 2015
Dippy Aggarwal
 
Мобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, GemaltoМобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, Gemalto
Aleksandrs Baranovs
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 cluster
Chao-Hsuan Shen
 
Building Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESBBuilding Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESB
Ryan Hoegg
 
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MININGDino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
ieee-cist
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Mining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory DataMining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory Data
NhatHai Phan
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Centre of Geographic Sciences (COGS)
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 

Viewers also liked (16)

DDPPresentation
DDPPresentationDDPPresentation
DDPPresentation
 
AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)AMQP and RabbitMQ (OKCJUG, January 2014)
AMQP and RabbitMQ (OKCJUG, January 2014)
 
IEEE big data 2015
IEEE big data 2015IEEE big data 2015
IEEE big data 2015
 
Мобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, GemaltoМобильные платежи подходы и методы исполнения, Gemalto
Мобильные платежи подходы и методы исполнения, Gemalto
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 cluster
 
Building Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESBBuilding Performant, Reliable, and Scalable Integrations with Mule ESB
Building Performant, Reliable, and Scalable Integrations with Mule ESB
 
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MININGDino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
Dino pedreschi keynote ieee cist 2014 BIG DATA ANALYTICS & SOCIAL MINING
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Mining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory DataMining Object Movement Patterns from Trajectory Data
Mining Object Movement Patterns from Trajectory Data
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' TrajectoriesSpatio-Temporal Data Mining and Classification of Ships' Trajectories
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 

Similar to Tulsa techfest Spark Core Aug 5th 2016

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
I Goo Lee.
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 

Similar to Tulsa techfest Spark Core Aug 5th 2016 (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

More from Mark Smith

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeople
Mark Smith
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundaries
Mark Smith
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesis
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1
Mark Smith
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2
Mark Smith
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of Jesus
Mark Smith
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologetics
Mark Smith
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologetics
Mark Smith
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 

More from Mark Smith (10)

Ss jan19 2020_isafepeople
Ss jan19 2020_isafepeopleSs jan19 2020_isafepeople
Ss jan19 2020_isafepeople
 
Ss jan12 2020_introboundaries
Ss jan12 2020_introboundariesSs jan12 2020_introboundaries
Ss jan12 2020_introboundaries
 
Ss dec092018genesis
Ss dec092018genesisSs dec092018genesis
Ss dec092018genesis
 
The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1The Bridge Sunday School. Acts Prayer Model Week 1
The Bridge Sunday School. Acts Prayer Model Week 1
 
The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2The Bridge Sunday School. Acts Prayer Model Week 2
The Bridge Sunday School. Acts Prayer Model Week 2
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Sunday School Trial of Jesus
Sunday School Trial of JesusSunday School Trial of Jesus
Sunday School Trial of Jesus
 
Ss sep11 2016_apologetics
Ss sep11 2016_apologeticsSs sep11 2016_apologetics
Ss sep11 2016_apologetics
 
Ss aug28 2016_apologetics
Ss aug28 2016_apologeticsSs aug28 2016_apologetics
Ss aug28 2016_apologetics
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 

Recently uploaded

Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 

Recently uploaded (20)

Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 

Tulsa techfest Spark Core Aug 5th 2016

  • 3. Spark Core • Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for: • memory management and fault recovery • scheduling, distributing and monitoring jobs on a cluster • interacting with storage systems
  • 4. Spark Core • Spark introduces the concept of an RDD (Resilient Distributed Dataset) • an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. • contains any type of object and is created by loading an external dataset or distributing a collection from the driver program. • RDDs support two types of operations: • Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result. • Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
  • 5. Spark DataFrames • DataFrames API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications: • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via SparkR)
  • 7. val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] college: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “as,df,asf” “q3,e,qw” “mb,kg,o” “as,df,asf” “qw,e,qw” “mb,k2,o” cNoDups: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “q3,e,qw “mb,k2,o”
  • 8. val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] college: RDD “as,df,asf” “qw,e,qw” “mb,kg,o” “as,df,asf” “q3,e,qw” “mb,kg,o” “as,df,asf” “qw,e,qw” “mb,k2,o” cRows: RDD Array(as,df,asf) Array(qw,e,qw) Array(mb,kg,o) Array(as,df,asf) Array(q3,e,qw) Array(mb,kg,o) Array(as,df,asf) Array(qw,e,qw) Array(mb,k2,o) cKeyRows: RDD key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)
  • 9. val cGrouped = cKeyRows .groupBy(x => x._1) .map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])] val cDups = cGrouped.filter(x => x._2.length > 1) cKeyRows: RDD key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o) key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o) cGrouped: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o) key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)
  • 10. val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])] val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] cGrouped: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o) key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw) “as,df,asf” “qw,e,qw” “mb,kg,o” “q3,e,qw “mb,k2,o” cNoDups: RDD cDups: RDD key->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf) key->Array(mb,kg,o) Array(mb,kg,o) key->Array(qw,e,qw) Array(qw,e,qw)
  • 11. Previously it was RDD but currently the Spark DataFrames API is considered to be the primary interaction point of Spark. but RDD is available if needed
  • 12. What is partitioning in Apache Spark? Partitioning is actually the main concept of access your entire Hardware resources while executing any Job. More Partition = More Parallelism So conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.
  • 13.
  • 14. DataFrames • So Dataframe is more like column structure and each record is actually a line. • Can Run statistics naturally as its somewhat works like SQL or Python/R Dataframe. • In RDD, to process any data for last 7 days, spark needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days. • Easier to program. • Better performance and storage in the heap of executor.
  • 15. How Dataframe ensures to read less data? • You can skip partition while reading the data using Dataframe. • Using Parquet • Skipping data using statistucs (ie min, max) • Using partitioning (ie year = 2015/month = 06…) • Pushing predicates into storage systems.
  • 16. What is Parquet • You can skip partition while reading the data using Dataframe. • Using Parquet • Skipping data using statistucs (ie min, max) • Using partitioning (ie year = 2015/month = 06…) • Pushing predicates into storage systems.
  • 17. • Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process. • If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet. • It limits I/O , so scans/reads only the columns that are needed. • Parquet is columnar layout based, so it compresses better, so save spaces. • So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.
  • 18. val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv") college.count res2: Long = 7805 val collegeNoDups = college.distinct collegeNoDups.count res3: Long = 7805 val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27 val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29 cNoDups.count res7: Long = 7805 college.count res8: Long = 9000 val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29 val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31 cKeyRows.take(2) res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array( val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33 val cDups = cGrouped.filter(x => x._2.length > 1) val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35 cDups.count res12: Long = 1195 val cNoDups = cGrouped.map(x => x._2(0)) cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35 cNoDups.count res13: Long = 7805 val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35 cNoDups.take(5) 16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28 res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245 Demo RDD Code
  • 19. import org.apache.spark.sql.SQLContext val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv") df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI: string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string, SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string, ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri... val dfd = df.distinct dfd.count res0: Long = 7804 df.count res1: Long = 8998 val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM")) dfdd.count res2: Long = 7804 val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt")) res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint] dfCnt.show +--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| df.registerTempTable("colleges") val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM") dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint] dfCnt2.show +--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| | 156921| 696100| 6961|Jefferson Communi...| 1| Demo DataFrame Code