SD CADD meeting 2016-08-30: Intro to Spark

Yana Valasatava
Yana ValasatavaPostdoctoral Fellow at University of California, San Diego
Introduc)on	
  to	
  Apache	
  Spark	
  
Peter	
  Rose	
  
peter.rose@rcsb.org	
  
Apache	
  Spark	
  is	
  a	
  fast	
  and	
  general	
  engine	
  for	
  large-­‐scale	
  data	
  processing	
  
•  In-­‐memory	
  processing	
  
	
  
Successor	
  of	
  Hadoop	
  (MapReduce)	
  
•  File-­‐based	
  processing	
  
hDp://spark.apache.org/	
  
Spark	
  Ecosystem	
  
Apache	
  Spark	
  works	
  in	
  parallel	
  on	
  
•  Mul)core	
  laptop,	
  desktop	
  
•  Single	
  server	
  
•  Cluster	
  (need	
  cluster	
  manager)	
  
SD CADD meeting 2016-08-30: Intro to Spark
RDD<String>	
   RDD<String>	
   PairRDD<String,Integer>	
   PairRDD<String,Integer>	
  
Map-­‐Reduce	
  Example	
  
one	
  to	
  many	
   one	
  to	
  one	
  
SD CADD meeting 2016-08-30: Intro to Spark
Scalable	
  machine	
  	
  
learning	
  library	
  
Module	
  for	
  running	
  
queries	
  on	
  
structured	
  data	
  
Data	
  Sources	
  
Module	
  to	
  build	
  scalable	
  fault-­‐
tolerant	
  streaming	
  applica)ons	
  Core	
  Data	
  Structures	
  
1 of 8

Recommended

CADD meeting 08-30-2016 by
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016Yana Valasatava
951 views26 slides
Compressive Structural Bioinformatics: Large-scale analysis and visualization... by
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Peter Rose
903 views16 slides
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo by
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
4.4K views11 slides
Spark meetup TCHUG by
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUGRyan Bosshart
2.8K views53 slides
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop by
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
899 views41 slides
Big Data Platform Industrialization by
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization DataWorks Summit/Hadoop Summit
1.1K views22 slides

More Related Content

Viewers also liked

Building a modern Application with DataFrames by
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
5.4K views87 slides
Building a modern Application with DataFrames by
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
5.1K views87 slides
Jump Start into Apache® Spark™ and Databricks by
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
3.9K views39 slides
Scalable And Incremental Data Profiling With Spark by
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
3.5K views40 slides
Hadoop application architectures - using Customer 360 as an example by
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
6.1K views162 slides
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ... by
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
6.6K views54 slides

Viewers also liked(13)

Building a modern Application with DataFrames by Databricks
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks5.4K views
Building a modern Application with DataFrames by Spark Summit
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit5.1K views
Jump Start into Apache® Spark™ and Databricks by Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks3.9K views
Scalable And Incremental Data Profiling With Spark by Jen Aman
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
Jen Aman3.5K views
Hadoop application architectures - using Customer 360 as an example by hadooparchbook
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook6.1K views
Apache spark 소개 및 실습 by 동현 강
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강44.8K views
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0 by Databricks
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks6.6K views
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제 by NAVER D2
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
NAVER D29.1K views
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi... by Databricks
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks9K views
Parallelizing Existing R Packages with SparkR by Databricks
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks7.1K views
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon... by StampedeCon
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon129.5K views

Recently uploaded

Dr. Ousmane Badiane-2023 ReSAKSS Conference by
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 slides
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
9 views37 slides
Shreyas hospital statistics.pdf by
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdfsamithavinal
5 views9 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
7 views14 slides
CRM stick or twist workshop by
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 views16 slides
Construction Accidents & Injuries by
Construction Accidents & InjuriesConstruction Accidents & Injuries
Construction Accidents & InjuriesBisnar Chase Personal Injury Attorneys
6 views5 slides

Recently uploaded(20)

Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Shreyas hospital statistics.pdf by samithavinal
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdf
samithavinal5 views
CRM stick or twist workshop by info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... by patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7528 views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views

SD CADD meeting 2016-08-30: Intro to Spark

  • 1. Introduc)on  to  Apache  Spark   Peter  Rose   peter.rose@rcsb.org  
  • 2. Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale  data  processing   •  In-­‐memory  processing     Successor  of  Hadoop  (MapReduce)   •  File-­‐based  processing   hDp://spark.apache.org/  
  • 4. Apache  Spark  works  in  parallel  on   •  Mul)core  laptop,  desktop   •  Single  server   •  Cluster  (need  cluster  manager)  
  • 6. RDD<String>   RDD<String>   PairRDD<String,Integer>   PairRDD<String,Integer>   Map-­‐Reduce  Example   one  to  many   one  to  one  
  • 8. Scalable  machine     learning  library   Module  for  running   queries  on   structured  data   Data  Sources   Module  to  build  scalable  fault-­‐ tolerant  streaming  applica)ons  Core  Data  Structures