Apache Spark part of Eindhoven Java Meetup

•

0 likes•96 views

The presentation of Apache Spark by Mylène Reiners during our first Eindhoven Java Meetup (see http://www.opencirclesolutions.nl/eindhoven-java-meetup/).

Data & Analytics

Apache Spark
AN ENGINE FOR LARGE-SCALE DATA PROCESSING

Introducing myself…
• Mylène Reiners
• Architect @Atos
• Focus innovation

Sketching the context
• Big Data
• New insights
• Analytics
• Data discovery

Sketching the context
• Hadoop
• Storing and managing data

Short demo in Scala (shell)
• Simple data analysis
• Read “README.md”
• Count the number of lines

RDD
• Resilient Distributed Dataset
• Creation
• Transformations
• Actions

Java example (accumulator)
JavaRDD<String> rdd = sc.textFile(args[1]);
final Accumulator<Integer> blankLines = sc.accumulator(0);
JavaRDD<String> callSigns = rdd.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.equals("")) {
blankLines.add(1);
}
return Arrays.asList(line.split(" "));
}});
callSigns.saveAsTextFile("output.txt")

Spark SQL
• Interface for working with (semi)structured data

Hive example
// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext;
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext;
// Import the JavaSchemaRDD
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.Row;
(...)
JavaSparkContext ctx = new JavaSparkContext(...);
SQLContext hiveCtx = new HiveContext(ctx);

Hive example (cont’d)
SchemaRDD input = hiveCtx.jsonFile(inputFile);
// Register the input schema RDD
input.registerTempTable("tweets");
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx.sql(
"SELECT text, retweetCount FROM tweets ORDER BY
retweetCount LIMIT 10");

Spark Streaming
• Acting on data as soon as it arrives
• Dstreams

Example
// Create a StreamingContext with a 1-second batch size from a SparkConf
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream from all the input on port 7777
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
// Filter our DStream for lines with "error"
JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
return line.contains("error");
}});
// Print out the lines with errors
errorLines.print();

Example
// Start our streaming context and wait for it
// to "finish"
jssc.start();
// Wait for the job to finish
jssc.awaitTermination();

Example
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc,
"followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}

Example
val ranksByUsername =
users.join(ranks)
.map {case (id, (username, rank))
=> (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))

What's hot

Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit

Spark in 15 minChristophe Marchal

Introduction to TitanDB Knoldus Inc.

Database ChoicesLynn Langit

Spark - The Ultimate Scala Collections by Martin OderskySpark Summit

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

Bleeding Edge DatabasesLynn Langit

SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde

Spark sql meetupMichael Zhang

Spark - The beginningsDaniel Leon

SFScon18 - Stefano Pampaloni - The SQL revengeSouth Tyrol Free Software Conference

Meetup070416 PresentationsAna Rebelo

Spark and scala course content | Spark and scala course online trainingSelfpaced

Spark and Spark Streaming宇傅

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit

Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB

Munich March 2015 - Cassandra + Spark OverviewChristopher Batey

Building a Lambda Architecture with Elasticsearch at Yieldbotyieldbot

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Spark IntroductionDataStax Academy

What's hot (20)

Quark Virtualization Engine for Analytics

Spark in 15 min

Introduction to TitanDB

Database Choices

Spark - The Ultimate Scala Collections by Martin Odersky

Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform

Bleeding Edge Databases

SQL Now! How Optiq brings the best of SQL to NoSQL data.

Spark sql meetup

Spark - The beginnings

SFScon18 - Stefano Pampaloni - The SQL revenge

Meetup070416 Presentations

Spark and scala course content | Spark and scala course online training

Spark and Spark Streaming

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...

Cassandra vs. ScyllaDB: Evolutionary Differences

Munich March 2015 - Cassandra + Spark Overview

Building a Lambda Architecture with Elasticsearch at Yieldbot

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Spark Introduction

Viewers also liked

Budaya politik dan praktiknyaArya Ningrat

3.giao trinh sql_va_pl_sqlminhduc_cv

Java development with the dynamo frameworkPatrick Deenen

Closing the Knowledge GapCentre of Geographic Sciences (COGS)

What's In A Building?Centre of Geographic Sciences (COGS)

Campsite project presentationTho Xitin

Viewers also liked (6)

Budaya politik dan praktiknya

3.giao trinh sql_va_pl_sql

Java development with the dynamo framework

Closing the Knowledge Gap

What's In A Building?

Campsite project presentation

Similar to Apache Spark part of Eindhoven Java Meetup

Apache Spark RDDsDean Chen

Apache Spark Overview @ ferretAndrii Gakhov

5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher

3 Dundee-Spark Overview for C* developersChristopher Batey

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Apache Spark and DataStax EnablementVincent Poncet

An Introduction to Sparkjlacefie

An Introduct to Spark - Atlanta Spark Meetupjlacefie

Introduction to Spark - DataFactZDataFactZ

Learning spark ch09 - Spark SQLphanleson

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Apache spark coreThành Nguyễn

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA

Introduction to apache sparkJUGBD

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj

Introduction to Apache SparkMohamed hedi Abidi

Similar to Apache Spark part of Eindhoven Java Meetup (20)

Apache Spark RDDs

Apache Spark Overview @ ferret

5 Ways to Use Spark to Enrich your Cassandra Environment

3 Dundee-Spark Overview for C* developers

Big Data Analytics with Apache Spark

Apache Spark and DataStax Enablement

An Introduction to Spark

An Introduct to Spark - Atlanta Spark Meetup

Introduction to Spark - DataFactZ

Learning spark ch09 - Spark SQL

Apache spark - Architecture , Overview & libraries

Apache spark core

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji

Introduction to apache spark

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Apache spark sneha challa- google pittsburgh-aug 25th

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Spark: The State of the Art Engine for Big Data Processing

Introduction to Apache Spark

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Ukraine War presentation: KNOW THE BASICSAishani27

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Industrialised data - the key to AI success.pdfLars Albertsson

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Ukraine War presentation: KNOW THE BASICS

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Brighton SEO | April 2024 | Data Storytelling

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Schema on read is obsolete. Welcome metaprogramming..pdf

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

RA-11058_IRR-COMPRESS Do 198 series of 1998

Customer Service Analytics - Make Sense of All Your Data.pptx

Predicting Employee Churn: A Data-Driven Approach Project Presentation

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Industrialised data - the key to AI success.pdf

Apache Spark part of Eindhoven Java Meetup

1. Apache Spark AN ENGINE FOR LARGE-SCALE DATA PROCESSING

2. Introducing myself… • Mylène Reiners • Architect @Atos • Focus innovation

3. Sketching the context • Big Data • New insights • Analytics • Data discovery

4. Sketching the context • Hadoop • Storing and managing data

5. Apache Spark • Speed • General purpose

6. Short demo in Scala (shell) • Simple data analysis • Read “README.md” • Count the number of lines

7. Role of SparkContext (sc)

8. RDD • Resilient Distributed Dataset • Creation • Transformations • Actions

9. RDD • Lazy • Recomputed

10. Java example (accumulator) JavaRDD<String> rdd = sc.textFile(args[1]); final Accumulator<Integer> blankLines = sc.accumulator(0); JavaRDD<String> callSigns = rdd.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { if (line.equals("")) { blankLines.add(1); } return Arrays.asList(line.split(" ")); }}); callSigns.saveAsTextFile("output.txt")

11. Apache Spark stack

12. Spark SQL • Interface for working with (semi)structured data

13. Hive example // Import Spark SQL import org.apache.spark.sql.hive.HiveContext; // Or if you can't have the hive dependencies import org.apache.spark.sql.SQLContext; // Import the JavaSchemaRDD import org.apache.spark.sql.SchemaRDD; import org.apache.spark.sql.Row; (...) JavaSparkContext ctx = new JavaSparkContext(...); SQLContext hiveCtx = new HiveContext(ctx);

14. Hive example (cont’d) SchemaRDD input = hiveCtx.jsonFile(inputFile); // Register the input schema RDD input.registerTempTable("tweets"); // Select tweets based on the retweetCount SchemaRDD topTweets = hiveCtx.sql( "SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10");

15. Spark Streaming • Acting on data as soon as it arrives • Dstreams

16. Example // Create a StreamingContext with a 1-second batch size from a SparkConf JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); // Create a DStream from all the input on port 7777 JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777); // Filter our DStream for lines with "error" JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() { public Boolean call(String line) { return line.contains("error"); }}); // Print out the lines with errors errorLines.print();

17. Example // Start our streaming context and wait for it // to "finish" jssc.start(); // Wait for the job to finish jssc.awaitTermination();

18. GraphX • Graphdatabase

19. Example // Load the edges as a graph val graph = GraphLoader.edgeListFile(sc, "followers.txt") // Run PageRank val ranks = graph.pageRank(0.0001).vertices // Join the ranks with the usernames val users = sc.textFile("users.txt").map { line => val fields = line.split(",") (fields(0).toLong, fields(1)) }

20. Example val ranksByUsername = users.join(ranks) .map {case (id, (username, rank)) => (username, rank) } // Print the result println(ranksByUsername.collect().mkString("n"))

21. MLib • Machine learning

22. Thank you

Apache Spark part of Eindhoven Java Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Apache Spark part of Eindhoven Java Meetup

Similar to Apache Spark part of Eindhoven Java Meetup (20)

Recently uploaded

Recently uploaded (20)

Apache Spark part of Eindhoven Java Meetup