SlideShare a Scribd company logo
Vitalii Bondarenko
Data Platform Competency Manager at Eleks
Vitaliy.bondarenko@eleks.com
HDInsight: Spark
Advanced in-memory BigData Analytics with Microsoft Azure
Agenda
● Spark Platform
● Spark Core
● Spark Extensions
● Using HDInsight Spark
About me
Vitalii Bondarenko
Data Platform Competency Manager
Eleks
www.eleks.com
20 years in software development
9+ years of developing for MS SQL Server
3+ years of architecting Big Data Solutions
● DW/BI Architect and Technical Lead
● OLTP DB Performance Tuning
●
Big Data Data Platform Architect
Spark Platform
Spark Stack
● Clustered computing platform
● Designed to be fast and general purpose
● Integrated with distributed systems
● API for Python, Scala, Java, clear and understandable code
● Integrated with Big Data and BI Tools
● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O
● First Apache release 2013, this moth v.2.0 has been released
Map-reduce computations
In-memory map-reduce
Execution Model
Spark Execution
● Shells and Standalone application
● Local and Cluster (Standalone, Yarn, Mesos, Cloud)
Spark Cluster Arhitecture
● Master / Cluster manager
● Cluster allocates resources on nodes
● Master sends app code and tasks tor nodes
● Executers run tasks and cache data
Connect to Cluster
● Local
● SparkContext and Master field
● spark://host:7077
● Spark-submit
DEMO: Execution Environments
● Local Spark installation
● Shells and Notebook
● Spark Examples
● HDInsight Spark Cluster
● SSH connection to Spark in Azure
● Jupyter Notebook connected to HDInsight Spark
Spark Core
RDD: resilient distributed dataset
● Parallelized collections with fault-tolerant (Hadoop datasets)
● Transformations set new RDDs (filter, map, distinct, union, subtract, etc)
● Actions call to calculations (count, collect, first)
● Transformations are lazy
● Actions trigger transformations computation
● Broadcast Variables send data to executors
● Accumulators collect data on driver
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
print "Input had " + badLinesRDD.count() + " concerning lines"
Spark program scenario
● Create RDD (loading external datasets, parallelizing a
collection on driver)
● Transform
● Persist intermediate RDDs as results
● Launch actions
Persistence (Caching)
● Avoid recalculations
● 10x faster in-memory
● Fault-tolerant
● Persistence levels
● Persist before first action
input = sc.parallelize(xrange(1000))
result = input.map(lambda x: x ** x)
result.persist(StorageLevel.MEMORY_ONLY)
result.count()
result.collect()
Transformations (1)
Transformations (2)
Actions (1)
Actions (2)
Data Partitioning
● userData.join(events)
● userData.partitionBy(100).persist()
● 3-4 partitions on CPU Core
● userData.join(events).mapValues(...).reduceByKey(...)
DEMO: Spark Core Operations
● Transformations
● Actions
Spark Extensions
Spark Streaming Architecture
● Micro-batch architecture
● SparkStreaming Concext
● Batch interval from 500ms
●
Transformation on Spark Engine
●
Outup operations instead of Actions
● Different sources and outputs
Spark Streaming Example
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
input_stream = ssc.textFileStream("sampleTextDir")
word_pairs = input_stream.flatMap(
lambda l:l.split(" ")).map(lambda w: (w,1))
counts = word_pairs.reduceByKey(lambda x,y: x + y)
counts.print()
ssc.start()
ssc.awaitTermination()
● Process RDDs in batches
● Start after ssc.start()
● Output to console on Driver
● Awaiting termination
Streaming on a Cluster
● Receivers with replication
● SparkContext on Driver
● Output from Exectors in batches saveAsHadoopFiles()
● spark-submit for creating and scheduling periodical streaming jobs
● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
Streaming Transformations
● DStreams
●
Stateless transformantions
● Stagefull transformantions
● Windowed transformantions
● UpdateStateByKey
● ReduceByWindow, reduceByKeyAndWindow
● Recomended batch size from 10 sec
val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1))
val ipCountDStream = ipDStream.reduceByKeyAndWindow(
{(x, y) => x + y}, // Adding elements in the new batches entering the window
{(x, y) => x - y}, // Removing elements from the oldest batches exiting the window
Seconds(30), // Window duration
Seconds(10)) // Slide duration
DEMO: Spark Streaming
● Simple streaming with PySpark
Spark SQL
●
SparkSQL interface for working with structured data by SQL
●
Works with Hive tables and HiveQL
● Works with files (Json, Parquet etc) with defined schema
●
JDBC/ODBC connectors for BI tools
●
Integrated with Hive and Hive types, uses HiveUDF
●
DataFrame abstraction
Spark DataFrames
● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive
● df.show()
● df.select(“name”, df(“age”)+1)
● df.filtr(df(“age”) > 19)
● df.groupBy(df(“name”)).min()
# Import Spark SQLfrom pyspark.sql
import HiveContext, Row
# Or if you can't include the hive requirementsfrom pyspark.sql
import SQLContext, Row
sc = new SparkContext(...)
hiveCtx = HiveContext(sc)
sqlContext = SQLContext(sc)
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweet
CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
Catalyst: Query Optimizer
● Analysis: map tables, columns, function, create a logical plan
● Logical Optimization: applies rules and optimize the plan
● Physical Planing: physical operator for the logical plan execution
● Cost estimation
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
DEMO: Using SparkSQL
● Simple SparkSQL querying
● Data Frames
● Data exploration with SparkSQL
● Connect from BI
Spark ML
Spark ML
●
Classification
●
Regression
●
Clustering
● Recommendation
●
Feature transformation, selection
●
Statistics
●
Linear algebra
●
Data mining tools
Pipeline Cmponents
●
DataFrame
●
Transformer
●
Estimator
● Pipeline
●
Parameter
Logistic Regression
DEMO: Spark ML
● Training a model
● Data visualization
New in Spark 2.0
● Unifying DataFrames and Datasets in Scala/Java (compile time
syntax and analysis errors). Same performance and convertible.
● SparkSession: a new entry point that supersedes SQLContext and
HiveContext.
● Machine learning pipeline persistence
● Distributed algorithms in R
● Faster Optimizer
● Structured Streaming
New in Spark 2.0
spark = SparkSession
.builder()
.appName("StructuredNetworkWordCount")
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode('complete')
.format('console')
.start()
query.awaitTermination()
windowedCounts = words.groupBy(
window(words.timestamp, '10 minutes', '5 minutes'),
words.word
).count()
HDInsight: Spark
Spark in Azure
HDInsight benefits
● Ease of creating clusters (Azure portal, PowerShell, .Net SDK)
●
Ease of use (noteboks, azure control panels)
●
REST APIs (Livy: job server)
●
Support for Azure Data Lake Store (adl://)
●
Integration with Azure services (EventHub, Kafka)
●
Support for R Server (HDInsight R over Spark)
● Integration with IntelliJ IDEA (Plugin, create and submit apps)
● Concurrent Queries (many users and connections)
● Caching on SSDs (SSD as persist method)
● Integration with BI Tools (connectors for PowerBI and Tableau)
● Pre-loaded Anaconda libraries (200 libraries for ML)
● Scalability (change number of nodes and start/stop cluster)
● 24/7 Support (99% up-time)
HDInsight Spark Scenarious
1. Streaming data, IoT and real-time analytics
2. Visual data exploration and interactive analysis (HDFS)
3. Spark with NoSQL (HBase and Azure DocumentDB)
4. Spark with Data Lake
5. Spark with SQL Data Warehouse
6. Machine Learning using R Server, Mllib
7. Putting it all together in a notebook experience
8. Using Excel with Spark
Q&A

More Related Content

What's hot

Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
Kai Sasaki
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Spark Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
Daniel Lemire
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Spark Summit
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Spark Summit
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
Matthias Niehoff
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 

What's hot (20)

Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Viewers also liked

Lianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi LyuLianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi Lyu
毅 吕
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
Nicolas Kourtellis
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Nicolas Kourtellis
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
Tania Akinina
 
ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016
Dinh Le Dat (Kevin D.)
 
クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析
aiichiro
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
Ludovic Piot
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
Ahmed Salman
 
GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow Introduction
Simon Su
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
BOAZ Bigdata
 
BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016
Thiago Santiago
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
Freek van Gool
 
Big Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical SystemsBig Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical Systems
Alexandre Prozoroff
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
O αργυροπελεκάνος
O αργυροπελεκάνοςO αργυροπελεκάνος
O αργυροπελεκάνος
70athinon
 
Javaepic
JavaepicJavaepic
Javaepic
trupti Deshmukh
 
learn Onlinejava
learn Onlinejavalearn Onlinejava
learn Onlinejava
trupti Deshmukh
 

Viewers also liked (20)

Lianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi LyuLianjia data infrastructure, Yi Lyu
Lianjia data infrastructure, Yi Lyu
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
 
ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016ANTS - 360 view of your customer - bigdata innovation summit 2016
ANTS - 360 view of your customer - bigdata innovation summit 2016
 
クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析クラウドを活用した自由自在なデータ分析
クラウドを活用した自由自在なデータ分析
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow Introduction
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
 
BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016BigData & Hadoop - Technology Latinoware 2016
BigData & Hadoop - Technology Latinoware 2016
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
 
Big Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical SystemsBig Data Patients and New Requirements for Clinical Systems
Big Data Patients and New Requirements for Clinical Systems
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
O αργυροπελεκάνος
O αργυροπελεκάνοςO αργυροπελεκάνος
O αργυροπελεκάνος
 
Javaepic
JavaepicJavaepic
Javaepic
 
learn Onlinejava
learn Onlinejavalearn Onlinejava
learn Onlinejava
 

Similar to SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure"

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
Gerger
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
Ridwan Fadjar
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
wang xing
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 

Similar to SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure" (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 

More from Inhacking

SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
Inhacking
 
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
Inhacking
 
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
Inhacking
 
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
Inhacking
 
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
SE2016 Company Development Vadym Gorenko  "How to pass the death valley"SE2016 Company Development Vadym Gorenko  "How to pass the death valley"
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
Inhacking
 
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
Inhacking
 
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
Inhacking
 
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
Inhacking
 
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
Inhacking
 
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
SE2016 Management Anna Lavrova "Gladiator in the suit  crisis is our brand!"SE2016 Management Anna Lavrova "Gladiator in the suit  crisis is our brand!"
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
Inhacking
 
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
Inhacking
 
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
Inhacking
 
SE2016 Management Yana Prolis "Please don't burn down!"
SE2016 Management Yana Prolis "Please don't burn down!"SE2016 Management Yana Prolis "Please don't burn down!"
SE2016 Management Yana Prolis "Please don't burn down!"
Inhacking
 
SE2016 Management Marina Bril "Management at marketing teams and performance"
SE2016 Management Marina Bril "Management at marketing teams and performance"SE2016 Management Marina Bril "Management at marketing teams and performance"
SE2016 Management Marina Bril "Management at marketing teams and performance"
Inhacking
 
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
Inhacking
 
SE2016 iOS Alexander Voronov "Test driven development in real world"
SE2016 iOS Alexander Voronov "Test driven development in real world"SE2016 iOS Alexander Voronov "Test driven development in real world"
SE2016 iOS Alexander Voronov "Test driven development in real world"
Inhacking
 
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
Inhacking
 
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
Inhacking
 
SE2016 Java Vladimir Mikhel "Scrapping the web"
SE2016 Java Vladimir Mikhel "Scrapping the web"SE2016 Java Vladimir Mikhel "Scrapping the web"
SE2016 Java Vladimir Mikhel "Scrapping the web"
Inhacking
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Inhacking
 

More from Inhacking (20)

SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
SE2016 Fundraising Roman Kravchenko "Investment in Ukrainian IoT-Startups"
 
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
SE2016 Fundraising Wlodek Laskowski "Insider guide to successful fundraising ...
 
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
SE2016 Fundraising Andrey Sobol "Blockchain Crowdfunding or "Mommy, look, I l...
 
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...
 
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
SE2016 Company Development Vadym Gorenko  "How to pass the death valley"SE2016 Company Development Vadym Gorenko  "How to pass the death valley"
SE2016 Company Development Vadym Gorenko "How to pass the death valley"
 
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"
 
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"
 
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"
 
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...
 
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
SE2016 Management Anna Lavrova "Gladiator in the suit  crisis is our brand!"SE2016 Management Anna Lavrova "Gladiator in the suit  crisis is our brand!"
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"
 
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...
 
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...
 
SE2016 Management Yana Prolis "Please don't burn down!"
SE2016 Management Yana Prolis "Please don't burn down!"SE2016 Management Yana Prolis "Please don't burn down!"
SE2016 Management Yana Prolis "Please don't burn down!"
 
SE2016 Management Marina Bril "Management at marketing teams and performance"
SE2016 Management Marina Bril "Management at marketing teams and performance"SE2016 Management Marina Bril "Management at marketing teams and performance"
SE2016 Management Marina Bril "Management at marketing teams and performance"
 
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"
 
SE2016 iOS Alexander Voronov "Test driven development in real world"
SE2016 iOS Alexander Voronov "Test driven development in real world"SE2016 iOS Alexander Voronov "Test driven development in real world"
SE2016 iOS Alexander Voronov "Test driven development in real world"
 
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"
 
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
SE2016 JS Alexey Osipenko "Basics of functional reactive programming"
 
SE2016 Java Vladimir Mikhel "Scrapping the web"
SE2016 Java Vladimir Mikhel "Scrapping the web"SE2016 Java Vladimir Mikhel "Scrapping the web"
SE2016 Java Vladimir Mikhel "Scrapping the web"
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure"

  • 1. Vitalii Bondarenko Data Platform Competency Manager at Eleks Vitaliy.bondarenko@eleks.com HDInsight: Spark Advanced in-memory BigData Analytics with Microsoft Azure
  • 2. Agenda ● Spark Platform ● Spark Core ● Spark Extensions ● Using HDInsight Spark
  • 3. About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing for MS SQL Server 3+ years of architecting Big Data Solutions ● DW/BI Architect and Technical Lead ● OLTP DB Performance Tuning ● Big Data Data Platform Architect
  • 5. Spark Stack ● Clustered computing platform ● Designed to be fast and general purpose ● Integrated with distributed systems ● API for Python, Scala, Java, clear and understandable code ● Integrated with Big Data and BI Tools ● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O ● First Apache release 2013, this moth v.2.0 has been released
  • 8. Execution Model Spark Execution ● Shells and Standalone application ● Local and Cluster (Standalone, Yarn, Mesos, Cloud) Spark Cluster Arhitecture ● Master / Cluster manager ● Cluster allocates resources on nodes ● Master sends app code and tasks tor nodes ● Executers run tasks and cache data Connect to Cluster ● Local ● SparkContext and Master field ● spark://host:7077 ● Spark-submit
  • 9. DEMO: Execution Environments ● Local Spark installation ● Shells and Notebook ● Spark Examples ● HDInsight Spark Cluster ● SSH connection to Spark in Azure ● Jupyter Notebook connected to HDInsight Spark
  • 11. RDD: resilient distributed dataset ● Parallelized collections with fault-tolerant (Hadoop datasets) ● Transformations set new RDDs (filter, map, distinct, union, subtract, etc) ● Actions call to calculations (count, collect, first) ● Transformations are lazy ● Actions trigger transformations computation ● Broadcast Variables send data to executors ● Accumulators collect data on driver inputRDD = sc.textFile("log.txt") errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badLinesRDD = errorsRDD.union(warningsRDD) print "Input had " + badLinesRDD.count() + " concerning lines"
  • 12. Spark program scenario ● Create RDD (loading external datasets, parallelizing a collection on driver) ● Transform ● Persist intermediate RDDs as results ● Launch actions
  • 13. Persistence (Caching) ● Avoid recalculations ● 10x faster in-memory ● Fault-tolerant ● Persistence levels ● Persist before first action input = sc.parallelize(xrange(1000)) result = input.map(lambda x: x ** x) result.persist(StorageLevel.MEMORY_ONLY) result.count() result.collect()
  • 18. Data Partitioning ● userData.join(events) ● userData.partitionBy(100).persist() ● 3-4 partitions on CPU Core ● userData.join(events).mapValues(...).reduceByKey(...)
  • 19. DEMO: Spark Core Operations ● Transformations ● Actions
  • 21. Spark Streaming Architecture ● Micro-batch architecture ● SparkStreaming Concext ● Batch interval from 500ms ● Transformation on Spark Engine ● Outup operations instead of Actions ● Different sources and outputs
  • 22. Spark Streaming Example from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 1) input_stream = ssc.textFileStream("sampleTextDir") word_pairs = input_stream.flatMap( lambda l:l.split(" ")).map(lambda w: (w,1)) counts = word_pairs.reduceByKey(lambda x,y: x + y) counts.print() ssc.start() ssc.awaitTermination() ● Process RDDs in batches ● Start after ssc.start() ● Output to console on Driver ● Awaiting termination
  • 23. Streaming on a Cluster ● Receivers with replication ● SparkContext on Driver ● Output from Exectors in batches saveAsHadoopFiles() ● spark-submit for creating and scheduling periodical streaming jobs ● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
  • 24. Streaming Transformations ● DStreams ● Stateless transformantions ● Stagefull transformantions ● Windowed transformantions ● UpdateStateByKey ● ReduceByWindow, reduceByKeyAndWindow ● Recomended batch size from 10 sec val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow( {(x, y) => x + y}, // Adding elements in the new batches entering the window {(x, y) => x - y}, // Removing elements from the oldest batches exiting the window Seconds(30), // Window duration Seconds(10)) // Slide duration
  • 25. DEMO: Spark Streaming ● Simple streaming with PySpark
  • 26. Spark SQL ● SparkSQL interface for working with structured data by SQL ● Works with Hive tables and HiveQL ● Works with files (Json, Parquet etc) with defined schema ● JDBC/ODBC connectors for BI tools ● Integrated with Hive and Hive types, uses HiveUDF ● DataFrame abstraction
  • 27. Spark DataFrames ● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive ● df.show() ● df.select(“name”, df(“age”)+1) ● df.filtr(df(“age”) > 19) ● df.groupBy(df(“name”)).min() # Import Spark SQLfrom pyspark.sql import HiveContext, Row # Or if you can't include the hive requirementsfrom pyspark.sql import SQLContext, Row sc = new SparkContext(...) hiveCtx = HiveContext(sc) sqlContext = SQLContext(sc) input = hiveCtx.jsonFile(inputFile) # Register the input schema RDD input.registerTempTable("tweets") # Select tweets based on the retweet CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
  • 28. Catalyst: Query Optimizer ● Analysis: map tables, columns, function, create a logical plan ● Logical Optimization: applies rules and optimize the plan ● Physical Planing: physical operator for the logical plan execution ● Cost estimation SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1
  • 29. DEMO: Using SparkSQL ● Simple SparkSQL querying ● Data Frames ● Data exploration with SparkSQL ● Connect from BI
  • 30. Spark ML Spark ML ● Classification ● Regression ● Clustering ● Recommendation ● Feature transformation, selection ● Statistics ● Linear algebra ● Data mining tools Pipeline Cmponents ● DataFrame ● Transformer ● Estimator ● Pipeline ● Parameter
  • 32. DEMO: Spark ML ● Training a model ● Data visualization
  • 33. New in Spark 2.0 ● Unifying DataFrames and Datasets in Scala/Java (compile time syntax and analysis errors). Same performance and convertible. ● SparkSession: a new entry point that supersedes SQLContext and HiveContext. ● Machine learning pipeline persistence ● Distributed algorithms in R ● Faster Optimizer ● Structured Streaming
  • 34. New in Spark 2.0 spark = SparkSession .builder() .appName("StructuredNetworkWordCount") .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:9999 lines = spark .readStream .format('socket') .option('host', 'localhost') .option('port', 9999) .load() # Split the lines into words words = lines.select( explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count() # Start running the query that prints the running counts to the console query = wordCounts .writeStream .outputMode('complete') .format('console') .start() query.awaitTermination() windowedCounts = words.groupBy( window(words.timestamp, '10 minutes', '5 minutes'), words.word ).count()
  • 37. HDInsight benefits ● Ease of creating clusters (Azure portal, PowerShell, .Net SDK) ● Ease of use (noteboks, azure control panels) ● REST APIs (Livy: job server) ● Support for Azure Data Lake Store (adl://) ● Integration with Azure services (EventHub, Kafka) ● Support for R Server (HDInsight R over Spark) ● Integration with IntelliJ IDEA (Plugin, create and submit apps) ● Concurrent Queries (many users and connections) ● Caching on SSDs (SSD as persist method) ● Integration with BI Tools (connectors for PowerBI and Tableau) ● Pre-loaded Anaconda libraries (200 libraries for ML) ● Scalability (change number of nodes and start/stop cluster) ● 24/7 Support (99% up-time)
  • 38. HDInsight Spark Scenarious 1. Streaming data, IoT and real-time analytics 2. Visual data exploration and interactive analysis (HDFS) 3. Spark with NoSQL (HBase and Azure DocumentDB) 4. Spark with Data Lake 5. Spark with SQL Data Warehouse 6. Machine Learning using R Server, Mllib 7. Putting it all together in a notebook experience 8. Using Excel with Spark
  • 39. Q&A