SlideShare a Scribd company logo
1 of 39
Download to read offline
Vitalii Bondarenko
Data Platform Competency Manager at Eleks
Vitaliy.bondarenko@eleks.com
HDInsight: Spark
Advanced in-memory BigData Analytics with Microsoft Azure
Agenda
● Spark Platform
● Spark Core
● Spark Extensions
● Using HDInsight Spark
About me
Vitalii Bondarenko
Data Platform Competency Manager
Eleks
www.eleks.com
20 years in software development
9+ years of developing for MS SQL Server
3+ years of architecting Big Data Solutions
● DW/BI Architect and Technical Lead
● OLTP DB Performance Tuning
●
Big Data Data Platform Architect
Spark Platform
Spark Stack
● Clustered computing platform
● Designed to be fast and general purpose
● Integrated with distributed systems
● API for Python, Scala, Java, clear and understandable code
● Integrated with Big Data and BI Tools
● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O
● First Apache release 2013, this moth v.2.0 has been released
Map-reduce computations
In-memory map-reduce
Execution Model
Spark Execution
● Shells and Standalone application
● Local and Cluster (Standalone, Yarn, Mesos, Cloud)
Spark Cluster Arhitecture
● Master / Cluster manager
● Cluster allocates resources on nodes
● Master sends app code and tasks tor nodes
● Executers run tasks and cache data
Connect to Cluster
● Local
● SparkContext and Master field
● spark://host:7077
● Spark-submit
DEMO: Execution Environments
● Local Spark installation
● Shells and Notebook
● Spark Examples
● HDInsight Spark Cluster
● SSH connection to Spark in Azure
● Jupyter Notebook connected to HDInsight Spark
Spark Core
RDD: resilient distributed dataset
● Parallelized collections with fault-tolerant (Hadoop datasets)
● Transformations set new RDDs (filter, map, distinct, union, subtract, etc)
● Actions call to calculations (count, collect, first)
● Transformations are lazy
● Actions trigger transformations computation
● Broadcast Variables send data to executors
● Accumulators collect data on driver
inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
print "Input had " + badLinesRDD.count() + " concerning lines"
Spark program scenario
● Create RDD (loading external datasets, parallelizing a
collection on driver)
● Transform
● Persist intermediate RDDs as results
● Launch actions
Persistence (Caching)
● Avoid recalculations
● 10x faster in-memory
● Fault-tolerant
● Persistence levels
● Persist before first action
input = sc.parallelize(xrange(1000))
result = input.map(lambda x: x ** x)
result.persist(StorageLevel.MEMORY_ONLY)
result.count()
result.collect()
Transformations (1)
Transformations (2)
Actions (1)
Actions (2)
Data Partitioning
● userData.join(events)
● userData.partitionBy(100).persist()
● 3-4 partitions on CPU Core
● userData.join(events).mapValues(...).reduceByKey(...)
DEMO: Spark Core Operations
● Transformations
● Actions
Spark Extensions
Spark Streaming Architecture
● Micro-batch architecture
● SparkStreaming Concext
● Batch interval from 500ms
●
Transformation on Spark Engine
●
Outup operations instead of Actions
● Different sources and outputs
Spark Streaming Example
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
input_stream = ssc.textFileStream("sampleTextDir")
word_pairs = input_stream.flatMap(
lambda l:l.split(" ")).map(lambda w: (w,1))
counts = word_pairs.reduceByKey(lambda x,y: x + y)
counts.print()
ssc.start()
ssc.awaitTermination()
● Process RDDs in batches
● Start after ssc.start()
● Output to console on Driver
● Awaiting termination
Streaming on a Cluster
● Receivers with replication
● SparkContext on Driver
● Output from Exectors in batches saveAsHadoopFiles()
● spark-submit for creating and scheduling periodical streaming jobs
● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
Streaming Transformations
● DStreams
●
Stateless transformantions
● Stagefull transformantions
● Windowed transformantions
● UpdateStateByKey
● ReduceByWindow, reduceByKeyAndWindow
● Recomended batch size from 10 sec
val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1))
val ipCountDStream = ipDStream.reduceByKeyAndWindow(
{(x, y) => x + y}, // Adding elements in the new batches entering the window
{(x, y) => x - y}, // Removing elements from the oldest batches exiting the window
Seconds(30), // Window duration
Seconds(10)) // Slide duration
DEMO: Spark Streaming
● Simple streaming with PySpark
Spark SQL
●
SparkSQL interface for working with structured data by SQL
●
Works with Hive tables and HiveQL
● Works with files (Json, Parquet etc) with defined schema
●
JDBC/ODBC connectors for BI tools
●
Integrated with Hive and Hive types, uses HiveUDF
●
DataFrame abstraction
Spark DataFrames
● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive
● df.show()
● df.select(“name”, df(“age”)+1)
● df.filtr(df(“age”) > 19)
● df.groupBy(df(“name”)).min()
# Import Spark SQLfrom pyspark.sql
import HiveContext, Row
# Or if you can't include the hive requirementsfrom pyspark.sql
import SQLContext, Row
sc = new SparkContext(...)
hiveCtx = HiveContext(sc)
sqlContext = SQLContext(sc)
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweet
CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
Catalyst: Query Optimizer
● Analysis: map tables, columns, function, create a logical plan
● Logical Optimization: applies rules and optimize the plan
● Physical Planing: physical operator for the logical plan execution
● Cost estimation
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
DEMO: Using SparkSQL
● Simple SparkSQL querying
● Data Frames
● Data exploration with SparkSQL
● Connect from BI
Spark ML
Spark ML
●
Classification
●
Regression
●
Clustering
● Recommendation
●
Feature transformation, selection
●
Statistics
●
Linear algebra
●
Data mining tools
Pipeline Cmponents
●
DataFrame
●
Transformer
●
Estimator
● Pipeline
●
Parameter
Logistic Regression
DEMO: Spark ML
● Training a model
● Data visualization
New in Spark 2.0
● Unifying DataFrames and Datasets in Scala/Java (compile time
syntax and analysis errors). Same performance and convertible.
● SparkSession: a new entry point that supersedes SQLContext and
HiveContext.
● Machine learning pipeline persistence
● Distributed algorithms in R
● Faster Optimizer
● Structured Streaming
New in Spark 2.0
spark = SparkSession
.builder()
.appName("StructuredNetworkWordCount")
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode('complete')
.format('console')
.start()
query.awaitTermination()
windowedCounts = words.groupBy(
window(words.timestamp, '10 minutes', '5 minutes'),
words.word
).count()
HDInsight: Spark
Spark in Azure
HDInsight benefits
● Ease of creating clusters (Azure portal, PowerShell, .Net SDK)
●
Ease of use (noteboks, azure control panels)
●
REST APIs (Livy: job server)
●
Support for Azure Data Lake Store (adl://)
●
Integration with Azure services (EventHub, Kafka)
●
Support for R Server (HDInsight R over Spark)
● Integration with IntelliJ IDEA (Plugin, create and submit apps)
● Concurrent Queries (many users and connections)
● Caching on SSDs (SSD as persist method)
● Integration with BI Tools (connectors for PowerBI and Tableau)
● Pre-loaded Anaconda libraries (200 libraries for ML)
● Scalability (change number of nodes and start/stop cluster)
● 24/7 Support (99% up-time)
HDInsight Spark Scenarious
1. Streaming data, IoT and real-time analytics
2. Visual data exploration and interactive analysis (HDFS)
3. Spark with NoSQL (HBase and Azure DocumentDB)
4. Spark with Data Lake
5. Spark with SQL Data Warehouse
6. Machine Learning using R Server, Mllib
7. Putting it all together in a notebook experience
8. Using Excel with Spark
Q&A

More Related Content

What's hot

Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs? ScyllaDB
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbotyieldbot
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLScyllaDB
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKShu-Jeng Hsieh
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayJacob Park
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeScyllaDB
 
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...DataStax
 

What's hot (20)

Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Demystifying the Distributed Database Landscape
Demystifying the Distributed Database LandscapeDemystifying the Distributed Database Landscape
Demystifying the Distributed Database Landscape
 
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
From Monolith to Microservices with Cassandra, Grpc, and Falcor (Luke Tillman...
 

Viewers also liked

Oleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectOleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectАліна Шепшелей
 
Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Аліна Шепшелей
 
Anton Ivinskyi Application level metrics and performance tests
Anton Ivinskyi	Application level metrics and performance testsAnton Ivinskyi	Application level metrics and performance tests
Anton Ivinskyi Application level metrics and performance testsАліна Шепшелей
 
Vladimir Lozanov How to deliver high quality apps to the app store
Vladimir Lozanov	How to deliver high quality apps to the app storeVladimir Lozanov	How to deliver high quality apps to the app store
Vladimir Lozanov How to deliver high quality apps to the app storeАліна Шепшелей
 
Макс Семенчук Дизайнер, которому доверяют
 Макс Семенчук Дизайнер, которому доверяют Макс Семенчук Дизайнер, которому доверяют
Макс Семенчук Дизайнер, которому доверяютАліна Шепшелей
 
Alexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldAlexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldАліна Шепшелей
 
Виталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииВиталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииАліна Шепшелей
 
Новые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноНовые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноMediaprojects Mail.Ru Group
 
Anton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersAnton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersАліна Шепшелей
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleJudy Nash
 

Viewers also liked (11)

Oleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile projectOleksandr Yefremov Continuously delivering mobile project
Oleksandr Yefremov Continuously delivering mobile project
 
Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?Миша Рыбачук Что такое дизайн?
Миша Рыбачук Что такое дизайн?
 
Anton Ivinskyi Application level metrics and performance tests
Anton Ivinskyi	Application level metrics and performance testsAnton Ivinskyi	Application level metrics and performance tests
Anton Ivinskyi Application level metrics and performance tests
 
Vladimir Lozanov How to deliver high quality apps to the app store
Vladimir Lozanov	How to deliver high quality apps to the app storeVladimir Lozanov	How to deliver high quality apps to the app store
Vladimir Lozanov How to deliver high quality apps to the app store
 
Макс Семенчук Дизайнер, которому доверяют
 Макс Семенчук Дизайнер, которому доверяют Макс Семенчук Дизайнер, которому доверяют
Макс Семенчук Дизайнер, которому доверяют
 
Alexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real worldAlexander Voronov Test driven development in real world
Alexander Voronov Test driven development in real world
 
Виталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компанииВиталий Лаптенок Процессы в продуктовой компании
Виталий Лаптенок Процессы в продуктовой компании
 
Новые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможноНовые рынки: делать медиа там, где это почти невозможно
Новые рынки: делать медиа там, где это почти невозможно
 
Anton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designersAnton Parkhomenko Boost your design workflow or git rebase for designers
Anton Parkhomenko Boost your design workflow or git rebase for designers
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
 

Similar to Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 

Similar to Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 

More from Аліна Шепшелей

Valerii Iakovenko Drones as the part of the present
Valerii Iakovenko	Drones as the part of the presentValerii Iakovenko	Drones as the part of the present
Valerii Iakovenko Drones as the part of the presentАліна Шепшелей
 
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Аліна Шепшелей
 
Andrew Veles Product design is about the process
Andrew Veles Product design is about the processAndrew Veles Product design is about the process
Andrew Veles Product design is about the processАліна Шепшелей
 
Kononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TVKononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TVАліна Шепшелей
 
Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?Аліна Шепшелей
 
Gregory Shehet Undefined' on prod, or how to test a react app
Gregory Shehet Undefined' on  prod, or how to test a react appGregory Shehet Undefined' on  prod, or how to test a react app
Gregory Shehet Undefined' on prod, or how to test a react appАліна Шепшелей
 
Alexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programmingAlexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programmingАліна Шепшелей
 
Roman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your dbRoman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your dbАліна Шепшелей
 
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big dataVolodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big dataАліна Шепшелей
 
Maksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a jobMaksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a jobАліна Шепшелей
 
Den Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily usersDen Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily usersАліна Шепшелей
 
Anton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side developmentAnton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side developmentАліна Шепшелей
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 

More from Аліна Шепшелей (20)

Valerii Iakovenko Drones as the part of the present
Valerii Iakovenko	Drones as the part of the presentValerii Iakovenko	Drones as the part of the present
Valerii Iakovenko Drones as the part of the present
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...
 
Andrew Veles Product design is about the process
Andrew Veles Product design is about the processAndrew Veles Product design is about the process
Andrew Veles Product design is about the process
 
Kononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TVKononenko Alina Designing for Apple Watch and Apple TV
Kononenko Alina Designing for Apple Watch and Apple TV
 
Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?Mihail Patalaha Aso: how to start and how to finish?
Mihail Patalaha Aso: how to start and how to finish?
 
Gregory Shehet Undefined' on prod, or how to test a react app
Gregory Shehet Undefined' on  prod, or how to test a react appGregory Shehet Undefined' on  prod, or how to test a react app
Gregory Shehet Undefined' on prod, or how to test a react app
 
Alexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programmingAlexey Osipenko Basics of functional reactive programming
Alexey Osipenko Basics of functional reactive programming
 
Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web Vladimir Mikhel Scrapping the web
Vladimir Mikhel Scrapping the web
 
Roman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your dbRoman Ugolnikov Migrationа and sourcecontrol for your db
Roman Ugolnikov Migrationа and sourcecontrol for your db
 
Dmutro Panin JHipster
Dmutro Panin JHipster Dmutro Panin JHipster
Dmutro Panin JHipster
 
Alex Theedom Java ee revisits design patterns
Alex Theedom	Java ee revisits design patternsAlex Theedom	Java ee revisits design patterns
Alex Theedom Java ee revisits design patterns
 
Alexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystackAlexey Tokar To find a needle in a haystack
Alexey Tokar To find a needle in a haystack
 
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big dataVolodymyr Getmanskyi How to build a dynamic pricing model using big data
Volodymyr Getmanskyi How to build a dynamic pricing model using big data
 
Maksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a jobMaksym Antipov Hardware development as a hobby and a job
Maksym Antipov Hardware development as a hobby and a job
 
Ievgen Vladimirov Only cloud
Ievgen Vladimirov Only cloudIevgen Vladimirov Only cloud
Ievgen Vladimirov Only cloud
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
Den Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily usersDen Golotyuk Big data from 30 million daily users
Den Golotyuk Big data from 30 million daily users
 
Anton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side developmentAnton Fedorchenko Swift for server side development
Anton Fedorchenko Swift for server side development
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics with microsoft azure

  • 1. Vitalii Bondarenko Data Platform Competency Manager at Eleks Vitaliy.bondarenko@eleks.com HDInsight: Spark Advanced in-memory BigData Analytics with Microsoft Azure
  • 2. Agenda ● Spark Platform ● Spark Core ● Spark Extensions ● Using HDInsight Spark
  • 3. About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing for MS SQL Server 3+ years of architecting Big Data Solutions ● DW/BI Architect and Technical Lead ● OLTP DB Performance Tuning ● Big Data Data Platform Architect
  • 5. Spark Stack ● Clustered computing platform ● Designed to be fast and general purpose ● Integrated with distributed systems ● API for Python, Scala, Java, clear and understandable code ● Integrated with Big Data and BI Tools ● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O ● First Apache release 2013, this moth v.2.0 has been released
  • 8. Execution Model Spark Execution ● Shells and Standalone application ● Local and Cluster (Standalone, Yarn, Mesos, Cloud) Spark Cluster Arhitecture ● Master / Cluster manager ● Cluster allocates resources on nodes ● Master sends app code and tasks tor nodes ● Executers run tasks and cache data Connect to Cluster ● Local ● SparkContext and Master field ● spark://host:7077 ● Spark-submit
  • 9. DEMO: Execution Environments ● Local Spark installation ● Shells and Notebook ● Spark Examples ● HDInsight Spark Cluster ● SSH connection to Spark in Azure ● Jupyter Notebook connected to HDInsight Spark
  • 11. RDD: resilient distributed dataset ● Parallelized collections with fault-tolerant (Hadoop datasets) ● Transformations set new RDDs (filter, map, distinct, union, subtract, etc) ● Actions call to calculations (count, collect, first) ● Transformations are lazy ● Actions trigger transformations computation ● Broadcast Variables send data to executors ● Accumulators collect data on driver inputRDD = sc.textFile("log.txt") errorsRDD = inputRDD.filter(lambda x: "error" in x) warningsRDD = inputRDD.filter(lambda x: "warning" in x) badLinesRDD = errorsRDD.union(warningsRDD) print "Input had " + badLinesRDD.count() + " concerning lines"
  • 12. Spark program scenario ● Create RDD (loading external datasets, parallelizing a collection on driver) ● Transform ● Persist intermediate RDDs as results ● Launch actions
  • 13. Persistence (Caching) ● Avoid recalculations ● 10x faster in-memory ● Fault-tolerant ● Persistence levels ● Persist before first action input = sc.parallelize(xrange(1000)) result = input.map(lambda x: x ** x) result.persist(StorageLevel.MEMORY_ONLY) result.count() result.collect()
  • 18. Data Partitioning ● userData.join(events) ● userData.partitionBy(100).persist() ● 3-4 partitions on CPU Core ● userData.join(events).mapValues(...).reduceByKey(...)
  • 19. DEMO: Spark Core Operations ● Transformations ● Actions
  • 21. Spark Streaming Architecture ● Micro-batch architecture ● SparkStreaming Concext ● Batch interval from 500ms ● Transformation on Spark Engine ● Outup operations instead of Actions ● Different sources and outputs
  • 22. Spark Streaming Example from pyspark.streaming import StreamingContext ssc = StreamingContext(sc, 1) input_stream = ssc.textFileStream("sampleTextDir") word_pairs = input_stream.flatMap( lambda l:l.split(" ")).map(lambda w: (w,1)) counts = word_pairs.reduceByKey(lambda x,y: x + y) counts.print() ssc.start() ssc.awaitTermination() ● Process RDDs in batches ● Start after ssc.start() ● Output to console on Driver ● Awaiting termination
  • 23. Streaming on a Cluster ● Receivers with replication ● SparkContext on Driver ● Output from Exectors in batches saveAsHadoopFiles() ● spark-submit for creating and scheduling periodical streaming jobs ● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
  • 24. Streaming Transformations ● DStreams ● Stateless transformantions ● Stagefull transformantions ● Windowed transformantions ● UpdateStateByKey ● ReduceByWindow, reduceByKeyAndWindow ● Recomended batch size from 10 sec val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1)) val ipCountDStream = ipDStream.reduceByKeyAndWindow( {(x, y) => x + y}, // Adding elements in the new batches entering the window {(x, y) => x - y}, // Removing elements from the oldest batches exiting the window Seconds(30), // Window duration Seconds(10)) // Slide duration
  • 25. DEMO: Spark Streaming ● Simple streaming with PySpark
  • 26. Spark SQL ● SparkSQL interface for working with structured data by SQL ● Works with Hive tables and HiveQL ● Works with files (Json, Parquet etc) with defined schema ● JDBC/ODBC connectors for BI tools ● Integrated with Hive and Hive types, uses HiveUDF ● DataFrame abstraction
  • 27. Spark DataFrames ● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive ● df.show() ● df.select(“name”, df(“age”)+1) ● df.filtr(df(“age”) > 19) ● df.groupBy(df(“name”)).min() # Import Spark SQLfrom pyspark.sql import HiveContext, Row # Or if you can't include the hive requirementsfrom pyspark.sql import SQLContext, Row sc = new SparkContext(...) hiveCtx = HiveContext(sc) sqlContext = SQLContext(sc) input = hiveCtx.jsonFile(inputFile) # Register the input schema RDD input.registerTempTable("tweets") # Select tweets based on the retweet CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
  • 28. Catalyst: Query Optimizer ● Analysis: map tables, columns, function, create a logical plan ● Logical Optimization: applies rules and optimize the plan ● Physical Planing: physical operator for the logical plan execution ● Cost estimation SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1
  • 29. DEMO: Using SparkSQL ● Simple SparkSQL querying ● Data Frames ● Data exploration with SparkSQL ● Connect from BI
  • 30. Spark ML Spark ML ● Classification ● Regression ● Clustering ● Recommendation ● Feature transformation, selection ● Statistics ● Linear algebra ● Data mining tools Pipeline Cmponents ● DataFrame ● Transformer ● Estimator ● Pipeline ● Parameter
  • 32. DEMO: Spark ML ● Training a model ● Data visualization
  • 33. New in Spark 2.0 ● Unifying DataFrames and Datasets in Scala/Java (compile time syntax and analysis errors). Same performance and convertible. ● SparkSession: a new entry point that supersedes SQLContext and HiveContext. ● Machine learning pipeline persistence ● Distributed algorithms in R ● Faster Optimizer ● Structured Streaming
  • 34. New in Spark 2.0 spark = SparkSession .builder() .appName("StructuredNetworkWordCount") .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:9999 lines = spark .readStream .format('socket') .option('host', 'localhost') .option('port', 9999) .load() # Split the lines into words words = lines.select( explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count() # Start running the query that prints the running counts to the console query = wordCounts .writeStream .outputMode('complete') .format('console') .start() query.awaitTermination() windowedCounts = words.groupBy( window(words.timestamp, '10 minutes', '5 minutes'), words.word ).count()
  • 37. HDInsight benefits ● Ease of creating clusters (Azure portal, PowerShell, .Net SDK) ● Ease of use (noteboks, azure control panels) ● REST APIs (Livy: job server) ● Support for Azure Data Lake Store (adl://) ● Integration with Azure services (EventHub, Kafka) ● Support for R Server (HDInsight R over Spark) ● Integration with IntelliJ IDEA (Plugin, create and submit apps) ● Concurrent Queries (many users and connections) ● Caching on SSDs (SSD as persist method) ● Integration with BI Tools (connectors for PowerBI and Tableau) ● Pre-loaded Anaconda libraries (200 libraries for ML) ● Scalability (change number of nodes and start/stop cluster) ● 24/7 Support (99% up-time)
  • 38. HDInsight Spark Scenarious 1. Streaming data, IoT and real-time analytics 2. Visual data exploration and interactive analysis (HDFS) 3. Spark with NoSQL (HBase and Azure DocumentDB) 4. Spark with Data Lake 5. Spark with SQL Data Warehouse 6. Machine Learning using R Server, Mllib 7. Putting it all together in a notebook experience 8. Using Excel with Spark
  • 39. Q&A