SlideShare a Scribd company logo
1 of 26
Download to read offline
Memulai Data Processing dengan
Spark dan Python
Ridwan Fadjar
Web Developer @Ebizu
Article Writer @(Codepolitan | POSS UPI | Serverless ID | Labbasdat CS UPI)
Apa itu Spark?
Fitur – Fitur Spark
● Largeset Dataset Processing
● Data Processing dengan sintaks seperti SQL
● Graph Processing
● Machine Learning diatas Spark
● Menerima data stream dari Kafka atau Kinesis
● Mendukung bahasa pemrograman seperti Java, Scala, Python, dan R
● File – file yang didukung oleh Spark antara lain: CSV, Orc, Parquet, Text, dan
lainnya
RDD vs DataFrame vs SQL
Contoh Arsitektur Data Pipeline
Dengan Spark (1)
Contoh Arsitektur Data Pipeline
Dengan Spark (2)
Contoh Arsitektur Data Pipeline
Dengan Spark (3)
Local Development
● Pasang Docker di Laptop Kamu
● Download container spark yang dibuat oleh singularities
● Jalankan Mesin
● Buat contoh kode dan simpan di dalam container
● Jalankan dengan spark-submit
Local Development (1)
● Contoh perintah 1: spark-submit –deploy-local client –master local script.py
● Contoh perintah 2: spark-submit –deploy-local client –master local[*] script.py
● Contoh perintah 3: spark-submit –deploy-local cluster –master yarn script.py
● Contoh perintah 4: spark-submit –deploy-local cluster –master yarn script.py –
pyFiles config.py
● Dan lainnya
Local Development (3)
● Aplikasi spark harus memiliki parameter agar dapat menerima kasus secara
dinamis dalam satu script
● Selalu mempunyai input dataset dan output dataset
● Bisa hanya satu node yaitu master saja, atau dengan 1 worker
● Gunakan PIP untuk instalasi dependensi yang dibutuhkan
● Lakukan unit test terhadap function atau library yang kamu buat sendiri
● Pastikan segala library yang dibutuhkan sudah terpasang di master dan
worker
Deployment
● Simpan source code di Github
● Gunakan Ansible untuk deploy aplikasi Spark dan mengatur konfigurasi
aplikasi tersebut
● Gunakan Ansible untuk mengatur konfigurasi di dalam node
Deployment di AWS
● Jalankan script langsung diatas AWS ElasticMapReduce
● Menggunakan AWS EMR Step dan Cluster melalui AWS Console
● Menggunakakn AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal
oleh Cron
● Menggunakan AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal
oleh Scheduler seperti Luigi, Apache Oozie, atau Apache Airflow
Integrasi Spark dengan Solusi Lain
● MySQL
● Kafka
● Elasticsearch
● Redis
● MemSQL
● AWS Kinesis
● Dan lainnya
Contoh Kode PySpark (1)
from pyspark import SparkConf, SparkContext
logFile = "/data/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
sc.setLogLevel("ERROR")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
Contoh Kode PySpark (2)
from pyspark import SparkContext, SparkConf
from random import randint
# http://localhost:9200/spark/_search/?size=1000&pretty=1
# spark-submit --jars /tmp/data/elasticsearch-hadoop-5.4.0.jar /tmp/data/spark-es-read-test.py
sc = SparkContext("local", "Simple App")
sc.setLogLevel("ERROR")
es_conf = {
"es.nodes" : "192.168.46.49",
"es.port" : "9200",
"es.resource" : "spark/docs",
}
if __name__ == "__main__":
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
print rdd.collect()
Contoh Kode PySpark (3)
from pyspark import SparkContext, SparkConf
from random import randint
# http://localhost:9200/spark/_search/?size=1000&pretty=1
# spark-submit --jars /tmp/data/elasticsearch-hadoop-5.4.0.jar /tmp/data/spark-es-write-test.py
sc = SparkContext("local", "Simple App")
sc.setLogLevel("ERROR")
es_conf = {
"es.nodes" : "192.168.46.49",
"es.port" : "9200",
"es.resource" : "spark/docs",
}
if __name__ == "__main__":
rdd = sc.parallelize([ (i, { "x":i, "y":"lorem ipsum sit dolor amet", "z":randint(0, 1000)} ) for i in range(0, 100) ])
rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
Contoh Kode PySpark (4)
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local[2]", "WordCountStreaming")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 10)
lines = ssc.socketTextStream("10.2.2.38", 9999)
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
Contoh Kode PySpark (5)
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[2]", "WordCountStreaming")
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 10)
topic = "test"
lines = KafkaUtils.createStream(ssc, "10.2.2.38:2181", "topic", {topic: 4})
words = lines.flatMap(lambda line: line[1].split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
Contoh Kode PySpark (6)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from random import randint
from datetime import timedelta, datetime
sc = SparkContext()
sc.setLogLevel("ERROR")
ss = SparkSession(sc)
sqlCtx = SQLContext(sc)
dataset = sc.textFile("/data/contoso/user-*.csv").map(lambda line: line.split("|"))
for row in dataset.take(5):
print ("-->")
print (row)
dframe = dataset.toDF()
dframe.show()
print(dframe.count())
try:
dframe.write.partitionBy("_6").format("parquet").save("user.parquet")
except Exception:
print("The parquet file was existed")
Contoh Kode PySpark (7)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from random import randint
from datetime import timedelta, datetime
sc = SparkContext()
sc.setLogLevel("ERROR")
ss = SparkSession(sc)
sqlCtx = SQLContext(sc)
dframe = ss.read.load("/user/spark/user.parquet")
dframe.show()
print(dframe.count())
Contoh Kode PySpark (8)
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import IntegerType, TimestampType, ByteType, ShortType, StringType, DecimalType, StructField, StructType
from random import randint
from datetime import timedelta, datetime
spark = SparkSession 
.builder 
.appName("Python Spark SQL Hive integration example") 
.config("spark.sql.warehouse.dir", "./spark-warehouse") 
.enableHiveSupport() 
.getOrCreate()
# dataset = sc.textFile("/data/campaign/campaign-metadata-sample-1.csv").map(lambda line: line.split("|"))
# for row in dataset.take(5):
# print ("-->")
# print (row)
schema = StructType([
StructField("metadata_id",StringType(),False),
StructField("type",StringType(),True),
StructField("event",StringType(),True),
StructField("metadata",StringType(),True),
StructField("application_id",StringType(),True),
StructField("created_at",StringType(),True),
StructField("api_version",StringType(),True)
])
dframe = spark.read.schema(schema).option("delimiter", "|").csv("/data/campaign/campaign-metadata-sample-1.csv")
dframe.show()
try:
dframe.write.partitionBy("_6").format("orc").save("campaign-metadata")
except Exception as e:
print (e)
print("The orc file was existed")
Managed Service Spark
- HortonWorks
- Azure HD Insight
- Amazon Web Service ElasticMapReduce
- Cloudera Spark
- Databricks
- dan lainnya
Solusi lain seperti Apache Spark
- Apache Beam
- Apache Flink
- Apache Storm
- Apache Hive
- Apache PrestoDB
- dan lainnya
DEMO
Q & A
Special Thanks
Zaky & Wildan yang telah mengajari saya Apache Spark
Fajri & Tajhul yang telah mengajari saya menggunakan berbagai produk di
AWS. Bramandityo yang telah mengajari saya Python.

More Related Content

What's hot

Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 

What's hot (20)

Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Active Session History in PostgreSQL:
Active Session History in PostgreSQL:Active Session History in PostgreSQL:
Active Session History in PostgreSQL:
 
MySQL innoDB split and merge pages
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 

Similar to Memulai Data Processing dengan Spark dan Python

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Databricks
 

Similar to Memulai Data Processing dengan Spark dan Python (20)

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache spark undocumented extensions
Apache spark undocumented extensionsApache spark undocumented extensions
Apache spark undocumented extensions
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
 

More from Ridwan Fadjar

More from Ridwan Fadjar (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdfPyCon ID 2023 - Ridwan Fadjar Septian.pdf
PyCon ID 2023 - Ridwan Fadjar Septian.pdf
 
Cloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdfCloud Infrastructure automation with Python-3.pdf
Cloud Infrastructure automation with Python-3.pdf
 
GraphQL- Presentation
GraphQL- PresentationGraphQL- Presentation
GraphQL- Presentation
 
Bugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdfBugs and Where to Find Them (Study Case_ Backend).pdf
Bugs and Where to Find Them (Study Case_ Backend).pdf
 
Introduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdfIntroduction to Elixir and Phoenix.pdf
Introduction to Elixir and Phoenix.pdf
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
CS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOpsCS meetup 2020 - Introduction to DevOps
CS meetup 2020 - Introduction to DevOps
 
Why Serverless?
Why Serverless?Why Serverless?
Why Serverless?
 
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
SenseHealth Indonesia Sharing Session - Do we really need growth mindset (1)
 
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
Risk Analysis of Dutch Healthcare Company Information System using ISO 27001:...
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2Mongodb intro-2-asbasdat-2018-v2
Mongodb intro-2-asbasdat-2018-v2
 
Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018Mongodb intro-2-asbasdat-2018
Mongodb intro-2-asbasdat-2018
 
Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018Mongodb intro-1-asbasdat-2018
Mongodb intro-1-asbasdat-2018
 
Resftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & CeleryResftul API Web Development with Django Rest Framework & Celery
Resftul API Web Development with Django Rest Framework & Celery
 
Kisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & PythonKisah Dua Sejoli: Arduino & Python
Kisah Dua Sejoli: Arduino & Python
 
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
Mengenal Si Ular Berbisa - Kopi Darat Python Bandung Desember 2014
 
Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1Modul pelatihan-django-dasar-possupi-v1
Modul pelatihan-django-dasar-possupi-v1
 
Membuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygameMembuat game-shooting-dengan-pygame
Membuat game-shooting-dengan-pygame
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Memulai Data Processing dengan Spark dan Python

  • 1. Memulai Data Processing dengan Spark dan Python Ridwan Fadjar Web Developer @Ebizu Article Writer @(Codepolitan | POSS UPI | Serverless ID | Labbasdat CS UPI)
  • 3. Fitur – Fitur Spark ● Largeset Dataset Processing ● Data Processing dengan sintaks seperti SQL ● Graph Processing ● Machine Learning diatas Spark ● Menerima data stream dari Kafka atau Kinesis ● Mendukung bahasa pemrograman seperti Java, Scala, Python, dan R ● File – file yang didukung oleh Spark antara lain: CSV, Orc, Parquet, Text, dan lainnya
  • 5. Contoh Arsitektur Data Pipeline Dengan Spark (1)
  • 6. Contoh Arsitektur Data Pipeline Dengan Spark (2)
  • 7. Contoh Arsitektur Data Pipeline Dengan Spark (3)
  • 8. Local Development ● Pasang Docker di Laptop Kamu ● Download container spark yang dibuat oleh singularities ● Jalankan Mesin ● Buat contoh kode dan simpan di dalam container ● Jalankan dengan spark-submit
  • 9. Local Development (1) ● Contoh perintah 1: spark-submit –deploy-local client –master local script.py ● Contoh perintah 2: spark-submit –deploy-local client –master local[*] script.py ● Contoh perintah 3: spark-submit –deploy-local cluster –master yarn script.py ● Contoh perintah 4: spark-submit –deploy-local cluster –master yarn script.py – pyFiles config.py ● Dan lainnya
  • 10. Local Development (3) ● Aplikasi spark harus memiliki parameter agar dapat menerima kasus secara dinamis dalam satu script ● Selalu mempunyai input dataset dan output dataset ● Bisa hanya satu node yaitu master saja, atau dengan 1 worker ● Gunakan PIP untuk instalasi dependensi yang dibutuhkan ● Lakukan unit test terhadap function atau library yang kamu buat sendiri ● Pastikan segala library yang dibutuhkan sudah terpasang di master dan worker
  • 11. Deployment ● Simpan source code di Github ● Gunakan Ansible untuk deploy aplikasi Spark dan mengatur konfigurasi aplikasi tersebut ● Gunakan Ansible untuk mengatur konfigurasi di dalam node
  • 12. Deployment di AWS ● Jalankan script langsung diatas AWS ElasticMapReduce ● Menggunakan AWS EMR Step dan Cluster melalui AWS Console ● Menggunakakn AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal oleh Cron ● Menggunakan AWS EMR Step dan Cluster melalui AWS CLI yang dijadwal oleh Scheduler seperti Luigi, Apache Oozie, atau Apache Airflow
  • 13. Integrasi Spark dengan Solusi Lain ● MySQL ● Kafka ● Elasticsearch ● Redis ● MemSQL ● AWS Kinesis ● Dan lainnya
  • 14. Contoh Kode PySpark (1) from pyspark import SparkConf, SparkContext logFile = "/data/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") sc.setLogLevel("ERROR") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
  • 15. Contoh Kode PySpark (2) from pyspark import SparkContext, SparkConf from random import randint # http://localhost:9200/spark/_search/?size=1000&pretty=1 # spark-submit --jars /tmp/data/elasticsearch-hadoop-5.4.0.jar /tmp/data/spark-es-read-test.py sc = SparkContext("local", "Simple App") sc.setLogLevel("ERROR") es_conf = { "es.nodes" : "192.168.46.49", "es.port" : "9200", "es.resource" : "spark/docs", } if __name__ == "__main__": rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf) print rdd.collect()
  • 16. Contoh Kode PySpark (3) from pyspark import SparkContext, SparkConf from random import randint # http://localhost:9200/spark/_search/?size=1000&pretty=1 # spark-submit --jars /tmp/data/elasticsearch-hadoop-5.4.0.jar /tmp/data/spark-es-write-test.py sc = SparkContext("local", "Simple App") sc.setLogLevel("ERROR") es_conf = { "es.nodes" : "192.168.46.49", "es.port" : "9200", "es.resource" : "spark/docs", } if __name__ == "__main__": rdd = sc.parallelize([ (i, { "x":i, "y":"lorem ipsum sit dolor amet", "z":randint(0, 1000)} ) for i in range(0, 100) ]) rdd.saveAsNewAPIHadoopFile( path='-', outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
  • 17. Contoh Kode PySpark (4) from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext("local[2]", "WordCountStreaming") sc.setLogLevel("ERROR") ssc = StreamingContext(sc, 10) lines = ssc.socketTextStream("10.2.2.38", 9999) words = lines.flatMap(lambda line: line.split(" ")) # Count each word in each batch pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) # Print the first ten elements of each RDD generated in this DStream to the console wordCounts.pprint() ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate
  • 18. Contoh Kode PySpark (5) from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext("local[2]", "WordCountStreaming") sc.setLogLevel("ERROR") ssc = StreamingContext(sc, 10) topic = "test" lines = KafkaUtils.createStream(ssc, "10.2.2.38:2181", "topic", {topic: 4}) words = lines.flatMap(lambda line: line[1].split(" ")) # Count each word in each batch pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) # Print the first ten elements of each RDD generated in this DStream to the console wordCounts.pprint() ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate
  • 19. Contoh Kode PySpark (6) from pyspark import SparkContext from pyspark.sql import SparkSession, SQLContext from random import randint from datetime import timedelta, datetime sc = SparkContext() sc.setLogLevel("ERROR") ss = SparkSession(sc) sqlCtx = SQLContext(sc) dataset = sc.textFile("/data/contoso/user-*.csv").map(lambda line: line.split("|")) for row in dataset.take(5): print ("-->") print (row) dframe = dataset.toDF() dframe.show() print(dframe.count()) try: dframe.write.partitionBy("_6").format("parquet").save("user.parquet") except Exception: print("The parquet file was existed")
  • 20. Contoh Kode PySpark (7) from pyspark import SparkContext from pyspark.sql import SparkSession, SQLContext from random import randint from datetime import timedelta, datetime sc = SparkContext() sc.setLogLevel("ERROR") ss = SparkSession(sc) sqlCtx = SQLContext(sc) dframe = ss.read.load("/user/spark/user.parquet") dframe.show() print(dframe.count())
  • 21. Contoh Kode PySpark (8) from pyspark import SparkContext from pyspark.sql import SparkSession, SQLContext from pyspark.sql.types import IntegerType, TimestampType, ByteType, ShortType, StringType, DecimalType, StructField, StructType from random import randint from datetime import timedelta, datetime spark = SparkSession .builder .appName("Python Spark SQL Hive integration example") .config("spark.sql.warehouse.dir", "./spark-warehouse") .enableHiveSupport() .getOrCreate() # dataset = sc.textFile("/data/campaign/campaign-metadata-sample-1.csv").map(lambda line: line.split("|")) # for row in dataset.take(5): # print ("-->") # print (row) schema = StructType([ StructField("metadata_id",StringType(),False), StructField("type",StringType(),True), StructField("event",StringType(),True), StructField("metadata",StringType(),True), StructField("application_id",StringType(),True), StructField("created_at",StringType(),True), StructField("api_version",StringType(),True) ]) dframe = spark.read.schema(schema).option("delimiter", "|").csv("/data/campaign/campaign-metadata-sample-1.csv") dframe.show() try: dframe.write.partitionBy("_6").format("orc").save("campaign-metadata") except Exception as e: print (e) print("The orc file was existed")
  • 22. Managed Service Spark - HortonWorks - Azure HD Insight - Amazon Web Service ElasticMapReduce - Cloudera Spark - Databricks - dan lainnya
  • 23. Solusi lain seperti Apache Spark - Apache Beam - Apache Flink - Apache Storm - Apache Hive - Apache PrestoDB - dan lainnya
  • 24. DEMO
  • 25. Q & A
  • 26. Special Thanks Zaky & Wildan yang telah mengajari saya Apache Spark Fajri & Tajhul yang telah mengajari saya menggunakan berbagai produk di AWS. Bramandityo yang telah mengajari saya Python.