SlideShare a Scribd company logo
1 of 53
Download to read offline
S3, Cassandra or Outer Space? Dumping
Time Series Data using Spark
Demi Ben-Ari - VP R&D @
Tel-Aviv 30 MARCH 2017
About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder
○ “Big Things” Big Data Community
○ Google Developer Group Cloud
In the Past:
● Sr. Data Engineer - Windward
● Team Leader & Sr. Software Engineer
Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek
Agenda
● Apache Spark brief overview and Catch Up
● Data flow and Environment
● What’s our time series data like?
● Where we started from - where we got to
○ Problems and our decisions
○ Evolution of the solution
● Conclusions
Spark Brief Overview & Catchup
Scala & Spark (Architecture)
Scala REPL Scala Compiler
Spark Runtime
Scala Runtime
JVM
File System
(eg. HDFS,
Cassandra, S3..)
Cluster Manager
(eg. Yarn, Mesos)
What kind of DSL is Apache Spark
● Centered around Collections
● Immutable data sets equipped with functional transformations
● These are exactly the Scala collection operations
map
flatMap
filter
...
reduce
fold
aggregate
...
union
intersection
...
Spark is A Multi-Language Platform
● Why to use Scala instead of
Python?
○ Native to Spark, Can use
everything without
translation
○ Types help
So Bottom Line…
What’s Spark???
United Tools Platform
United Tools Platform - Single Framework
Batch
InteractiveStreaming
Single Framework
Data flow and Environment
(Our Use Case)
Structure of the Data
● Maritime Analytics Platform
● Geo Locations + Metadata
● Arriving over time
● Different types of messages being reported by satellites
● Encoded (For Compression purposes)
● Might arrive later than actually transmitted
Data Flow Diagram
External
Data
Source
Analytics
Layers
Data Pipeline
Parsed
Raw
Entity Resolution
Process
Building insights
on top of the entities
Data Output
Layer
Anomaly
Detection
Trends
Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv
OB1K
RESTful Java Services
Basic Terms
● Missing Parts in Time Series Data
◦ Data arriving from the satellites
● Might be causing delays because of bad transmission
◦ Data vendors delaying the data stream
◦ Calculation in Layers may cause Holes in the Data
● Calculating the Data layers by time slices
Basic Terms
● Idempotence
is the property of certain operations in mathematics and computer
science, that can be applied multiple times without changing the
result beyond the initial application.
● Function: Same input => Same output
Basic Terms
● Partitions == Parallelism
◦ Physical / Logical partitioning
● Resilient Distributed Datasets (RDDs) == Collections
◦ fault-tolerant collection of elements that can be operated on in
parallel.
◦ Applying immutable transformations and actions over RDDs
What RDD’s really are?
So…..
The Problem - Receiving DATA
Beginning state, no data, and the timeline
begins
T = 0
Level 3 Entity
Level 2 Entity
Level 1 Entity
The Problem - Receiving DATA
T = 10
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size
Level 1 entities data arrives
and gets stored
The Problem - Receiving DATA
T = 10
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size
Level 3 entities are created on
top of Level 2’s Data
(Decreased amount of data)
Level 2 entities are created
on top of Level 1’s Data
(Decreased amount of
data)
The Problem - Receiving DATA
T = 20
Level 3 Entity
Level 2 Entity
Level 1 Entity
Computation sliding window size
Because of the sliding window’s
back size, level 2 and 3 entities
would not be created properly and
there would be “Holes” in the Data
Level 1 entity's
data arriving late
Solution to the Problem
● Creating Dependent Micro services forming a data pipeline
◦ Mainly Apache Spark applications
◦ Services are only dependent on the Data - not the previous
service’s run
● Forming a structure and scheduling of “Back Sliding Window”
◦ Know your data and its relevance through time
◦ Don’t try to foresee the future – it might Bias the results
How it looks like in the end...
Level 3 Entity
Level 2 Entity
Level 1 Entity
6 Hour time slot
12 Hours of Data
A Week of Data
More than a Week of Data
Starting point & Infrastructure
How we started?
● Spark Standalone – via ec2 scripts
◦ Around 5 nodes (r3.xlarge instances)
◦ Didn’t want to keep a persistent HDFS – Costs a lot
◦ 100 GB (per day) => ~150 TB for 4 years
◦ Cost for server per year (r3.xlarge):
- On demand: ~2900$
- Reserved: ~1750$
● Know your costs: http://www.ec2instances.info/
Know Your Costs
Decision
● Working with S3 as the persistence layer
◦ Pay extra for
- Put (0.005 per 1000 requests)
- Get (0.004 per 10,000 requests)
◦ 150TB => ~210$ for 4 years of Data
● Same format as HDFS (CSV files)
◦ s3n://some-bucket/entity1/201412010000/part-00000
◦ s3n://some-bucket/entity1/201412010000/part-00001
◦ ……
What about the serving?
MongoDB for Serving
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB
Replica Set
Spark
Cluster
Master
Write
Read
Spark Slave - Server Specs
● Instance Type: r3.xlarge
● CPU’s: 4
● RAM: 30.5GB
● Storage: ephemeral
● Amount: 10+
MongoDB - Server Specs
● MongoDB version: 2.6.1
● Instance Type: m3.xlarge (AWS)
● CPU’s: 4
● RAM: 15GB
● Storage: EBS
● DB Size: ~500GB
● Collection Indexes: 5 (4 compound)
The Problem
● Batch jobs
◦ Should run for 5-10 minutes in total
◦ Actual - runs for ~40 minutes
● Why?
◦ ~20 minutes to write with the Java mongo driver – Async
(Unacknowledged)
◦ ~20 minutes to sync the journal
◦ Total: ~ 40 Minutes of the DB being unavailable
◦ No batch process response and no UI serving
Alternative Solutions
● Sharded MongoDB (With replica sets)
◦ Pros:
- Increases Throughput by the amount of shards
- Increases the availability of the DB
◦ Cons:
- Very hard to manage DevOps wise (for a small team of
developers)
- High cost of servers – because each shared need 3 replicas
Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
Master
Our DevOps – After that solution
We had no
DevOps guy at
that time at all
☹
Alternative Solutions
● Apache Cassandra
◦ Pros:
- Very large developer community
- Linearly scalable Database
- No single master architecture
- Proven working with distributed engines like Apache Spark
◦ Cons:
- We had no experience at all with the Database
- No Geo Spatial Index – Needed to implement by ourselves
The Solution
● Migration to Apache Cassandra
● Create easily a Cassandra cluster using DataStax Community
AMI on AWS
◦ First easy step – Using the spark-cassandra-connector
(Easy bootstrap move to Spark ⬄ Cassandra)
◦ Creating a monitoring dashboard to Cassandra
● Second phase:
◦ Creating a self managed and self provisioned Cassandra
Cluster
◦ Tuning the hell out of it!!!
Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read
Result
● Performance improvement
◦ Batch write parts of the job run in 3 minutes instead of ~ 40
minutes in MongoDB
● Took 2 weeks to go from “Zero to Hero”, and to ramp up a
running solution that work without glitches
So (Again)?
Transferring the Heaviest Process
● Micro service that runs every 10 minutes
● Writes to Cassandra 30GB per iteration
◦ (Replication factor 3 => 90GB)
● At first took us 18 minutes to do all of the writes
◦ Not Acceptable in a 10 minute process
Cluster On OpsCenter - Before
Transferring the Heaviest Process
● Solutions
◦ We chose the i2.xlarge
◦ Optimization of the Cluster
◦ Changing the JDK to Java-8
- Changing the GC algorithm to G1
◦ Tuning the Operation system
- Ulimit, removing the swap
◦ Write time went down to ~5 minutes (For 30GB RF=3)
Sounds good right? I don’t think so
Cloud Watch After Tuning
The Solution
● Taking the same Data Model that we held in Cassandra (All of the
Raw data per 10 minutes) and put it on S3
◦ Write time went down from ~5 minutes to 1.5 minutes
● Added another process, not dependent on the main one,
happens every 15 minutes
◦ Reads from S3, downscales the amount and Writes them to
Cassandra for serving
Parsed
Raw
Static /
Aggregated
Data
Spark Analytics Layers
UI Serving
Downscaled
Data
Heavy
Fusion
Process
How it looks after all?
Conclusion
● Always give an estimate to your data
◦ Frequency
◦ Volume
◦ Arrangement of the previous phase
● There is no “Best” persistence layer
◦ There is the right one for the job
◦ Don’t overload an existing solution
Conclusion
● Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
● “With great power,
comes lots of partitioning”
◦ Control your work and
data distribution via partitions
● https://www.pinterest.com/pin/155514993354583499/ (Thanks)
Questions?
● LinkedIn
● Twitter: @demibenari
● Blog:
http://progexc.blogspot.com/
● demi.benari@gmail.com
● “Big Things” Community
Meetup, YouTube, Facebook,
Twitter
● GDG Cloud
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark

More Related Content

What's hot

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster CassandraTzach Livyatan
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Spark Summit
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsScyllaDB
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
How Scylla Manager Handles Backups
How Scylla Manager Handles BackupsHow Scylla Manager Handles Backups
How Scylla Manager Handles BackupsScyllaDB
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and MessagingXin Wang
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
QCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaSean Zhong
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferScyllaDB
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...DevOpsDays Tel Aviv
 
Lightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning SpeedLightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning SpeedScyllaDB
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streamsKam Kasravi
 
Performance Tuning - Understanding Garbage Collection
Performance Tuning - Understanding Garbage CollectionPerformance Tuning - Understanding Garbage Collection
Performance Tuning - Understanding Garbage CollectionHaribabu Nandyal Padmanaban
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsIgor Mielientiev
 

What's hot (20)

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical Tips
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
How Scylla Manager Handles Backups
How Scylla Manager Handles BackupsHow Scylla Manager Handles Backups
How Scylla Manager Handles Backups
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
QCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on Akka
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
 
Lightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning SpeedLightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning Speed
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streams
 
Performance Tuning - Understanding Garbage Collection
Performance Tuning - Understanding Garbage CollectionPerformance Tuning - Understanding Garbage Collection
Performance Tuning - Understanding Garbage Collection
 
How to manage large amounts of data with akka streams
How to manage large amounts of data with akka streamsHow to manage large amounts of data with akka streams
How to manage large amounts of data with akka streams
 

Similar to S3, Cassandra or Outer Space? Dumping Time Series Data using Spark

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data LakeMarcos Rebelo
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 

Similar to S3, Cassandra or Outer Space? Dumping Time Series Data using Spark (20)

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Netty training
Netty trainingNetty training
Netty training
 

More from Codemotion Tel Aviv

Keynote: Trends in Modern Application Development - Gilly Dekel, IBM
Keynote: Trends in Modern Application Development - Gilly Dekel, IBMKeynote: Trends in Modern Application Development - Gilly Dekel, IBM
Keynote: Trends in Modern Application Development - Gilly Dekel, IBMCodemotion Tel Aviv
 
Angular is one fire(base)! - Shmuela Jacobs
Angular is one fire(base)! - Shmuela JacobsAngular is one fire(base)! - Shmuela Jacobs
Angular is one fire(base)! - Shmuela JacobsCodemotion Tel Aviv
 
Demystifying docker networking black magic - Lorenzo Fontana, Kiratech
Demystifying docker networking black magic - Lorenzo Fontana, KiratechDemystifying docker networking black magic - Lorenzo Fontana, Kiratech
Demystifying docker networking black magic - Lorenzo Fontana, KiratechCodemotion Tel Aviv
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Codemotion Tel Aviv
 
Facts about multithreading that'll keep you up at night - Guy Bar on, Vonage
Facts about multithreading that'll keep you up at night - Guy Bar on, VonageFacts about multithreading that'll keep you up at night - Guy Bar on, Vonage
Facts about multithreading that'll keep you up at night - Guy Bar on, VonageCodemotion Tel Aviv
 
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...Codemotion Tel Aviv
 
Unleash the power of angular Reactive Forms - Nir Kaufman, 500Tech
Unleash the power of angular Reactive Forms - Nir Kaufman, 500TechUnleash the power of angular Reactive Forms - Nir Kaufman, 500Tech
Unleash the power of angular Reactive Forms - Nir Kaufman, 500TechCodemotion Tel Aviv
 
Can we build an Azure IoT controlled device in less than 40 minutes that cost...
Can we build an Azure IoT controlled device in less than 40 minutes that cost...Can we build an Azure IoT controlled device in less than 40 minutes that cost...
Can we build an Azure IoT controlled device in less than 40 minutes that cost...Codemotion Tel Aviv
 
Actors and Microservices - Can two walk together? - Rotem Hermon, Gigya
Actors and Microservices - Can two walk together? - Rotem Hermon, GigyaActors and Microservices - Can two walk together? - Rotem Hermon, Gigya
Actors and Microservices - Can two walk together? - Rotem Hermon, GigyaCodemotion Tel Aviv
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...Codemotion Tel Aviv
 
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...Codemotion Tel Aviv
 
Containerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with KubernetesContainerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with KubernetesCodemotion Tel Aviv
 
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaFullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaCodemotion Tel Aviv
 
The Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, WixThe Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, WixCodemotion Tel Aviv
 
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...Codemotion Tel Aviv
 
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerry
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerryGetting Physical with Web Bluetooth - Uri Shaked, BlackBerry
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerryCodemotion Tel Aviv
 
Web based virtual reality - Tanay Pant, Mozilla
Web based virtual reality - Tanay Pant, MozillaWeb based virtual reality - Tanay Pant, Mozilla
Web based virtual reality - Tanay Pant, MozillaCodemotion Tel Aviv
 
Material Design Demytified - Ran Nachmany, Google
Material Design Demytified - Ran Nachmany, GoogleMaterial Design Demytified - Ran Nachmany, Google
Material Design Demytified - Ran Nachmany, GoogleCodemotion Tel Aviv
 
All the reasons for choosing react js that you didn't know about - Avi Marcus...
All the reasons for choosing react js that you didn't know about - Avi Marcus...All the reasons for choosing react js that you didn't know about - Avi Marcus...
All the reasons for choosing react js that you didn't know about - Avi Marcus...Codemotion Tel Aviv
 
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, Skycure
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, SkycureMobile Security Attacks: A Glimpse from the Trenches - Yair Amit, Skycure
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, SkycureCodemotion Tel Aviv
 

More from Codemotion Tel Aviv (20)

Keynote: Trends in Modern Application Development - Gilly Dekel, IBM
Keynote: Trends in Modern Application Development - Gilly Dekel, IBMKeynote: Trends in Modern Application Development - Gilly Dekel, IBM
Keynote: Trends in Modern Application Development - Gilly Dekel, IBM
 
Angular is one fire(base)! - Shmuela Jacobs
Angular is one fire(base)! - Shmuela JacobsAngular is one fire(base)! - Shmuela Jacobs
Angular is one fire(base)! - Shmuela Jacobs
 
Demystifying docker networking black magic - Lorenzo Fontana, Kiratech
Demystifying docker networking black magic - Lorenzo Fontana, KiratechDemystifying docker networking black magic - Lorenzo Fontana, Kiratech
Demystifying docker networking black magic - Lorenzo Fontana, Kiratech
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
 
Facts about multithreading that'll keep you up at night - Guy Bar on, Vonage
Facts about multithreading that'll keep you up at night - Guy Bar on, VonageFacts about multithreading that'll keep you up at night - Guy Bar on, Vonage
Facts about multithreading that'll keep you up at night - Guy Bar on, Vonage
 
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...
Master the Art of the AST (and Take Control of Your JS!) - Yonatan Mevorach, ...
 
Unleash the power of angular Reactive Forms - Nir Kaufman, 500Tech
Unleash the power of angular Reactive Forms - Nir Kaufman, 500TechUnleash the power of angular Reactive Forms - Nir Kaufman, 500Tech
Unleash the power of angular Reactive Forms - Nir Kaufman, 500Tech
 
Can we build an Azure IoT controlled device in less than 40 minutes that cost...
Can we build an Azure IoT controlled device in less than 40 minutes that cost...Can we build an Azure IoT controlled device in less than 40 minutes that cost...
Can we build an Azure IoT controlled device in less than 40 minutes that cost...
 
Actors and Microservices - Can two walk together? - Rotem Hermon, Gigya
Actors and Microservices - Can two walk together? - Rotem Hermon, GigyaActors and Microservices - Can two walk together? - Rotem Hermon, Gigya
Actors and Microservices - Can two walk together? - Rotem Hermon, Gigya
 
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
How to Leverage Machine Learning (R, Hadoop, Spark, H2O) for Real Time Proces...
 
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...
My Minecraft Smart Home: Prototyping the internet of uncanny things - Sascha ...
 
Containerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with KubernetesContainerised ASP.NET Core apps with Kubernetes
Containerised ASP.NET Core apps with Kubernetes
 
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForzaFullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
Fullstack DDD with ASP.NET Core and Anguar 2 - Ronald Harmsen, NForza
 
The Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, WixThe Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, Wix
 
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...
SOA Lessons Learnt (or Microservices done Better) - Sean Farmar, Particular S...
 
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerry
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerryGetting Physical with Web Bluetooth - Uri Shaked, BlackBerry
Getting Physical with Web Bluetooth - Uri Shaked, BlackBerry
 
Web based virtual reality - Tanay Pant, Mozilla
Web based virtual reality - Tanay Pant, MozillaWeb based virtual reality - Tanay Pant, Mozilla
Web based virtual reality - Tanay Pant, Mozilla
 
Material Design Demytified - Ran Nachmany, Google
Material Design Demytified - Ran Nachmany, GoogleMaterial Design Demytified - Ran Nachmany, Google
Material Design Demytified - Ran Nachmany, Google
 
All the reasons for choosing react js that you didn't know about - Avi Marcus...
All the reasons for choosing react js that you didn't know about - Avi Marcus...All the reasons for choosing react js that you didn't know about - Avi Marcus...
All the reasons for choosing react js that you didn't know about - Avi Marcus...
 
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, Skycure
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, SkycureMobile Security Attacks: A Glimpse from the Trenches - Yair Amit, Skycure
Mobile Security Attacks: A Glimpse from the Trenches - Yair Amit, Skycure
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark

  • 1. S3, Cassandra or Outer Space? Dumping Time Series Data using Spark Demi Ben-Ari - VP R&D @ Tel-Aviv 30 MARCH 2017
  • 2. About Me Demi Ben-Ari, Co-Founder & VP R&D @ Panorays ● BS’c Computer Science – Academic College Tel-Aviv Yaffo ● Co-Founder ○ “Big Things” Big Data Community ○ Google Developer Group Cloud In the Past: ● Sr. Data Engineer - Windward ● Team Leader & Sr. Software Engineer Missile defense and Alert System - “Ofek” – IAF Interested in almost every kind of technology – A True Geek
  • 3. Agenda ● Apache Spark brief overview and Catch Up ● Data flow and Environment ● What’s our time series data like? ● Where we started from - where we got to ○ Problems and our decisions ○ Evolution of the solution ● Conclusions
  • 5. Scala & Spark (Architecture) Scala REPL Scala Compiler Spark Runtime Scala Runtime JVM File System (eg. HDFS, Cassandra, S3..) Cluster Manager (eg. Yarn, Mesos)
  • 6. What kind of DSL is Apache Spark ● Centered around Collections ● Immutable data sets equipped with functional transformations ● These are exactly the Scala collection operations map flatMap filter ... reduce fold aggregate ... union intersection ...
  • 7. Spark is A Multi-Language Platform ● Why to use Scala instead of Python? ○ Native to Spark, Can use everything without translation ○ Types help
  • 10. United Tools Platform - Single Framework Batch InteractiveStreaming Single Framework
  • 11. Data flow and Environment (Our Use Case)
  • 12. Structure of the Data ● Maritime Analytics Platform ● Geo Locations + Metadata ● Arriving over time ● Different types of messages being reported by satellites ● Encoded (For Compression purposes) ● Might arrive later than actually transmitted
  • 13. Data Flow Diagram External Data Source Analytics Layers Data Pipeline Parsed Raw Entity Resolution Process Building insights on top of the entities Data Output Layer Anomaly Detection Trends
  • 15. Basic Terms ● Missing Parts in Time Series Data ◦ Data arriving from the satellites ● Might be causing delays because of bad transmission ◦ Data vendors delaying the data stream ◦ Calculation in Layers may cause Holes in the Data ● Calculating the Data layers by time slices
  • 16. Basic Terms ● Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application. ● Function: Same input => Same output
  • 17. Basic Terms ● Partitions == Parallelism ◦ Physical / Logical partitioning ● Resilient Distributed Datasets (RDDs) == Collections ◦ fault-tolerant collection of elements that can be operated on in parallel. ◦ Applying immutable transformations and actions over RDDs
  • 20. The Problem - Receiving DATA Beginning state, no data, and the timeline begins T = 0 Level 3 Entity Level 2 Entity Level 1 Entity
  • 21. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 1 entities data arrives and gets stored
  • 22. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 3 entities are created on top of Level 2’s Data (Decreased amount of data) Level 2 entities are created on top of Level 1’s Data (Decreased amount of data)
  • 23. The Problem - Receiving DATA T = 20 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Because of the sliding window’s back size, level 2 and 3 entities would not be created properly and there would be “Holes” in the Data Level 1 entity's data arriving late
  • 24. Solution to the Problem ● Creating Dependent Micro services forming a data pipeline ◦ Mainly Apache Spark applications ◦ Services are only dependent on the Data - not the previous service’s run ● Forming a structure and scheduling of “Back Sliding Window” ◦ Know your data and its relevance through time ◦ Don’t try to foresee the future – it might Bias the results
  • 25. How it looks like in the end... Level 3 Entity Level 2 Entity Level 1 Entity 6 Hour time slot 12 Hours of Data A Week of Data More than a Week of Data
  • 26. Starting point & Infrastructure
  • 27. How we started? ● Spark Standalone – via ec2 scripts ◦ Around 5 nodes (r3.xlarge instances) ◦ Didn’t want to keep a persistent HDFS – Costs a lot ◦ 100 GB (per day) => ~150 TB for 4 years ◦ Cost for server per year (r3.xlarge): - On demand: ~2900$ - Reserved: ~1750$ ● Know your costs: http://www.ec2instances.info/
  • 29. Decision ● Working with S3 as the persistence layer ◦ Pay extra for - Put (0.005 per 1000 requests) - Get (0.004 per 10,000 requests) ◦ 150TB => ~210$ for 4 years of Data ● Same format as HDFS (CSV files) ◦ s3n://some-bucket/entity1/201412010000/part-00000 ◦ s3n://some-bucket/entity1/201412010000/part-00001 ◦ ……
  • 30. What about the serving?
  • 31. MongoDB for Serving Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
  • 32. Spark Slave - Server Specs ● Instance Type: r3.xlarge ● CPU’s: 4 ● RAM: 30.5GB ● Storage: ephemeral ● Amount: 10+
  • 33. MongoDB - Server Specs ● MongoDB version: 2.6.1 ● Instance Type: m3.xlarge (AWS) ● CPU’s: 4 ● RAM: 15GB ● Storage: EBS ● DB Size: ~500GB ● Collection Indexes: 5 (4 compound)
  • 34. The Problem ● Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes ● Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
  • 35. Alternative Solutions ● Sharded MongoDB (With replica sets) ◦ Pros: - Increases Throughput by the amount of shards - Increases the availability of the DB ◦ Cons: - Very hard to manage DevOps wise (for a small team of developers) - High cost of servers – because each shared need 3 replicas
  • 36. Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
  • 37. Our DevOps – After that solution We had no DevOps guy at that time at all ☹
  • 38. Alternative Solutions ● Apache Cassandra ◦ Pros: - Very large developer community - Linearly scalable Database - No single master architecture - Proven working with distributed engines like Apache Spark ◦ Cons: - We had no experience at all with the Database - No Geo Spatial Index – Needed to implement by ourselves
  • 39. The Solution ● Migration to Apache Cassandra ● Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra-connector (Easy bootstrap move to Spark ⬄ Cassandra) ◦ Creating a monitoring dashboard to Cassandra ● Second phase: ◦ Creating a self managed and self provisioned Cassandra Cluster ◦ Tuning the hell out of it!!!
  • 40. Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  • 41. Result ● Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB ● Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  • 43. Transferring the Heaviest Process ● Micro service that runs every 10 minutes ● Writes to Cassandra 30GB per iteration ◦ (Replication factor 3 => 90GB) ● At first took us 18 minutes to do all of the writes ◦ Not Acceptable in a 10 minute process
  • 45. Transferring the Heaviest Process ● Solutions ◦ We chose the i2.xlarge ◦ Optimization of the Cluster ◦ Changing the JDK to Java-8 - Changing the GC algorithm to G1 ◦ Tuning the Operation system - Ulimit, removing the swap ◦ Write time went down to ~5 minutes (For 30GB RF=3) Sounds good right? I don’t think so
  • 47. The Solution ● Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3 ◦ Write time went down from ~5 minutes to 1.5 minutes ● Added another process, not dependent on the main one, happens every 15 minutes ◦ Reads from S3, downscales the amount and Writes them to Cassandra for serving
  • 48. Parsed Raw Static / Aggregated Data Spark Analytics Layers UI Serving Downscaled Data Heavy Fusion Process How it looks after all?
  • 49. Conclusion ● Always give an estimate to your data ◦ Frequency ◦ Volume ◦ Arrangement of the previous phase ● There is no “Best” persistence layer ◦ There is the right one for the job ◦ Don’t overload an existing solution
  • 50. Conclusion ● Spark is a great framework for distributed collections ◦ Fully functional API ◦ Can perform imperative actions ● “With great power, comes lots of partitioning” ◦ Control your work and data distribution via partitions ● https://www.pinterest.com/pin/155514993354583499/ (Thanks)
  • 52. ● LinkedIn ● Twitter: @demibenari ● Blog: http://progexc.blogspot.com/ ● demi.benari@gmail.com ● “Big Things” Community Meetup, YouTube, Facebook, Twitter ● GDG Cloud