SlideShare a Scribd company logo
Modern Lambda architecture in Big Data
Piotr Hejwowski
Hello world :)
■ Who am I ?
■ Java developer working in Codete
■ Keen on Big Data and modern backend approach
■ Luckily can develop this passion in Codete
■ https://github.com/Hejwo
■ piotr.hejwowski@codete.com
■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms.
■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture
■ Disclaimer 3 - Live coding ? Next time
■ Disclaimer 4 - From zero to hero style
Recap & Intro
■ Recap - at the end of last GDG we were talking about Machine Learning
■ We talk about difference between Data Science and Big Data, often confused
Recap
Data science
■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract
knowledge or insights from data in various forms, either structured or unstructured
■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable
information
Big data
■ Modern methods of gathering, processing big volumes of data
■ More info in next 40 mins ;)
What’s Big Data ?
What’s Big Data ?
What’s Big Data ?
■ Amount of our data is getting larger and larger
■ Important role in it is Internet of Things -> sensors, sensors are everywhere !
■ At some point EVEN business guys discovered that there’s great value behind unstructured data
■ ETL’s on massive scale
■ Recommendation systems based on FB likes
■ Analysing user traffic on e-shops and optimizing contents
■ Raw data from car’s sensors
■ Optimizing traffic like in Lublin :)
■ POTENTIAL and AMOUNT of data that we need is HUGE
■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!!
■ Discovering new relations in our data
But… When Big Data ?
How to process Big Data ?
Moore’s law is dying [*]
“Moore's law is the observation that the number of transistors in a
single core doubles approximately every two years”
■ Right now every new transistor progress is getting more and
more expensive.
■ New processors are getting more and more expensive.
■ Since now we could rely on Moore's law. If our
infrastructure is not doing well after two years and
approximately same cost we could have faster.
■ But… we still have many cores. But… sometimes distributing
work on many cores it’s still not enough.
How to process Big Data ? - Scale up vs. Scale out
Scale up
■ Costy components
■ Complexed application/system logic. Often multithreaded
■ Poor fault-tolerance
■ Machine is getting hot as Mordor.
■ Cheaper machines
■ Easier application and system logic
■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain.
■ Fault-tolerance - If half of our machines will explode we still can do something
■ Needs data centers :(
Scale out
How to process Big Data ? - Scale up vs. Scale out
Meet Apache Spark - Big Data processing engine !
Meet Apache Spark - Big Data processing engine !
■ Created in Berkley university
■ At beginning it was Proof of Concept for Mesos cluster management
■ Much more faster than his father - Hadoop
■ By default it operates on memory.
■ No frequent disc writes means more speed
■ Rich and simple caching mechanism
■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk
■ We're gonna focus on Spark due to time
Meet Apache Spark - Big Data processing engine !
Is Big Data processing THE only direction ?
Spark is faster than Hadoop, but still… it’s heavy machinery
Is Big Data THE only direction ?
Reactive Manifesto
■ Responsive - What happens when Wifi
is down ? Users want FAST responses !
■ Elastic - Large system tend to have
frequent, massive loads
■ Resilient - System must stay available
and any kind of response is better than
no response.
■ Message Driven - isolation and
non-blocking is achieved via async
communication. Thanks to that we have
clear boundaries, isolation,
transparency.
How to achieve this two goals ? Let’s go lambda !
?
?
Meet our systems heart - Apache Kafka
■ Lightlight fast Messaging system
■ Heart of Big Data system
■ Distributed
■ Build by LinkedIn
■ Written in Scala
■ Producers and Consumers concept
■ Auto recovery, Brokers detection
Meet our systems heart - Apache Kafka
Meet our systems heart - Apache Kafka
We’ve got two parts of puzzle !
?
Spark Streaming - when batch is not enough
Spark Streaming - when batch is not enough
■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run
ad-hoc queries on stream state.
■ Used as a rapid fast micro batching
■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open
source software meant dealing with multiple frameworks
■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP.
■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly.
■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time
analysis.
Witch done with the puzzle !
Now… let’s store it ! NoSql store it !
■ Large datasets
■ Easy to scale out
■ Less schema validation on write means faster
■ Schemaless databases can be a great value in Big Data, all
thought we sometimes don’t know what we need and we
want our data to be dirty.
Why NoSQL ?
Now… let’s store it ! NoSql store it !
Now… let’s store it ! NoSql store it !
Why it’s modern ?
■ Fast, reliable
■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin
■ Less code (Hadoop’s MapReduce ? It’s an essay)
■ Comparing to older approach - less chaos thanks to Kafka.
Cons
■ More like micro batching not real time
■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support
■ Things tend to get complicated when we’re Kafka messages within single topic evolve
■ DevOps, needed, strong powerful developers needed
■ Distributed world is complicated world
■ Thousands of frameworks and ideas every year
What next ?
Apache Spark resources :
■ http://spark.apache.org/
■ https://hortonworks.com/tutorials/
■ https://codete.com/blog/
Apache Kafka resources :
■ http://spark.apache.org/
■ http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
NoSql resources :
■ http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/
Sources
■ Internet in a minute : http://www.visualcapitalist.com/what-happens-internet-minute-2016/
■ Big Data and V4’s : https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know
■ Moore’s law : https://en.wikipedia.org/wiki/Moore%27s_law
■ Apache Spark : http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html
■ Apache Kafka : https://softwareengineeringdaily.com/2015/08/06/kafka-with-guozhang-wang/
■ Spark Streamming : http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/
■ NoSQL : https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/
Thank you ? No. Thank YOU
Spark ? Interesting alternative to ETL Hell
■ SAP, SAS, Elixir
■ ETL has nice visual building blocks, but this means....
■ … Click, Click, Click, Click… (RSI danger !)
■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test.
■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused.
■ Data is getting out of sync. So ETL pipeline gets out of sync.
■ In Big Data world we have Apache Avro for schema registry
■ Big Data can handle more
■ Legacy code
■ $$$ It’s for FREE $$$
■ Can throw Machine Learning into it and do interesting things. Not only batches.
■ Lack professional support
■ Big Data is not that mature
■ Let’s look what will happen here
Apache Kafka vs. Rabbit MQ
Apache Kafka vs. Rabbit MQ
Apache Kafka vs. Rabbit MQ
Kafka :
■ + Fire hose of events (100k+/sec)
■ + Availability of re-read messages (Good for CQRS)
■ + Scale out
■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry
■ - You don't mind supporting on your own
■ - No AMQP and complexed routing
RabbitMQ :
■ + Messages may be routed in complexed way to consumers
■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you !
■ + Scale out
■ - (20k+/sec) messages
■ - Messages are deleted after consumers ack
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

More Related Content

What's hot

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021
Michael98364
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
Data Science Thailand
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Data Con LA
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
Dmitry Buzdin
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
Zekeriya Besiroglu
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
Ophir Cohen
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
Christopher Curtin
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Datascience lab 2017 odessa kappa architecture 2.0
Datascience lab 2017 odessa   kappa architecture 2.0Datascience lab 2017 odessa   kappa architecture 2.0
Datascience lab 2017 odessa kappa architecture 2.0
Juantomás García Molina
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
Farzin Bagheri
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 

What's hot (19)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Datascience lab 2017 odessa kappa architecture 2.0
Datascience lab 2017 odessa   kappa architecture 2.0Datascience lab 2017 odessa   kappa architecture 2.0
Datascience lab 2017 odessa kappa architecture 2.0
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 

Similar to Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
Huy Do
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
Ricard Clau
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Spark
SparkSpark
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 

Similar to Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Spark
SparkSpark
Spark
 
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 

Recently uploaded

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 

Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

  • 1. Modern Lambda architecture in Big Data Piotr Hejwowski
  • 2. Hello world :) ■ Who am I ? ■ Java developer working in Codete ■ Keen on Big Data and modern backend approach ■ Luckily can develop this passion in Codete ■ https://github.com/Hejwo ■ piotr.hejwowski@codete.com ■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms. ■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture ■ Disclaimer 3 - Live coding ? Next time ■ Disclaimer 4 - From zero to hero style
  • 3. Recap & Intro ■ Recap - at the end of last GDG we were talking about Machine Learning ■ We talk about difference between Data Science and Big Data, often confused Recap Data science ■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract knowledge or insights from data in various forms, either structured or unstructured ■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable information Big data ■ Modern methods of gathering, processing big volumes of data ■ More info in next 40 mins ;)
  • 6. What’s Big Data ? ■ Amount of our data is getting larger and larger ■ Important role in it is Internet of Things -> sensors, sensors are everywhere ! ■ At some point EVEN business guys discovered that there’s great value behind unstructured data ■ ETL’s on massive scale ■ Recommendation systems based on FB likes ■ Analysing user traffic on e-shops and optimizing contents ■ Raw data from car’s sensors ■ Optimizing traffic like in Lublin :) ■ POTENTIAL and AMOUNT of data that we need is HUGE ■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!! ■ Discovering new relations in our data
  • 8. How to process Big Data ? Moore’s law is dying [*] “Moore's law is the observation that the number of transistors in a single core doubles approximately every two years” ■ Right now every new transistor progress is getting more and more expensive. ■ New processors are getting more and more expensive. ■ Since now we could rely on Moore's law. If our infrastructure is not doing well after two years and approximately same cost we could have faster. ■ But… we still have many cores. But… sometimes distributing work on many cores it’s still not enough.
  • 9. How to process Big Data ? - Scale up vs. Scale out Scale up ■ Costy components ■ Complexed application/system logic. Often multithreaded ■ Poor fault-tolerance ■ Machine is getting hot as Mordor. ■ Cheaper machines ■ Easier application and system logic ■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain. ■ Fault-tolerance - If half of our machines will explode we still can do something ■ Needs data centers :( Scale out
  • 10. How to process Big Data ? - Scale up vs. Scale out
  • 11. Meet Apache Spark - Big Data processing engine !
  • 12. Meet Apache Spark - Big Data processing engine ! ■ Created in Berkley university ■ At beginning it was Proof of Concept for Mesos cluster management ■ Much more faster than his father - Hadoop ■ By default it operates on memory. ■ No frequent disc writes means more speed ■ Rich and simple caching mechanism ■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk ■ We're gonna focus on Spark due to time
  • 13. Meet Apache Spark - Big Data processing engine !
  • 14. Is Big Data processing THE only direction ? Spark is faster than Hadoop, but still… it’s heavy machinery
  • 15. Is Big Data THE only direction ? Reactive Manifesto ■ Responsive - What happens when Wifi is down ? Users want FAST responses ! ■ Elastic - Large system tend to have frequent, massive loads ■ Resilient - System must stay available and any kind of response is better than no response. ■ Message Driven - isolation and non-blocking is achieved via async communication. Thanks to that we have clear boundaries, isolation, transparency.
  • 16. How to achieve this two goals ? Let’s go lambda ! ? ?
  • 17. Meet our systems heart - Apache Kafka ■ Lightlight fast Messaging system ■ Heart of Big Data system ■ Distributed ■ Build by LinkedIn ■ Written in Scala ■ Producers and Consumers concept ■ Auto recovery, Brokers detection
  • 18. Meet our systems heart - Apache Kafka
  • 19. Meet our systems heart - Apache Kafka
  • 20. We’ve got two parts of puzzle ! ?
  • 21. Spark Streaming - when batch is not enough
  • 22. Spark Streaming - when batch is not enough ■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. ■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. ■ Used as a rapid fast micro batching ■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open source software meant dealing with multiple frameworks ■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP. ■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly. ■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time analysis.
  • 23. Witch done with the puzzle !
  • 24. Now… let’s store it ! NoSql store it ! ■ Large datasets ■ Easy to scale out ■ Less schema validation on write means faster ■ Schemaless databases can be a great value in Big Data, all thought we sometimes don’t know what we need and we want our data to be dirty. Why NoSQL ?
  • 25. Now… let’s store it ! NoSql store it !
  • 26. Now… let’s store it ! NoSql store it !
  • 27. Why it’s modern ? ■ Fast, reliable ■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin ■ Less code (Hadoop’s MapReduce ? It’s an essay) ■ Comparing to older approach - less chaos thanks to Kafka.
  • 28. Cons ■ More like micro batching not real time ■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support ■ Things tend to get complicated when we’re Kafka messages within single topic evolve ■ DevOps, needed, strong powerful developers needed ■ Distributed world is complicated world ■ Thousands of frameworks and ideas every year
  • 29. What next ? Apache Spark resources : ■ http://spark.apache.org/ ■ https://hortonworks.com/tutorials/ ■ https://codete.com/blog/ Apache Kafka resources : ■ http://spark.apache.org/ ■ http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ NoSql resources : ■ http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/
  • 30. Sources ■ Internet in a minute : http://www.visualcapitalist.com/what-happens-internet-minute-2016/ ■ Big Data and V4’s : https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know ■ Moore’s law : https://en.wikipedia.org/wiki/Moore%27s_law ■ Apache Spark : http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html ■ Apache Kafka : https://softwareengineeringdaily.com/2015/08/06/kafka-with-guozhang-wang/ ■ Spark Streamming : http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/ ■ NoSQL : https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/
  • 31. Thank you ? No. Thank YOU
  • 32. Spark ? Interesting alternative to ETL Hell ■ SAP, SAS, Elixir ■ ETL has nice visual building blocks, but this means.... ■ … Click, Click, Click, Click… (RSI danger !) ■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test. ■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused. ■ Data is getting out of sync. So ETL pipeline gets out of sync. ■ In Big Data world we have Apache Avro for schema registry ■ Big Data can handle more ■ Legacy code ■ $$$ It’s for FREE $$$ ■ Can throw Machine Learning into it and do interesting things. Not only batches. ■ Lack professional support ■ Big Data is not that mature ■ Let’s look what will happen here
  • 33. Apache Kafka vs. Rabbit MQ
  • 34. Apache Kafka vs. Rabbit MQ
  • 35. Apache Kafka vs. Rabbit MQ Kafka : ■ + Fire hose of events (100k+/sec) ■ + Availability of re-read messages (Good for CQRS) ■ + Scale out ■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry ■ - You don't mind supporting on your own ■ - No AMQP and complexed routing RabbitMQ : ■ + Messages may be routed in complexed way to consumers ■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you ! ■ + Scale out ■ - (20k+/sec) messages ■ - Messages are deleted after consumers ack