SlideShare a Scribd company logo
Data Engineering for Data Scientists
Jonathan Lacefield – Solution Architect
DataStax
Introduction
• Jonathan Lacefield
– Solutions Architect, DataStax
– Former Dev, DBA, Architect, reformed PM
– Email: jlacefie@gmail.com
– Twitter: @jlacefie
– LinkedIn: www.linkedin.com/in/jlacefield
DataStax Introduction
1. Commercial Provider of Apache Cassandra
2. Provider of Proprietary Software Built on
Apache Cassandra
3. Deliverer a linearly scalable, “always-on” Data
Platform on the foundation of Apache Cassandra
and the integration of:
1. Apache Spark
2. Apache SOLR
3. Apache Hadoop
4. TitanDB
DataStax, What we Do (Use Cases)
• Fraud Detection
• Personalization
• Internet of Things
• Messaging
• Lists of Things (Products, Playlists, etc)
• Smaller set of other things too!
We are all about working with temporal data sets at
large volumes with high transaction counts
(velocity).
“One believes things because one has been
conditioned to believe them.”
― Aldous Huxley, Brave New World
After today, you will have enough knowledge to walk into
any organization and communicate with Data Engineers,
in their terms, to effectively design Analytical solutions
based on modern technologies.
Agenda
• Background and Context
– From 1 Database to Distributed, Polyglot Persistence
Data Stores
• Data Engineering Concepts 101
– The CAP Theorem and it’s Variants
• Data Engineering Concepts 102
– Deeper into CAP
• The Data Stores You Will (Probably) Use
• The Architectures in Which You Will Participate
What’s Happened in the Last 10 Years
OLTP
Web Application Tier
OLAP
Statistical/Analytical Applications
ETL
2005
Ahh….2005
Today
Today
2015
OLTP
Web Application Tier
OLAP
Statistical/Analytical Applications
ETL
Innovations in Data Engineering
• 2000 – Eric Brewer’s Cap Theorem, proved in 2002
– http://en.wikipedia.org/wiki/CAP_theorem
• 2004 – Google MapReduce
– http://research.google.com/archive/mapreduce.html
• 2006 – Google Big Table
– http://static.googleusercontent.com/media/research.google.com/en/us/archive/
bigtable-osdi06.pdf
• 2007 – Amazon Dynamo
– http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
• 2008 – Polyglot Persistence
– https://www.altamiracorp.com/blog/employee-posts/polyglot-persistence
• 2009 – NoSQL (in modern terms) Introduced
– http://en.wikipedia.org/wiki/NoSQL
• 2012 – Berkley Spark
– https://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/nsdi_spark.pdf
• …
Today
F1 F2 F3
F4 F5 F6
F7 F8 F9
Distributed File
Systems
ETL
• Polyglot Persistence and Services Integration
are the Norm
• Data Stores are Distributed
• Centralize Data via File Systems
• Hadoop, GFS, S3, etc
• Open Source Rules
• Analytical Applications
• Python, R, Scala, Java
• Data Pipelines (not depicted)
SO WHAT?
WHO CARES?
To succeed you
must thrive in this
environment!
1 + 1 = 2 Only Sometimes
CAP Theorem (The Foundation)
It is impossible for a distributed computer system to
simultaneously provide all three of the following
guarantees:
• Consistency (all nodes see the same data at the same
time)
• Availability (a guarantee that every request receives
a response about whether it succeeded or failed)
• Partition tolerance (the system continues to operate
despite arbitrary message loss or failure of part of
the system)
Consistency
Add nodes in a system see the same data at the
same time.
V1 V1 V1 V1
V1
Availability
A guarantee that every request receives a
response about whether it succeeded or failed.
V1 V1 V1 V1
Request Response
Partition Tolerance
The system continues to operate despite arbitrary
message loss or failure of part of the system.
Graphic and following example, borrowed from here –
http://www.slideshare.net/YoavFrancis/cap-theorem-theory-implications-and-practices
CAP an Example
V0 V0 V0 V0
CAP an Example
V1 V0 V0 V0
V1
CAP an Example
V1 V1 V1V1
CAP an Example
V1 V1 V1 V1
V1
CAP an Example
V1 V1 V1 V1
CAP an Example
V2 V1 V1 V1
V2
CAP an Example
V2 V2V2 V1
Partition
CAP an Example
V2 V2 V2 V1
V1
Partition
In a Distributed Environment, one
must trade availability, consistency,
or partition tolerance.
Availability Techniques
Either a system is available in the face of any failure or it is not.
Leader | Follower
Leader
Follower
Follower
Peer – to - Peer
Availability Vulnerability Availability Resilient*
I’m biased, but to me…
Truly Available Systems MUST
BE Distributed Across
Geographical Boundaries.
Availability Technique Examples
Leader | Follower Peer Based
RDBMS (particularly sharded) Cassandra
MongoDB Riak*
Hadoop (and Ecosystem) DynamoDB
Spark S3
Most Analytical-Oriented Data Stores Favor the Leader |
Follower Approach of Availability.
Consistency Techniques
• Systems that are Leader | Follower based are
typically consistent
• Peer based, or other non Leader | Follower
based systems are vulnerable to consistency.
– These types of systems are typically called
Eventually Consistent because they do tend to
become consistent over a period of time.
Highlighted Consistency Types
Consistency Type Definition Example
Strict A shared-memory system is said to support the strict
consistency model if the value returned by a read
operation on a memory address is always the same as the
value written by the most recent write operation to that
address, irrespective of the locations of the processes
performing the read and write operations. That is, all
writes instantaneously become visible to all processes.
Sequential
(all nodes appear to see
the same order)
The result of any execution is the same as if the (read and
write) operations by all processes on the data store were
executed in some sequential order and the operations of
each individual process appear in this sequence in the
order specified by its program.
Linearizable
(also known as atomic
consistency)
An execution is linearizable if each operation taking place
in linearizable order by placing a point between its begin
time and its end time and guarantees sequential
consistency.
Casual
(order may not be
observed)
Writes that are potentially causally related must be seen
by all processes in the same order. Concurrent writes may
be seen in a different order on different machines.
For more, go here - http://en.wikipedia.org/wiki/Consistency_model
And here - http://en.wikipedia.org/wiki/Linearizability
Highlighted Consistency Protocols
for Eventually Consistent Systems
Protocol Definition
CRDT
(Convergent Replicated Data
Types)
Used to enable abstract functionality in EC Systems. sets, lists,
counters that require additional functionality to ensure they are
accurate in eventually consistent distributed system.
https://vimeo.com/43903960
CRDT – Last Write Win Implementation of CRDT where timestamps are stored in cell
values and the system only returns the replica with the latest
timestamp.
CRDT – Vector Clocks Implementation of CRDT where the system stores and returns a
merged set of all writes. Typically requires a read-before-write
style operation.
Paxos
(2 Phase Commits)
Used to provide strong consistency in an EC system at the cost of
performance for the transaction. The coordinator gets agreement
from participants that the coordinator’s message will be the only
accepted mutation during the operation. Typically require 4 RTT’s
RAMP New Theoretical protocol to provide strong consistency, like Paxos,
at half or better the cost. Writes typically take 2 RTTs and reads
typically take 1-2 RTTs.
Partition Tolerance
• Technically, Partition Tolerance relates to
networking, but it is vague.
• Technically, if the System can withstand a
network partition, then it is tolerant to
Partition.
Note: My interpretation of Partition Tolerance is
controversial as the CAP Theorem is very vague on the
meaning of “Working” when defining Partition
Tolerance.
Trade Offs
In practicality, each “service” chooses to trade
Availability for Consistency.
F1 F2 F3
F4 F5 F6
F7 F8 F9
Lets Say F1, F3, F5, F6, F9 are Leader |
Follower based
Lets Say F2, F4, F7, F8 are Peer based
What does this mean?
Systems by CAP Classification
AP CP AC
Cassandra Hadoop and EcoSystem RDBMS
Riak Spark Vertica
Dynamo Mongo
CouchDB Couchbase
Can your Analytical solution tolerate data sourced from
an non always available system, i.e. holes in data?
Can your Analytical solution tolerate data sourced from
an eventually consistent system, i.e. different results at
different times?
What if your data comes from both types of systems?
What if you are processing your data on one or the other
system?
Practical CAP
Reference Architectures
Here are some views of “standard” architectures
• Lambda
• Kappa
• “Data Lake”
Lambda
http://lambda-architecture.net/
Kappa
Simplified Lambda, where all data is streamed
http://www.kappa-architecture.com/
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Data Lake
My view – Data Lake is Marketicture
• Pivotal - http://www.informationweek.com/big-
data/software-platforms/pivotal-subscription-points-to-real-
value-in-big-data/d/d-id/1174110
• Hortonworks -
http://www.slideshare.net/hortonworks/modern-data-
architecture-for-a-data-lake-with-informatica-and-
hortonworks-data-platform
• Cloudera - http://vision.cloudera.com/the-enterprise-data-
hub/
http://www.gartner.com/newsroom/id/2809117
Summary
• Data Scientists will require working knowledge
of Data Engineering
• CAP
– Consistency
– Availability
– Partition Tolerance
• Architectures in the New World
Data Engineering for Data Scientists

More Related Content

What's hot

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
Sheetal Pratik
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
TEST Huddle
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
Catherine Kimani
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
Databricks
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
Cloudera, Inc.
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
Sudheer Kondla
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
DataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
Data ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewData ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick Overview
Durga Gadiraju
 

What's hot (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Data ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewData ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick Overview
 

Similar to Data Engineering for Data Scientists

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Debajani Mohanty
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptx
levichan1
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
KarthikR780430
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
javier ramirez
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
DataStax Academy
 
DevOpsDays SLC - Getting Along With Your DBOps Team
DevOpsDays SLC - Getting Along With Your DBOps TeamDevOpsDays SLC - Getting Along With Your DBOps Team
DevOpsDays SLC - Getting Along With Your DBOps Team
Nick DeMaster
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
VitsRangannavar
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
Khalid Salama
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 

Similar to Data Engineering for Data Scientists (20)

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
17-NoSQL.pptx
17-NoSQL.pptx17-NoSQL.pptx
17-NoSQL.pptx
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...Everything you always wanted to know about Distributed databases, at devoxx l...
Everything you always wanted to know about Distributed databases, at devoxx l...
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
DevOpsDays SLC - Getting Along With Your DBOps Team
DevOpsDays SLC - Getting Along With Your DBOps TeamDevOpsDays SLC - Getting Along With Your DBOps Team
DevOpsDays SLC - Getting Along With Your DBOps Team
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Data Engineering for Data Scientists

  • 1. Data Engineering for Data Scientists Jonathan Lacefield – Solution Architect DataStax
  • 2. Introduction • Jonathan Lacefield – Solutions Architect, DataStax – Former Dev, DBA, Architect, reformed PM – Email: jlacefie@gmail.com – Twitter: @jlacefie – LinkedIn: www.linkedin.com/in/jlacefield
  • 3. DataStax Introduction 1. Commercial Provider of Apache Cassandra 2. Provider of Proprietary Software Built on Apache Cassandra 3. Deliverer a linearly scalable, “always-on” Data Platform on the foundation of Apache Cassandra and the integration of: 1. Apache Spark 2. Apache SOLR 3. Apache Hadoop 4. TitanDB
  • 4. DataStax, What we Do (Use Cases) • Fraud Detection • Personalization • Internet of Things • Messaging • Lists of Things (Products, Playlists, etc) • Smaller set of other things too! We are all about working with temporal data sets at large volumes with high transaction counts (velocity).
  • 5. “One believes things because one has been conditioned to believe them.” ― Aldous Huxley, Brave New World
  • 6. After today, you will have enough knowledge to walk into any organization and communicate with Data Engineers, in their terms, to effectively design Analytical solutions based on modern technologies.
  • 7. Agenda • Background and Context – From 1 Database to Distributed, Polyglot Persistence Data Stores • Data Engineering Concepts 101 – The CAP Theorem and it’s Variants • Data Engineering Concepts 102 – Deeper into CAP • The Data Stores You Will (Probably) Use • The Architectures in Which You Will Participate
  • 8. What’s Happened in the Last 10 Years OLTP Web Application Tier OLAP Statistical/Analytical Applications ETL 2005
  • 10. Today
  • 11. Today
  • 13. Innovations in Data Engineering • 2000 – Eric Brewer’s Cap Theorem, proved in 2002 – http://en.wikipedia.org/wiki/CAP_theorem • 2004 – Google MapReduce – http://research.google.com/archive/mapreduce.html • 2006 – Google Big Table – http://static.googleusercontent.com/media/research.google.com/en/us/archive/ bigtable-osdi06.pdf • 2007 – Amazon Dynamo – http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf • 2008 – Polyglot Persistence – https://www.altamiracorp.com/blog/employee-posts/polyglot-persistence • 2009 – NoSQL (in modern terms) Introduced – http://en.wikipedia.org/wiki/NoSQL • 2012 – Berkley Spark – https://amplab.cs.berkeley.edu/wp-content/uploads/2012/01/nsdi_spark.pdf • …
  • 14. Today F1 F2 F3 F4 F5 F6 F7 F8 F9 Distributed File Systems ETL • Polyglot Persistence and Services Integration are the Norm • Data Stores are Distributed • Centralize Data via File Systems • Hadoop, GFS, S3, etc • Open Source Rules • Analytical Applications • Python, R, Scala, Java • Data Pipelines (not depicted)
  • 16. To succeed you must thrive in this environment! 1 + 1 = 2 Only Sometimes
  • 17. CAP Theorem (The Foundation) It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency (all nodes see the same data at the same time) • Availability (a guarantee that every request receives a response about whether it succeeded or failed) • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • 18. Consistency Add nodes in a system see the same data at the same time. V1 V1 V1 V1 V1
  • 19. Availability A guarantee that every request receives a response about whether it succeeded or failed. V1 V1 V1 V1 Request Response
  • 20. Partition Tolerance The system continues to operate despite arbitrary message loss or failure of part of the system. Graphic and following example, borrowed from here – http://www.slideshare.net/YoavFrancis/cap-theorem-theory-implications-and-practices
  • 21. CAP an Example V0 V0 V0 V0
  • 22. CAP an Example V1 V0 V0 V0 V1
  • 23. CAP an Example V1 V1 V1V1
  • 24. CAP an Example V1 V1 V1 V1 V1
  • 25. CAP an Example V1 V1 V1 V1
  • 26. CAP an Example V2 V1 V1 V1 V2
  • 27. CAP an Example V2 V2V2 V1 Partition
  • 28. CAP an Example V2 V2 V2 V1 V1 Partition
  • 29. In a Distributed Environment, one must trade availability, consistency, or partition tolerance.
  • 30. Availability Techniques Either a system is available in the face of any failure or it is not. Leader | Follower Leader Follower Follower Peer – to - Peer Availability Vulnerability Availability Resilient*
  • 31.
  • 32. I’m biased, but to me… Truly Available Systems MUST BE Distributed Across Geographical Boundaries.
  • 33. Availability Technique Examples Leader | Follower Peer Based RDBMS (particularly sharded) Cassandra MongoDB Riak* Hadoop (and Ecosystem) DynamoDB Spark S3 Most Analytical-Oriented Data Stores Favor the Leader | Follower Approach of Availability.
  • 34. Consistency Techniques • Systems that are Leader | Follower based are typically consistent • Peer based, or other non Leader | Follower based systems are vulnerable to consistency. – These types of systems are typically called Eventually Consistent because they do tend to become consistent over a period of time.
  • 35. Highlighted Consistency Types Consistency Type Definition Example Strict A shared-memory system is said to support the strict consistency model if the value returned by a read operation on a memory address is always the same as the value written by the most recent write operation to that address, irrespective of the locations of the processes performing the read and write operations. That is, all writes instantaneously become visible to all processes. Sequential (all nodes appear to see the same order) The result of any execution is the same as if the (read and write) operations by all processes on the data store were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program. Linearizable (also known as atomic consistency) An execution is linearizable if each operation taking place in linearizable order by placing a point between its begin time and its end time and guarantees sequential consistency. Casual (order may not be observed) Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. For more, go here - http://en.wikipedia.org/wiki/Consistency_model And here - http://en.wikipedia.org/wiki/Linearizability
  • 36. Highlighted Consistency Protocols for Eventually Consistent Systems Protocol Definition CRDT (Convergent Replicated Data Types) Used to enable abstract functionality in EC Systems. sets, lists, counters that require additional functionality to ensure they are accurate in eventually consistent distributed system. https://vimeo.com/43903960 CRDT – Last Write Win Implementation of CRDT where timestamps are stored in cell values and the system only returns the replica with the latest timestamp. CRDT – Vector Clocks Implementation of CRDT where the system stores and returns a merged set of all writes. Typically requires a read-before-write style operation. Paxos (2 Phase Commits) Used to provide strong consistency in an EC system at the cost of performance for the transaction. The coordinator gets agreement from participants that the coordinator’s message will be the only accepted mutation during the operation. Typically require 4 RTT’s RAMP New Theoretical protocol to provide strong consistency, like Paxos, at half or better the cost. Writes typically take 2 RTTs and reads typically take 1-2 RTTs.
  • 37. Partition Tolerance • Technically, Partition Tolerance relates to networking, but it is vague. • Technically, if the System can withstand a network partition, then it is tolerant to Partition. Note: My interpretation of Partition Tolerance is controversial as the CAP Theorem is very vague on the meaning of “Working” when defining Partition Tolerance.
  • 38. Trade Offs In practicality, each “service” chooses to trade Availability for Consistency. F1 F2 F3 F4 F5 F6 F7 F8 F9 Lets Say F1, F3, F5, F6, F9 are Leader | Follower based Lets Say F2, F4, F7, F8 are Peer based What does this mean?
  • 39. Systems by CAP Classification AP CP AC Cassandra Hadoop and EcoSystem RDBMS Riak Spark Vertica Dynamo Mongo CouchDB Couchbase
  • 40. Can your Analytical solution tolerate data sourced from an non always available system, i.e. holes in data? Can your Analytical solution tolerate data sourced from an eventually consistent system, i.e. different results at different times? What if your data comes from both types of systems? What if you are processing your data on one or the other system? Practical CAP
  • 41. Reference Architectures Here are some views of “standard” architectures • Lambda • Kappa • “Data Lake”
  • 43. Kappa Simplified Lambda, where all data is streamed http://www.kappa-architecture.com/ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
  • 44. Data Lake My view – Data Lake is Marketicture • Pivotal - http://www.informationweek.com/big- data/software-platforms/pivotal-subscription-points-to-real- value-in-big-data/d/d-id/1174110 • Hortonworks - http://www.slideshare.net/hortonworks/modern-data- architecture-for-a-data-lake-with-informatica-and- hortonworks-data-platform • Cloudera - http://vision.cloudera.com/the-enterprise-data- hub/ http://www.gartner.com/newsroom/id/2809117
  • 45. Summary • Data Scientists will require working knowledge of Data Engineering • CAP – Consistency – Availability – Partition Tolerance • Architectures in the New World

Editor's Notes

  1. 3 mins
  2. 10 mins
  3. 15 mins
  4. 17 mins
  5. 20 mins
  6. 22 mins
  7. 32 mins
  8. 36 mins
  9. 40 mins
  10. 45 mins
  11. 47 mins
  12. 55 mins
  13. 60 mins
  14. 62 mins
  15. 65 mins
  16. 70 mins
  17. 75 nins
  18. 80 mins
  19. 85 mins
  20. 90 mins
  21. 92 mins
  22. 97 mins
  23. 100 mins
  24. 102
  25. 105
  26. 110 mins
  27. 120 mins
  28. 125 ins