Anything data (revisited)

Anything data
revisited
Big, Streaming, NoSQL, Cloud, Science
… a sloppy travel guide

whoami ( linkedin. com/in/ahmetakyol )

whoami - Dilbert already did it

Who are these people or who are you ?

Why a travel guide ?
“... Martin is an excellent map reader even in the most
hectic Italian traffic … And after Martin and Cindy
left us, we did better because we had learned from what
they had showed us … When there’s no guide available, it
helps to have someone who understands how to read the
maps, tracks, signs, and indications. When we’re on our
own, it helps to learn how to do those things ourselves“
“Software projects are always traveling in
areas they don’t know “
Ron Jeffries (from his foreword for PoAPA book)

Why a ‘sloppy’ travel guide - (Big Data Landscape 2012 )

Why a ‘sloppy’ travel guide - (Big Data Landscape 2017 )

Why a ‘sloppy’ travel guide - (many others: ai,iot ...)

Why a ‘sloppy’ travel guide - ( the ‘n’ V’s of Big Data )

Chasing Cool Technologies - Big Data Envy
“We continue to see organizations chasing ‘cool’ technologies,
taking on unnecessary complexity and risk when a simpler choice
would be better.”
“ While we've long understood the value of Big Data to better
understand how people interact with us, we've noticed an alarming
trend of Big Data envy: organizations using complex tools to handle
‘not-really-that-big’ Data.”
“ The Apache Cassandra database promises massive scalability on commodity
hardware, but we have seen teams overwhelmed by its architectural and
operational complexity. Unless you have data volumes that require a 100+
node cluster, we recommend against using Cassandra. ”
https://www.thoughtworks.com/radar/techniques/big-data-envy

Big Data Envy - architectural complexity (expectation)
from ‘10000 foot view’
big data systems may seem
like ‘good old n-tier’s

Big Data Envy - architectural complexity (example)
A dataflow diagram
from a good (but still a)
reference application.
Real life examples are
usually more complex !

Big Data Envy - architectural complexity (aws example)
Big Data Architectural Patterns and Best Practices on AWS : https://www.youtube.com/watch?v=RNrsIlweCno

Big Data Envy - architectural complexity (blueprints)

Big Data Envy - operational complexity (devops)

Big Data Envy - operational complexity (devops)
http://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide
● Tuning JVM, OS and
each (big) data
system
● Choosing right
hardware for each
‘right solution’
● Orchestrating /
monitoring /
debugging many
small applications
running on and/or
interacting with such
distributed systems
OOM Troubleshooting example for Apache Spark

Know thyself - reaching the cliff of confusion
https://www.vikingcodeschool.com/posts/why-learning-to-code-is-so-damn-hard

What is your learning style ?
“ What’s a better
learning strategy:
covering a subject in
full detail from top-to-
bottom, or progressively
sharpening a quick
overview? “

How about an expanding/evolving learning style ?
Lifelong learning is
the "ongoing, voluntary, and
self-motivated" pursuit of
knowledge for either
personal or professional
reasons. Therefore, it not
only enhances social
inclusion, active
citizenship, and personal
development, but also
self-sustainability, as well
as competitiveness and
employability.

The Unknown Unknowns - the iceberg of ignorance
In his acclaimed study “The Iceberg
of Ignorance”, consultant Sidney
Yoshida concluded: “Only 4% of an
organization’s front line problems
are known by top management, 9% are
known by middle management, 74% by
supervisors and 100% by employees…”

Guidelines - the very first principle (business value)
“DDD isn’t first and foremost about technology.
In its most central principles, DDD is about
discussion, listening, understanding, discovery,
and business value, all in an effort to
centralize knowledge. If you are capable of
understanding the business in which your company
works, you can at a minimum participate in the
software model discovery process to produce a
Ubiquitous Language.”
“Our highest priority is to satisfy
the customer
through early and continuous
delivery of
valuable software”
the very first principle of the agile manifesto

Guidelines - science before technology (business value)

Guidelines - garbage dump or compulsive hoarding (business value)

Guidelines - making simple but not simpler
● “ Make things as simple as possible,
but not simpler.” (Albert Einstein)
● As simple as possible: no over-engineering
search for the simplest feasible solution
possible
○ feasible ‘ready’ solution
○ fully managed solutions
○ manageable packed solutions with support
○ solutions known for stability, manageability
● Not simpler: no under-engineering
○ right task, right tool
○ right usage: design patterns, best practices

Guidelines - right task right tool isn’t enough

Guidelines - right task right tool right usage
DynamoDB Design Patterns and Best Practices : https://www.youtube.com/watch?v=PDQ3jbDyTQ4

Guidelines - don’t let API fool you (cassandra)
CQL Under The Hood : https://www.youtube.com/watch?v=CY5-bWpqAVA

Guidelines - learn data paths and structures ( C* )
learning “write path”,
“read path” and main
internal data structures
gives critical hints
about “do’s and don’ts”;
especially anti-patterns:
● Queue-like designs
● Intensive updates
● Deletes
http://www.slideshare.net/doanduyhai/cassandra-nice-use-cases-and-worst-anti-patterns

Guidelines - loading data, layouts and file formats (hdfs)
● Data distribution , small files
problem
● Row v.s. columnar formats
● I/O advantage, read only what you
need:
○ Vertical: projection
○ Horizontal: predicate pushdown

Guidelines -SQL or not (Spark as a Compiler)
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Guidelines -SQL or not (Beam Combine vs GroupBy)
https://issues.apache.org/jira/browse/BEAM-2477

Guidelines -SQL or not ( Spark RDD vs Spark DF and SQL)
https://databricks.com/session/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets

Guidelines - learning from costs (google)

Guidelines - learning from costs (bigquery)

Guidelines - learning from costs (kinesis)
“ Pricing is based on volume of data ingested
into Amazon Kinesis Firehose, which is
calculated as the number of data records you
send to the service, times the size of each
record rounded up to the nearest 5KB. For
example, if your data records are 42KB each,
Amazon Kinesis Firehose will count each record
as 45 KB of data ingested. ”
“ A record is the data that your data producer
adds to your Amazon Kinesis Stream. A PUT
Payload Unit is counted in 25KB payload
“chunks” that comprise a record. For example,
a 5KB record contains one PUT Payload Unit, a
45KB record contains two PUT Payload Units,
and a 1MB record contains 40 PUT Payload
Units. PUT Payload Unit is charged with a per
million PUT Payload Units rate. ”

Cloud computing - simple example
“ a system, which
tracks price
changes for my
desirable products
in online stores
(which I trust to
buy from) and
notifies me over
the email when
price drops. “
http://www.bebetterdeveloper.com/coding/architecture/serverless-system-architecture-using-aws.html

Cloud computing - simple “serverless” example
http://www.bebetterdeveloper.com/coding/architecture/serverless-system-architecture-using-aws.html
“ a system, which
tracks price
changes for my
desirable products
in online stores
(which I trust to
buy from) and
notifies me over
the email when
price drops. “

Cloud computing - serverless real world example

Guidelines - learn windows of opportunity (streaming)
SELECT sensorid,
Count(*) AS count
FROM sensorreadings TIMESTAMP by time
GROUP BY sensorid,
tumblingwindow(second, 10)

SELECT sensorid,
Count(*) AS count
FROM sensorreadings TIMESTAMP by time
GROUP BY sensorid,
hoppingwindow(second, 10, 5)

The Evolution of Massive-Scale Data Processing : https://goo.gl/f31iXP

Guidelines - data processing evolution (history)
The Evolution of Massive-Scale Data Processing : https://goo.gl/f31iXP

Guidelines - data processing evolution (unified/continuous)
https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

Guidelines (bonus) - know thy theorem ( CAP )

Guidelines (bonus) - know thy theorem ( PACELC )

Anything data (revisited)

Recommended

Recommended

More Related Content

Similar to Anything data (revisited)

Similar to Anything data (revisited) (20)

Recently uploaded

Recently uploaded (20)

Anything data (revisited)