This document provides guidelines for working with big data technologies and systems. It emphasizes the importance of understanding business needs and value over chasing new technologies. It recommends starting simply rather than over-engineering, and learning key concepts like data structures, query languages, processing models and costs before implementing solutions. The document also discusses architectural complexity, operational challenges and evolving approaches to data processing.
Talk on Amazon Redshift, Meetup Les Nouvelles Organisations, 11/02/2016, Paris - http://www.meetup.com/fr-FR/lesnouvellesorganisations/events/227195680/
Microservices architecture is a very powerful way to build scalable systems optimized for speed of change. To do this, we need to build independent, autonomous services which by definition tend to minimize dependencies on other systems. One of the tenants of microservices, and a way to minimize dependencies, is “a service should own its own database”. Unfortunately this is a lot easier said than done. Why? Because: your data.
We’ve been dealing with data in information systems for 5 decades so isn’t this a solved problem? Yes and no. A lot of the lessons learned are still very relevant. Traditionally, we application developers have accepted the practice of using relational databases and relying on all of their safety guarantees without question. But as we build services architectures that span more than one database (by design, as with microservices), things get harder. If data about a customer changes in one database, how do we reconcile that with other databases (especially where the data storage may be heterogenous?).
For developers focused on the traditional enterprise, not only do we have to try to build fast-changing systems that are surrounded by legacy systems, the domains (finance, insurance, retail, etc) are incredibly complicated. Just copying with Netflix does for microservices may or may not be useful. So how do we develop and reason about the boundaries in our system to reduce complexity in the domain?
In this talk, we’ll explore these problems and see how Domain Driven Design helps grapple with the domain complexity. We’ll see how DDD concepts like Entities and Aggregates help reason about boundaries based on use cases and how transactions are affected. Once we can identify our transactional boundaries we can more carefully adjust our needs from the CAP theorem to scale out and achieve truly autonomous systems with strictly ordered eventual consistency. We’ll see how technologies like Apache Kafka, Apache Camel and Debezium.io can help build the backbone for these types of systems. We’ll even explore the details of a working example that brings all of this together.
Melbourne Microservices Meetup: Agenda for a new ArchitectureSaul Caganoff
This presentation steps back to look at the current IT climate and context for microservices. I argue that we are experiencing a paradigm shift in how we build applications and that microservices may represent a new paradigm alternative.
I then look back at previous experience with application architectures, the driving forces acting today in terms of "crisis" and opportunities and what aspects of microservices we want to examine in more detail in future meetup events.
A presentation on why or why not microservices, why a platform is important, discovering how to break down a monolith and some of the challenges you'll face (data, transactions, boundaries, etc). Last section is on Istio and service mesh introductions. Follow on twitter @christianposta for updates and more details
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
The slideshare view is not great, but the downloadable PDF file is just fine.
Originally presented at:
British Computer Society (BCS) SPA-270, London, UK, 6 February 2013
http://www.bcs-spa.org/cgi-bin/view/SPA/NoSqlDatabasesForBigData
Talk on Amazon Redshift, Meetup Les Nouvelles Organisations, 11/02/2016, Paris - http://www.meetup.com/fr-FR/lesnouvellesorganisations/events/227195680/
Microservices architecture is a very powerful way to build scalable systems optimized for speed of change. To do this, we need to build independent, autonomous services which by definition tend to minimize dependencies on other systems. One of the tenants of microservices, and a way to minimize dependencies, is “a service should own its own database”. Unfortunately this is a lot easier said than done. Why? Because: your data.
We’ve been dealing with data in information systems for 5 decades so isn’t this a solved problem? Yes and no. A lot of the lessons learned are still very relevant. Traditionally, we application developers have accepted the practice of using relational databases and relying on all of their safety guarantees without question. But as we build services architectures that span more than one database (by design, as with microservices), things get harder. If data about a customer changes in one database, how do we reconcile that with other databases (especially where the data storage may be heterogenous?).
For developers focused on the traditional enterprise, not only do we have to try to build fast-changing systems that are surrounded by legacy systems, the domains (finance, insurance, retail, etc) are incredibly complicated. Just copying with Netflix does for microservices may or may not be useful. So how do we develop and reason about the boundaries in our system to reduce complexity in the domain?
In this talk, we’ll explore these problems and see how Domain Driven Design helps grapple with the domain complexity. We’ll see how DDD concepts like Entities and Aggregates help reason about boundaries based on use cases and how transactions are affected. Once we can identify our transactional boundaries we can more carefully adjust our needs from the CAP theorem to scale out and achieve truly autonomous systems with strictly ordered eventual consistency. We’ll see how technologies like Apache Kafka, Apache Camel and Debezium.io can help build the backbone for these types of systems. We’ll even explore the details of a working example that brings all of this together.
Melbourne Microservices Meetup: Agenda for a new ArchitectureSaul Caganoff
This presentation steps back to look at the current IT climate and context for microservices. I argue that we are experiencing a paradigm shift in how we build applications and that microservices may represent a new paradigm alternative.
I then look back at previous experience with application architectures, the driving forces acting today in terms of "crisis" and opportunities and what aspects of microservices we want to examine in more detail in future meetup events.
A presentation on why or why not microservices, why a platform is important, discovering how to break down a monolith and some of the challenges you'll face (data, transactions, boundaries, etc). Last section is on Istio and service mesh introductions. Follow on twitter @christianposta for updates and more details
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
The slideshare view is not great, but the downloadable PDF file is just fine.
Originally presented at:
British Computer Society (BCS) SPA-270, London, UK, 6 February 2013
http://www.bcs-spa.org/cgi-bin/view/SPA/NoSqlDatabasesForBigData
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Burr Sutter
We can be brilliant developers, but we won’t succeed—and won’t lead our organizations to succeed—without a new perspective (if you will) and new assumptions about the components of the “technology ecosystem” that are fundamentally critical to our success. This includes the operators, QA team, DBAs, security folks, and even the pure business contingent—in most cases, each of these individuals and groups plays a critical role in the success of what we create and give birth to as developers. What we do in isolation might be genius, but if we insulate ourselves—especially with arrogance—from these colleagues, neither our code nor our organizations will realize their full potential, and most will fail. The bottom line is that our old ways are no longer viable, and as the elite within our industry, we will be the leaders and heroes who discard old assumptions and adopt a new perspective in this exciting journey to digital transformation—where the impossible can become reality.
The rise of the digital platforms is transforming the principles of economic growth, how businesses compete and organisations are formed; essentially reshaping the world we live, work, and play in. Scott will introduce the underpinning characteristics of a digital platform, explain why they both accelerate delivery within organisations as well as creating an ecosystem for positioning the organisation to compete and even shape the connected economy
Azure for AWS & GCP Pros: Which Azure services to use?Daniel Zivkovic
Learn how to choose which #Azure services to use so that you can start "Jumping Clouds" with confidence :) Watch the recording at https://youtu.be/34U1hUJmCUc and for more forward-looking #Software #Developerment topics, join http://ServerlessToronto.org User Group
LINKS FROM THE MEETUP & CHAT
https://www.askyourdeveloper.com/
http://youtube.serverlesstoronto.org
https://youtu.be/Ivcndg9pTpk?t=1390
https://www.meetup.com/Serverless-Toronto/events/276721419/
https://www.meetup.com/Serverless-Toronto/events/275256767/
https://www.meetup.com/Serverless-Toronto/events/276752609/
https://developerweeklypodcast.com/
https://channel9.msdn.com/Shows/Azure-Friday
https://www.pluralsight.com/paths/microsoft-azure-compute-for-developers
https://azureoverview.com/
https://build5nines.com/
https://azure.microsoft.com/en-us/updates/
https://azure.microsoft.com/en-us/blog/
https://docs.microsoft.com/en-us/azure/architecture/
https://www.mssqltips.com/sqlservertip/5144/sql-server-temporal-tables-vs-change-data-capture-vs-change-tracking--part-3/
https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
https://www.manning.com/books/azure-data-engineering
https://www.manning.com/books/azure-storage-streaming-and-batch-analytics
https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
https://cloudevents.io/
https://docs.microsoft.com/en-us/azure/architecture/patterns/
https://www.linkedin.com/pulse/you-asking-your-team-design-perfect-solution-daniel-zivkovic/
https://youtu.be/GBTdnfD6s5Q
https://www.linkedin.com/company/serverless-toronto/
События, шины и интеграция данных в непростом мире микросервисов / Валентин Г...Ontico
Микросервисы получают все большую популярность в компаниях по всему миру. Какие организационные и технические проблемы они помогают решать? С какого момента монолиты перестают справляться с растущей нагрузкой на ваш сервис? Почему Zalando -- самый большой онлайн-ретейлер в Европе -- выбрал микросервисы в качестве главной архитектуры для новых проектов?
Помогая в решении организационных проблем быстрорастущей компании, микросервисы ставят новые технические задачи, одной из которых, помимо увеличения сложности системы в целом, является проблема безопасного обмена сообщениями между микросервисами, удобной интеграции данных и возможности их корреляции и анализа.
Слушатели узнают, как в Zalando решают эту проблему с использованием централизованной шины передачи данных -- Nakadi. Получат представление о тех проблемах, которые их могут поджидать при выборе похожей архитектуры на примере проблем выбора формата передачи данных, системы версионирования формата сообщений и сложностей эксплуатации высоконагруженных кластеров Kafka в облачной системе AWS.
TechNet Events Presents – for the IT Professional
In this session, we will discuss:
Azure architecture from the IT professional’s point of view
Why an IT operations team would want to pursue Azure as an extension to the data center
Configuration, deployment and scaling Azure-based applications
The Azure roles (web, web service and worker)
Azure storage options
Azure security and identity options
How Azure-based applications can be integrated with on-premises applications
How operations teams can manage and monitor Azure-based applications
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Full review 04.2020 about Azure Data Explorer service. Slide Desk is a sort of review od Kusto, in terms of usage, ingestion techniques, querying and exporting data, using anomaly detection and clustering methods.
Self-Driving cars. Commercial drones. Smart cameras. Movie and music creation. Powerful & intelligent robots. Over the past few years, a new revolution has brought AI almost to the level of science-fiction. However, most companies are not worried about far-off futuristic applications of AI, they want to know what AI can do - today - for their organisations. Distinguishing the hype from reality can be a bit confusing, especially when you consider the attention that AI gets from the media and commentators. So, how can your organisation get started and put AI to work for you? That is the question I will answer in this talk. From greater customer intimacy, increasing competitive advantage and improving efficiency, I will discuss and show how AI can be used today and help the organisation in more impactful ways.
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
This slide deck will be removed from here in the future. It has been moved to : https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Teaching Elephants to Dance (Federal Audience): A Developer's Journey to Digi...Burr Sutter
We can be brilliant developers, but we won’t succeed—and won’t lead our organizations to succeed—without a new perspective (if you will) and new assumptions about the components of the “technology ecosystem” that are fundamentally critical to our success. This includes the operators, QA team, DBAs, security folks, and even the pure business contingent—in most cases, each of these individuals and groups plays a critical role in the success of what we create and give birth to as developers. What we do in isolation might be genius, but if we insulate ourselves—especially with arrogance—from these colleagues, neither our code nor our organizations will realize their full potential, and most will fail. The bottom line is that our old ways are no longer viable, and as the elite within our industry, we will be the leaders and heroes who discard old assumptions and adopt a new perspective in this exciting journey to digital transformation—where the impossible can become reality.
The rise of the digital platforms is transforming the principles of economic growth, how businesses compete and organisations are formed; essentially reshaping the world we live, work, and play in. Scott will introduce the underpinning characteristics of a digital platform, explain why they both accelerate delivery within organisations as well as creating an ecosystem for positioning the organisation to compete and even shape the connected economy
Azure for AWS & GCP Pros: Which Azure services to use?Daniel Zivkovic
Learn how to choose which #Azure services to use so that you can start "Jumping Clouds" with confidence :) Watch the recording at https://youtu.be/34U1hUJmCUc and for more forward-looking #Software #Developerment topics, join http://ServerlessToronto.org User Group
LINKS FROM THE MEETUP & CHAT
https://www.askyourdeveloper.com/
http://youtube.serverlesstoronto.org
https://youtu.be/Ivcndg9pTpk?t=1390
https://www.meetup.com/Serverless-Toronto/events/276721419/
https://www.meetup.com/Serverless-Toronto/events/275256767/
https://www.meetup.com/Serverless-Toronto/events/276752609/
https://developerweeklypodcast.com/
https://channel9.msdn.com/Shows/Azure-Friday
https://www.pluralsight.com/paths/microsoft-azure-compute-for-developers
https://azureoverview.com/
https://build5nines.com/
https://azure.microsoft.com/en-us/updates/
https://azure.microsoft.com/en-us/blog/
https://docs.microsoft.com/en-us/azure/architecture/
https://www.mssqltips.com/sqlservertip/5144/sql-server-temporal-tables-vs-change-data-capture-vs-change-tracking--part-3/
https://azure.microsoft.com/en-us/pricing/details/synapse-analytics/
https://www.manning.com/books/azure-data-engineering
https://www.manning.com/books/azure-storage-streaming-and-batch-analytics
https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp
https://cloudevents.io/
https://docs.microsoft.com/en-us/azure/architecture/patterns/
https://www.linkedin.com/pulse/you-asking-your-team-design-perfect-solution-daniel-zivkovic/
https://youtu.be/GBTdnfD6s5Q
https://www.linkedin.com/company/serverless-toronto/
События, шины и интеграция данных в непростом мире микросервисов / Валентин Г...Ontico
Микросервисы получают все большую популярность в компаниях по всему миру. Какие организационные и технические проблемы они помогают решать? С какого момента монолиты перестают справляться с растущей нагрузкой на ваш сервис? Почему Zalando -- самый большой онлайн-ретейлер в Европе -- выбрал микросервисы в качестве главной архитектуры для новых проектов?
Помогая в решении организационных проблем быстрорастущей компании, микросервисы ставят новые технические задачи, одной из которых, помимо увеличения сложности системы в целом, является проблема безопасного обмена сообщениями между микросервисами, удобной интеграции данных и возможности их корреляции и анализа.
Слушатели узнают, как в Zalando решают эту проблему с использованием централизованной шины передачи данных -- Nakadi. Получат представление о тех проблемах, которые их могут поджидать при выборе похожей архитектуры на примере проблем выбора формата передачи данных, системы версионирования формата сообщений и сложностей эксплуатации высоконагруженных кластеров Kafka в облачной системе AWS.
TechNet Events Presents – for the IT Professional
In this session, we will discuss:
Azure architecture from the IT professional’s point of view
Why an IT operations team would want to pursue Azure as an extension to the data center
Configuration, deployment and scaling Azure-based applications
The Azure roles (web, web service and worker)
Azure storage options
Azure security and identity options
How Azure-based applications can be integrated with on-premises applications
How operations teams can manage and monitor Azure-based applications
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Full review 04.2020 about Azure Data Explorer service. Slide Desk is a sort of review od Kusto, in terms of usage, ingestion techniques, querying and exporting data, using anomaly detection and clustering methods.
Self-Driving cars. Commercial drones. Smart cameras. Movie and music creation. Powerful & intelligent robots. Over the past few years, a new revolution has brought AI almost to the level of science-fiction. However, most companies are not worried about far-off futuristic applications of AI, they want to know what AI can do - today - for their organisations. Distinguishing the hype from reality can be a bit confusing, especially when you consider the attention that AI gets from the media and commentators. So, how can your organisation get started and put AI to work for you? That is the question I will answer in this talk. From greater customer intimacy, increasing competitive advantage and improving efficiency, I will discuss and show how AI can be used today and help the organisation in more impactful ways.
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
This slide deck will be removed from here in the future. It has been moved to : https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
7. Why a travel guide ?
“... Martin is an excellent map reader even in the most
hectic Italian traffic … And after Martin and Cindy
left us, we did better because we had learned from what
they had showed us … When there’s no guide available, it
helps to have someone who understands how to read the
maps, tracks, signs, and indications. When we’re on our
own, it helps to learn how to do those things ourselves“
“Software projects are always traveling in
areas they don’t know “
Ron Jeffries (from his foreword for PoAPA book)
11. Why a ‘sloppy’ travel guide - ( the ‘n’ V’s of Big Data )
12. Chasing Cool Technologies - Big Data Envy
“We continue to see organizations chasing ‘cool’ technologies,
taking on unnecessary complexity and risk when a simpler choice
would be better.”
“ While we've long understood the value of Big Data to better
understand how people interact with us, we've noticed an alarming
trend of Big Data envy: organizations using complex tools to handle
‘not-really-that-big’ Data.”
“ The Apache Cassandra database promises massive scalability on commodity
hardware, but we have seen teams overwhelmed by its architectural and
operational complexity. Unless you have data volumes that require a 100+
node cluster, we recommend against using Cassandra. ”
https://www.thoughtworks.com/radar/techniques/big-data-envy
13. Big Data Envy - architectural complexity (expectation)
from ‘10000 foot view’
big data systems may seem
like ‘good old n-tier’s
14. Big Data Envy - architectural complexity (example)
A dataflow diagram
from a good (but still a)
reference application.
Real life examples are
usually more complex !
15.
16. Big Data Envy - architectural complexity (aws example)
Big Data Architectural Patterns and Best Practices on AWS : https://www.youtube.com/watch?v=RNrsIlweCno
17. Big Data Envy - architectural complexity (blueprints)
19. Big Data Envy - operational complexity (devops)
http://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide
● Tuning JVM, OS and
each (big) data
system
● Choosing right
hardware for each
‘right solution’
● Orchestrating /
monitoring /
debugging many
small applications
running on and/or
interacting with such
distributed systems
OOM Troubleshooting example for Apache Spark
20. Know thyself - reaching the cliff of confusion
https://www.vikingcodeschool.com/posts/why-learning-to-code-is-so-damn-hard
21. What is your learning style ?
“ What’s a better
learning strategy:
covering a subject in
full detail from top-to-
bottom, or progressively
sharpening a quick
overview? “
22. How about an expanding/evolving learning style ?
Lifelong learning is
the "ongoing, voluntary, and
self-motivated" pursuit of
knowledge for either
personal or professional
reasons. Therefore, it not
only enhances social
inclusion, active
citizenship, and personal
development, but also
self-sustainability, as well
as competitiveness and
employability.
23. The Unknown Unknowns - the iceberg of ignorance
In his acclaimed study “The Iceberg
of Ignorance”, consultant Sidney
Yoshida concluded: “Only 4% of an
organization’s front line problems
are known by top management, 9% are
known by middle management, 74% by
supervisors and 100% by employees…”
24. Guidelines - the very first principle (business value)
“DDD isn’t first and foremost about technology.
In its most central principles, DDD is about
discussion, listening, understanding, discovery,
and business value, all in an effort to
centralize knowledge. If you are capable of
understanding the business in which your company
works, you can at a minimum participate in the
software model discovery process to produce a
Ubiquitous Language.”
“Our highest priority is to satisfy
the customer
through early and continuous
delivery of
valuable software”
the very first principle of the agile manifesto
27. Guidelines - making simple but not simpler
● “ Make things as simple as possible,
but not simpler.” (Albert Einstein)
● As simple as possible: no over-engineering
search for the simplest feasible solution
possible
○ feasible ‘ready’ solution
○ fully managed solutions
○ manageable packed solutions with support
○ solutions known for stability, manageability
● Not simpler: no under-engineering
○ right task, right tool
○ right usage: design patterns, best practices
29. Guidelines - right task right tool right usage
DynamoDB Design Patterns and Best Practices : https://www.youtube.com/watch?v=PDQ3jbDyTQ4
30. Guidelines - don’t let API fool you (cassandra)
CQL Under The Hood : https://www.youtube.com/watch?v=CY5-bWpqAVA
31. Guidelines - don’t let API fool you (cassandra)
CQL Under The Hood : https://www.youtube.com/watch?v=CY5-bWpqAVA
32. Guidelines - learn data paths and structures ( C* )
learning “write path”,
“read path” and main
internal data structures
gives critical hints
about “do’s and don’ts”;
especially anti-patterns:
● Queue-like designs
● Intensive updates
● Deletes
http://www.slideshare.net/doanduyhai/cassandra-nice-use-cases-and-worst-anti-patterns
33. Guidelines - loading data, layouts and file formats (hdfs)
● Data distribution , small files
problem
● Row v.s. columnar formats
● I/O advantage, read only what you
need:
○ Vertical: projection
○ Horizontal: predicate pushdown
34. Guidelines -SQL or not (Spark as a Compiler)
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
35. Guidelines -SQL or not (Beam Combine vs GroupBy)
https://issues.apache.org/jira/browse/BEAM-2477
36. Guidelines -SQL or not ( Spark RDD vs Spark DF and SQL)
https://databricks.com/session/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets
39. Guidelines - learning from costs (kinesis)
“ Pricing is based on volume of data ingested
into Amazon Kinesis Firehose, which is
calculated as the number of data records you
send to the service, times the size of each
record rounded up to the nearest 5KB. For
example, if your data records are 42KB each,
Amazon Kinesis Firehose will count each record
as 45 KB of data ingested. ”
“ A record is the data that your data producer
adds to your Amazon Kinesis Stream. A PUT
Payload Unit is counted in 25KB payload
“chunks” that comprise a record. For example,
a 5KB record contains one PUT Payload Unit, a
45KB record contains two PUT Payload Units,
and a 1MB record contains 40 PUT Payload
Units. PUT Payload Unit is charged with a per
million PUT Payload Units rate. ”
40. Cloud computing - simple example
“ a system, which
tracks price
changes for my
desirable products
in online stores
(which I trust to
buy from) and
notifies me over
the email when
price drops. “
http://www.bebetterdeveloper.com/coding/architecture/serverless-system-architecture-using-aws.html
41. Cloud computing - simple “serverless” example
http://www.bebetterdeveloper.com/coding/architecture/serverless-system-architecture-using-aws.html
“ a system, which
tracks price
changes for my
desirable products
in online stores
(which I trust to
buy from) and
notifies me over
the email when
price drops. “
43. Guidelines - learn windows of opportunity (streaming)
SELECT sensorid,
Count(*) AS count
FROM sensorreadings TIMESTAMP by time
GROUP BY sensorid,
tumblingwindow(second, 10)
44. Guidelines - learn windows of opportunity (streaming)
SELECT sensorid,
Count(*) AS count
FROM sensorreadings TIMESTAMP by time
GROUP BY sensorid,
hoppingwindow(second, 10, 5)
45. Guidelines - learn windows of opportunity (streaming)
The Evolution of Massive-Scale Data Processing : https://goo.gl/f31iXP
46. Guidelines - data processing evolution (history)
The Evolution of Massive-Scale Data Processing : https://goo.gl/f31iXP
47. Guidelines - data processing evolution (unified/continuous)
https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html