Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data

•

0 likes•109 views

The document provides guidance on practices that can lead to failure in big data systems. It warns against assumptions that schemas are unnecessary, that databases can scale reads and writes infinitely, and that network connections and hardware will always be available. Instead, it recommends defining schemas and metadata, understanding database models, preparing for failures through testing, and managing resources and data pipelines. Proper data governance, partitioning, replication, and understanding of consistency and transaction models can help avoid failures.

Methods and Practices for
guaranteed failure in Big Data
PANTELIS NASIKAS
1

Delusions and Pitfalls
● Who needs Schemas afterall
● My SuperNoYesSQLDB scales on reads AND writes
like..forever!
● The network and all peers will always be there!
● Resource management , not really a concern for my script
● ETL/Jobs management , BYO-DAG tool

Who needs schemas afterall
● Always Define schemas and versioning process irrespective of the serialization
● Schema registries indispensable
● Metadata management , respect the end users and engineers
● Means to implement data governance
● Builds trust across teams

My SuperNoYesSQLDB scales on reads AND writes
● Understand model that the Database was built for
● Not all APIs created equally
● Partitioning and Replication as key design elements keeping in mind failures -
extensions - rebalancing
● Data consistency on high speed writes...spooky, fsync ( when really? )
● Transactions to keep data and denormalized views in sync ? alternative options

The network and all peers will always be here!
● The Network IS reliable , P.Bailis & K. Kingsbury
● Hardware IS NOT (always) reliable
● Prepare for failure
● Test systems under failing hardware/software
● Learn your APIs (eg what happens when partitions fail , or move ? )
● where is your place in CAP ? Are you CP or CA ?
● what is acceptable for your case ?
● How does my database recover after failure ?
● Can this introduce new problems ?

ETL/Jobs management , BYO-DAG tool
● Data Pipelines and Lineage
● Who generates what , at what time
● Failure! Who is next? What does rescheduling mean
to dependents ?
● How can i really find dependents ?
● Let me build that API! It’s just software after all.
● Build testable pipelines w/o any need for production
data...

Resource management , not really a concern
● Every job / task / stream / long running service should be constrained and ideally
isolated
● Database / Filesystem access from 3rd party as well
● Understand your jobs’ requirements ( communication cost model, partitioning and
shuffling effect on cpu/memory/network)
● You don’t want to preempt you multi-terabyte batch job just before the end
● Orchestrate small well defined tasks
● Do not assume large resource allocations

8
Pantelis Nasikas
pantelis.nasikas@agileactors.com
Thank you,

PHP Benelux 2019 edition Working effectively with legacy code isn’t all about creating test harnesses before refactoring algorithms. The “safety first” strategy doesn’t always apply. Not if the code you’re looking at is LYING IN YOUR FACE anyway. In this talk I’ll show you what brutal refactoring is. I’ll show you the red glowy eyes of the Churn. And I’ll hold up some big warning signs that should prevent you from producing legacy code today. Table flips allowed.

Advanced web application architecture - Talk

Matthias Noback

The Bleeding Edge

jClarity

Enterprise Java developers value reliability and stability, but what happens to your technology stack if you’re willing to take a risk? At jClarity we’re running production systems using HTML 5, AngularJS, Vert.x, Mongo, Groovy and deploying using Chef. Its been an interesting ride: some things worked really well and some things don’t. Come along and find out what we’d recommend and what we’d avoid with hindsight. We’ll also talk about how we ended up with such a diverse stack and how to make technology choices in a fairer way. We’ve made the mistakes so you don’t have to!

Microservices with spring boot

Mesut Can Gurle

Open source ml systems that need to be built

Nikhil Garg

Advanced web application architecture Way2Web

Matthias Noback

How to: - Design a clean domain model - Model your application's use cases as application services - Connect those well-designed layers to the world outside Protecting your high quality domain model can be accomplished by applying a so-called ports & adapters or hexagonal architecture. Some of the keywords for this talk: aggregate design, domain events, application services, commands, queries and events, layered architecture, ports & adapters, hexagonal architecture.

Introduction to the Data Grid

OutSystems

Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.

The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...

Dave Stokes

AMW43 - Unba.se, Distributed database for human interaction

Daniel Norman

kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community

Scalable, good, cheap

Marc Cluet

Liferay portals in real projects

IBACZ

See what pitfalls companies are facing when running Liferay portal. In the previous year, our company has audited 5 real-life projects based on Liferay Portal which are now running in production mode and serving many users. The audits were focused on architecture, infrastructure, technical design and implementation. During the presentation, we will show you common anti-patterns we have found during the audits and their impacts and consequences on the portal.

Choosing the right parallel compute architecture

corehard_by

Multi-core architecture is the present and future way in which the market is addressing Moore’s law limitations. Multi-core workstations, high performance computers, GPUs and the focus on hybrid/ public cloud technologies for offloading and scaling applications is the direction development is heading. Leveraging multiple cores in order to increase application performance and responsiveness is expected especially from classic high-throughput executions such as rendering, simulations, and heavy calculations. Choosing the correct multi-core strategy for your software requirements is essential, making the wrong decision can have serious implications on software performance, scalability, memory usage and other factors. In this overview, we will inspect various considerations for choosing the correct multi-core strategy for your application’s requirement and investigate the pros and cons of multi-threaded development vs multi-process development. For example, Boost’s GIL (Generic Image Library) provides you with the ability to efficiently code image processing algorithms. However, deciding whether your algorithms should be executed as multi-threaded or multi-process has a high impact on your design, coding, future maintenance, scalability, performance, and other factors. A partial list of considerations to take into account before taking this architectural decision includes: - How big are the images I need to process - What risks can I have in terms of race-conditions, timing issues, sharing violations – does it justify multi-threading programming? - Do I have any special communication and synchronization requirements? - How much time would it take my customers to execute a large scenario? - Would I like to scale processing performance by using the cloud or cluster? We will then examine these issues in real-world environments. In order to learn how this issue is being addressed in a real-world scenario, we will examine common development and testing environments we are using in our daily work and compare the multi-core strategies they have implemented in order to promote higher development productivity.

High performance computing tutorial, with checklist and tips to optimize clus...

Pradeep Redddy Raamana

Path dependent-development (PyCon India)

ncoghlan_dev

CouchConf SF 2012 Lightning Talk - Operational ExcellenceLaine Campbell

Presto

Knoldus Inc.

Voxxed Athens 2018 - Eventing, Serverless, and the Extensible Enterprise

Voxxed Athens

Voxxed Athens 2018 - Let’s Get Chatty with Conversational Interface with Java...

Voxxed Athens

Similar to Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data

Machine learning and big data @ uber a tale of two systems

Zhenxiao Luo

Building real time Data Pipeline using Spark Streaming

datamantra

What drives Innovation? Innovations And Technological Solutions for the Distr...Stefano Fago

Ideas spracklen-final

supportlogic

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Dan Lynn

Apache Cassandra at Target - Cassandra Summit 2014

Dan Cundiff

Building a Database for the End of the World

jhugg

Path Dependent Development (PyCon AU)

ncoghlan_dev

Proper Care and Feeding of a MySQL Database for Busy Linux Administrators

Dave Stokes

OSMC 2019 | How to improve database Observability by Charles Judith

NETWAYS

The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...

Dave Stokes

AMW43 - Unba.se, Distributed database for human interaction

Daniel Norman

kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community

Scalable, good, cheap

Marc Cluet

Liferay portals in real projects

IBACZ

Choosing the right parallel compute architecture

corehard_by

High performance computing tutorial, with checklist and tips to optimize clus...

Pradeep Redddy Raamana

Path dependent-development (PyCon India)

ncoghlan_dev

CouchConf SF 2012 Lightning Talk - Operational ExcellenceLaine Campbell

Presto

Knoldus Inc.

Similar to Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data (20)

Machine learning and big data @ uber a tale of two systems

Building real time Data Pipeline using Spark Streaming

What drives Innovation? Innovations And Technological Solutions for the Distr...

Ideas spracklen-final

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Apache Cassandra at Target - Cassandra Summit 2014

Building a Database for the End of the World

Path Dependent Development (PyCon AU)

Proper Care and Feeding of a MySQL Database for Busy Linux Administrators

OSMC 2019 | How to improve database Observability by Charles Judith

The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...

AMW43 - Unba.se, Distributed database for human interaction

kranonit S06E01 Игорь Цинько: High load

Scalable, good, cheap

Liferay portals in real projects

Choosing the right parallel compute architecture

High performance computing tutorial, with checklist and tips to optimize clus...

Path dependent-development (PyCon India)

CouchConf SF 2012 Lightning Talk - Operational Excellence

Presto

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

The Future of Platform Engineering

Jemma Hussein Allen

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Peter Spielvogel

Building better applications for business users with SAP Fiori. • What is SAP Fiori and why it matters to you • How a better user experience drives measurable business benefits • How to get started with SAP Fiori today • How SAP Fiori elements accelerates application development • How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities • How SAP Fiori paves the way for using AI in SAP apps

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

The Metaverse and AI: how can decision-makers harness the Metaverse for their...

Jen Stirrup

The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives? How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Free Complete Python - A step towards Data Science

RinaMondal9

Enhancing Performance with Globus and the Science DMZ

Globus

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

UiPathCommunity

In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni. 📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath: Autopilot per Studio Web Autopilot per Studio Autopilot per Apps Clipboard AI GenAI applicata alla Document Understanding 👨‍🏫👨‍💻 Speakers: Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath Andrei Tasca, RPA Solutions Team Lead @NTT Data

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !

DevOps and Testing slides at DASA Connect

Generative AI Deep Dive: Advancing from Proof of Concept to Production

RESUME BUILDER APPLICATION Project for students

Climate Impact of Software Testing at Nordic Testing Days

The Future of Platform Engineering

Epistemic Interaction - tuning interfaces to provide information for AI support

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

A tale of scale & speed: How the US Navy is enabling software delivery from l...

The Art of the Pitch: WordPress Relationships and Sales

The Metaverse and AI: how can decision-makers harness the Metaverse for their...

Essentials of Automations: The Art of Triggers and Actions in FME

Free Complete Python - A step towards Data Science

Enhancing Performance with Globus and the Science DMZ

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

By Design, not by Accident - Agile Venture Bolzano 2024

Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data

1. Methods and Practices for guaranteed failure in Big Data PANTELIS NASIKAS 1

2. Delusions and Pitfalls ● Who needs Schemas afterall ● My SuperNoYesSQLDB scales on reads AND writes like..forever! ● The network and all peers will always be there! ● Resource management , not really a concern for my script ● ETL/Jobs management , BYO-DAG tool

3. Who needs schemas afterall ● Always Define schemas and versioning process irrespective of the serialization ● Schema registries indispensable ● Metadata management , respect the end users and engineers ● Means to implement data governance ● Builds trust across teams

4. My SuperNoYesSQLDB scales on reads AND writes ● Understand model that the Database was built for ● Not all APIs created equally ● Partitioning and Replication as key design elements keeping in mind failures - extensions - rebalancing ● Data consistency on high speed writes...spooky, fsync ( when really? ) ● Transactions to keep data and denormalized views in sync ? alternative options

5. The network and all peers will always be here! ● The Network IS reliable , P.Bailis & K. Kingsbury ● Hardware IS NOT (always) reliable ● Prepare for failure ● Test systems under failing hardware/software ● Learn your APIs (eg what happens when partitions fail , or move ? ) ● where is your place in CAP ? Are you CP or CA ? ● what is acceptable for your case ? ● How does my database recover after failure ? ● Can this introduce new problems ?

6. ETL/Jobs management , BYO-DAG tool ● Data Pipelines and Lineage ● Who generates what , at what time ● Failure! Who is next? What does rescheduling mean to dependents ? ● How can i really find dependents ? ● Let me build that API! It’s just software after all. ● Build testable pipelines w/o any need for production data...

7. Resource management , not really a concern ● Every job / task / stream / long running service should be constrained and ideally isolated ● Database / Filesystem access from 3rd party as well ● Understand your jobs’ requirements ( communication cost model, partitioning and shuffling effect on cpu/memory/network) ● You don’t want to preempt you multi-terabyte batch job just before the end ● Orchestrate small well defined tasks ● Do not assume large resource allocations

8. 8 Pantelis Nasikas pantelis.nasikas@agileactors.com Thank you,

Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data

Recommended

Recommended

More Related Content

Similar to Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data

Similar to Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data (20)

More from Voxxed Athens

More from Voxxed Athens (18)

Recently uploaded

Recently uploaded (20)

Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data