This talk takes you on a rollercoaster ride through Hadoop 2 and explains the most significant changes and components.
The talk has been held on the JavaLand conference in Brühl, Germany on 25.03.2014.
Agenda:
- Welcome Office
- YARN Land
- HDFS 2 Land
- YARN App Land
- Enterprise Land
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
Talk held at a combined meeting of the Web Performance Karlsruhe (http://www.meetup.com/Karlsruhe-Web-Performance-Group/events/153207062) & Big Data Karlsruhe/Stuttgart (http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/162836152) user groups.
Agenda:
- Why Hadoop 2?
- HDFS 2
- YARN
- YARN Apps
- Write your own YARN App
- Tez, Hive & Stinger Initiative
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
Big Data projects are a struggle, not only on the technical side but also on the organizational side. In this talk the author shares his experience and opinions from almost 5 years of Big Data projects and develops an Agile Big Data Model which reflects his ideas on how Big Data projects can be successful, even in large companies.
Talk held at the crossover meetup of the "Agile Stammtisch Rhein-Main" and the "Hadoop & Spark User Group Rhein-Main" at codecentric AG on 31.01.2017.
Talk held at a combined meeting of the Web Performance Karlsruhe (http://www.meetup.com/Karlsruhe-Web-Performance-Group/events/153207062) & Big Data Karlsruhe/Stuttgart (http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/162836152) user groups.
Agenda:
- Why Hadoop 2?
- HDFS 2
- YARN
- YARN Apps
- Write your own YARN App
- Tez, Hive & Stinger Initiative
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
Intel is contributing to a common security framework for Apache Hadoop, in the form of Project Rhino, which enables Hadoop to run workloads without compromising performance or security. Join this session to learn how your enterprise can take advantage of the security capabilities in the Intel Data Platform running on AWS to analyze data while ensuring technical safeguards that help you remain in compliance.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Apache Hadoop 3 updates with migration storySunil Govindan
Apache Hadoop 3 Insights &Migrating your clusters from Hadoop 2 to Hadoop 3 presented by Sunil Govindan and Rohith Sharma K S
At Bangalore Hadoop Meetup on 28th July 2018
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
Intel is contributing to a common security framework for Apache Hadoop, in the form of Project Rhino, which enables Hadoop to run workloads without compromising performance or security. Join this session to learn how your enterprise can take advantage of the security capabilities in the Intel Data Platform running on AWS to analyze data while ensuring technical safeguards that help you remain in compliance.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Apache Hadoop 3 updates with migration storySunil Govindan
Apache Hadoop 3 Insights &Migrating your clusters from Hadoop 2 to Hadoop 3 presented by Sunil Govindan and Rohith Sharma K S
At Bangalore Hadoop Meetup on 28th July 2018
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
This is a talk given at ApacheCon 2015
If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community.
Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!
When it comes time to select database software for your project, there are a bewildering number of choices. How do you know if your project is a good fit for a relational database, or whether one of the many NoSQL options is a better choice?
In this webinar you will learn when to use MongoDB and how to evaluate if MongoDB is a fit for your project. You will see how MongoDB's flexible document model is solving business problems in ways that were not previously possible, and how MongoDB's built-in features allow running at scale.
Topics covered include:
Performance and Scalability
MongoDB's Data Model
Popular MongoDB Use Cases
Customer Stories
Deploying deep learning models with Docker and KubernetesPetteriTeikariPhD
Short introduction for platform agnostic production deployment with some medical examples.
Alternative download: https://www.dropbox.com/s/qlml5k5h113trat/deep_cloudArchitecture.pdf?dl=0
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
This deck covers concepts and motivations behind Apache Hadoop YARN, the key technology in Hadoop 2 to deliver a Data Operating System for the enterprise.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
Jim Scott, Director of Enterprise Strategy, MapR; Cofounder, CHUG
In this talk, we will take a look back at the short history of Hadoop, along with the trials and tribulation that have come along with this ground-breaking technology. We will explore the reasons why enterprises need to look deeper into their wants and needs and further into the future to prepare for where they are going.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their real-world YARN application.
Technical introduction into Apache Spark - the Swiss Army Knife of Big Data analytics tools.
The talk was held at the Big Data User Group Mannheim, Germany at 24.11.2014.
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
Der Talk wurde am 11.12.2013 auf der Java User Group Karlsruhe gehalten und gibt einen Überblick und Einstieg in MongoDB aus der Sicht eines Java-Programmierers.
Dabei werden folgende Themen behandelt:
- Buzzword Bingo: NoSQL, Big Data, Horizontale Skalierung, CAP-Theorem, Eventual Consistency
- Übersicht über MongoDB
- Datenmanipulation: CRUD, Aggregation Framework, Map/Reduce
- Indexing
- Konsistenz beim Schreiben und Lesen von Daten
- Java API & Frameworks
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
Slides of my MongoDB Training given at Coding Serbia Conference on 18.10.2013
Agenda:
1. Introduction to NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD with MongoDB
3. Indexing: Speed up your queries with MongoDB
4. MapReduce: Data aggregation with MongoDB
5. Aggregation Framework: Data aggregation done the MongoDB way
6. Replication: High Availability with MongoDB
7. Sharding: Scaling with MongoDB
Der Talk wurde am 25.09.2013 auf der Java User Group Frankfurt gehalten und gibt einen Überblick und Einstieg in MongoDB aus der Sicht eines Java-Programmierers.
Dabei werden folgende Themen behandelt:
- Buzzword Bingo: NoSQL, Big Data, Horizontale Skalierung, CAP-Theorem, Eventual Consistency
- Übersicht über MongoDB
- Datenmanipulation: CRUD, Aggregation Framework, Map/Reduce
- Indexing
- Konsistenz beim Schreiben und Lesen von Daten
- Java API & Frameworks
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- Why Twitter Storm?
- What is Twitter Storm?
- What to do with Twitter Storm?
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
6. 25.03.2014
2
…there was MapReduce
In the beginning of Hadoop
• It could handle data sizes way beyond those
of its competitors
• It was resilient in the face of failure
• It made it easy for users to bring their code
and algorithms to the data
9. 25.03.2014
2
HDFS
…but it was Batch
HDFS HDFS
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Hadoop 1 (2007)
10. 25.03.2014
2
…but it had limitations
Hadoop 1 (2007)
• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
• Hard partition of resources into map & reduce slots
– Low resource utilization
• Lacks support for alternate paradigms and services
13. 25.03.2014
2
A brief history of Hadoop 2
• Originally conceived & architected by the
team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the Hadoop 2 release manager
• The community has been working on
Hadoop 2 for over 4 years
• Hadoop 2 based architecture running at
scale at Yahoo!
– Deployed on 35,000+ nodes for 6+ months
14. 25.03.2014
2
Hadoop 1
HDFS
Redundant, reliable
storage
Hadoop 2: Next-gen platform
MapReduce
Cluster resource mgmt.
+ data processing
Hadoop 2
HDFS 2
Redundant, reliable storage
MapReduce
Data processing
Single use system
Batch Apps
Multi-purpose platform
Batch, Interactive, Streaming, …
YARN
Cluster resource management
Others
Data processing
15. 25.03.2014
2
Taking Hadoop beyond batch
Applications run natively in Hadoop
HDFS 2
Redundant, reliable storage
Batch
MapReduce
Store all data in one place
Interact with data in multiple ways
YARN
Cluster resource management
Interactive
Tez
Online
HOYA
Streaming
Storm, …
Graph
Giraph
In-Memory
Spark
Other
Search, …
16. 25.03.2014
2
YARN: Design Goals
• Build a new abstraction layer by splitting up
the two major functions of the JobTracker
• Cluster resource management
• Application life-cycle management
• Allow other processing paradigms
• Flexible API for implementing YARN apps
• MapReduce becomes YARN app
• Lots of different YARN apps
23. 25.03.2014
2
HDFS 2: In a nutshell
• Removes tight coupling of Block
Storage and Namespace
• Adds (built-in) High Availability
• Better Scalability & Isolation
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052
24. 25.03.2014
2
HDFS 2: Federation
NameNodes do not talk to each other
NameNodes manages
only slice of namespace
DataNodes can store
blocks managed by
any NameNode
NameNode 1
Namespace 1
Namespace
State
Block
Map
NameNode 2
Namespace 2
Block Pools
Pool 1 Pool 2
Block Storage as
generic storage service
Data Nodes
b3 b1
b2 b4
b2 b1
b5
b3 b2
b5 b4
JBOD JBOD JBOD
Namespace
State
Block
Map
Horizontally scale IO and storage
25. 25.03.2014
2
HDFS 2: Architecture
Active NameNode Standby NameNode
DataNodeDataNodeDataNode DataNode DataNode
Maintains Block
Map and Edits File Simultaneously
reads and applies
the edits
Report to both NameNodes
Block
Map
Edits
File
Block
Map
Edits
File
NFS Shared state on NFS
OR
Quorum based storage
Journal
Node
Journal
Node
Journal
Node
Take orders
only from the
Active
or
26. 25.03.2014
2
ZKFailover
Controller
ZKFailover
Controller
HDFS 2: High Availability
Active NameNode Standby NameNode
DataNodeDataNodeDataNode DataNode DataNode
Block
Map
Edits
File
Block
Map
Edits
File
ZooKeeper
Node
ZooKeeper
Node
ZooKeeper
Node
Send Heartbeats & Block Reports
Shared State
Monitors health
of NN, OS, HW
Heartbeat Heartbeat
Holds special
lock znode
Journal
Node
Journal
Node
Journal
Node
27. 25.03.2014
2
HDFS 2: Write-Pipeline
• Earlier versions of HDFS
• Files were immutable
• Write-once-read-many model
• New features in HDFS 2
• Files can be reopened for append
• New primitives: hflush and hsync
• Replace data node on failure
• Read consistency
DataNode
1
DataNode
2
DataNode
3
DataNode
4
Writer
Add new node to
the pipeline
Reader
Data Data Data
Can read from any node and
then failover to any other node
28. 25.03.2014
2
HDFS 2: Snapshots
• Admin can create point in time snapshots of HDFS
• Of the entire file system
• Of a specific data-set (sub-tree directory of file system)
• Restore state of entire file system or data-set to a
snapshot (like Apple Time Machine)
• Protect against user errors
• Snapshot diffs identify changes made to data set
• Keep track of how raw or derived/analytical data changes
over time
29. 25.03.2014
2
HDFS 2: NFS Gateway
• Supports NFS v3 (NFS v4 is work in progress)
• Supports all HDFS commands
• List files
• Copy, move files
• Create and delete directories
• Ingest for large scale analytical workloads
• Load immutable files as source for analytical processing
• No random writes
• Stream files into HDFS
• Log ingest by applications writing directly to HDFS client
mount
30. 25.03.2014
2
HDFS 2: Performance
• Many improvements
• New AppendableWrite-Pipeline
• Read path improvements for fewer memory copies
• Short-circuit local reads for 2-3x faster random
reads
• I/O improvements using posix_fadvise()
• libhdfs improvements for zero copy reads
• Significant improvements: I/O 2.5 - 5x faster
34. 25.03.2014
2
MapReduce 2: In a nutshell
• MapReduce is now a YARN app
• No more map and reduce slots, it’s containers now
• No more JobTracker, it’s YarnAppmaster library now
• Multiple versions of MapReduce
• The older mapred APIs work without modification or recompilation
• The newer mapreduce APIs may need to be recompiled
• Still has one master server component: the Job History Server
• The Job History Server stores the execution of jobs
• Used to audit prior execution of jobs
• Will also be used by YARN framework to store charge backs at that level
• Better cluster utilization
• Increased scalability & availability
35. 25.03.2014
2
MapReduce 2: Shuffle
• Faster Shuffle
• Better embedded server: Netty
• Encrypted Shuffle
• Secure the shuffle phase as data moves across the cluster
• Requires 2 way HTTPS, certificates on both sides
• Causes significant CPU overhead, reserve 1 core for this work
• Certificates stored on each node (provision with the cluster), refreshed every
10 secs
• Pluggable Shuffle Sort
• Shuffle is the first phase in MapReduce that is guaranteed to not be data-
local
• Pluggable Shuffle/Sort allows application developers or hardware
developers to intercept the network-heavy workload and optimize it
• Typical implementations have hardware components like fast networks and
software components like sorting algorithms
• API will change with future versions of Hadoop
36. 25.03.2014
2
MapReduce 2: Performance
• Key Optimizations
• No hard segmentation of resource into map and reduce slots
• YARN scheduler is more efficient
• MR2 framework has become more efficient than MR1: shuffle
phase in MRv2 is more performant with the usage of Netty.
• 40.000+ nodes running YARN across over 365 PB of data.
• About 400.000 jobs per day for about 10 million hours of
compute time.
• Estimated 60% – 150% improvement on node usage per day
• Got rid of a whole 10,000 node datacenter because of their
increased utilization.
38. 25.03.2014
2
Apache Tez: In a nutshell
• Distributed execution framework that works on
computations represented as dataflow graphs
• Tez is Hindi for “speed”
• Naturally maps to execution plans
produced by query optimizers
• Highly customizable to meet a
broad spectrum of use cases and to
enable dynamic performance
optimizations at runtime
• Built on top of YARN
39. 25.03.2014
2
Apache Tez: Architecture
• Task with pluggable Input, Processor & Output
Task
HDFS
Input
Map
Processor
Sorted
Output
„Classical“ Map
Task
Shuffle
Input
Reduce
Processor
HDFS
Output
„Classical“ Reduce YARN ApplicationMaster to run
DAG of Tez Tasks
40. 25.03.2014
2
Apache Tez: Tez Service
• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)
• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive (or Pig)
• Submit query plan to Tez Service
– Native Hadoop service, not ad-hoc
41. 25.03.2014
2
Hadoop 1
HDFS
Redundant, reliable storage
Apache Tez: The new primitive
MapReduce
Cluster resource mgmt. + data
processing
Hadoop 2
MapReduce as Base Apache Tez as Base
Pig Hive Other
HDFS
Redundant, reliable storage
YARN
Cluster resource management
Tez
Execution Engine
MR Pig Hive Real
time
Storm
O
t
h
e
r
42. 25.03.2014
2
Apache Tez: Performance
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Existing Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
10s
Total 31s
Hive/Tez
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
2s
Total 23s
Tez & Hive Service
Parse Query 0.5s
Create Plan 0.5s
Submit to Tez
Service
0.5s
Process Map-Reduce 2s
Total 3.5s
* No exact numbers, for illustration only
46. 25.03.2014
2
Storm: In a nutshell
• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm
• Ported on YARN
• https://github.com/yahoo/storm-yarn
47. 25.03.2014
2
Storm: Conceptual view
Spout
Spout
Spout
Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Tuple
Bolt
• Consumer of streams
• Processing of tuples
• Possibly emits new
tuplesStream
Unbound sequence of
tuples
Tuple
Tuple
Tuple
List of name-value pairs
Topology
Network of spouts & bolts as the nodes and
streams as the edges
49. 25.03.2014
2
Spark: In a nutshell
• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
– Speedup of up to 100x
– On-disk queries 5-10x faster
• Spark is now a top-level Apache project
– http://spark.apache.org
• Compatible with Hadoop‘s Storage API
• Spark can be run on top of YARN
– http://spark.apache.org/docs/0.9.0/running-on-yarn.html
50. 25.03.2014
2
Spark: RDD
• Key idea: Resilient Distributed Datasets
(RDDs)
• Read-only partitioned collection of
records
• Optionally cached in memory across cluster
• Manipulated through parallel operators
• Support only coarse-grained operations
• Map
• Reduce
• Group-by transformations
• Automatically recomputed on failure
RDD
A11
A12
A13
53. 25.03.2014
2
HOYA: In a nutshell
• Create on-demand HBase clusters
• Small HBase cluster in large YARN cluster
• Dynamic HBase clusters
• Self-healing HBase Cluster
• Elastic HBase clusters
• Transient/intermittent clusters for workflows
• Configure custom configurations & versions
• Better isolation
• More efficient utilization/sharing of cluster
54. 25.03.2014
2
HOYA: Creation of AppMaster
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
55. 25.03.2014
2
HOYA: Deployment of HBase
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
HBase Master
Region Server
Region Server
56. 25.03.2014
2
HOYA: Bind via ZooKeeper
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
HBase Master
Region Server
Region Server
HBase
Client
ZooKeeper
58. 25.03.2014
2
Giraph: In a nutshell
• Giraph is a framework for processing semi-
structured graph data on a massive scale
• Giraph is loosely based upon Google's
Pregel
• Both systems are inspired by the Bulk
Synchronous Parallel model
• Giraph performs iterative calculations
on top of an existing Hadoop cluster
• Uses Single Map-only Job
• Apache top level project since 2012
– http://giraph.apache.org
60. 25.03.2014
2
Falcon: In a nutshell
• A framework for managing data processing
in Hadoop Clusters
• Falcon runs as a standalone server as part of
the Hadoop cluster
• Key Features:
• Data Replication Handling
• Data Lifecycle Management
• Process Coordination & Scheduling
• Declarative Data Process Programming
• Apache Incubation Status
• http://falcon.incubator.apache.org
61. 25.03.2014
2
Falcon: One-stop Shop
Data Management Needs Tool Orchestration
Data Processing
Replication
Retention
Scheduling
Reprocessing
Multi Cluster Mgmt.
Oozie
Sqoop
Distcp
Flume
MapReduce
Pig & Hive
62. 25.03.2014
2
Falcon: Weblog Use Case
• Weblogs saved hourly to primary cluster
• HDFS location is /weblogs/{date}
• Desired Data Policy:
• Replicate weblogs to secondary cluster
• Evict weblogs from primary cluster after 2 days
• Evict weblogs from secondary cluster after 1
week
64. 25.03.2014
2
Knox: In a nutshell
• System that provides a single point of
authentication and access for Apache
Hadoop services in a cluster.
• The gateway runs as a server (or cluster of
servers) that provide centralized access to
one or more Hadoop clusters.
• The goal is to simplify Hadoop security for
both users and operators
• Apache Incubation Status
• http://knox.incubator.apache.org