Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
Biography
Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Talk
Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.
References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Source Code: https://github.com/tspannhw/MmFLaNK
FLiP Stack
StreamNative
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
Real time cloud native open source streaming of any data to apache solrTimothy Spann
Real time cloud native open source streaming of any data to apache solr
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business.
In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.
Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
Biography
Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Talk
Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.
References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Source Code: https://github.com/tspannhw/MmFLaNK
FLiP Stack
StreamNative
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
Real time cloud native open source streaming of any data to apache solrTimothy Spann
Real time cloud native open source streaming of any data to apache solr
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
While the last few years have seen great advancements in computing paradigms for big data stores, there remains one critical bottleneck in this architecture - the ingestion process. Instead of immediate insights into the data, a poor ingestion process can cause headaches and problems to no end. On the other hand, a well-designed ingestion infrastructure should give you real-time visibility into how your systems are functioning at any given time. This can significantly increase the overall effectiveness of your ad-campaigns, fraud-detection systems, preventive-maintenance systems, or other critical applications underpinning your business.
In this session we will explore various modes of ingest including pipelining, pub-sub, and micro-batching, and identify the use-cases where these can be applied. We will present this in the context of open source frameworks such as Apache Flume, Kafka, among others that can be used to build related solutions. We will also present when and how to use multiple modes and frameworks together to form hybrid solutions that can address non-trivial ingest requirements with little or no extra overhead. Through this discussion we will drill-down into details of configuration and sizing for these frameworks to ensure optimal operations and utilization for long-running deployments.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
Looking to strengthen your expertise of Cassandra and DataStax Enterprise? This DataStax Training Webinar will arm you with the knowledge and hands-on skills to get the most out of your DataStax Enterprise environment. If you’ve already taken a DataStax training, consider this a free refresher. Considering training? Then this is a solid intro for developers and admins on your team.
This webinar will highlight the training curriculum and drill into each of the Cassandra expert-led courses so you can determine what meets your needs. Training topics:
Core Concepts, Skills, and Tools
Operations & Performance Tuning
Data Modeling
Using Apache Solr within DataStax Enterprise
And more!
Data processing use cases, from transformation to analytics, perform tasks that require various combinations of queuing, streaming & lightweight processing steps. Until now, supporting all of those needs has required different systems for each task--stream processing engines, messaging queuing middleware, & streaming messaging systems. That has led to increased complexity for development & operations.
In this session, well discuss the need to unify these capabilities in a single system & how Apache Pulsar was designed to address that. Apache Pulsar is a next generation distributed pub-sub system that was developed & deployed at Yahoo. Streamlios Karthik Ramasamy, will explain how the architecture & design of Pulsar provides the flexibility to support developers & applications needing any combination of queuing, messaging, streaming & lightweight compute.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
Streaming all over the world Real life use cases with Kafka Streams, Dr. Benedikt Linse, Senior Solutions Architect, Confluent
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/281819704/
Open Source Bristol 30 March 2022
https://www.meetup.com/Open-Source-Bristol/events/284198269/
18:35 // 'Building a Scalable Event Streaming and Messaging Platform using Apache Pulsar for Fintech' // Tim Spann and John Kinson
Today, companies are adopting Apache Pulsar, an open-source messaging and event streaming platform. Pulsar’s scalability and cloud-native capabilities make it uniquely positioned to meet a range of emerging business needs, including AdTech, fraud detection, IoT analytics, microservices development, and payment processing.
Tim Spann and John Kinson will share insights into the modern data streaming landscape, how Apache Pulsar fits into it, and how it can be used for Fintech. John will also talk about the origins of StreamNative as a Commercial Open Source Software company, and how that has shaped the go-to-market strategy.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
Kafka as a streaming data platform is becoming the successor to traditional messaging systems such as RabbitMQ. Nevertheless, there are still some use cases where they could be a good fit. This one single slide tries to answer in a concise and unbiased way where to use Apache Kafka and where to use RabbitMQ. Your comments and feedback are much appreciated.
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...Streamlio
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. This presentation from Strata 2017 in New York provides an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.
DynomiteDB is a high performance Dynamo layer that adds data replication and sharding to Redis and other single-server storage engines, plus the ability to scale linearly, high availability via a shared nothing architecture with no single point of failure (SPOF), and support for 1000+ node clusters that span multiple data centers.
Embeddable data transformation for real time streamsJoey Echeverria
Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.
Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataStreamNative
We have long stressed that there is more and more a need for unified messaging and streaming and that Apache Pulsar is the platform that better supports this vision and makes it possible, at a large scale. In his talk, Matteo Merli will show how we can take this messaging & streaming unified paradigm one step further, to fully take advantage of their integration. The result is a drastically simplified architecture: a single system that is able to support the data throughout its entire lifecycle, from when the event is happening down to the historical archiving. The ramifications of this shift are big, as we can see Pulsar is in the perfect spot to enable tighter integration between online and offline worlds.
Ai big dataconference_jeffrey ricker_kappa_architectureOlga Zinkevych
Topic of presentation: Kappa architecture (and beyond)
The main points of the presentation:
We will discuss the evolution of big data architecture, from batch to Lambda to Kappa. I will walk through how to implement a Kappa architecture with practical examples, focusing on how to reach full potential and avoid the pitfalls. We will finish with reviewing what lies ahead, including the inevitable consolidation between microservices, GPGPU and Hadoop.
http://dataconf.com.ua/index.php#agenda
#dataconf
#AIBDConference
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
Looking to strengthen your expertise of Cassandra and DataStax Enterprise? This DataStax Training Webinar will arm you with the knowledge and hands-on skills to get the most out of your DataStax Enterprise environment. If you’ve already taken a DataStax training, consider this a free refresher. Considering training? Then this is a solid intro for developers and admins on your team.
This webinar will highlight the training curriculum and drill into each of the Cassandra expert-led courses so you can determine what meets your needs. Training topics:
Core Concepts, Skills, and Tools
Operations & Performance Tuning
Data Modeling
Using Apache Solr within DataStax Enterprise
And more!
Data processing use cases, from transformation to analytics, perform tasks that require various combinations of queuing, streaming & lightweight processing steps. Until now, supporting all of those needs has required different systems for each task--stream processing engines, messaging queuing middleware, & streaming messaging systems. That has led to increased complexity for development & operations.
In this session, well discuss the need to unify these capabilities in a single system & how Apache Pulsar was designed to address that. Apache Pulsar is a next generation distributed pub-sub system that was developed & deployed at Yahoo. Streamlios Karthik Ramasamy, will explain how the architecture & design of Pulsar provides the flexibility to support developers & applications needing any combination of queuing, messaging, streaming & lightweight compute.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
Streaming all over the world Real life use cases with Kafka Streams, Dr. Benedikt Linse, Senior Solutions Architect, Confluent
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/281819704/
Open Source Bristol 30 March 2022
https://www.meetup.com/Open-Source-Bristol/events/284198269/
18:35 // 'Building a Scalable Event Streaming and Messaging Platform using Apache Pulsar for Fintech' // Tim Spann and John Kinson
Today, companies are adopting Apache Pulsar, an open-source messaging and event streaming platform. Pulsar’s scalability and cloud-native capabilities make it uniquely positioned to meet a range of emerging business needs, including AdTech, fraud detection, IoT analytics, microservices development, and payment processing.
Tim Spann and John Kinson will share insights into the modern data streaming landscape, how Apache Pulsar fits into it, and how it can be used for Fintech. John will also talk about the origins of StreamNative as a Commercial Open Source Software company, and how that has shaped the go-to-market strategy.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
Kafka as a streaming data platform is becoming the successor to traditional messaging systems such as RabbitMQ. Nevertheless, there are still some use cases where they could be a good fit. This one single slide tries to answer in a concise and unbiased way where to use Apache Kafka and where to use RabbitMQ. Your comments and feedback are much appreciated.
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...Streamlio
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. This presentation from Strata 2017 in New York provides an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.
DynomiteDB is a high performance Dynamo layer that adds data replication and sharding to Redis and other single-server storage engines, plus the ability to scale linearly, high availability via a shared nothing architecture with no single point of failure (SPOF), and support for 1000+ node clusters that span multiple data centers.
Embeddable data transformation for real time streamsJoey Echeverria
Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.
Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataStreamNative
We have long stressed that there is more and more a need for unified messaging and streaming and that Apache Pulsar is the platform that better supports this vision and makes it possible, at a large scale. In his talk, Matteo Merli will show how we can take this messaging & streaming unified paradigm one step further, to fully take advantage of their integration. The result is a drastically simplified architecture: a single system that is able to support the data throughout its entire lifecycle, from when the event is happening down to the historical archiving. The ramifications of this shift are big, as we can see Pulsar is in the perfect spot to enable tighter integration between online and offline worlds.
Ai big dataconference_jeffrey ricker_kappa_architectureOlga Zinkevych
Topic of presentation: Kappa architecture (and beyond)
The main points of the presentation:
We will discuss the evolution of big data architecture, from batch to Lambda to Kappa. I will walk through how to implement a Kappa architecture with practical examples, focusing on how to reach full potential and avoid the pitfalls. We will finish with reviewing what lies ahead, including the inevitable consolidation between microservices, GPGPU and Hadoop.
http://dataconf.com.ua/index.php#agenda
#dataconf
#AIBDConference
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Bigdata Hadoop project payment gateway domainKamal A
Live Hadoop project in payment gateway domain for people seeking real time work experience in bigdata domain. Email: Onlinetraining2011@gmail.com ,
Skypeid: onlinetraining2011
My profile: www.linkedin.com/pub/kamal-a/65/2b2/2b5
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The core of Apache Hadoop consists of a storage part (HDFS) and a processing part (MapReduce).
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...István Ignácz
Our main topic is the redesign of the "listing dashboard" and the "new listing" pages at ingatlan.com. The interesting part is that we decided to redesign these pages because we wanted them to support some new features and a whole new concept of our product. However the new pages improved our KPIs too. I'll talk about the whys and hows and try to give you some sneak peek into the most important questions and decisions that led us to the new concept and the redesign.
An affordable semi-furnished housing for the working class or student who need a convenient place to stay near their place of work or school.
It gives an option for daily commuters who travel from their homes far from Ortigas or Makati CBD a place they can use as a halfway house during the weekdays.
It is an alternative for renters who can use their rent money as down payment for their very own unit.
This project will target the B, C, D markets.
Urban deca homes campville project presentationRoy Buen
UDH Campville is 12-building medium rise affordable housing project for the working class or starter families who need a convenient place to stay near work or school.
It is ideal for those working in the Alabang and Muntinlupa areas, as well as the Sucat area.
It is an alternative for renters who can use their money as down payment for their very own unit.
This project targets the B, C, and D markets.
#Renttoown #Renttoowncondo #renttoownalabang
Brief Introduction about Hadoop and Core Services.Muthu Natarajan
I have given quick introduction about Hadoop, Big Data, Business Intelligence and other core services and program involved to use Hadoop as a successful tool for Big Data analysis.
My true understanding in Big-Data:
“Data” become “information” but now big data bring information to “Knowledge” and ‘knowledge” becomes “Wisdom” and “Wisdom” turn into “Business” or “Revenue”, All if you use promptly & timely manner
The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.
The presentation covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
The session covers how to get started to build big data solutions in Azure. Azure provides different Hadoop clusters for Hadoop ecosystem. The session covers the basic understanding of HDInsight clusters including: Apache Hadoop, HBase, Storm and Spark. The session covers how to integrate with HDInsight in .NET using different Hadoop integration frameworks and libraries. The session is a jump start for engineers and DBAs with RDBMS experience who are looking for a jump start working and developing Hadoop solutions. The session is a demo driven and will cover the basics of Hadoop open source products.
Quick Brief about " What is Hadoop"
I didn't explain in detail about hadoop, but reading this slides will give you insight of Hadoop and core product usage. This document will be more useful for PM, Newbies, Technical Architect entering into Cloud Computing.
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
What is Big Data? What is Hadoop? What is MapReduce? How do the other components such as: Oozie, Hue, Hive, Impala works? Which are the main Hadoop distributions? What is Spark? What are the differences between Batch and Streaming processing? What are some Business Intelligence Solutions by focusing on some business cases?
Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers.
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
The purpose of this study is, with the help of Watson Analytics examine why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
Telecommunication Analysis(3 use-cases) with IBM cognos insightsheetal sharma
The purpose of this study is, with the help of IBM Cognos Insight analyze why customers are not used the connection of Bits Telecom Company, which factors are influence the churn. Also see the cross selling and up-selling, also focus on profitability and investment and find out the way for better results.
IBM Watson Analytics sets powerful analytics capabilities free so practically anyone can use them. Automated data preparation, predictive analytics, reporting, dashboards, visualization and collaboration capabilities, enable you to take control of your own analysis. You can then take the appropriate action to address a problem or seize an opportunity, all without asking IT or a data expert for help.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
2. What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of
which may be prone to failures.
3. The project includes these modules:
● Hadoop Common: The common utilities that support the other
Hadoop modules.
● Hadoop Distributed File System (HDFS™): A distributed file
system that provides high-throughput access to application
data.
● Hadoop YARN: A framework for job scheduling and cluster
resource management.
● Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
4. Other Hadoop-related projects at Apache
include:
● Ambari™: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and
Sqoop. Ambari also provides a dashboard for viewing cluster health such
as heatmaps and ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their performance characteristics
in a user-friendly manner.
● Avro™: A data serialization system.
● Cassandra™: A scalable multi-master database with no single points of
failure.
● Chukwa™: A data collection system for managing large distributed
systems.
● HBase™: A scalable, distributed database that supports structured data
storage for large tables.
5. Other Hadoop-related projects at Apache
include:
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel
computation.
● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including
ETL, machine learning, stream processing, and graph computation.
● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process
data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and
other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g.
ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
●
ZooKeeper™: A high-performance coordination service for distributed applications.
6. Introduction
● The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
● Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
● Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.
7. ● Monitor a Hadoop Cluster
➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
➢ Ambari leverages Ganglia for metrics collection.
➢ Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (e.g., a node goes down, remaining disk space is low, etc).
● Ambari enables Application Developers and System Integrators to:
➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs.
8. Getting Started with Ambari
● Follow the installation guide for Ambari 1.7.0.
● Note: Ambari currently supports the 64-bit version
of the following Operating Systems:
● RHEL (Redhat Enterprise Linux) 5 and 6
● CentOS 5 and 6
● OEL (Oracle Enterprise Linux) 5 and 6
● SLES (SuSE Linux Enterprise Server) 11
● Ubuntu 12
9. Apache Avro
Introduction
● Apache Avro™ is a data serialization system.
Avro provides:
● Rich data structures.
● A compact, fast, binary data format.
● A container file, to store persistent data.
● Remote procedure call (RPC).
● Simple integration with dynamic languages. Code generation is not
required to read or write data files nor to use or implement RPC
protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages.
10. Apache Avro
Schemas
● Avro relies on schemas. When Avro data is read, the schema used when writing
it is always present. This permits each datum to be written with no per-value
overheads, making serialization both fast and small. This also facilitates use with
dynamic, scripting languages, since data, together with its schema, is fully self-
describing.
● When Avro data is stored in a file, its schema is stored with it, so that files may
be processed later by any program. If the program reading the data expects a
different schema this can be easily resolved, since both schemas are present.
● When Avro is used in RPC, the client and server exchange schemas in the
connection handshake. (This can be optimized so that, for most calls, no schemas
are actually transmitted.) Since both client and server both have the other's full
schema, correspondence between same named fields, missing fields, extra fields,
etc. can all be easily resolved.
● Avro schemas are defined with JSON . This facilitates implementation in
languages that already have JSON libraries.
11. Apache Avro
Comparison with other systems
● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro
differs from these systems in the following fundamental aspects.
● Dynamic typing: Avro does not require that code be generated. Data is always
accompanied by a schema that permits full processing of that data without code
generation, static datatypes, etc. This facilitates construction of generic data-processing
systems and languages.
● Untagged data: Since the schema is present when data is read, considerably less type
information need be encoded with data, resulting in smaller serialization size.
● No manually-assigned field IDs: When a schema changes, both the old and new schema
are always present when processing data, so differences may be resolved symbolically,
using field names.
● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The
Apache Software Foundation.
12. Apache Cassandra
The Apache Cassandra database is the right choice when you need
scalability and high availability without compromising performance.
Linear scalability and proven fault-tolerance on commodity hardware or
cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra's support for replicating across multiple data centers is best-in-
class, providing lower latency for your users and the peace of mind of
knowing that you can survive regional outages.
Cassandra's data model offers the convenience of column indexes with the
performance of log-structured updates, strong support for denormalization
and materialized views, and powerful built-in caching.
13. Apache Cassandra Overview
● Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub,
GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and
over 1500 more companies that have large, active data sets.
One of the largest production deployments is Apple's, with over 75,000 nodes
storing over 10 PB of data. Other large Cassandra installations include Netflix
(2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine
Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over
100 nodes, 250 TB)
Fault Tolerant
Data is automatically replicated to multiple nodes for fault-tolerance.
Replication across multiple data centers is supported. Failed nodes can be
replaced with no downtime.
14. Apache Cassandra Overview
Performance
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks
and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every
node in the cluster is identical.
Durable
Cassandra is suitable for applications that can't afford to lose data, even when an
entire data center goes down.
15. Apache Cassandra Overview
● You're in Control
Choose between synchronous or asynchronous replication for each update.
Highly available asynchronous operations are optimized with features like
Hinted Hand off and Read Repair.
● Elastic
Read and write throughput both increase linearly as new machines are added,
with no downtime or interruption to applications.
Professionally Supported
Cassandra support contracts and services are available from third parties.
16. Chukwa
● Chukwa is an open source data collection system
for monitoring large distributed systems. Chukwa
is built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework and
inherits Hadoop’s scalability and robustness.
Chukwa also includes a flexible and powerful
toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
17. ● Apache HBase™ is the Hadoop database, a distributed,
scalable, big data store.
● Use Apache HBase™ when you need random, real time
read/write access to your Big Data. This project's goal is the
hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et
al. Just as Bigtable leverages the distributed data storage
provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.
18. Features of Apache HBase
● Linear and modular scalability.
● Strictly consistent reads and writes.
● Automatic and configurable sharding of tables
● Automatic failover support between Region Servers.
● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
● Easy to use Java API for client access.
● Block cache and Bloom Filters for real-time queries.
● Query predicate push down via server side Filters
● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data
encoding options
● Extensible jruby-based (JIRB) shell
● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
19. Apache Hive
● The Apache Hive ™ data warehouse software facilitates querying
and managing large data sets residing in distributed storage. Hive
provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers
to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
● Hive is an open source volunteer project under the Apache
Software Foundation. Previously it was a subproject of Apache
Hadoop, but has now graduated to become a top-level project of its
own.
20. Apache Mahout
● The Apache Mahout™ project's goal is to build a
scalable machine learning library.
With scalable we mean:
● Scalable to large data sets. Our core algorithms for clustering,
classification and collaborative filtering are implemented on
top of scalable, distributed systems. However, contributions
that run on a single machine are welcome as well.
● Scalable to support your business case. Mahout is distributed
under a commercially friendly Apache Software license.
21. Apache Mahout
● Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on
the project itself but also on potential use cases. Come to the
mailing lists to find out more.
● Currently Mahout supports mainly three use cases:
Recommendation mining takes users' behavior and from that tries
to find items users might like. Clustering takes e.g. text documents
and groups them into groups of topically related documents.
Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign
unlabeled documents to the (hopefully) correct category.
22. Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets.
● At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject).
23. Apache Pig
● Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
● Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex
tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
● Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
● Extensibility. Users can create their own functions to do special-
purpose processing.
24. Apache Spark
● Apache Spark™ is a fast and general engine for large-scale data
processing.
● Ease of Use
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala and Python shells.
● Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Spark SQL, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
25. Apache Spark
● Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, S3.
You can run Spark readily using its standalone cluster mode, on
EC2, or run it on Hadoop YARN or Apache Mesos. It can read
from HDFS, HBase, Cassandra, and any Hadoop data source.
26. Apache Tez
Introduction
● The Apache Tez project is aimed at building an application framework
which allows for a complex directed-acyclic-graph of tasks for processing
data. It is currently built atop Apache Hadoop YARN
The 2 main design themes for Tez are:
●
Empowering end users by:
Expressive data flow definition APIs
Flexible Input-Processor-Output run time model
Data type agnostic
Simplifying deployment
27. Apache Tez
● Execution Performance
Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at run time
Dynamic physical data flow decisions
28. By allowing projects like Apache Hive and Apache Pig to run a complex
DAG of tasks, Tez can be used to process data, that earlier took multiple
MR jobs, now in a single Tez job as shown below.
29. Apache ZooKeeper
● Apache ZooKeeper is an effort to develop and maintain
an open-source server which enables highly reliable
distributed coordination.
● ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in some
form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race
conditions that are inevitable. Because of the difficulty of implementing
these kinds of services, applications initially usually skimp on them
,which make them brittle in the presence of change and difficult to
manage. Even when done correctly, different implementations of these
services lead to management complexity when the applications are
deployed.