Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Speaker:
Jeff Sposetti, Product Management, Hortonworks
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
Kerberos is the ubiquitous authentication mechanism when it comes to secure any Hadoop Services. With recent updates in Hadoop core and various Apache Hadoop components, inherent Kerberos support has matured and has come a long way.
Understanding & configuring Kerberos is still a challenge but even more painful & frustrating is troubleshooting a Kerberos issue. There are lot of things (small & big) that can go wrong (and will go wrong!). This talk covers the Kerberos debugging part in detail and discusses the tools & tricks that can be used to narrow down any Kerberos issue.
Rather than discussing the issues and their resolution, we will focus on how to approach a Kerberos problem and do's / dont's in Kerberos scene. This talk will provide a step by step guide that will equip the audience for troubleshooting future Kerberos problems.
Agenda is to discuss:
- Systematic approach to Kerberos troubleshooting
- Kerberos Tools available in Hadoop arsenal
- Tips & Tricks to narrow down Kerberos issues quickly
- Some nasty Kerberos issues from Support trenches
Some prior knowledge on Kerberos basics will be appreciated but is not a prerequisite.
Speaker:
Vipin Rathor, Sr. Product Specialist (HDP Security), Hortonworks
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Speaker
Davor Bonaci, Senior Software Engineer, Google
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Speaker:
Jeff Sposetti, Product Management, Hortonworks
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
Kerberos is the ubiquitous authentication mechanism when it comes to secure any Hadoop Services. With recent updates in Hadoop core and various Apache Hadoop components, inherent Kerberos support has matured and has come a long way.
Understanding & configuring Kerberos is still a challenge but even more painful & frustrating is troubleshooting a Kerberos issue. There are lot of things (small & big) that can go wrong (and will go wrong!). This talk covers the Kerberos debugging part in detail and discusses the tools & tricks that can be used to narrow down any Kerberos issue.
Rather than discussing the issues and their resolution, we will focus on how to approach a Kerberos problem and do's / dont's in Kerberos scene. This talk will provide a step by step guide that will equip the audience for troubleshooting future Kerberos problems.
Agenda is to discuss:
- Systematic approach to Kerberos troubleshooting
- Kerberos Tools available in Hadoop arsenal
- Tips & Tricks to narrow down Kerberos issues quickly
- Some nasty Kerberos issues from Support trenches
Some prior knowledge on Kerberos basics will be appreciated but is not a prerequisite.
Speaker:
Vipin Rathor, Sr. Product Specialist (HDP Security), Hortonworks
There is a fundamental shift underway in IT to include open, software defined, distributed systems like Hadoop. As a result, every Oracle professional should strive to learn these new technologies or risk being left behind. This session is designed specifically for Oracle database professionals so they can better understand SQL on Hadoop and the benefits it brings to the enterprise. Attendees will see how SQL on Hadoop compares to Oracle in areas such as data storage, data ingestion, and SQL processing. Various live demos will provide attendees with a first-hand look at these new world technologies. Presented at Collaborate 18.
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Speaker
Davor Bonaci, Senior Software Engineer, Google
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
Big Data and the Internet of Things (IoT) have forced businesses and the Federal Government to reevaluate their existing data strategies and adopt a more modern data architecture. With the advent of the connected data platform, migrating or building data-driven applications that take advantage of data-in-motion and data-at-rest can be a daunting journey to undertake. Scaling, reusability, and achieving operational agility are just some of the common pitfalls associated with existing software architectures. How do we embrace this paradigm shift? Adopting agile methodologies and emerging development practices such as Microservices and DevOps offer greater agility and operational efficiency enabling the government to rapidly build modern data-driven applications.
During this talk and demonstration, we will show how the federal government can unleash the true power of the connected data platform with modern data-driven applications.
Connected Data Platform:
• Hortonworks DataFlow
o Using Apache NiFi for capturing data at the edge of the data lake & managing the flow of data to the data platform
o Apache Storm for complex event processing and stream processing
• Hortonworks Data Platform
o Apache Accumulo for scalability and cell-level security
o Apache YARN for resource management
• Modern Data-Driven Applications
o Microservices: a software architecture practice for designing software applications as suites of independently deployable services, promoting componentization, single responsibility & scalability. Adopting a Microservices mindset enables the government to be technology agnostic: using the best tool or programming language for the job.
♣ Demoed REST API’s on-top of Apache Accumulo. (Spark-Java, AngularJS/Typescript)
o DevOps: A culture and practice that breaks down the silos found between development and operations teams in traditional software practices.
♣ CI / CD pipelines, automated build kick-offs using containers (Docker, Jenkins)
This talk will lay out a basic environment for promoting greater agility and operational efficiency for the federal government while taking advantage of a connected data platform.
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Event: Passcamp, 07.12.2017
Speaker: Stefan Kirner
Mehr Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Mehr Tech-Artikel: https://www.inovex.de/blog
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
Big Data and the Internet of Things (IoT) have forced businesses and the Federal Government to reevaluate their existing data strategies and adopt a more modern data architecture. With the advent of the connected data platform, migrating or building data-driven applications that take advantage of data-in-motion and data-at-rest can be a daunting journey to undertake. Scaling, reusability, and achieving operational agility are just some of the common pitfalls associated with existing software architectures. How do we embrace this paradigm shift? Adopting agile methodologies and emerging development practices such as Microservices and DevOps offer greater agility and operational efficiency enabling the government to rapidly build modern data-driven applications.
During this talk and demonstration, we will show how the federal government can unleash the true power of the connected data platform with modern data-driven applications.
Connected Data Platform:
• Hortonworks DataFlow
o Using Apache NiFi for capturing data at the edge of the data lake & managing the flow of data to the data platform
o Apache Storm for complex event processing and stream processing
• Hortonworks Data Platform
o Apache Accumulo for scalability and cell-level security
o Apache YARN for resource management
• Modern Data-Driven Applications
o Microservices: a software architecture practice for designing software applications as suites of independently deployable services, promoting componentization, single responsibility & scalability. Adopting a Microservices mindset enables the government to be technology agnostic: using the best tool or programming language for the job.
♣ Demoed REST API’s on-top of Apache Accumulo. (Spark-Java, AngularJS/Typescript)
o DevOps: A culture and practice that breaks down the silos found between development and operations teams in traditional software practices.
♣ CI / CD pipelines, automated build kick-offs using containers (Docker, Jenkins)
This talk will lay out a basic environment for promoting greater agility and operational efficiency for the federal government while taking advantage of a connected data platform.
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis.
They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data
that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists
can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds).
Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure.
At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices
and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using
Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible.
Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL)
for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
The team at Fandango heartily embraced NoSQL, using Couchbase to power a key media publishing system. The initial implementation was fraught with integration issues and high latency, and required a major effort to successfully refactor. My talk will outline the key organizational and architectural decisions that created deep systemic problems, and the steps taken to re-architect the system to achieve a high level of performance at scale.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Event: Passcamp, 07.12.2017
Speaker: Stefan Kirner
Mehr Tech-Vorträge: https://www.inovex.de/de/content-pool/vortraege/
Mehr Tech-Artikel: https://www.inovex.de/blog
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Cask Webinar
Date: 08/10/2016
Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0
In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit.
Some of the highlights include:
- Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger.
- Preview mode - Ability to preview and debug data pipelines before deploying them.
- Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines
- Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming.
- Data usage analytics - Ability to report application usage of data sets.
- And much more!
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
This is a slide deck from QuerySurge's Big Data Testing webinar.
Learn why Testing is pivotal to the success of your Big Data Strategy .
Learn more at www.querysurge.com
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data, Hadoop and NoSQL. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.
This information is geared towards:
- Big Data & Data Warehouse Architects,
- ETL Developers
- ETL Testers, Big Data Testers
- Data Analysts
- Operations teams
- Business Intelligence (BI) Architects
- Data Management Officers & Directors
You will learn how to:
- Improve your Data Quality
- Accelerate your data testing cycles
- Reduce your costs & risks
- Provide a huge ROI (as high as 1,300%)
This topic introduces the need of a unique architecture style for Cloud Native application deployments. Further, the fitment of DevOps, usage of Microservices and the runtime of Cloud Native application (* as a Service) are covered in detail. The need of distributed computing in Cloud for Cloud Native applications is trivial to understand. Insights on the same are covered.
How to build "AutoScale and AutoHeal" systems using DevOps practices by using modern technologies.
A complete build pipeline and the process of architecting a nearly unbreakable system were part of the presentation.
These slides were presented at 2018 DevOps conference in Singapore. http://claridenglobal.com/conference/devops-sg-2018/
Managing data analytics in a hybrid cloudKaran Singh
We’ll talk about the changes in the industry that customers are faced with and how Red Hat Hyperconverged Infrastructure can address those challenges . Our customers are struggling not only to manage the growth of big data (structured and unstructured), but also to reap timely business insights from their data using their existing data infrastructure like monolithic Hadoop clusters. This often leads to alternative approaches that often lead to disappointing results.
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
This is a step-by-step approach the entire ecosystem of features driven by Azure Data eXplorer. You can find many examples using Kusto dialect, in order to acquire data, process and build up complete web interfaces using only one service: ADX.
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere.
This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools.
Topics:
- Business Platform Challenges
- What Enterprise Consciousness Solves
- How ScyllaDB Empowers Enterprise Consciousness
- What can ScyllaDB do for Big Companies
- What can ScyllaDB do for smaller companies.
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
This session covers 9 new and exciting big data technologies that are starting to become relevant in the enterprise. The session focuses on technologies that are still not mainstream but that have the potential to influence the next generation of enterprise big data solutions
Similar to Elasticsearch + Cascading for Scalable Log Processing (20)
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
We (Concurrent) conducted a survey of Cascading users. The Cascading community is one of the most mature Hadoop development communities, with the majority having over 3 years experience. See what they are using, why they are using it and what future challenges they anticipate.
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
Breathe New Life Into Your Data Warehouse by Offloading ETL on Hadoop
Offloading ETL processes to Hadoop is often one of the first Big Data efforts many organizations take on, especially enterprises. The reason being the return-on-investment (ROI) for these projects are quantifiable.
The challenge is how do you quickly migrate your legacy ETL processes to Hadoop so you can realize these benefits and achieve ROI?
In this 30-min webinar, you will learn how to jump start your ETL offloading project so you can:
- Port the majority of existing ETL processes to Hadoop with little to no changes
- Reduce development time of new ETL process on Hadoop by up to 50%
- Optimize ETL performance by choosing the right compute fabric to get the job done
- Monitor your ETL processes in real-time to ensure jobs complete successfully and service levels are met
You should attend this webinar if you are responsible for your organizations data warehouse, middleware, big data infrastructure or information management systems.
How To Get Hadoop App Intelligence with DrivenCascading
You built Cascading/Scalding apps to mine all that data you collected in Hadoop. But just when you were seeing results, something went wrong — the app broke, data flows stopped, and business came to a halt.
So what do you do next? How do you find out what went wrong in the shortest time possible? How do you pinpoint the line of code where the error occurred? How do you know which SLA is going to be impacted? How do you view the lineage of data to adhere to compliance requirements?
In this presentation, we show you how to easily find the answers with Driven, the most comprehensive Big Data App Performance Management Platform.
Furthermore, this presentation describes how Driven can help you build higher quality big data apps; run big data apps more reliably; and manage big data apps more effectively.
Who should view this PPT: Any person or organization that is currently involved in planning, deploying or managing a Hadoop application infrastructure.
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
This video dives into 7 best practices for how IT organizations can achieve true operational readiness on Hadoop using Driven and Cascading.
For any person, organization or enterprise that is currently involved in planning, deploying or managing a Hadoop infrastructure. Development Teams, IT Ops, Executive Management.
Key Takeaways:
- Connecting execution problems with application context
- Defining and enforcing SLAs
- Understanding inter-app dependencies
- Rationing your cluster
- Tracing data access at the operational level
- Building culture and tools supporting collaboration between developers, operators, & other Hadoop team members
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
André Kelpe's presentation at Hadoop User Group France - 25.11.2014.
Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without rewriting any of the application code.
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
Presentation by Dhruv Kumar, Sr. Field Engineer at Concurrent.
Amid all the hype and investment around Big Data technologies, many Java software engineers are asking what it takes to become big data engineers. As Java professionals, towards which path shall I steer my career?
Join Dhruv Kumar as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. We’ll provide an overview of the application development landscape for developing applications on Hadoop and explain why Cascading has become so popular, comparing it to other abstractions such as Pig and Hive. Dhruv will also show you how Java developers can easily get started building applications on Hadoop with live examples of good ‘ole Java code.
Introduction to Cascading by Bryce Lohr
Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop.
Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies.
https://www.linkedin.com/pub/bryce-lohr/3/589/225
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Elasticsearch + Cascading for Scalable Log Processing
1. DRIVING INNOVATION
THROUGH DATA
LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH
Elasticsearch Meetup, Oct 30 2014
2. WHAT IS LOG FILE ANALYTICS?
• Making sense of large amounts of [semi|un]structured data
• What type of log file data?
‣ Syslog
‣ Web log files (Apache, Nginix, WebTrends, Omniture)
‣ POS transactions
‣ Advertising impressions (Doubleclick DART, OpenX, Atlas)
‣ Twitter firehose (yes, it’s a log file!)
• Anything with a timestamp and data
2
3. LOGSTASH ARCHITECTURE
3
h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7
• Data
collec*on
is
flexible
• Lots
of
input/output
plugins
• Grok
filtering
is
easy
• Kibana
UI
is
a?rac*ve
4. WHAT CAN WE DO WITH CASCADING + LOGSTASH?
• Provide richer log-processing capabilities
• Integrate & correlate with other information
‣ Large list of integration adapters
• Analyze large volumes of log data
• Capture & retain unfiltered log data
• Operationalize your log-processing application
4
5. GET TO KNOW CONCURRENT
5
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application
development and management
Products and Technology
• CASCADING
Open Source - The most widely used application infrastructure for
building Big Data apps with over 175,000 downloads each month
• DRIVEN
Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data
application infrastructure.
Founded: 2008
HQ: San Francisco, CA
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
www.concurrentinc.com
6. CASCADING - DE-FACTO STANDARD FOR DATA APPS
Cascading Apps
6
SQL Clojure Ruby
New Fabrics
Tez Storm
Supported Fabrics and Data Stores
Mainframe DB / DW In-Memory Data Stores Hadoop
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …
7. CASCADING 3.0
7
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will
allow for data apps to execute on
existing and emerging fabrics
through its new customizable query
planner.
• Cascading 3.0 will support — Local
In-Memory, Apache MapReduce and
soon thereafter (3.1) Apache Tez,
Apache Spark and Apache Storm
Enterprise Data Applications
Local In-Memory MapReduce
Apache
Tez, Storm,
Computation Fabrics
8. … AND INCLUDES RICH SET OF EXTENSIONS
8
http://www.cascading.org/extensions/
9. DEMO: WORD COUNT EXAMPLE WITH CASCADING
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
9
configuration
integration
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster
10. SOME COMMON PATTERNS
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
10
filter
filter
function
function filter function
data
Pipeline
Split Join
Merge
data
Topology
11. THE STANDARD FOR DATA APPLICATION DEVELOPMENT
11
www.cascading.org
Build data apps
that are
scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java, SQL,
modeling skill sets
Application
Portability
Write once, then run on
different computation
fabrics
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:
12. CASCADING
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
12
Processing API Integration API
Process Planner
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
13. BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
13
16. OPERATIONAL EXCELLENCE
Visibility Through All Stages of App Lifecycle
From Development — Building and Testing
• Design & Development
• Debugging
• Tuning
To Production — Monitoring and Tracking
• Maintain Business SLAs
• Balance & Controls
• Application and Data Quality
• Operational Health
• Real-time Insights
16
18. DEEPER VISUALIZATION INTO YOUR HADOOP CODE
• Easily comprehend, debug, and tune
your data applications
• Get rich insights on your application
performance
• Monitor applications in real-time
• Compare app performance with
historical (previous) iterations
18
Debug and optimize your Hadoop applications more effectively with Driven
19. GET OPERATIONAL INSIGHTS WITH DRIVEN
• Quickly breakdown how often
applications execute based on their tags,
teams, or names
• Immediately identify if any application is
monopolizing cluster resources
• Understand the utilization of your cluster
with a timeline of all applications running
19
Visualize the activity of your applications to help maintain SLAs
20. ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
• Easily keep track of all your
applications by segmenting them with
user-defined tags
• Segment your applications for
trending analysis, cluster analysis,
and developing chargeback models
• Quickly breakdown how often
applications execute based on their
tags, teams, or names
20
Segment your applications for greater insights across all your applications
21. COLLABORATE WITH TEAMS
Utilize teams to collaborate and gain visibility over your set of applications
• Invite others to view and collaborate
on a specific application
• Gain visibility to all the apps and their
owners associated with each team
• Simply manage your teams and the
users assigned to them
21
22. MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
Fast, powerful, rich search capabilities enable you to easily find the exact set of
• Identify problematic apps with their
owners and teams
• Search for groups of applications
segmented by user-defined tags
• Compare specific applications with their
previous iterations to ensure that your
application can meet its SL
22
applications that you’re looking for
23. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
23
24. • Logstash provides a flexible and a robust way to collect
log data; Grok lets you parse logs without coding. Kibana
UI is attractive to analyze the information
• Cascading is the de-facto framework for building Big Data
(Hadoop) applications and processing data at scale
• Cascading+Logstash let’s you develop applications to
collect and process large volumes of data
• With Driven, you can put your mission critical log-processing
applications in production and monitor SLAs
TAKE AWAY POINTS
24