In April 2015, Apache Geode (incubating) was born from Pivotal’s GemFire, the distributed in-memory database. However, the donation of over 1M LOC was just the beginning of the journey. In this talk we discuss how the GemFire engineering team has adapted their development infrastructure, processes, and culture to embrace the “Apache Way". We present lessons learned and best practices for new and incubating open source projects in areas of initial code submission, IP clearance, governance policies, code review, and community building. We discuss the challenges the team faced and how we changed internal communication and software design processes to a community-driven model. In particular, we highlight effective strategies for growing a project community and embracing new members. Finally, we show how changing to the open source model has increased both productivity and quality.
Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
In April 2015, Apache Geode (incubating) was born from Pivotal’s GemFire, the distributed in-memory database. However, the donation of over 1M LOC was just the beginning of the journey. In this talk we discuss how the GemFire engineering team has adapted their development infrastructure, processes, and culture to embrace the “Apache Way". We present lessons learned and best practices for new and incubating open source projects in areas of initial code submission, IP clearance, governance policies, code review, and community building. We discuss the challenges the team faced and how we changed internal communication and software design processes to a community-driven model. In particular, we highlight effective strategies for growing a project community and embracing new members. Finally, we show how changing to the open source model has increased both productivity and quality.
Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
An Introduction to Apache Geode (incubating)Anthony Baker
Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures.
Geode pools memory (along with CPU, network and optionally local disk) across multiple processes to manage application objects and behavior. It uses dynamic replication and data partitioning techniques for high availability, improved performance, scalability, and fault tolerance. Geode is both a distributed data container and an in-memory data management system providing reliable asynchronous event notifications and guaranteed message delivery.
Pivotal GemFire has had a long and winding journey, starting in 2002, winding through VMware, Pivotal, and finding it's way to Apache in 2015. Companies using GemFire have deployed it in some of the most mission critical latency sensitive applications in their enterprises, making sure tickets are purchased in a timely fashion, hotel rooms are booked, trades are made, and credit card transactions are cleared. This presentation discusses:
- A brief history of GemFire
- Architecture and use cases
- Why we are taking GemFire Open Source
- Design philosophy and principles
But most importantly: how you can join this exciting community to work on the bleeding edge in-memory platform.
This webinar will give an overview of the typical use of EDB Postgres Advanced Server and EDB tools for a smart city project, developed in a city with more than 2.2 million people. The main goal of this project is to achieve a 24/7 uninterrupted service with zero data loss and without any service interruption.
During the session, we will explore the project architecture and we will discuss what specific tools were used, and how these tools help manage DBAs’ daily tasks. We will also discuss what type of data is critical for a smart city project.
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Impetus Technologies
Impetus webcast "Leveraging NoSQL Database Technology to Implement Real-time Data Architectures” available at http://bit.ly/1g6Eaj4
This webcast:
• Presents trade-offs of using different approaches to achieve a real-time architecture
• Closely examines an implementation of a NoSQL based real-time architecture
• Shares specific capabilities offered by NoSQL Databases that enable cost and reliability advantages over other techniques
Apache Geode (incubating) is the core of Pivotal Gemfire now available as an open source project governed by Apache Software Foundation Incubator. The legacy of Pivotal Gemfire and the ASF community uniquely position Geode as a secret ingredient for modern-day data management architectures.
These types of architectures require a robust in-memory data grid solution to handle a variety of use cases, ranging from enterprise-wide caching to real-time transactional applications at scale. In addition, as memory size and network bandwidth growth continues to outpace those of disk, the importance of managing large pools of RAM at scale increases. It is essential to innovate at the same pace.
Apache Geode (incubating) has all the right ingredients to do for RAM what HDFS has done for direct attach disks. The excitement (and funding!) in this area of big data ecosystem is palpable, and the ASF is the place where the innovation is happening. Come to this session to understand: a brief history of Geode, architecture and use cases, design philosophy and principles, but most importantly: how you too can participate in the in-memory data center revolution.
Automating Postgres Deployments on AWS and VMware, with Terraform and AnsibleEDB
Deploying Postgres on AWS and VMware shouldn’t be a clunky, manual task. In 2020 and beyond, automation and Postgres will go hand in hand.
Highly available, fault-tolerant PostgreSQL database cluster creation and monitoring can now be deployed in a matter of minutes using Terraform and Ansible scripts. In this webinar, you will learn how developers, devOps, and DBAs can automate cloud and on-premise Postgres deployment with the help of Terraform and Ansible.
We will cover:
- Popular deployment options for Postgres
- Postgres platform database and toolsets
- Automating deployment of Postgres with high availability
- Automating deployment of fault-tolerant Postgres cluster
- Monitoring and managing Postgres at scale
- Demo for Terraform Postgres deployment scripts
New Approaches to Migrating from Oracle to Enterprise-Ready Postgres in the C...EDB
Join Marc Linster to learn how to build Oracle-compatible Postgres databases in the AWS cloud in minutes. Marc will cover:
- Getting started
- Migration strategies
- How to pick the migration targets
- Provisioning, scaling, and managing Postgres
- Creating highly available clusters with EDB's Cloud Database Service (CDS)
- A live migration of an Oracle database with schema, data and stored procedures to EDB Postgres
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XDVMware Tanzu
Companies across all industries and sizes are investing in strategic custom applications to enhance their competitive advantages. Developing these applications requires continuous improvement, based on insights gleaned from collecting and analyzing the data that they generate.
Big Data for high-performing, scalable and reliable applications requires a new set of tools and technologies. Pivotal GemFire is a distributed in-memory NoSQL data management solution for creating high-scale custom applications. Pivotal GemFire XD supports structured data as part the industry’s first Hadoop-based platform for creating closed loop analytics solutions – enabling businesses to continuously optimize real-time automation in their applications.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
An Introduction to Apache Geode (incubating)Anthony Baker
Geode is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures.
Geode pools memory (along with CPU, network and optionally local disk) across multiple processes to manage application objects and behavior. It uses dynamic replication and data partitioning techniques for high availability, improved performance, scalability, and fault tolerance. Geode is both a distributed data container and an in-memory data management system providing reliable asynchronous event notifications and guaranteed message delivery.
Pivotal GemFire has had a long and winding journey, starting in 2002, winding through VMware, Pivotal, and finding it's way to Apache in 2015. Companies using GemFire have deployed it in some of the most mission critical latency sensitive applications in their enterprises, making sure tickets are purchased in a timely fashion, hotel rooms are booked, trades are made, and credit card transactions are cleared. This presentation discusses:
- A brief history of GemFire
- Architecture and use cases
- Why we are taking GemFire Open Source
- Design philosophy and principles
But most importantly: how you can join this exciting community to work on the bleeding edge in-memory platform.
This webinar will give an overview of the typical use of EDB Postgres Advanced Server and EDB tools for a smart city project, developed in a city with more than 2.2 million people. The main goal of this project is to achieve a 24/7 uninterrupted service with zero data loss and without any service interruption.
During the session, we will explore the project architecture and we will discuss what specific tools were used, and how these tools help manage DBAs’ daily tasks. We will also discuss what type of data is critical for a smart city project.
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Impetus Technologies
Impetus webcast "Leveraging NoSQL Database Technology to Implement Real-time Data Architectures” available at http://bit.ly/1g6Eaj4
This webcast:
• Presents trade-offs of using different approaches to achieve a real-time architecture
• Closely examines an implementation of a NoSQL based real-time architecture
• Shares specific capabilities offered by NoSQL Databases that enable cost and reliability advantages over other techniques
Apache Geode (incubating) is the core of Pivotal Gemfire now available as an open source project governed by Apache Software Foundation Incubator. The legacy of Pivotal Gemfire and the ASF community uniquely position Geode as a secret ingredient for modern-day data management architectures.
These types of architectures require a robust in-memory data grid solution to handle a variety of use cases, ranging from enterprise-wide caching to real-time transactional applications at scale. In addition, as memory size and network bandwidth growth continues to outpace those of disk, the importance of managing large pools of RAM at scale increases. It is essential to innovate at the same pace.
Apache Geode (incubating) has all the right ingredients to do for RAM what HDFS has done for direct attach disks. The excitement (and funding!) in this area of big data ecosystem is palpable, and the ASF is the place where the innovation is happening. Come to this session to understand: a brief history of Geode, architecture and use cases, design philosophy and principles, but most importantly: how you too can participate in the in-memory data center revolution.
Automating Postgres Deployments on AWS and VMware, with Terraform and AnsibleEDB
Deploying Postgres on AWS and VMware shouldn’t be a clunky, manual task. In 2020 and beyond, automation and Postgres will go hand in hand.
Highly available, fault-tolerant PostgreSQL database cluster creation and monitoring can now be deployed in a matter of minutes using Terraform and Ansible scripts. In this webinar, you will learn how developers, devOps, and DBAs can automate cloud and on-premise Postgres deployment with the help of Terraform and Ansible.
We will cover:
- Popular deployment options for Postgres
- Postgres platform database and toolsets
- Automating deployment of Postgres with high availability
- Automating deployment of fault-tolerant Postgres cluster
- Monitoring and managing Postgres at scale
- Demo for Terraform Postgres deployment scripts
New Approaches to Migrating from Oracle to Enterprise-Ready Postgres in the C...EDB
Join Marc Linster to learn how to build Oracle-compatible Postgres databases in the AWS cloud in minutes. Marc will cover:
- Getting started
- Migration strategies
- How to pick the migration targets
- Provisioning, scaling, and managing Postgres
- Creating highly available clusters with EDB's Cloud Database Service (CDS)
- A live migration of an Oracle database with schema, data and stored procedures to EDB Postgres
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Scale Out Your Big Data Apps: The Latest on Pivotal GemFire and GemFire XDVMware Tanzu
Companies across all industries and sizes are investing in strategic custom applications to enhance their competitive advantages. Developing these applications requires continuous improvement, based on insights gleaned from collecting and analyzing the data that they generate.
Big Data for high-performing, scalable and reliable applications requires a new set of tools and technologies. Pivotal GemFire is a distributed in-memory NoSQL data management solution for creating high-scale custom applications. Pivotal GemFire XD supports structured data as part the industry’s first Hadoop-based platform for creating closed loop analytics solutions – enabling businesses to continuously optimize real-time automation in their applications.
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Christian Tzolov
Slides from ApacheCon BigData 2015 HAWQ/GEODE talk: http://sched.co/3zut
In the space of Big Data, two powerful data processing tools compliment each other. Namely HAWQ and Geode. HAWQ is a scalable OLAP SQL-on-Hadoop system, while Geode is OLTP like, in-memory data grid and event processing system. This presentation will show different integration approaches that allow integration and data exchange between HAWQ and Geode. Presentation will walking you through the implementation of the different Integration strategies demonstrating the power of combining various OSS technologies for processing bit and fast data. Presentation will touch upon OSS technologies like HAWQ, Geode, SpringXD, Hadoop and Spring Boot.
Building Wall St Risk Systems with Apache GeodeAndre Langevin
In this talk from the 2016 Apache Geode Summit, I discuss how Geode forms the core of many Wall Street derivative risk solutions. By externalizing risk from trading systems, Geode-based solutions provide cross-product risk management at speeds suitable for automated hedging, while simultaneously eliminating the back office costs associated with traditional trading system based solutions.
Infinispan Servers: Beyond peer-to-peer data gridsGalder Zamarreño
In this session, Infinispan developer Galder Zamarreño will:
- Provide a brief introduction to peer-to-peer and client/server architectures.
- Describe the benefits of using Infinispan in a client/server mode, particularly in cloud-style environments.
- Introduce the audience to Infinispan’s selection of server modules that provide varied access methods: REST and WebSocket for HTTP access, Memcached protocol access and Hot Rod, Infinispan’s very own highly efficient binary protocol which supports smart-clients.
- Demonstrate an Infinispan client/server example showing how geographically separated Infinispan data grids could be linked together via Hot Rod client/server modules in order to provide different disaster recovery strategies.
Keeping Infinispan In Shape: Highly-Precise, Scalable Data EvictionGalder Zamarreño
Java Collections Framework represents one of the key building blocks of any Java application. Although the standard JDK devoted a lot of attention to developing a coherent and easy to use collections framework one important aspect remains overlooked – collection element eviction. Collection memory footprint can not grow indefinitely because we would eventually run out of memory; we either have to remove elements from a collection or somehow periodically evict certain elements according to a chosen eviction algorithm. Since day one eviction has been a key feature of Infinispan, and in the latest 4.1 release thanks to event update batching, Infinispan has reduced the eviction overhead to such an extent that it hardly affects application performance. On top of that, Infinispan implements LIRS, a more precise eviction algorithm compared to the traditional LRU, making it the first open source project to implement this revolutionary algorithm in the data grid space. In this session, Galder and Vladimir will present to the details behind these changes, performance measurements and third-party use case testimonies.
How to use the WAN Gateway feature of Apache Geode to implement multi-site and active-active failover, disaster recovery, and global scale applications.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Application development using the Oracle RAD Stack. Oracle REST Data Services, Oracle APEX, Oracle Database. Easy, Empowering, SQL Centric, Productive, Dependable, Lasting...
Cask Webinar
Date: 08/10/2016
Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0
In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit.
Some of the highlights include:
- Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger.
- Preview mode - Ability to preview and debug data pipelines before deploying them.
- Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines
- Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming.
- Data usage analytics - Ability to report application usage of data sets.
- And much more!
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
One of the recent trends in cloud computing is the tendency for enterprises to use multiple clouds, either to take advantage of unique offerings per cloud vendor or to hedge their bets. Creating a data tier that spans multiple clouds is a key consideration in these architectures. Apache Cassandra is an ideal technology for the multi-cloud data tier due to its flexibility and resilience. In this talk we’ll discuss the challenges you’ll face in developing the data tier of multi-cloud applications, provide recommendations on how to address these challenges with Cassandra, and show how we created a sample application to demonstrate these recommendations in practice.
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
Element Fleet has the largest benchmark database in our industry and we needed a robust and linearly scalable platform to turn this data into actionable insights for our customers. The platform needed to support advanced analytics, streaming data sets, and traditional business intelligence use cases.
In this presentation, we will discuss how we built a single, unified platform for both Advanced Analytics and traditional Business Intelligence using Cassandra on DSE. With Cassandra as our foundation, we are able to plug in the appropriate technology to meet varied use cases. The platform we’ve built supports real-time streaming (Spark Streaming/Kafka), batch and streaming analytics (PySpark, Spark Streaming), and traditional BI/data warehousing (C*/FiloDB). In this talk, we are going to explore the entire tech stack and the challenges we faced trying support the above use cases. We will specifically discuss how we ingest and analyze IoT (vehicle telematics data) in real-time and batch, combine data from multiple data sources into to single data model, and support standardized and ah-hoc reporting requirements.
About the Speaker
Jim Peregord Vice President - Analytics, Business Intelligence, Data Management, Element Corp.
MySQL Cluster - Latest Developments (up to and including MySQL Cluster 7.4)Andrew Morgan
MySQL Cluster is the distributed, shared-nothing version of MySQL. It’s typically used for applications that need any combination of high availability, real-time performance, and scaling of reads and writes. After a brief introduction to the technology, its uses, and the new features added in MySQL Cluster 7.3, this session focuses on the very latest developments happening in MySQL Cluster 7.4. As you’d expect from a real-time, scalable, distributed, in-memory database, performance continues to be a top priority, as do simplicity of use and robustness. Come hear firsthand what’s being done to make sure MySQL Cluster continues to dominate in mission-critical, high-performance applications.
Latest (storage IO) patterns for cloud-native applications OpenEBS
Applying micro service patterns to storage giving each workload its own Container Attached Storage (CAS) system. This puts the DevOps persona within full control of the storage requirements and brings data agility to k8s persistent workloads. We will go over the concept and the implementation of CAS, as well as its orchestration.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4. 4
"…an in-memory, distributed database with strong
consistency built to support low latency
transactional applications at extreme scale.”
5. History
2004 2008 2014
• Massive increase in data
volumes
• Falling margins per
transaction
• Increasing cost of IT
maintenance
• Need for elasticity in
systems
• Financial Services
Providers (every major
Wall Street bank)
• Department of Defense
• Real Time response needs
• Time to market constraints
• Need for flexible data
models across enterprise
• Distributed development
• Persistence + In-memory
• Global data visibility needs
• Fast Ingest needs for data
• Need to allow devices to
hook into enterprise data
• Always on
• Largest travel Portal
• Airlines
• Trade clearing
• Online gambling
• Largest Telcos
• Large mfrers
• Largest Payroll processor
• Auto insurance giants
• Largest rail systems on
earth
• 1000+ customers in production
• Cutting edge use cases
6. Interesting use cases
China Railway
Corporation
5,700 train stations
4.5 million tickets per day
20 million daily users
1.4 billion page views per day
40,000 visits per second
*http://pivotal.io/big-data/pivotal-gemfire
Indian Railways
7,000 stations
72,000 miles of track
23 million passengers daily
120,000 concurrent users
10,000 transactions per minute
7. Interesting use cases
Indian RailwaysChina Railway Corporation
World: ~7,349,000,000
~36% of the world population
Population: 1,251,695,6161,401,586,609
8. 8
Application Patterns
• Caching for speed and scale
➡ Read-Through, Write-Through, Write-Behind
• As the OLTP system of record
➡ Data in-memory for low-latency, on disk for durability
• Parallel compute engine
• Real-time analytics
10. • Cache
• Region
• Member
• Client Cache
• Functions
• Listeners
Geode Concepts and Usage
11. • Cache
• In-memory storage and management for your data
• Configurable through XML, Spring, Java API or CLI
• Collection of Region
Region
Region
Region
Cache
JVM
Concepts
12. Concepts
• Region
• Distributed java.util.Map on steroids (Key/Value)
• Consistent API regardless of where or how data is stored
• Observable (reactive)
• Highly available, redundant on cache Member (s).
• Querying
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
13. Concepts
• Region
• Local, Replicated or Partitioned
• In-memory or persistent
• Redundant
• LRU
• Overflow
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
LOCAL
LOCAL_HEAP_LRU
LOCAL_OVERFLOW
LOCAL_PERSISTENT
LOCAL_PERSISTENT_OVERFLOW
PARTITION
PARTITION_HEAP_LRU
PARTITION_OVERFLOW
PARTITION_PERSISTENT
PARTITION_PERSISTENT_OVERFLOW
PARTITION_PROXY
PARTITION_PROXY_REDUNDANT
PARTITION_REDUNDANT
PARTITION_REDUNDANT_HEAP_LRU
PARTITION_REDUNDANT_OVERFLOW
PARTITION_REDUNDANT_PERSISTENT
PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
REPLICATE
REPLICATE_HEAP_LRU
REPLICATE_OVERFLOW
REPLICATE_PERSISTENT
REPLICATE_PERSISTENT_OVERFLOW
REPLICATE_PROXY
14. • Persistent Regions
• Durability
• WAL for efficient writing
• Consistent recovery
• Compaction
Concepts
Modify
k1->v5
Create
k6->v6
Create
k2->v2
Create
k4->v4
Oplog2.crf
Member
1
Modify
k4->v7Oplog3.crf
Put k4->v7
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Server 1 Server N
15. • Member
• A process that has a connection to the system
• A process that has created a cache
• Embeddable within your application
Concepts
Client
Locator
Server
16. Concepts
• Client cache
• A process connected to the Geode server(s)
• Can have a local copy of the data
• Can be notified about events on the servers
Application
GemFire Server
Region
Region
RegionClient Cache
17. Concepts
• Functions
• Used for distributed concurrent processing
(Map/Reduce, stored procedure)
• Highly available
• Data oriented
• Member oriented
Submit (f1)
f1 , f2 , … fn
Execute
Functions
22. • Minimize copying
• Minimize contention points
• Flexible consistency model
• Partitioning and parallelism
• Avoid disk seeks
• Automated benchmarks
What makes it fast?
25. Why Open Source? Why ASF?
• Open source is fundamentally changing software buying patterns
• Customers get transparency and co-development of features
• It’s the community that matters
• ASF provides a framework for open source
26. Geode Will Be a Significant Apache Project
• 1M+ LOC, over a 1000 person years invested into cutting edge R&D
• Thousands of production customers in very demanding verticals
• Cutting edge use cases that have shaped product thinking
• A core technology team that has stayed together since founding
• Performance differentiators that are baked into every aspect of the product
27. Geode versus GemFire
• Geode is a project supported by the OSS community
• GemFire is product from Pivotal, based on Geode source
• We donated everything but the kitchen sink*
• Development process follows “The Apache Way”
* Multi-site WAN replication, continuous queries, and native (C/C++/.NET) client
28. "Talk is cheap, show me the code"
• Clone & Build
git
clone
https://github.com/apache/incubator-‐geode
cd
incubator-‐geode
./gradlew
build
-‐Dskip.tests=true
• Start a server
cd
gemfire-‐assembly/build/install/apache-‐geode
./bin/gfsh
gfsh>
start
locator
-‐-‐name=locator
gfsh>
start
server
-‐-‐name=server
gfsh>
create
region
-‐-‐name=myRegion
-‐-‐type=REPLICATE
31. • RDD
• Dataframe
• Driver
• Worker
Quick intro to Apache Spark
"An RDD in Spark is simply an immutable distributed collection of objects.
Each RDD is split into multiple partitions, which may be computed on different nodes
of the cluster. RDDs can contain any type of Python, Java, or Scala objects,
including user-defined classes."
32. • RDD
• Dataframe
• Driver
• Worker
Quick intro to Apache Spark
“A dataframe is a distributed collection of rows organized into named columns. An
abstraction for selecting, filtering and plotting structured data (pandas), previously
known as SchemaRDD."
34. Live Data
Apache Geode / GemFire
1- Live data is ingested into the grid
2 - Trained ML model compares
new data to historical patterns
3 - Results are pushed
immediately to deployed
applications
Machine Learning
model
4 - Re-training is triggered, updating
the model with the latest historical data
Spring XD
Spring XD
Data
Temperature
Hot
Warm
38. • Off-heap memory storage
• HDFS persistence
• Lucene indexes
• Spark connector
• Cloud Foundry service
• Distributed transactions
…and other ideas from the Geode community!
Roadmap
39. How to Get Involved
• http://geode.incubator.apache.org
• Join the mailing lists; ask a question, answer a question, learn
dev@geode.incubator.apache.org
user@geode.incubator.apache.org
• File a bug in JIRA
• Update the wiki, website, or documentation
• Create example applications
• Use it in your project! We need you!