This document discusses enabling real-time analytics using Hadoop MapReduce on an in-memory data grid (IMDG). It describes implementing MapReduce using parallel method invocation on an IMDG to eliminate batch scheduling overhead and analyze live data. Sample use cases are presented for applications in financial services, ecommerce, and other industries that require real-time analysis of large, changing datasets.
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called "operational intelligence," and the need for it is widespread.
This talk will explain how in-memory computing techniques can be used to implement operational intelligence. It will show how an in-memory data grid integrated with a data-parallel compute engine can track events generated by a live system, analyze them in real time, and create alerts that help steer the system’s behavior. Code samples will demonstrate how an in-memory data grid employs object-oriented techniques to simplify the correlation and analysis of incoming events by maintaining an in-memory model of a live system.
The talk also will examine simplifications offered by this approach over directly analyzing incoming event streams from a live system using complex event processing or Storm. Lastly, it will explain key requirements of the in-memory computing platform for operational intelligence, in particular real-time updating of individual objects and high availability using data replication, and contrast these requirements to the design goals for stream processing in Spark.
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
IBM Informix - The Ideal Database for Internet of Things
Exclusive luncheon at IBM World of Watson 2016. Informix is the best fit for IoT sensor data analytics at the edge and in the cloud.
With the advent of new open source platforms around Hadoop, NoSQL databases & in-memory databases, the data management stack in the enterprise is undergoing complete re-platforming. Batch and stream processing are two distinct data processing paradigms that need to be supported over this new stack. In this session I will talk about the importance of having a unified batch and stream processing engine and share my learning around -
Sample use cases to that bring out the need to have a unified stream & batch processing engine
Important features needed in the unified platform to tackle the above use cases.
Informix Spark Streaming is an extension of Informix that allows data to be streamed out of the database as soon as it is inserted, updated, or deleted.
The protocol currently used to stream the changes is MQTT v3.1.1 (older versions not supported!). This extension is able to stream data to any MQTT broker where it can be processed or passed on to subscribing clients for processing.
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called "operational intelligence," and the need for it is widespread.
This talk will explain how in-memory computing techniques can be used to implement operational intelligence. It will show how an in-memory data grid integrated with a data-parallel compute engine can track events generated by a live system, analyze them in real time, and create alerts that help steer the system’s behavior. Code samples will demonstrate how an in-memory data grid employs object-oriented techniques to simplify the correlation and analysis of incoming events by maintaining an in-memory model of a live system.
The talk also will examine simplifications offered by this approach over directly analyzing incoming event streams from a live system using complex event processing or Storm. Lastly, it will explain key requirements of the in-memory computing platform for operational intelligence, in particular real-time updating of individual objects and high availability using data replication, and contrast these requirements to the design goals for stream processing in Spark.
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
IBM Informix - The Ideal Database for Internet of Things
Exclusive luncheon at IBM World of Watson 2016. Informix is the best fit for IoT sensor data analytics at the edge and in the cloud.
With the advent of new open source platforms around Hadoop, NoSQL databases & in-memory databases, the data management stack in the enterprise is undergoing complete re-platforming. Batch and stream processing are two distinct data processing paradigms that need to be supported over this new stack. In this session I will talk about the importance of having a unified batch and stream processing engine and share my learning around -
Sample use cases to that bring out the need to have a unified stream & batch processing engine
Important features needed in the unified platform to tackle the above use cases.
Informix Spark Streaming is an extension of Informix that allows data to be streamed out of the database as soon as it is inserted, updated, or deleted.
The protocol currently used to stream the changes is MQTT v3.1.1 (older versions not supported!). This extension is able to stream data to any MQTT broker where it can be processed or passed on to subscribing clients for processing.
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
Achieving scale and performance using cloud native environmentRakuten Group, Inc.
ID Platform Product can be used by every Rakuten Group Companies and can easily serve millions of users. Multi-Region product challenges are many, example:
- Ensure 4 9’s availability
- Management across each region
- Alerting and Monitoring across each region
- Auto scaling (Scale up and Scale down) across each region
- Performance (vertical scale up)
- Cost
- DB Consistency Across Multiple Regions
- Resiliency
At Ecosystem Platform Layer for Rakuten, we handle each of these and this presentation is about how we handle these challenging scenarios.
In memory computing principles by Mac Moore of GridGainData Con LA
In the presentation, we will provide an overview of general in-memory computing principles and the drivers behind it. We will start with a summary of the technical drivers (abundant hardware resources) and market forces (the rise of Big Data). We will cover popular and emerging use cases for in-memory computing, from financial industry trading platforms to mobile payment processing, online advertising, online/mobile gaming back-ends and more. We will then present some foundational concepts and terminology, and discuss considerations around any in-memory solution. From there, we will illustrate how a complete in-memory computing stack like GridGain combines clustering, high performance computing, in-memory data grids, stream processing and Hadoop acceleration into one unified and easy to use platform.
Assure MIMIX, the leader in IBM i high availability and disaster recovery, keeps your mission-critical business applications running continuously and protects your data from loss. Precisely has recently delivered a new release of Assure MIMIX 10. This new release Assure includes even better support for IBM i customers operating in Cloud, Hosting and Managed Service Ecosystems.
Assure MIMIX 10 provides a new simplified pricing and licensing model built to support the needs of today’s IBM i systems whether they are on-premises or in the cloud. In addition, there are several new capabilities that are designed to make Assure MIMIX an even better solution for IBM i users needing a powerful HA solution.
Join us on this on-demand webinar to learn about the new Assure MIMIX 10 licensing changes as well as:
- Faster, more intelligent synchronization
- Automated configuration capabilities
- Enhanced recovery operations
Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
In this session, we’ll explore the various approaches to caching data in an OutSystems application—from the basic concepts of caching, to situations when you should or shouldn’t cache data.
We’ll also discuss how to use the built-in cache mechanism in OutSystems, including how it works, how to implement it, and some considerations for best practices.
For mobile apps and reactive web applications, we’ll cover caching data client-side using IndexedDB, as well as additional server-side caching resources.
20 Altair PBS Professional Features in 20 minutes, 2018Susheel Patidar
Altair PBS Works Suite, Industries most Advance Suite of Software for High Performance Computing. It has PBS Access focused on Engineers and Researchers and PBS Control with Administrator & HPC Managers.
Klaus Gottschalk from IBM presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
"Last year IBM together with partners out of the OpenPOWER foundation won two of the multi-year contacts of the US CORAL program. Within these contacts IBM develops an ac- celerated HPC infrastructure and software development ecosystem that will be a major step towards Exascale Computing. We believe that the CORAL roadmap will enable a massive pull for transformation of HPC codes for accelerated systems. The talk will discuss the IBM HPC strategy, explain the OpenPOWER foundation and the show IBM OpenPOWER roadmap for CORAL and beyond."
Watch the video presentation: http://wp.me/p3RLHQ-f9x
Learn more: http://e.huawei.com/us/solutions/business-needs/data-center/high-performance-computing
See more talks from the Switzerland HPC Conference:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From Data to Services at the Speed of BusinessAli Hodroj
From Data to Services at the Speed of Business: Applying cloud-native paradigm to combine fast data analytics with microservices architecture for hybrid workloads.
Zero Downtime, Zero Touch Stretch Clusters from Software-Defined StorageDataCore Software
Business continuity, especially across data centers in nearby locations often depends on complicated scripts, manual intervention and numerous checklists. Those error-prone processes are exponentially more difficult when the data storage equipment differs between sites.
Such difficulties force many organizations to settle for partial disaster recovery measures, conceding data loss and hours of downtime during occasional facility outages.
In this webcast and live demo, you’ll learn about:
• Software-defined storage services capable of continuously mirroring data in
real-time between unlike storage devices.
• Non-disruptive failover between stretched cluster requiring zero touch.
• Rapid restoration of normal conditions when the facilities come back up.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
Achieving scale and performance using cloud native environmentRakuten Group, Inc.
ID Platform Product can be used by every Rakuten Group Companies and can easily serve millions of users. Multi-Region product challenges are many, example:
- Ensure 4 9’s availability
- Management across each region
- Alerting and Monitoring across each region
- Auto scaling (Scale up and Scale down) across each region
- Performance (vertical scale up)
- Cost
- DB Consistency Across Multiple Regions
- Resiliency
At Ecosystem Platform Layer for Rakuten, we handle each of these and this presentation is about how we handle these challenging scenarios.
In memory computing principles by Mac Moore of GridGainData Con LA
In the presentation, we will provide an overview of general in-memory computing principles and the drivers behind it. We will start with a summary of the technical drivers (abundant hardware resources) and market forces (the rise of Big Data). We will cover popular and emerging use cases for in-memory computing, from financial industry trading platforms to mobile payment processing, online advertising, online/mobile gaming back-ends and more. We will then present some foundational concepts and terminology, and discuss considerations around any in-memory solution. From there, we will illustrate how a complete in-memory computing stack like GridGain combines clustering, high performance computing, in-memory data grids, stream processing and Hadoop acceleration into one unified and easy to use platform.
Assure MIMIX, the leader in IBM i high availability and disaster recovery, keeps your mission-critical business applications running continuously and protects your data from loss. Precisely has recently delivered a new release of Assure MIMIX 10. This new release Assure includes even better support for IBM i customers operating in Cloud, Hosting and Managed Service Ecosystems.
Assure MIMIX 10 provides a new simplified pricing and licensing model built to support the needs of today’s IBM i systems whether they are on-premises or in the cloud. In addition, there are several new capabilities that are designed to make Assure MIMIX an even better solution for IBM i users needing a powerful HA solution.
Join us on this on-demand webinar to learn about the new Assure MIMIX 10 licensing changes as well as:
- Faster, more intelligent synchronization
- Automated configuration capabilities
- Enhanced recovery operations
Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
In this session, we’ll explore the various approaches to caching data in an OutSystems application—from the basic concepts of caching, to situations when you should or shouldn’t cache data.
We’ll also discuss how to use the built-in cache mechanism in OutSystems, including how it works, how to implement it, and some considerations for best practices.
For mobile apps and reactive web applications, we’ll cover caching data client-side using IndexedDB, as well as additional server-side caching resources.
20 Altair PBS Professional Features in 20 minutes, 2018Susheel Patidar
Altair PBS Works Suite, Industries most Advance Suite of Software for High Performance Computing. It has PBS Access focused on Engineers and Researchers and PBS Control with Administrator & HPC Managers.
Klaus Gottschalk from IBM presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
"Last year IBM together with partners out of the OpenPOWER foundation won two of the multi-year contacts of the US CORAL program. Within these contacts IBM develops an ac- celerated HPC infrastructure and software development ecosystem that will be a major step towards Exascale Computing. We believe that the CORAL roadmap will enable a massive pull for transformation of HPC codes for accelerated systems. The talk will discuss the IBM HPC strategy, explain the OpenPOWER foundation and the show IBM OpenPOWER roadmap for CORAL and beyond."
Watch the video presentation: http://wp.me/p3RLHQ-f9x
Learn more: http://e.huawei.com/us/solutions/business-needs/data-center/high-performance-computing
See more talks from the Switzerland HPC Conference:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From Data to Services at the Speed of BusinessAli Hodroj
From Data to Services at the Speed of Business: Applying cloud-native paradigm to combine fast data analytics with microservices architecture for hybrid workloads.
Zero Downtime, Zero Touch Stretch Clusters from Software-Defined StorageDataCore Software
Business continuity, especially across data centers in nearby locations often depends on complicated scripts, manual intervention and numerous checklists. Those error-prone processes are exponentially more difficult when the data storage equipment differs between sites.
Such difficulties force many organizations to settle for partial disaster recovery measures, conceding data loss and hours of downtime during occasional facility outages.
In this webcast and live demo, you’ll learn about:
• Software-defined storage services capable of continuously mirroring data in
real-time between unlike storage devices.
• Non-disruptive failover between stretched cluster requiring zero touch.
• Rapid restoration of normal conditions when the facilities come back up.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
Two complementary trends are particularly strong in enterprise IT today: MongoDB itself, and the movement of infrastructure, platform, and software to as-a-service models. Being designed from the start to work in cloud deployments, MongoDB is a natural fit.
Learn how your enterprise can create its own MongoDB service offering, combining the advantages of MongoDB and cloud for agile, nearly-instantaneous deployments. Ease your operations workload by centralizing your points for enforcement, standardize best policies, and enable elastic scalability.
We will provide you with an enterprise planning outline which incorporates needs and value for stakeholders across operations, development, and business. We will cover accounting, chargeback integration, and quantification of benefits to the enterprise (such as standardizing best practices, creating elastic architecture, and reducing database maintenance costs).
There are many computational paradigms that could be used to harness the power of the herd of computers. In financial services, a share-nothing approach could be used to speed up CPU intensive calculations while the hierarchal nature of rollups requires tight synchronization. Some interesting use cases are:
In Wealth Management, the SQL approach is traditionally used, but it lacks efficient support of hierarchal structures, iterative calculation, and provides limited scalability. Unlike traditional, centralized scale-up enterprise systems, an in-memory-based architecture scales out and takes advantage of cost-effective high volume commodity hardware that maximizes compute power efficiently. It makes the user experience better by speeding up response time utilizing distributed implementation of calculation algorithms. OData enables DaaS to expose financial data and calculation capabilities.
In the insurance industry, in-memory computing was used for Monte-Carlo to estimate the value of life insurance policies. This is a very CPU-intensive task, which requires 2000 cores to build ~1 million simulated policies in 30 minutes (about 25 trillion numbers or 100TB of data), which then aggregates and compresses into 40GB of data for analysis.
To speed up CPU-intensive iterative financial calculations, we use a share-nothing approach while the hierarchal nature of rollups requires tight synchronization. Several algorithms that are typical for the financial industry, different approaches on distribution and synchronization, and the benefits of in-memory data grid technologies will be discussed.
Presentation for App that predicts machine failure based on provided data.
Link to GitHub repo:
https://github.com/bishtabhinavsingh/Docker-pySpark-MachineFailurePrediction
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB
To successfully implement our clients' unique use cases and data patterns, it is mandatory that we unlearn many relational concepts while designing and rapidly developing efficient applications in NoSQL.
In this session, we will talk about some of our client use cases and the strategies we adopted using features of MongoDB.
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring as a Service - Learn about the trials and challenges Agile Networks faced while converting their Nagios XI instance over to service outside customers.
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...MapR Technologies
Get an insider's view into one of the most talked-about Hadoop deployments in the world!
As more enterprises realize the value of big data, Hadoop is moving from lab curiosity to genuine competitive advantage. But how can you confidently deploy it in a production environment?
In this joint webinar with Syncsort, learn firsthand from industry thought leader, Mike Brown, CTO of comScore, how to offload critical data and optimize your enterprise data architecture with Hadoop to increase performance while lowering costs.
In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics
What is it about? In-Stream Event Processing is a new approach for building near real time big data systems with rapidly growing user base and applications like clickstream analytics, preventive maintenance or fraud detection. Maturity of some open source projects enables building an enterprise grade In-Stream Processing service in-house. However the open source world comprises of many competing projects of different maturity, having different perspectives so the task to select effective and efficient projects is not straightforward. In the talk I’ll present a blueprint of an In-Stream Processing Service, enterprise grade reliable and scalable, cloud ready, build from 100% open source components.
Serverless Architectures in Banking: OpenWhisk on IBM Bluemix at SantanderDaniel Krook
Presentation at IBM InterConnect on March 21, 2017.
Santander is one of the largest companies in the world, yet size is no guarantee of future survival given several challenges in the retail banking industry, primarily from disruptive new startups and a changing regulatory landscape. Success requires cutting-edge cloud computing solutions that achieve better resource utilization through automatic application scaling to match demand; and an associated, finer-grained cost model that helps distribute compute load at a lower cost. Learn how IBM and Santander partnered to create next-generation solutions for retail banking with the OpenWhisk open source project hosted on IBM Bluemix, which enables serverless architectures for event driven programming.
Building a real-time, scalable and intelligent programmatic ad buying platformJampp
After a brief introduction to programmatic ads and RTB we go through the evolution of Jampp's data platform to handle the enormous about of data we need to process.
Similar to November 2013 HUG: Real-time analytics with in-memory grid (20)
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Agenda
• Quick Review of In-Memory Data Grids
• The Need for Real-Time Analytics: Two Use Cases
• Data-Parallel Computation on an IMDG Using Parallel Method
Invocation (PMI)
• Implementing MapReduce Using PMI: ScaleOut hServer™
• Sample Use Cases
• Video Demo
• Comparison to Spark
2
ScaleOut Software, Inc.
3. About ScaleOut Software
• Develops and markets In-Memory Data Grids:
software middleware for:
• Scaling application performance and
• Performing real-time analytics using
• In-memory data storage and computing
• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• Eight years in the market; 400 customers, 9,000 servers
• Sample customers:
3
ScaleOut Software, Inc.
4. What is an In-Memory Data Grid?
In-memory storage for fast updates and retrieval of live data
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
4
ScaleOut Software, Inc.
5. Our Focus: Real-Time Analytics
Real-time
Batch
Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:
Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:
“Business Intelligence”
“Operational Intelligence”
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
5
Big Data Analytics
Real-Time
Batch
Analytics
Server
Hadoop
IBM
Teradata
SAS
SAP
hServer
ScaleOut Software, Inc.
• Analyzing
warehoused data
• Mining for longterm trends
6. Online Systems Need Real-Time Analysis
A
•
•
•
•
•
6
few examples:
Equity trading: to minimize risk during a trading day
Ecommerce: to optimize real-time shopping activity
Reservations systems: to identify issues, reroute, etc.
Credit cards: to detect fraud in real time
Smart grids: to optimize power distribution & detect issues
ScaleOut Software, Inc.
7. Integrate MapReduce
into IMDG for Real-Time Analytics
Benefits:
• Enables use of widely used Hadoop MapReduce APIs:
• Accelerates data access by staging data in memory.
• Eliminates batch scheduling
and data shuffling overheads
of standard Hadoop distributions.
• Analyzes and updates live data.
• Enables Hadoop
deployment in live
systems.
• Hadoop MapReduce
programs run without change.
• ScaleOut’s implementation is called
ScaleOut hServer™.
7
ScaleOut Software, Inc.
8. Data-Parallel Analysis Is Not New
• 1980’s: Special Purpose Hardware: “SIMD”
Thinking Machines
Connection Machine 5
• 1990’s: General Purpose Parallel Supercomputers:
“Domain Decomposition”, “SPMD”
Intel
IPSC-2
8
ScaleOut Software, Inc.
IBM
SP1
9. Data-Parallel Analysis Is Not New
• 1990’s – early 2000’s: HPC on Clusters: “MPI”
HP
Blade
Servers
• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”
Amazon EC2,
Windows Azure
9
ScaleOut Software, Inc.
10. Parallel Method Invocation
• Basic, well understood model of data-parallel computation
• Implemented for use on objects hosted in IMDGs:
• Executes user’s code in parallel across the grid.
• Uses parallel query to select objects for analysis.
Analyze Data (Eval)
In-Memory Data Grid Runs
Data-Parallel Analysis.
Combine Results
(Merge)
10
ScaleOut Software, Inc.
11. Running the Analysis
The parallel analysis executes in three steps:
• Step 1: The application first selects all relevant objects in the
collection with a parallel query run on all grid servers.
• Note: Query spec matches data’s object-oriented properties.
11
ScaleOut Software, Inc.
12. Running the Analysis: Step 2
• Step 2: The IMDG automatically schedules analysis operations
across all grid servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
12
ScaleOut Software, Inc.
13. Running the Analysis: Step 3
• Step 3: The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
trader’s display as one
object.
13
ScaleOut Software, Inc.
14. Sample Performance Results for PMI
Optimizing a stock trading platform with real-time analysis:
• IMDG hosted in Amazon
cloud using 75 servers.
• IMDG holds 1 TB of stock
history data in memory.
• IMDG handles continuous
stream of updates (1.1 GB/s).
• IMDG performs real-time
analysis on live data.
• Entire data set analyzed in
4.1 seconds (250 GB/s).
• IMDG scales linearly as
workload grows.
14
ScaleOut Software, Inc.
15. Implementing Real-Time MapReduce
• Goal: Run MapReduce applications from a remote workstation.
• The IMDG automatically builds an “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• The invocation grid can be reused to shorten startup time.
• Use PMI to implement MapReduce.
15
ScaleOut Software, Inc.
16. Accelerating MapReduce Execution
PMI is the foundation of fast
execution time:
• Data can be input from either the
IMDG or an external data source.
•
Works with any input/output format
compatible with the Apache
distribution.
• ScaleOut IMDG uses its dataparallel execution engine (PMI) to
invoke the mappers and the
reducers.
•
Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
•
•
16
Minimizes data motion between the
mappers and reducers.
Allows optional sorting.
ScaleOut Software, Inc.
17. Only One-Line Code Change
ScaleOut hServer subclasses the Hadoop Job class:
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws
Exception {
// This job will run using ScaleOut hServer:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Configuration conf = new Configuration();
Job job = new HServerJob(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
job.waitForCompletion(true);
}
17
public static void main(String[] args)
throws Exception {
ScaleOut Software, Inc.
18. Accessing IMDG Data for M/R
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.
18
ScaleOut Software, Inc.
19. Optimized In-Memory Storage
Multiple in-memory storage
models:
• Named cache, optimized
for rich semantics:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis:
• Highly efficient object
storage
• Pipelined, bulk-access
mechanisms
19
ScaleOut Software, Inc.
20. Example: Ecommerce: Inventory Management
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use PMI to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
20
ScaleOut Software, Inc.
21. Example in Financial Services
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated map/reduce
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
21
ScaleOut Software, Inc.
23. Comparison to Spark
• Spark is intended to accelerate data analysis using in-memory
computing.
• ScaleOut’s IMDG provides standard MapReduce for “live” systems.
Spark
ScaleOut IMDG
New MapReduce engine
Yes
Yes
In-memory data storage
Resilient Distr. Datasets
Distributed Objects
Load/store from HDFS
Yes
Yes
Avoid disk access
Yes
Yes
CRUD on live data
No
Yes
Query on properties
No
Yes
High availability
Rebuild on failure
Replication and failover
Extensibility
Additional operators
PMI methods
Open source
Yes
Hybrid
23
ScaleOut Software, Inc.
24. Summary
• Online systems need to analyze “live” data in real-time.
• MapReduce has traditionally focused on analyzing
large, static (offline) datasets held in file systems.
• An in-memory data grid (IMDG) can accelerate
MapReduce applications, enabling real-time analytics:
• Enables the application to analyze and update live data.
• Leverages the IMDG’s load-balanced placement of data.
• Avoids batch-scheduled startup delays.
• Avoids data motion from secondary storage.
• MapReduce can be implemented using standard dataparallel computing techniques (“parallel method
invocation”):
• Tightly integrates Map/Reduce engine with the IMDG.
• Accelerates Map/Reduce execution by >20X in benchmark
tests.
24
ScaleOut Software, Inc.
25. Accelerating Start-Up Times
• The invocation grid can be re-used across MapReduce jobs:
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
}
25
ScaleOut Software, Inc.
26. Targeted Use Cases
Run continuous Hadoop
on live data, while it’s
being updated.
Accelerate Hadoop on
static data with a one
line code change.
Quickly prototype
Hadoop code.
26
“Capture perishable business
opportunities and identify issues.”
Real-time risk
analysis
Credit card fraud
detection
...
“Speed-up Hadoop execution by >10X for
faster business insights.”
Financial
modeling
Process
simulations
...
“Validate your Hadoop code before it
goes into batch processing.”
No need to install
Hadoop stack
ScaleOut Software, Inc.
Fast-turn debug
and tuning
...
27. The Need for Real-Time Analytics
Many Use Cases:
•
Across Key Industries:
Authorizations / Payment
Processing / Mobile Payments
•
•
•
•
•
•
•
•
•
•
27
ScaleOut Software, Inc.
Health Care
•
Operational Risk Compliance
Government
•
Financial: Risk, P&L, Pricing
Life Sciences
•
Execution Rules
IC / DoD
•
Market Feed / Event Handlers
Logistics
•
Churn Management
Manufacturing
•
Situational Awareness
Utilities
•
Fraud Detection
Retail
•
Real Time Tracking
Telco
•
Sensor Data / SCADA
Financial
•
Inventory Management
CPG
•
Service Activation
•
•
Law enforcement
28. Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
• Typically used for very large, static, offline datasets
• Data must be copied from disk-based storage (e.g., HDFS) into
memory for analysis.
• Hadoop Map/Reduce adds lengthy batch scheduling and data
shuffling overhead.
28
ScaleOut Software, Inc.
29. Hadoop Users Need
Real-Time Analytics
• ScaleOut Software conducted informal survey at Strata 2013
Conference (Santa Clara).
• Based on 150 responses:
• 78% of organizations generate fast-changing data.
• 60% use Hadoop and 78% plan to expand usage of Hadoop within
12 months.
• Only 42% consider Hadoop to be an effective platform for realtime analysis, but…
• 93% would benefit from real-time data analytics.
• 71% consider a 10X improvement in performance meaningful.
• Take-away: Hadoop users need real-time analytics.
29
ScaleOut Software, Inc.
30. Optional Caching of HDFS Data
• ScaleOut hServer adds Dataset Record Reader (wrapper) to
cache HDFS data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on
subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
30
ScaleOut Software, Inc.
31. Java Example: Parallel Method Invocation
• Create method to analyze each queried stock object and another
method to pair-wise merge the results:
public class StockAnalysis implements
Invokable<Stock, StockCalcParams, Double>
{
public Double eval(Stock stock, StockCalcParams param)
throws InvokeException {
return stock.getPrice() * stock.getTotalShares();
}
public Double merge(Double first, Double second)
throws InvokeException {
return first + second;
}
}
31
ScaleOut Software, Inc.
32. Java Example: Parallel Method Invocation
•
Run a parallel method invocation on the query results:
NamedCache cache = CacheFactory.getCache("Stocks");
InvokeResult valueOfSelectedStocks =
cache.invoke(
StockAnalysis.class,
Stock.class,
or(equal("ticker", "GOOG"), equal("ticker", "ORCL")),
new StockCalcParams());
System.out.println("The value of selected stocks is" +
valueOfSelectedStocks.getResult());
32
ScaleOut Software, Inc.