This document discusses the MADlib architecture for performing scalable machine learning and analytics on large datasets using massively parallel processing. It describes how MADlib implements algorithms like linear regression across distributed database segments to solve challenges like multiplying data across nodes. It also discusses how MADlib uses a convex optimization framework to iteratively solve machine learning problems and the use of streaming algorithms to compute analytics in a single data scan. Finally, it outlines how the MADlib architecture provides scalable machine learning capabilities to data scientists through interfaces like PivotalR.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.
In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
/* Ted's MapR meeting slides incorporated here */
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
/* Ted's MapR meeting slides incorporated here */
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Similar to MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR (20)
Here are the slides for Greenplum Chat #8. You can view the replay here: https://www.youtube.com/watch?v=FKFiyJDgdQk
The increased frequency and sophistication of high-profile data breaches and malicious hacking is putting organizations at continued risk of data theft and significant business disruption. Complicating this scenario is the unbounded growth of Big Data and petabyte-scale data storage, new open source database and distribution schemes, and the continued adoption of cloud services by enterprises.
Pivotal Greenplum customers often look for additional encryption of data-at-rest and data-in-motion. The massively parallel processing (MPP) architecture of Pivotal Greenplum provides an architecture that is unlike traditional OLAP on RDBMS for data warehousing, and encryption capabilities must address the scale-out architecture.
The Zettaset Big Data Encryption Suite has been designed for optimal performance and scalability in distributed Big Data systems like Greenplum Database and Apache HAWQ.
Here is a replay of our recent Greenplum Chat with Zettaset:
00:59 What is Greenplum’s approach for encryption and why Zettaset?
02:17 Results of field testing Zettaset with Greenplum
03:50 Introduction to Zettaset, the security company
05:36 Overview of Zettaset and their solutions
14:51 Different layers for encrypting data at rest
16:50 Encryption key management for big data
20:51 Zettaset BD Encrypt for data at rest and data in motion
22:19 How to mitigate encryption overhead with an MPP scale-out system
24:12 How to deploy BD Encrypt
25:50 Deep dive on data at rest encryption
30:44 Deep dive on data in motion encryption
36:72 Q: How does Zettaset deal with encrypting Greenplums multiple interfaces?
38:08 Q: Can I encrypt data for a particular column?
40:26 How Zettaset fits into a security strategy
41:21 Q: What is the performance impact on queries by encrypting the entire database?
43:28 How Zettaset helps Greenplum meet IT compliance requirements
45:12 Q: How authentication for keys is obtained
48:50 Q: How can Greenplum users try out Zettaset?
50:53 Q: What is a ‘Zettaset Security Coach’?
How to use the WAN Gateway feature of Apache Geode to implement multi-site and active-active failover, disaster recovery, and global scale applications.
#GeodeSummit Keynote: Creating the Future of Big Data Through 'The Apache Way"PivotalOpenSourceHub
Keynote at Geode Summit 2016 by Dr. Justin Erenkrantz, Bloolmberg LP. Creating the Future of Big Data Through "The Apache Way" and why this matters to the community
#GeodeSummit: Combining Stream Processing and In-Memory Data Grids for Near-R...PivotalOpenSourceHub
The financial sector is an exciting mix of challenges regarding throughput, high availability as well as specific constraints regarding latency and consistency. In the continuous evolution of its platform, Murex relies on open source technologies like Apache Geode and Apache Storm in a "kind of" lambda architecture to ensure storage, near-real time (around the milliseconds) aggregation of thousands of events per second, advanced notification mechanisms and on-demand deployments. This talk will focus on the technical architecture, the underlying principles as well as the technologies used to support this mix of functional and non-functional requirements.
In this session we review the design of the newly released off heap storage feature in Apache Geode, and discuss use cases and potential direction for additional capabilities of this feature.
In this session we review the design of the current capabilities of a partially completed feature in Apache Geode - the ability to act as a backend for Redis client applications. We’ll explore potential use cases and future direction that this capability might evolve.
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
In this session we review the design of the current state of support for Apache Geode by Spring Cloud Data Flow, and explore additional use cases and future direction that Spring Cloud Data Flow and Apache Geode might evolve.
In this session we review the design of the current capabilities of the Spring Data GemFire API that supports Geode, and explore additional use cases and future direction that the Spring API and underlying Geode support might evolve.
#GeodeSummit - Modern manufacturing powered by Spring XD and GeodePivotalOpenSourceHub
Wondering how to improve on your production yield, increase asset life and activate reliability centered maintenance? TEKsystems has developed “Golden Batch” recommendation engine to realize your goals of modern manufacturing. This is a Predictive analytics framework built on top of Manufacturing Data Lake for analysis and training of machine learning algorithms, and subsequent processing and detection of streaming data from sensors to detect or predict failures. We’ll present a solution architecture featuring Spring XD for data pipelining, Apache Geode for in-memory processing, Hadoop as a data lake, and R for machine learning.
#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...PivotalOpenSourceHub
One of the largest retailers in North America are considering Apache Geode for their new mobile loyalty application, to support their digital transformation effort. They would use Geode to provide operational data services for their mobile cloud service. This retailer needs to replace sluggish response times with sub-second response which will improved conversion rates. They also want to able to close the loop between data science findings and app experience. This way the right customer interaction is suggested when it is needed such as when customers are looking at their mobile app while walking in the store, or sending notifications at the individuals most likely shopping times. The final benefits of using Geode will include faster development cycles, increased customer loyalty, and higher revenue.
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...PivotalOpenSourceHub
In this session we explore a case study of a large-scale government fraud detection program that prevents billions of dollars in fraudulent payments each year leveraging the beta release of the GemFire+Greenplum Connector, which is planned for release in GemFire 9. Topics will include an overview of the system architecture and a review of the new GemFire+Greenplum Connector features that simplify use cases requiring a blend of massively parallel database capabilities and accelerated in-memory data processing.
#GeodeSummit: Democratizing Fast Analytics with Ampool (Powered by Apache Geode)PivotalOpenSourceHub
Today, if events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, apps will get what used to be batch insights in near real time. The technology enabling this includes smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is building on Apache Geode to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast and complex end-to-end pipelines to be built -- closing the loop and providing dramatically lower time to critical insights.
#GeodeSummit: Architecting Data-Driven, Smarter Cloud Native Apps with Real-T...PivotalOpenSourceHub
This talk introduces an open-source solution that integrates cloud native apps running on Cloud Foundry with an open-source hybrid transactions + analytics real-time solution. The architecture is based on the fastest scalable, highly available and fully consistent In-Memory Data Grid (Apache Geode / GemFire), natively integrated to the first open-source massive parallel data warehouse (Greenplum Database) in a hybrid transactional and analytical architecture that is extremely fast, horizontally scalable, highly resilient and open source. This session also features a live demo running on Cloud Foundry, showing a real case of real-time closed-loop analytics and machine learning using the featured solution.
Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
#GeodeSummit - Where Does Geode Fit in Modern System ArchitecturesPivotalOpenSourceHub
In this talk, Eitan Suez explores the question: Where does Geode fit in an organization's system architecture? Geode is a unique and feature-rich product that perhaps hasn't seen as much adoption as it deserves. Today's apps are no longer the straightforward, database-backed web applications we used to build a few years ago. Applications have become more sophisticated, as they've had to meet the need to scale, to be reliable, fault-tolerant, and to integrate with other systems. In this talk, Eitan will suggest one particular fit for Geode in the context of a CQRS architecture, and welcomes you to attend, and to contribute by sharing how you've put Geode to use in your organization.
How Southwest Airlines Uses Geode
Distributed systems and fast data require new software patterns and implementation skills. Learn how Southwest Airlines uses Apache Geode, organizes team responsibilities, and approaches design tradeoffs. Drawing inspiration from real whiteboard conversations, we’ll explore: common development pitfalls, environment capacity planning, streaming data patterns like consumer checkpointing, support roles, and production lessons learned.
Every day, Apache Geode improves how Southwest Airlines schedules nearly 4,000 flights and serves over 500,000 passengers. It’s an essential component of Southwest’s ability to reduce flight delays and support future growth.
#GeodeSummit - Wall St. Derivative Risk Solutions Using GeodePivotalOpenSourceHub
In this talk, Andre Langevin discusses how Geode forms the core of many Wall Street derivative risk solutions. By externalizing risk from trading systems, Geode-based solutions provide cross-product risk management at speeds suitable for automated hedging, while simultaneously eliminating the back office costs associated with traditional trading system based solutions.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
7. 8Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b
c d
e f
g h
Segment 1
Segment 2
a c e g
b d f h
Segment1
Segment2
8. 9Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b
c d
e f
g h
a c e g
b d f h
a2+c2+e2+g2
=
Data across nodes
are multiplied
9. 10Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b
c d
e f
g h
a c e g
b d f h
a2+c2+e2+g2
=
Data across nodes
are multiplied!
ab+cd+ef+gh
10. 11Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b
c d
e f
g h
a c e g
b d f h
a2+c2+e2+g2
=
Looks like the
result can be
decomposed
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
11. 12Pivotal Confidential–Internal Use Only
Challenges in computing OLS solution
a b
c d
e f
g h
a c e g
b d f h
a2+c2+e2+g2
=
Data across nodes
are multiplied!
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
= +a b e
f
e f
a
b +c d
g
h
g hc
d
+
12. 13Pivotal Confidential–Internal Use Only
Linear Regression: Streaming Algorithm
How to compute with a single table scan?
XT
X
XT
y
-1
XTyXTX
+ +
-1
13. 14Pivotal Confidential–Internal Use Only
Problem solved? … Not Yet
Many ML solutions are iterative without analytical
formulations
Initialize problem
Perform single step
Has converged?
Return results
false
true
14. 15Pivotal Confidential–Internal Use Only
In general, use a convex optimization
framework
7 2.383 0.3904
6 2.869 0.4769
8 4.475 1.151
3 13.35 3.263
0 45.48 13.10
171.7 84.59
cution times
ure6: TheArchetypical Convex Function f(x) = x2
.
Application Objective
Least Squares
P
(u,y)2⌦(xT u − y)2
Lasso [38]
P
(u,y)2⌦(xT u − y)2 + µkxk1
Logisitic Regression
P
(u,y)2⌦log(1 + exp(−yxtu))
P T
Each step has an analytical formulation
that can be performed in parallel
Gradient Descent
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update the model
Until stopping criterion is satisfied
17. 18Pivotal Confidential–Internal Use Only 18Pivotal Confidential–Internal Use Only
But not all data scientists
speak SQL …
Accessing scalability through R
18. 19Pivotal Confidential–Internal Use Only
Why R?
O’Reilly: Strata 2013 Data Science Salary Survey
“The preponderance of R and Python usage is more surprising …
two most commonly used individual tools, even above Excel. R and Python are likely
popular because they are easily accessible and effective open source tools.”
That SQL/RDBisthetop bar isno surprise: accessingdataisthemeat
and potatoes of data analysis, and has not been displaced by other
tools. Thepreponderance of R and Python usageismoresurprising
—operating systems aside, these werethetwo most commonly used
individual tools, even aboveExcel, which for yearshasbeen thego-to
19. 20Pivotal Confidential–Internal Use Only
PivotalR: Bringing MADlib and HAWQ to a
familiar R interface
Challenge
Want to harness the familiarity of R’s interface and
the performance & scalability benefits of in-DB
analytics
d <- db.data.frame(”houses")
houses_linregr <-
madlib.lm(price ~ tax
+ bath
+ size
, data=d)
Pivotal R
SELECT madlib.linregr_train( 'houses’,
'houses_linregr’,
'price’,
'ARRAY[1, tax, bath, size]’);
SQL Code
20. 21Pivotal Confidential–Internal Use Only
PivotalR Design Overview
2. SQL to execute
3. Computation results
1. R SQL
RPostgreSQL
PivotalR
Data lives hereNo data here
Database/HAWQ
w/ MADlib
• Syntax is analogous to native R function
• Data doesn’t need to leave the database
• All heavy lifting, including model estimation
& computation, are done in the database
22. 23Pivotal Confidential–Internal Use Only
library(PivotalR)
db.connect(port = 14526, dbname = "madlib")
db.objects()
x <- db.data.frame("madlibtestdata.dt_abalone")
dim(x)
names(x)
x$rings
lookat(x, 10) # look at a sample of table
mean(x$rings)
lookat(mean(x$rings))
fit <- madlib.lm(rings ~ . - id | sex, data = y)
predict(fit, x)
mean((x$rings - predict(fit, x))^2)
x$sex <- as.factor(v$sex)
m0 <- madlib.glm(resp ~ age,
family="binomial", data=dbbank)
mstep <- step(m0, scope=list(
lower=~age,
upper=~age + factor(marital) +
factor(education) +
factor(housing) + factor(loan) +
factor(job)))
Load the Library
Connect to the database “madlib” on port 14526
List all the tables in the active connection
Create an R object that references a table in the database
Report #/rows and #/columns in the table
Column names within the table
Database query object representing “select rings from madlibtestdata.dt_abalone”
Pull 10 rows of data from the table back into the R environment
query object representing “select avg(rings) from madlibtestdata.dt_abalone”
execute the query and report back the result
Run a linear regression within the database and return a model object
Create a query object representing scoring the model in the database
Query object calculating the mean square error of the model
Add a calculated factor column to the database query object
Calculate a logistic regression model
Perform stepwise feature selection
Demonstration
23. 26Pivotal Confidential–Internal Use Only
Class hierarchy
db.obj
db.data.frame db.Rquery
db.table db.view
Wrapper of objects in database
x = db.data.frame("table")
Resides in R only
x[,1:2],
merge(x, y, by="column")
Operations/
MADlib
functions
lookat
as.db.data.frame
operation
24. 27Pivotal Confidential–Internal Use Only
Some of current features
A wrapper of MADlib
• Generalized linear models
(lm, glm)
• Elastic Net (elnet)
• Cross validation (generic.cv)
• ARIMA
• Tree methods
(rpart, randomforest)
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
+ - * /
%% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect
db.list db.objects
db.existsObject delete
• dim
• names
• as.factor()
• content
And more ... (SQL wrapper)
• predict
25. 28Pivotal Confidential–Internal Use Only
We’re looking for contributors
• Browse our help pages
– Start page: madlib.net
– Github pages
• github.com/apache/incubator-madlib (SQL)
• github.com/pivotalsoftware/pivotalr (R)
• github.com/pivotalsoftware/pymadlib (Python)
• Use our product and report issues:
• https://issues.apache.org/jira/browse/MADLIB (Issue tracker)
• user@madlib.incubator.apache.org (User forum)
• dev@madlib.incubator.apache.org (Developer forum)
26. 29Pivotal Confidential–Internal Use Only
Credits
Leaders and contributors:
Gavin Sherry
Caleb Welton
Joseph Hellerstein
Christopher Ré
Zhe Wang
Florian Schoppmann
Hai Qian
Shengwen Yang
Xixuan Feng
and many others
…
27. 30Pivotal Confidential–Internal Use Only 30Pivotal Confidential–Internal Use Only
Thank you for your attention
Important links:
Product email: user@madlib.net
Product site: madlib.net
29. 32Pivotal Confidential–Internal Use Only
Performing a linear regression on 10 million rows in seconds
Hellerstein et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB
Endowment 5.12 (2012): 1700-1711.
30. 33Pivotal Confidential–Internal Use Only
Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals
• First-order conditions for the following quadratic objective
(in c)
yield the minimizer
31. 34Pivotal Confidential–Internal Use Only
Linear Regression: Streaming Algorithm
How to compute with a single table scan?
XT
X
XT
y
-1
XTX XTy
35. 38Pivotal Confidential–Internal Use Only
PivotalR vs PL/R
PivotalR
• Interface is R client
• Execution is in database
• Parallelism handled by
PivotalR
• Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
PL/R
• Interface is SQL client
• Execution is in R
• Parallelism via SQL
function invocation
• Supports all of R
psql> CREATE FUNCTION lregr() …
LANGUAGE PLR;
psql> SELECT lregr( array_agg(interlocks),
array_agg(assets),
array_agg(nation) )
FROM t1;
36. 39Pivotal Confidential–Internal Use Only
Parallelized R in Pivotal via PL/R:
An Example
SQL & R
R piggy-backs on Pivotal’s parallel architecture
Minimize data movement
Build predictive model for each state in parallel
TN
Data
CA
Data
NY
Data
PA
Data
TX
Data
CT
Data
NJ
Data
IL
Data
MA
Data
WA
Data
TN
Model
CA
Model
NY
Model
PA
Model
TX
Model
CT
Model
NJ
Model
IL
Model
MA
Model
WA
Model
Editor's Notes
For a table of 10 million rows
Num of independent variables (x axis) = number of columns in table.
So a vertical slice allows you to look at scalability.
See roughly linear scale in execution time with number of segments.
* e.g., 6 segments approx 200 sec, 24 segments approx 50 sec (4x faster)