The document discusses Oracle's Advanced Analytics Option which extends the Oracle Database into a comprehensive advanced analytics platform. It includes Oracle Data Mining for in-database predictive analytics and data mining, and Oracle R Enterprise which integrates the open-source R statistical programming language with the database. The option aims to bring algorithms to the data within the database to eliminate data movement and reduce total cost of ownership compared to traditional statistical environments.
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
The document discusses Oracle Machine Learning (OML) services on Oracle Autonomous Database. It provides an overview of the OML services REST API, which allows storing and deploying machine learning models. It enables scoring of models using REST endpoints for application integration. The API supports classification/regression of ONNX models from libraries like Scikit-learn and TensorFlow. It also provides cognitive text capabilities like topic discovery, keywords, sentiment analysis and text summarization.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Oracle Data Integration Platform is a cornerstone for big data solutions that provides five core capabilities: business continuity, data movement, data transformation, data governance, and streaming data handling. It includes eight core products that can operate in the cloud or on-premise, and is considered the most innovative in areas like real-time/streaming integration and extract-load-transform capabilities with big data technologies. The platform offers a comprehensive architecture covering key areas like data ingestion, preparation, streaming integration, parallel connectivity, and governance.
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
Not familiar with Oracle NoSQL Database yet? This great product introduction session discusses the primary functionality included with the product as well as integration with other Oracle products. It includes a live demo that illustrates installation and configuration as well as data modeling and sample NoSQL application development.
The document is a presentation on Oracle NoSQL Database that discusses its use cases, Oracle's NoSQL and big data strategy, technical features of Oracle NoSQL Database, and customer references. The presentation covers how Oracle NoSQL Database can be used for real-time event processing, sensor data acquisition, fraud detection, recommendations, and globally distributed databases. It also discusses Oracle's approach to integrating NoSQL, Hadoop, and relational databases. Customer references are provided for Airbus's use of Oracle NoSQL Database for flight test sensor data storage and analysis.
The document discusses Oracle's Advanced Analytics Option which extends the Oracle Database into a comprehensive advanced analytics platform. It includes Oracle Data Mining for in-database predictive analytics and data mining, and Oracle R Enterprise which integrates the open-source R statistical programming language with the database. The option aims to bring algorithms to the data within the database to eliminate data movement and reduce total cost of ownership compared to traditional statistical environments.
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
The document discusses Oracle Machine Learning (OML) services on Oracle Autonomous Database. It provides an overview of the OML services REST API, which allows storing and deploying machine learning models. It enables scoring of models using REST endpoints for application integration. The API supports classification/regression of ONNX models from libraries like Scikit-learn and TensorFlow. It also provides cognitive text capabilities like topic discovery, keywords, sentiment analysis and text summarization.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Oracle Data Integration Platform is a cornerstone for big data solutions that provides five core capabilities: business continuity, data movement, data transformation, data governance, and streaming data handling. It includes eight core products that can operate in the cloud or on-premise, and is considered the most innovative in areas like real-time/streaming integration and extract-load-transform capabilities with big data technologies. The platform offers a comprehensive architecture covering key areas like data ingestion, preparation, streaming integration, parallel connectivity, and governance.
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
Not familiar with Oracle NoSQL Database yet? This great product introduction session discusses the primary functionality included with the product as well as integration with other Oracle products. It includes a live demo that illustrates installation and configuration as well as data modeling and sample NoSQL application development.
The document is a presentation on Oracle NoSQL Database that discusses its use cases, Oracle's NoSQL and big data strategy, technical features of Oracle NoSQL Database, and customer references. The presentation covers how Oracle NoSQL Database can be used for real-time event processing, sensor data acquisition, fraud detection, recommendations, and globally distributed databases. It also discusses Oracle's approach to integrating NoSQL, Hadoop, and relational databases. Customer references are provided for Airbus's use of Oracle NoSQL Database for flight test sensor data storage and analysis.
The Art of Intelligence – Introduction Machine Learning for Oracle profession...Lucas Jellema
Our technology has gotten smart and fast enough to make predictions and come up with recommendations in near real time. Machine Learning is the art of deriving models from our Big Data collections – harvesting historic patterns and trends – and applying those models to new data in order to rapidly and adequately respond to that data. This presentation will explain and demonstrate in simple, straightforward terms and using easy to understand practical examples what Machine Learning really is and how it can be useful in our world of applications, integrations and databases. Hadoop and Spark, real time and streaming analytics, Watson and Cloud Datalab, Jupyter Notebooks and Citizen Data Scientists will all make their appearance, as will SQL.
Application development with Oracle NoSQL Database 3.0Anuj Sahni
The document introduces table-based data modeling features for Oracle NoSQL Database. It discusses using tables to simplify application data modeling with familiar concepts like tables and data types. Examples show how to model user and email data using tables, including defining the schema using DDL, querying the data using DML, and indexing the tables. The document also provides an example of modeling user and email data from an email client application to illustrate how to approach data modeling.
Notes on Data Governance in Hadoop. This is my self learning by reading hortonworks and cloudera manual, then come up with these slides. Please refer in detail at cloudera.com
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Expand a Data warehouse with Hadoop and Big Datajdijcks
After investing years in the data warehouse, are you now supposed to start over? Nope. This session discusses how to leverage Hadoop and big data technologies to augment the data warehouse with new data, new capabilities and new business models.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
The document discusses Teradata's portfolio for Hadoop, including the Teradata Aster Big Analytics Appliance, the Teradata Appliance for Hadoop, a commodity offering with Dell, and support for the Hortonworks Data Platform. It provides consulting, training, support, and managed services for Hadoop. Teradata SQL-H gives business users standard SQL access to data stored in Hadoop through Teradata, allowing queries to run quickly on Teradata while accessing data from Hadoop efficiently through HCatalog.
This document provides an overview of Oracle GoldenGate 12c, a heterogeneous replication tool. It describes GoldenGate's key features like real-time data integration and query offloading. The document outlines GoldenGate's topologies, architecture, supported databases, and data types. It compares GoldenGate to Oracle Streams and details new features in 12c like optimized capture methods and improved high availability. Basic concepts are explained, such as classic and integrated capture, downstream and bi-directional replication. Restrictions on data types and database features are also noted.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
Streaming Solutions for Real time problemsAbhishek Gupta
The document is a presentation on streaming solutions for real-time problems using Apache Kafka, Kafka Streams, and Redis. It begins with an introduction and overview of the technologies. It then presents a sample monitoring application using metrics from multiple machines as a use case. The presentation demonstrates how to implement this application using Kafka as the event store, Kafka Streams for processing, and Redis as the state store. It also shows how to deploy the application components on Oracle Cloud.
3rd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
See the magic of graphs in this session. Graph analysis can answer questions like detecting patterns of fraud or identifying influential customers - and do it quickly and efficiently. We’ll show you the APIs for accessing graphs and running analytics such as finding influencers, communities, anomalies, and how to use them from various languages including Groovy, Python, and Javascript, with Jupiter and Zeppelin notebooks.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
The document discusses different data storage formats such as text, Avro, Parquet, and their suitability for writing and reading data. It provides examples of how to choose a format based on factors like query needs, data types, and whether schemas need to evolve. The document also demonstrates how Avro can handle schema evolution by adding or changing fields while still reading existing data.
This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks
Overview presentation showing Oracle Big Data Appliance and Oracle Big Data SQL in combination with why this really matters. Big Data SQL brings you the unique ability to analyze data across the entire spectrum of system, NoSQL, Hadoop and Oracle Database.
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
Teradata Connectors for Hadoop enable high-volume data movement between Teradata and Hadoop platforms. LinkedIn conducted a proof-of-concept using the connectors for use cases like copying clickstream data from Hadoop to Teradata for analytics and publishing dimension tables from Teradata to Hadoop for machine learning. The connectors help address challenges of scalability and tight processing windows for these large-scale data transfers.
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
R is a popular open-source statistical programming language and software environment for predictive analytics. It has a large community and ecosystem of packages that allow data scientists to solve various problems. Microsoft R Server is a scalable platform that allows R to handle large datasets beyond memory capacity by distributing computations across nodes in a cluster and storing data on disk in efficient column-based formats. It provides high performance through parallelization and rewriting algorithms in C++.
The Art of Intelligence – Introduction Machine Learning for Oracle profession...Lucas Jellema
Our technology has gotten smart and fast enough to make predictions and come up with recommendations in near real time. Machine Learning is the art of deriving models from our Big Data collections – harvesting historic patterns and trends – and applying those models to new data in order to rapidly and adequately respond to that data. This presentation will explain and demonstrate in simple, straightforward terms and using easy to understand practical examples what Machine Learning really is and how it can be useful in our world of applications, integrations and databases. Hadoop and Spark, real time and streaming analytics, Watson and Cloud Datalab, Jupyter Notebooks and Citizen Data Scientists will all make their appearance, as will SQL.
Application development with Oracle NoSQL Database 3.0Anuj Sahni
The document introduces table-based data modeling features for Oracle NoSQL Database. It discusses using tables to simplify application data modeling with familiar concepts like tables and data types. Examples show how to model user and email data using tables, including defining the schema using DDL, querying the data using DML, and indexing the tables. The document also provides an example of modeling user and email data from an email client application to illustrate how to approach data modeling.
Notes on Data Governance in Hadoop. This is my self learning by reading hortonworks and cloudera manual, then come up with these slides. Please refer in detail at cloudera.com
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Expand a Data warehouse with Hadoop and Big Datajdijcks
After investing years in the data warehouse, are you now supposed to start over? Nope. This session discusses how to leverage Hadoop and big data technologies to augment the data warehouse with new data, new capabilities and new business models.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
The document discusses Teradata's portfolio for Hadoop, including the Teradata Aster Big Analytics Appliance, the Teradata Appliance for Hadoop, a commodity offering with Dell, and support for the Hortonworks Data Platform. It provides consulting, training, support, and managed services for Hadoop. Teradata SQL-H gives business users standard SQL access to data stored in Hadoop through Teradata, allowing queries to run quickly on Teradata while accessing data from Hadoop efficiently through HCatalog.
This document provides an overview of Oracle GoldenGate 12c, a heterogeneous replication tool. It describes GoldenGate's key features like real-time data integration and query offloading. The document outlines GoldenGate's topologies, architecture, supported databases, and data types. It compares GoldenGate to Oracle Streams and details new features in 12c like optimized capture methods and improved high availability. Basic concepts are explained, such as classic and integrated capture, downstream and bi-directional replication. Restrictions on data types and database features are also noted.
Presentación sobre la futura base de datos 18c, en la cual se incorpora todo lo mejor de las tecnologías Oracle, perfilando así una base de datos autónoma.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Cómo Oracle ha logrado separar el motor SQL de su emblemática base de datos para procesar las consultas y los drivers de acceso que permiten leer datos, tanto de ficheros sobre el Hadoop Distributed File System, como de la herramienta de Data Warehousing, HIVE.
Streaming Solutions for Real time problemsAbhishek Gupta
The document is a presentation on streaming solutions for real-time problems using Apache Kafka, Kafka Streams, and Redis. It begins with an introduction and overview of the technologies. It then presents a sample monitoring application using metrics from multiple machines as a use case. The presentation demonstrates how to implement this application using Kafka as the event store, Kafka Streams for processing, and Redis as the state store. It also shows how to deploy the application components on Oracle Cloud.
3rd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
See the magic of graphs in this session. Graph analysis can answer questions like detecting patterns of fraud or identifying influential customers - and do it quickly and efficiently. We’ll show you the APIs for accessing graphs and running analytics such as finding influencers, communities, anomalies, and how to use them from various languages including Groovy, Python, and Javascript, with Jupiter and Zeppelin notebooks.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
The document discusses different data storage formats such as text, Avro, Parquet, and their suitability for writing and reading data. It provides examples of how to choose a format based on factors like query needs, data types, and whether schemas need to evolve. The document also demonstrates how Avro can handle schema evolution by adding or changing fields while still reading existing data.
This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks
Overview presentation showing Oracle Big Data Appliance and Oracle Big Data SQL in combination with why this really matters. Big Data SQL brings you the unique ability to analyze data across the entire spectrum of system, NoSQL, Hadoop and Oracle Database.
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
Teradata Connectors for Hadoop enable high-volume data movement between Teradata and Hadoop platforms. LinkedIn conducted a proof-of-concept using the connectors for use cases like copying clickstream data from Hadoop to Teradata for analytics and publishing dimension tables from Teradata to Hadoop for machine learning. The connectors help address challenges of scalability and tight processing windows for these large-scale data transfers.
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
R is a popular open-source statistical programming language and software environment for predictive analytics. It has a large community and ecosystem of packages that allow data scientists to solve various problems. Microsoft R Server is a scalable platform that allows R to handle large datasets beyond memory capacity by distributing computations across nodes in a cluster and storing data on disk in efficient column-based formats. It provides high performance through parallelization and rewriting algorithms in C++.
This document discusses the rise of open source analytics tools and languages. It notes that SAS and SPSS previously dominated the market but were very expensive. R, Python, and Hadoop have provided lower-cost open source alternatives for data storage, querying, visualization, and statistical analysis. The document reviews popular open source tools like R, Python, RapidMiner, and Hadoop ecosystems. It also discusses commercial offerings that build on open source like Revolution Analytics. Overall, open source has helped reduce the costs of analytics software and enabled more organizations to benefit from data-driven insights.
Slidedeck Datenanalysen auf Enterprise-Niveau mit Oracle R Enterprise - DOAG2014Nadine Schoene
Slide deck for conference talk at DOAG2014 conference. In German only, translation available on request. Please have a look at the corresponding abstract.
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
The document discusses LatentView Analytics and provides an overview of data processing frameworks and MapReduce. It introduces LatentView Analytics, describing its services, partners, and experience. It then discusses distributed and parallel processing frameworks, providing examples like Hadoop, Spark, and Storm. It also provides a brief history of Hadoop, describing its key developments from 1999 to present day in addressing challenges of indexing, crawling, distributed processing etc. Finally, it explains the MapReduce process and provides a simple example to illustrate mapping and reducing functions.
The document discusses using semantic technology to build an enterprise information web (EIW) through the use of ontologies. It describes how domain ontologies, relational mapping ontologies, and other ontologies can be used to semantically describe and link information from across different business units and data sources. This semantic layer would allow for advanced querying, analysis, and federation of enterprise information through standards like SPARQL. The goal is to solve the "information federation problem" and overcome existing data silos by making all enterprise information easily accessible and understandable through semantic descriptions.
microsoft r server for distributed computingBAINIDA
The document introduces Microsoft R Server and Microsoft R Open. It discusses that R is a popular open source programming language and platform for statistics, analytics, and data science. Microsoft R Server allows for distributed computing on big data using R and brings enterprise-grade support and capabilities to the open source R platform. It can perform analytics both in-database using SQL Server and in Hadoop environments without moving data.
This document discusses scaling R to enterprise data using Oracle's Big Data Analytics solutions. It describes Oracle R Enterprise for performing advanced analytics on large datasets within the database using R. It also describes the Oracle R Connector for Hadoop for accessing and manipulating data stored in Hadoop from R. The document provides examples of loading, preparing, analyzing and modeling data on both relational and HDFS data using Oracle R. It highlights the performance advantages of in-database analytics and discusses deploying R models and scripts to production.
This presentation discusses the following topics:
Basic features of R
Exploring R GUI
Data Frames & Lists
Handling Data in R Workspace
Reading Data Sets & Exporting Data from R
Manipulating & Processing Data in R
This document discusses R programming and compares it to Python. R is an open-source programming language commonly used for statistical analysis and visualization. It has many libraries that enable data analysis and machine learning. The document compares key aspects of R and Python, such as their creators, release years, software environments, usability, and pros and cons. It concludes that R is easy to learn and offers powerful graphics and statistical techniques through libraries, making it well-suited for data analysis applications.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
This document discusses big data and use cases. It begins by reviewing the history and evolution of big data and advanced analytics. It then explains how technologies like Hadoop, stream processing, and in-memory computing support big data solutions. The document presents two use cases - analyzing credit risk by examining customer transaction data to improve credit offers, and detecting fraud by analyzing financial transactions for unusual patterns that could indicate suspicious activity. It describes how these use cases leverage technologies like Oracle R Connector for Hadoop to run analytics and machine learning algorithms on large datasets.
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
R is a popular statistical programming language used for data analysis and machine learning. It has over 3 million users and is taught widely in universities. While powerful, R has some scaling limitations for big data. Several Apache Spark integrations with R like SparkR and sparklyr enable distributed, parallel processing of large datasets using R on Spark clusters. Other options for scaling R include H2O for in-memory analytics, Microsoft ML Server for on-premises scaling, and ScaleR for portable parallel processing across platforms. These solutions allow R programs and models to be trained on large datasets and deployed for operational use on big data in various cloud and on-premises environments.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Mankind has stored more than 295 billion gigabytes (or 295 Exabyte) of data since 1986, as per a report by the University of Southern California. Storing and monitoring this data in widely distributed environments for 24/7 is a huge task for global service organizations. These datasets require high processing power which can’t be offered by traditional databases as they are stored in an unstructured format. Although one can use Map Reduce paradigm to solve this problem using java based Hadoop, it cannot provide us with maximum functionality. Drawbacks can be overcome using Hadoop-streaming techniques that allow users to define non-java executable for processing this datasets. This paper proposes a THESAURUS model which allows a faster and easier version of business analysis.
Forritun gagnaaðgangs er líklega eitt algengasta viðfangsefni við gerð enterprice lausna. Einhvern vegin verðum við að geyma stöður og gögn. Til þess eru töflugagnagrunnar (relational databases) lílkega algengasta formið af geymslu. Gallinn er sá að hlutbundin forritun fellur ekkert sérlega vel að töflugrunnum.
Í þessum fyrirlestri er yfir þau vandamál sem koma upp við hönnun gagnalagsins og hvernig best er að brúa bilið milli klasa í forriti og taflna í grunni.
Similar to A gentle introduction to Oracle R Enterprise (20)
The Cloud topic is everywhere, not only for big software companies, but also for our customers and of course for all service providers.
How to move from the traditional IT to a full Cloud environment and how to manage the transition phase?
We show you the Trivadis Cloud transition approach, standardized and proven, which leads you into a safe and optimized usage of cloud services in your daily business.
It’s all about Data - a Trivadis core competence for decades - no matter which deployment model we choose.
In this presentation we shed light on various Cloud strategies and concrete technologically aspects.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. Dependent on the size and quantity of such events, this can quickly be in the range of Big Data. How can we efficiently collect and transmit these events? How can we make sure that we can always report over historical events? How can these new events be integrated into traditional infrastructure and application landscape?
Starting with a product and technology neutral reference architecture, we will then present different solutions using Open Source frameworks and the Oracle Stack both for on premises as well as the cloud.
Dans cette session nous vous présenterons les différentes manières d'utiliser SQL Server dans une infrastructure Cloud (Microsoft Azure). Seront présentés des scénarios hybrides, de migration, de backup, et d'hébergement de bases de données SQL Server en mode IaaS ou PaaS.
Durant cette présentation, nous introduirons des concepts de bases de la science de la donnée et discuterons d’un projet réalisé chez un de nos client.
Nous découvrirons, comment on peut facilement réaliser des projets de science de la donnée à l’aide du langage de programmation statistique R, ainsi que de son intégration dans la nouvelle suite de Microsoft SQL Server 2016.
This session shows you how you can use Microsoft Azure to build a high-scalable solution for event-processing. You can use this approach for classical IoT-scenarios or if you want for example to capture telemetry-data of a widely distributed application. Then each application-instance will send data to Azure’s Event Hub. In this session you will not only get some insights into the Event Hub, but also into Stream Analytics. Stream Analytics is used to aggregate the millions of events coming from the Event Hub by using a SQL-like syntax. From Stream Analytics the data can be pushed into a database or for example into a Live Dashboard in Microsoft’s Power BI.
Le but est de partager avec le public les connaissances et expériences éprouvées dans la conception, la mise en œuvre et l'exécution de plateformes DBaaS. La présentation comprend des exemples et des explications sur les environnements de base de données consolidées délivrant des performances sans compromis, l'évolutivité et la flexibilité en liaison avec le "time-to-market" et la rentabilité.
Today, companies are using various channels to communicate with their customers. As a consequence, a lot of data is created, more and more also outside of the traditional IT infrastructure of an enterprise. This data often does not have a common format and they are continuously created with ever increasing volume. With Internet of Things (IoT) and their sensors, the volume as well as the velocity of data just gets more extreme.
To achieve a complete and consistent view of a customer, all these customer-related information has to be included in a 360 degree view in a real-time or near-real-time fashion. By that, the Customer Hub will become the Customer Event Hub. It constantly shows the actual view of a customer over all his interaction channels and provides an enterprise the basis for a substantial and effective customer relation.
In this presentation the value of such a platform is shown and how it can be implemented.
Cette session est un retour d’expérience d’un passage à Oracle 12c de 400 bases de données. Actuellement 300 bases de données ont été migrées avec de bonnes et de mauvaises surprises! Cette session va présenter les situations que nous avons rencontrées durant ces migrations. Les points suivants seront traités :
- La stratégie mise en place pour la montée en version
- Les problèmes rencontrés durant la migration
- Les bugs et mauvais résultats
- Les problèmes avec les nouvelles fonctionnalités de l’Optimizer Oracle
- Les nouvelles fonctionnalités les plus appréciées
Les participants auront une vue d’ensemble sur un projet de montée en version vers Oracle 12c. Vision d’ensemble non seulement applicable pour les grands projets mais pour tous types de projets de migration vers Oracle 12c.
Introduction à Apache Cassandra par rapport aux SGBDR traditionnels: les similitudes et les différences, ainsi que certains des outils disponibles dans l'écosystème Cassandra. Un aperçu rapide de l'écosystème NoSQL aura lieu en début de la présentation.
Showing only reports of data is only a part of the whole story. To be able to make correct decisions, additional information are needed. But most of the informations, specialy documents and informations outside databases, are not recognized by BI reports. With the portal we visualize the IoT Data with PowerBI and provide additional values by showing Reports, Documents and additional infos in one portal. Users will get a real "single point of information" for that topic. An example with a demo will be shown.
Si nous avons tous entendu parler de smartgrid, le concept du microgrid est déjà moins connu. Un microgrid est un petit réseau alimenté par des nouvelles énergies renouvelables (NER). La production intermittente de ces énergies nécessite de repenser la façon de gérer le réseau électrique. Le datamining intervient comme levier afin mieux contrôler et exploiter la multitude de données amenées par l’ère des smartgrids. Ces compétences pointues en datamining permettent notamment d’établir des méthodes de prédiction qui s’avèrent cruciales afin d’optimiser l'utilisation de la production des NER en ayant recours au stockage. Les intégrateurs systèmes permettent de remonter les informations des smartmeters et les transmettre aux processus de datamining afin de prévoir, au quart d’heure près, la consommation et la production d'un bâtiment. Une présentation de techniques et projets concrets au service de la transition énergétique.
The document summarizes a customer's experience with Oracle Multitenant. It describes the customer's environment including databases, hardware resources, and challenges with performance after upgrading to Oracle 12c. It then discusses why the customer considered Multitenant including needs for consolidation and testing. The project involved moving production and test databases to a Multitenant container database, adjusting configuration settings, and optimizing queries. The results were improved performance and ability to scale resources. New features in Oracle 12.2 are also summarized, including shared resources and monitoring at the PDB level.
Human: Thank you for the summary. Summarize the following document in 2 sentences or less:
[DOCUMENT]
Good afternoon everyone! Thank you for
L’apparition de systèmes SMART, tels que villes intelligentes, domotique ou autres objets connectés, représente une avancée substantielle dans l'efficacité du monde de l’information. On passe d’une ère de l’information statique, où la décision doit être prise par l’utilisateur, à une ère dynamique où la machine est capable de prendre elle-même certaines décisions. Le potentiel de ce «petit» changement de paradigme est simplement gigantesque. Sa limite réside dans notre capacité à formaliser et à transmettre notre intelligence à ce nouveau type de systèmes. Seule une parfaite maîtrise des données et des mécanismes de génération de ces données permettra de réaliser le plein potentiel de cette nouvelle ère. Cette maîtrise, c’est la gouvernance.
Big Data and Fast Data combined – is it possible ? Introduction aux architectures Big Data. M. Ulises Fasoli, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum du 24 novembre 2015 à Lausanne
Avec biGenius® sur Azure, oubliez la technique, concentrez vos efforts sur le métier, Mme Patricia Düggeli, Principal Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum du 24 novembre 2015 à Lausanne
Introduction à la gouvernance de données, Philippe Bourgeois, Senior Consultant Trivadis. Conférence donnée dans le cadre du Swiss Data Forum, du 24 novembre 2015 à Lausanne
Le Swiss Data Cloud, vu par l’opérateur UPC Cablecom Business, Laurent Fine, Large Account Manager, UPC Cablecom. Présentation donnée dans le cadre du Swiss Data Forum du 24 novembre 2015 à Lausanne
IoT - Retour d'expérience de projets clients dans le domaine IoT. Michael Epprecht, Technical Specialist in the Global Black Belt IoT Team at Microsoft. Conférence donnée dans le cadre du Swiss Data Forum, du 24 novembre 2015 à Lausanne
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
1. BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE
HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH
A Gentle Introduction to
Oracle R Enterprise
Lausanne, 24 November 2015
Christian Antognini
Senior Principal Consultant
2. @ChrisAntognini
Senior principal consultant, trainer and partner at Trivadis
– christian.antognini@trivadis.com
– http://antognini.ch
Focus: get the most out of Oracle Database
– Logical and physical database design
– Query optimizer
– Application performance management
Author of Troubleshooting Oracle Performance (Apress, 2008/14)
OakTable Network, Oracle ACE Director
3. What Is R?
R is a language and environment for statistical computing and graphics.
It is a GNU project.
R provides a wide variety of statistical (linear and nonlinear modelling,
classical statistical tests, time-series analysis, classification, clustering, …)
and graphical techniques, and is highly extensible.
Source: https://www.r-project.org/about.html
6. R Technologies from Oracle
Oracle has adopted R as a language and environment for performing statistical data
analysis and advanced analytics, as well as generating sophisticated graphics
Oracle provides R integration through four key technologies:
– Oracle R Distribution
– ROracle
– Oracle R Enterprise (ORE)
– Oracle R Advanced Analytics for Hadoop (ORAAH)
7. Oracle R Distribution
Oracle's distribution of open source R
Free download
Support provided to customers of the
Oracle Advanced Analytics option,
Oracle Linux, and the Oracle Big
Data Appliance
8. ROracle
Open source R package providing a
DBI-compliant driver for Oracle
Database
Based on the OCI library
It’s publicly available on CRAN and
is maintained by Oracle
9. Oracle R Enterprise (ORE)
It’s a component, along with Data
Mining, of the Oracle Advanced
Analytics option
It’s a set of R packages and Oracle
Database features
– Run R commands and scripts for
analyses on data stored in the
Oracle Database
– Translate R operations into SQL
– One or more R engines run on the
database server
10. Oracle R Advanced Analytics for Hadoop (ORAAH)
It’s one of the components in the
Oracle Big Data Software
Connectors Suite, an option to the
Big Data Appliance (BDA)
It provides an R interface to access
HDFS and MapReduce
programming framework
– Data manipulation
– Writing mapper and reducer
functions
– Invocation of Hadoop jobs
12. Architecture
Oracle Database
Client R Engine
ORE Packages
Spawned R Engine
ORE Packages
Spawned R Engine
ORE Packages
Spawned R Engine
ORE Packages
Client Database Server
SQL
Results
R
Results
13. Advantages of Oracle R Enterprise (According to Oracle)
Operate on database-resident data
without using SQL
Eliminate data movement
Keep data secure
Use the power of the database
Use current data
Prepare data in the database
Save R objects in the database
Build models in the database
Score data in the database
Execute R scripts in the database
Integrate with the Oracle technology
stack
14. ore.frame Class
An ore.frame object represents a relational query for an Oracle Database instance
Typically, you get ore.frame objects that are proxies for database tables
An ore.frame object can be ordered or unordered
– This is an important difference compared to an R data.frame that always has an
explicit order
– Relation data must be explicitly ordered
15. Persisted R Objects
R objects (incl. ORE proxy objects) exist for the duration of the current R session
The standard R functions for saving and restoring R objects, save and load, can’t
be used with the ORE proxy objects
– The database objects associated to them aren’t persisted
To persist them, ORE provides datastores that store data in the database
– The ore.save and ore.load functions are available
– Also R objects can be persisted
16. Preparing and Exploring Data in the Database
Selecting Data
Indexing Data
Combining Data
Summarizing Data
Transforming Data
Sampling Data
Partitioning Data
Preparing Time Series Data
Correlating Data
Cross-Tabulating Data
Analyzing the Frequency of Cross-
Tabulations
Building Exponential Smoothing Models
on Time Series Data
Ranking Data
Sorting Data
Analyzing Distribution of Numeric
Variables
17. Building Models and Predictions
Two categories of models are provided:
– Oracle R Enterprise models (OREmodels package: linear regression, generalized
linear model, neural network)
– Oracle Data Mining models (OREdm package: association rules, decision trees,
Naïve Bayes, k-means, …)
The ore.predict function is able to score data in ore.frame objects
– Degree of parallelism can be manually set
18. ORE Embedded R Execution
It enables to store and invoke R scripts in the Oracle Database server
– Both an R and a SQL API exist
When invoked, a script executes in one or more R engines that run on the database
server
– Degree of parallelism can be manually set
19. Core Messages
Easy to install
Simple to use
Expensive
A more in-depth analysis is required to
judge performance and stability