Mark Rittman, CTO of Rittman Mead, gave a keynote presentation on big data for Oracle developers and DBAs with a focus on Apache Spark, real-time analytics, and predictive analytics. He discussed how Hadoop can provide flexible, cheap storage for logs, feeds, and social data. He also explained several Hadoop processing frameworks like Apache Spark, Apache Tez, Cloudera Impala, and Apache Drill that provide faster alternatives to traditional MapReduce processing.
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
Динамичное развитие инструментов для обработки Больших Данных порождает новые подходы к повышению производительности. Ключевые новые технологии в Hadoop 2.0, такие как Yarn labeling и Storage Tiering, уже используются компаниями Yahoo и Ebay. Эти новые технологии открывают путь для серьезного повышения эффективности ИТ-инфраструктуры для Hadoop, достигая прироста производительности в несколько десятков процентов при одновременном снижении потребления памяти и электроэнергии.
Эталонная архитектура для Hadoop от HP — HP Big Data Reference Architecture — предлагает использование специализированных "микросерверов" HP Moonshot вкупе с высокоплотными узлами хранения HP Apollo для достижения лучших на сегодня показателей полезной отдачи от железа в Hadoop.
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
Динамичное развитие инструментов для обработки Больших Данных порождает новые подходы к повышению производительности. Ключевые новые технологии в Hadoop 2.0, такие как Yarn labeling и Storage Tiering, уже используются компаниями Yahoo и Ebay. Эти новые технологии открывают путь для серьезного повышения эффективности ИТ-инфраструктуры для Hadoop, достигая прироста производительности в несколько десятков процентов при одновременном снижении потребления памяти и электроэнергии.
Эталонная архитектура для Hadoop от HP — HP Big Data Reference Architecture — предлагает использование специализированных "микросерверов" HP Moonshot вкупе с высокоплотными узлами хранения HP Apollo для достижения лучших на сегодня показателей полезной отдачи от железа в Hadoop.
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
What's the origin of Big Data? What are the real life usage scenarios where Hadoop has been successfully adopted? How do you get started within your organizations?
Cortana Analytics Suite is a fully managed big data and advanced analytics suite that transforms your data into intelligent action. It is comprised of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription. This presentation will cover all the products involved, how they work together, and use cases.
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
We have the challenge of how to reliably store massive quantities of data that are available even in the face of infrastructure failures. We have similar challenges on the application side. The most successful cloud architectures break applications down into microservices. How then do we deploy, upgrade and manage the scale of those microservices? This session will illustrate how to tackle these challenges by taking advantage of both Cassandra and Microsoft's next generation PaaS infrastructure called Azure Service Fabric.
Introduction to HDInsight Hadoop on Windows Azure services, including using the interactive console with JavaScript and running WordCount via other methods (Streaming, Hive, etc..)
How to Operationalise Real-Time Hadoop in the CloudAttunity
Hadoop and the Cloud are two of the most disruptive technologies to have emerged from the last decade, but how can you adapt to the increasing rate of change whilst providing the enterprise with the right data, quickly?
Watch this webinar with Attunity, Cloudera and Microsoft and learn:
-How to ingest the most valuable enterprise data into Hadoop
-About real life use cases of Cloudera on Azure
-How to combine the power of Hadoop and the scalable flexibility of Azure
Enable your business with more data in less time. Visit www.attunity.com for more information.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Apache Mesos, Apache Hadoop, Apache Spark + Custom Enterprise Applications: This stack combined is greater than the sum of each of the pieces of this stack. Mesos can manage resources across an entire data center, Hadoop provides a distributed data store and scalable data processing, and Spark delivers great in-memory and disk-based performance of data processing as well as streaming capabilities. Couple all of that with custom enterprise applications, and the data center turns into a well-oiled machine. When combined, this software stack delivers unlimited flexibility for the entire data center.
Jim Scott, Director of Architecture and Enterprise Strategy | Strata + Hadoop World | Barcelona, Spain, November 2014
Cortana Analytics Suite is a fully managed big data and advanced analytics suite that transforms your data into intelligent action. It is comprised of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription. This presentation will cover all the products involved, how they work together, and use cases.
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
We have the challenge of how to reliably store massive quantities of data that are available even in the face of infrastructure failures. We have similar challenges on the application side. The most successful cloud architectures break applications down into microservices. How then do we deploy, upgrade and manage the scale of those microservices? This session will illustrate how to tackle these challenges by taking advantage of both Cassandra and Microsoft's next generation PaaS infrastructure called Azure Service Fabric.
Introduction to HDInsight Hadoop on Windows Azure services, including using the interactive console with JavaScript and running WordCount via other methods (Streaming, Hive, etc..)
How to Operationalise Real-Time Hadoop in the CloudAttunity
Hadoop and the Cloud are two of the most disruptive technologies to have emerged from the last decade, but how can you adapt to the increasing rate of change whilst providing the enterprise with the right data, quickly?
Watch this webinar with Attunity, Cloudera and Microsoft and learn:
-How to ingest the most valuable enterprise data into Hadoop
-About real life use cases of Cloudera on Azure
-How to combine the power of Hadoop and the scalable flexibility of Azure
Enable your business with more data in less time. Visit www.attunity.com for more information.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Apache Mesos, Apache Hadoop, Apache Spark + Custom Enterprise Applications: This stack combined is greater than the sum of each of the pieces of this stack. Mesos can manage resources across an entire data center, Hadoop provides a distributed data store and scalable data processing, and Spark delivers great in-memory and disk-based performance of data processing as well as streaming capabilities. Couple all of that with custom enterprise applications, and the data center turns into a well-oiled machine. When combined, this software stack delivers unlimited flexibility for the entire data center.
Jim Scott, Director of Architecture and Enterprise Strategy | Strata + Hadoop World | Barcelona, Spain, November 2014
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
Set of product roadmap + capabilities slides from Oracle Data Integration Product Management, and thoughts on data integration on big data implementations by Mark Rittman (Independent Analyst)
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsMark Rittman
This is a session for Oracle DBAs and devs that looks at the cutting edge big data techs like Spark, Kafka etc, and through demos shows how Hadoop is now a a real-time platform for fast analytics, data integration and predictive modeling
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products such as Oracle Big Data SQL on the Oracle Big Data Appliance along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we'll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete "data fabric" solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
There are many options for providing SQL access over data in a Hadoop cluster, including proprietary vendor products along with open-source technologies such as Apache Hive, Cloudera Impala and Apache Drill; customers are using those to provide reporting over their Hadoop and relational data platforms, and looking to add capabilities such as calculation engines, data integration and federation along with in-memory caching to create complete analytic platforms. In this session we’ll look at the options that are available, compare database vendor solutions with their open-source alternative, and see how emerging vendors are going beyond simple SQL-on-Hadoop products to offer complete “data fabric” solutions that bring together old-world and new-world technologies and allow seamless offloading of archive data and compute work to lower-cost Hadoop platforms.
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
In this slidecast, Alex Gorbachev from Pythian presents a Practical Introduction to Hadoop. This is a great primer for viewers who want to get the big picture on how Hadoop works with Big Data and how this approach differs from relational databases.
Watch the presentation: http://inside-bigdata.com/slidecast-a-practical-introduction-to-hadoop/
Download the audio:
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
View our quarterly customer education webcast to learn about the new advancements in Syncsort DMX and DMX-h data integration software and DataFunnel - our new easy-to-use browser-based database onboarding application. Learn about DMX Change Data Capture and the advantages of true streaming over micro-batch.
View this webcast on-demand where you'll hear the latest news on:
• Improvements in Syncsort DMX and DMX-h
• What’s next in the new DataFunnel interface
• Streaming data in DMX Change Data Capture
• Hadoop 3 support in Syncsort Integrate products
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Similar to IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-Time and Predictive Analytics (20)
Using Oracle Big Data Discovey as a Data Scientist's ToolkitMark Rittman
As delivered at Trivadis Tech Event 2016 - how Big Data Discovery along with Python and pySpark was used to build predictive analytics models against wearables and smart home data
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
As presented at OGh SQL Celebration Day 2016 - including new content on why NoSQL and Hadoop is a better solution for social network analysis than the Oracle Database (for now...)
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
Presented at the UKOUG Business Analytics SIG Meeting in April 2016, addresses the question as to whether enterprise BI tools such as OBIEE12c are relevant in the world of Gartner BiModal Mode 1 + Mode 2 analytics, and Hybrid cloud/on-premise deployments
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
This talk focus is on what a data reservoir is, how it related to the RDBMS DW, and how Big Data Discovery provides access to it to business and BI users
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
OBIEE12c comes with an updated version of Essbase that focuses entirely in this release on the query acceleration use-case. This presentation looks at this new release and explains how the new BI Accelerator Wizard manages the creation of Essbase cubes to accelerate OBIEE query performance
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization.
In this session we'll look at Oracle Big Data Discovery and how it provides a "visual face" to your big data initatives, and how it complements and extends the work that you currently do using business analytics tools.
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
Presentation given at Oracle Openworld 2015 on moving an existing OBIEE11g BI platform to Oracle Public Cloud, including accompanying DW database and continuing the ETL process. Explores migration process and what's now possible in Oracle Cloud for hosting full OBIEE platforms, and looks at what the benefits of such a migration might be for customers and end-users.
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
Presentation from the Rittman Mead BI Forum 2015 masterclass, pt.2 of a two-part session that also covered creating the Discovery Lab. Goes through setting up Flume log + twitter feeds into CDH5 Hadoop using ODI12c Advanced Big Data Option, then looks at the use of OBIEE11g with Hive, Impala and Big Data SQL before finally using Oracle Big Data Discovery for faceted search and data mashup on-top of Hadoop
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
Slides from a two-day OBIEE11g seminar in Dubai, February 2015, at the Oracle University Expert Summit. Covers the following topics:
1. OBIEE 11g Overview & New Features
2. Adding Exalytics and In-Memory Analytics to OBIEE 11g
3. Source Control and Concurrent Development for OBIEE
4. No Silver Bullets - OBIEE 11g Performance in the Real World
5. Oracle BI Cloud Service Overview, Tips and Techniques
6. Moving to Oracle BI Applications 11g + ODI
7. Oracle Essbase and Oracle BI EE 11g Integration Tips and Techniques
8. OBIEE 11g and Predictive Analytics, Hadoop & Big Data
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
Slides from my 2hr session at the UKOUG Tech'14 Super Sunday event, covering Hadoop basics and use of Oracle Data Integrator 12c for ETL on the Hadoop platform. Also some coverage of Oracle big data product announcements from OOW2014.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
The affect of service quality and online reviews on customer loyalty in the E...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-Time and Predictive Analytics
1. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Big Data for Oracle Developers & DBAs -
Towards Spark, Real-Time and Predictive Analytics
Mark Rittman, CTO, Rittman Mead
IlOUG Tech Day 2016 Day 2 Keynote, 31st May 2016 @ 9.15am
2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Mark Rittman, Co-Founder of Rittman Mead
‣Oracle ACE Director, specialising in Oracle BI&DW
‣14 Years Experience with Oracle Technology
‣Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
‣Oracle Business Intelligence Developers Guide
‣Oracle Exalytics Revealed
‣Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
•Email : mark.rittman@rittmanmead.com
•Twitter : @markrittman
About the Speaker
4. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Gives us an ability to store more data, at more detail, for longer
•Provides a cost-effective way to analyse vast amounts of data
•Hadoop & NoSQL technologies can give us “schema-on-read” capabilities
•There’s vast amounts of innovation in this area we can harness
•And it’s very complementary to Oracle BI & DW
Why is Hadoop of Interest to Us?
5. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Flexible Cheap Storage for Logs, Feeds + Social Data
$50k
Hadoop
Node
Voice + Chat
Transcripts
Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics
Raw Data
Customer 360 Apps
Predictive
Models
SQL-on-Hadoop
Business analytics
Real-time Feeds,
batch and API
6. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Incorporate Hadoop Data Reservoirs into DW Design
Virtualization&
QueryFederation
Enterprise
Performance
Management
Pre-built &
Ad-hoc
BI Assets
Information
Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Data
Science
Data Engines &
Poly-structured
sources
Content
Docs Web & Social Media
SMS
Structured
Data
Sources
•Operational Data
•COTS Data
•Master & Ref. Data
•Streaming & BAM
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted from
business process changes
Past, current and future interpretation of
enterprise data. Structured to support agile
access & navigation
Discovery Lab Sandboxes Rapid Development Sandboxes
Project based data stores to
support specific discovery
objectives
Project based data stored to
facilitate rapid content /
presentation delivery
Data Sources
7. info@rittmanmead.com www.rittmanmead.com @rittmanmead 7
•Oracle Engineered system for big data processing and analysis
•Start with Oracle Big Data Appliance Starter Rack - expand up to 18 nodes per rack
•Cluster racks together for horizontal scale-out using enterprise-quality infrastructure
Oracle Big Data Appliance
Starter Rack + Expansion
• Cloudera CDH + Oracle software
• 18 High-spec Hadoop Nodes with
InfiniBand switches for internal
Hadoop traffic, optimised for network
throughput
• 1 Cisco Management Switch
• Single place for support for H/W + S/
W
Deployed on Oracle Big Data Appliance
Oracle Big Data Appliance
Starter Rack + Expansion
• Cloudera CDH + Oracle software
• 18 High-spec Hadoop Nodes with
InfiniBand switches for internal
Hadoop traffic, optimised for network
throughput
• 1 Cisco Management Switch
• Single place for support for H/W + S/
W
Enriched
Customer Profile
Modeling
Scoring
Infiniband
8. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hadoop, through MapReduce, breaks processing down into simple stages
‣Map : select the columns and values you’re interested in, pass through as key/value pairs
‣Reduce : aggregate the results
•Most ETL jobs can be broken down into filtering,
projecting and aggregating
•Hadoop then automatically runs job on cluster
‣Share-nothing small chunks of work
‣Run the job on the node where the data is
‣Handle faults etc
‣Gather the results back in
Hadoop Tenets : Simplified Distributed Processing
Mapper
Filter, Project
Mapper
Filter, Project
Mapper
Filter, Project
Reducer
Aggregate
Reducer
Aggregate
Output
One HDFS file per reducer,
in a directory
9. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•MapReduce jobs are typically written in Java, but Hive can make this simpler
•Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
•Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
•Approach used by ODI and OBIEE
to gain access to Hadoop data
•Allows Hadoop data to be accessed just like
any other data source (sort of...)
Hive as the Hadoop SQL Access Layer
10. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Data integration tools such as Oracle Data Integrator can load and process Hadoop data
•BI tools such as Oracle Business Intelligence 12c can report on Hadoop data
•Generally use MapReduce and Hive to access data
‣ODBC and JDBC access to Hive tabular data
‣Allows Hadoop unstructured/semi-structured
data on HDFS to be accessed like RDBMS
Hive Provides a SQL Interface for BI + ETL Tools
Access direct Hive or extract using ODI12c
for structured OBIEE dashboard analysis
What pages are people visiting?
Who is referring to us on Twitter?
What content has the most reach?
11. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Most Oracle DBAs and developers know about Hadoop, but assume…
Common Developer Understanding of Hadoop Today
‣Hadoop is just for batch (because of the MapReduce JVN spin-up issue)
‣Hadoop is just for large datasets, not ad-hoc work or micro batches
‣Hadoop will always be slow because it stages everything to disk
‣All Hadoop can do is Map (select, filter) and Reduce (aggregate)
‣Hadoop == MapReduce
22. info@rittmanmead.com www.rittmanmead.com @rittmanmead 22
•MapReduce’s great innovation was to break processing down into distributed jobs
•Jobs that have no functional dependency on each other, only upstream tasks
•Provides a framework that is infinitely scalable and very fault tolerant
•Hadoop handled job scheduling and resource management
‣All MapReduce code had to do was provide the “map” and “reduce” functions
‣Automatic distributed processing
‣Slow but extremely powerful
Hadoop 1.0 and MapReduce
23. info@rittmanmead.com www.rittmanmead.com @rittmanmead 23
•A typical Hive or Pig script compiles down into multiple MapReduce jobs
•Each job stages its intermediate results to disk
•Safe, but slow - write to disk, spin-up separate JVMs for each job
MapReduce - Scales By Writing Intermediate Results to Disk
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.34 sec HDFS Read: 10952994 HDFS Write: 5239 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.1 sec HDFS Read: 9983 HDFS Write: 164 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 440 msec
OK
1
2
24. info@rittmanmead.com www.rittmanmead.com @rittmanmead 24
•MapReduce 2 (MR2) splits the functionality of the JobTracker
by separating resource management and job scheduling/monitoring
•Introduces YARN (Yet Another Resource Manager)
•Permits other processing frameworks to MR
‣For example, Apache Spark
•Maintains backwards compatibility with MR1
•Introduced with CDH5+
MapReduce 2 and YARN
Node
Manager
Node
Manager
Node
Manager
Resource
Manager
Client
Client
25. info@rittmanmead.com www.rittmanmead.com @rittmanmead 25
•Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc
•Models processing as an entire data flow graph (DAG), rather than separate job steps
‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems
‣Dataflow steps pass data between them as streams, rather than writing/reading from disk
•Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez
•Favoured In-memory / Hive v2
route by Hortonworks
Apache Tez
InputData
TEZ DAG
Map()
Map()
Map()
Reduce()
OutputData
Reduce()
Reduce()
Reduce()
InputData
Map()
Map()
Reduce()
Reduce()
28. info@rittmanmead.com www.rittmanmead.com @rittmanmead 28
•Cloudera’s answer to Hive query response time issues
•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access
•Mostly in-memory, but spills to disk if required
•Uses Hive metastore to access Hive table metadata
•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
29. info@rittmanmead.com www.rittmanmead.com @rittmanmead 29
•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations
•Apache AVRO often used for general-purpose processing
‣Splitability, schema evolution, in-built metadata, support for block compression
•Parquet now commonly used with Impala due to column-orientated storage
‣Mirrors work in RDBMS world around column-store
‣Only return (project) the columns you require across a wide table
Apache Parquet - Column-Orientated Storage for Analytics
30. info@rittmanmead.com www.rittmanmead.com @rittmanmead 30
•But Parquet (and HDFS) have significant limitation for real-time analytics applications
‣Append-only orientation, focus on column-store
makes streaming ingestion harder
•Cloudera Kudu aims to combine best of HDFS + HBase
‣Real-time analytics-optimised
‣Supports updates to data
‣Fast ingestion of data
‣Accessed using SQL-style tables
and get/put/update/delete API
Cloudera Kudu - Combining Best of HBase and Column-Store
31. info@rittmanmead.com www.rittmanmead.com @rittmanmead 31
•Part of Oracle Big Data 4.0 (BDA-only)
‣Also requires Oracle Database 12c, Oracle Exadata Database Machine
•Extends Oracle Data Dictionary to cover Hive
•Extends Oracle SQL and SmartScan to Hadoop
•Extends Oracle Security Model over Hadoop
‣Fine-grained access control
‣Data redaction, data masking
‣Uses fast c-based readers where possible
(vs. Hive MapReduce generation)
‣Map Hadoop parallelism to Oracle PQ
‣Big Data SQL engine works on top of YARN
‣Like Spark, Tez, MR2
Oracle Big Data SQL
Exadata
Storage Servers
Hadoop
Cluster
Exadata Database
Server
Oracle Big
Data SQL
SQL Queries
SmartScan SmartScan
35. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery
•Inspired by Google Dremel, innovation is querying raw data with schema optional
•Automatically infers and detects schema from semi-structured datasets and NoSQL DBs
•Join across different silos of data e.g. JSON records, Hive tables and HBase database
•Aimed at different use-cases than Hive -
low-latency queries, discovery
(think Endeca vs OBIEE)
Introducing Apache Drill - “We Don’t Need No Roads”
36. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Most modern datasource formats embed their schema in the data (“schema-on-read”)
•Apache Drill makes these as easy to join to traditional datasets as “point me at the data”
•Cuts out unnecessary work in defining Hive schemas for data that’s self-describing
•Supports joining across files,
databases, NoSQL etc
Self-Describing Data - Parquet, AVRO, JSON etc
37. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Files can exist either on the local filesystem, or on HDFS
•Connection to directory or file defined in storage configuration
•Can work with CSV, TXT, TSV etc
•First row of file can provide schema (column names)
Apache Drill and Text Files
SELECT * FROM dfs.`/tmp/csv_with_header.csv2`;
+-------+------+------+------+
| name | num1 | num2 | num3 |
+-------+------+------+------+
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
+-------+------+------+------+
7 rows selected (0.12 seconds)
SELECT * FROM dfs.`/tmp/csv_no_header.csv`;
+------------------------+
| columns |
+------------------------+
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
+------------------------+
7 rows selected (0.112 seconds)
38. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•JSON (Javascript Object Notation) documents are
often used for data interchange
•Exports from Twitter and other consumer services
•Web service responses and other B2B interfaces
•A more lightweight form of XML that is “self-
describing”
•Handles evolving schemas, and optional attributes
•Drill treats each document as a row, and has features
to
•Flatten nested data (extract elements from arrays)
•Generate key/value pairs for loosely structured data
Apache Drill and JSON Documents
use dfs.iot;
show files;
select in_reply_to_user_id, text from `all_tweets.json`
limit 5;
+---------------------+------+
| in_reply_to_user_id | text |
+---------------------+------+
| null | BI Forum 2013 in Brighton has now sold-out |
| null | "Football has become a numbers game |
| null | Just bought Lyndsay Wise’s Book |
| null | An Oracle BI "Blast from the Past" |
| 14716125 | Dilbert on Agile Programming |
+---------------------+------+
5 rows selected (0.229 seconds)
select name, flatten(fillings) as f
from dfs.users.`/donuts.json`
where f.cal < 300;
39. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores)
•NoSQL databases (HBase etc)
•Parquet files (native storage format - columnar + self describing)
Apache Drill and Hive, HBase, Parquet Sources etc
USE hbase;
SELECT * FROM students;
+-------------+-----------------------+-----------------------------------------------------+
| row_key | account | address |
+-------------+-----------------------+------------------------------------------------------+
| [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} |
| [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} |
| [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} |
| [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} |
+-------------+-----------------------+----------------------------------------------------------------------+
SELECT firstname,lastname FROM
hiveremote.`customers` limit 10;`
+------------+------------+
| firstname | lastname |
+------------+------------+
| Essie | Vaill |
| Cruz | Roudabush |
| Billie | Tinnes |
| Zackary | Mockus |
| Rosemarie | Fifield |
| Bernard | Laboy |
| Marianne | Earman |
+------------+------------+
SELECT * FROM dfs.`iot_demo/geodata/region.parquet`;
+--------------+--------------+-----------------------+
| R_REGIONKEY | R_NAME | R_COMMENT |
+--------------+--------------+-----------------------+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
+--------------+--------------+-----------------------+
40. info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly
•Individual analysts exploring new datasets, leveraging corporate metadata/data to help
•Hive is more about large-scale, centrally curated set-based big data access
•Drill models conceptually as JSON, vs. Hive’s tabular approach
•Drill introspects schema from whatever it connects to, vs. formal modeling in Hive
Apache Drill vs. Apache Hive
Interactive Queries
(Data Discovery, Tableau/VA)
Reporting Queries
(Canned Reports, OBIEE)
ETL
(ODI, Scripting, Informatica)
Apache Drill Apache Hive
Interactive Queries
100ms - 3mins
Reporting Queries
3mins - 20mins
ETL & Batch Queries
20mins - hours
47. info@rittmanmead.com www.rittmanmead.com @rittmanmead 47
•Another DAG execution engine running on YARN
•More mature than TEZ, with richer API and more vendor support
•Uses concept of an RDD (Resilient Distributed Dataset)
‣RDDs like tables or Pig relations, but can be cached in-memory
‣Great for in-memory transformations, or iterative/cyclic processes
•Spark jobs comprise of a DAG of tasks operating on RDDs
•Access through Scala, Python or Java APIs
•Related projects include
‣Spark SQL
‣Spark Streaming
Apache Spark
48. info@rittmanmead.com www.rittmanmead.com @rittmanmead 48
•Native support for multiple languages
with identical APIs
‣Python - prototyping, data wrangling
‣Scala - functional programming features
‣Java - lower-level, application integration
•Use of closures, iterations, and other
common language constructs to minimize code
•Integrated support for distributed +
functional programming
•Unified API for batch and streaming
Rich Developer Support + Wide Developer Ecosystem
scala> val logfile = sc.textFile("logs/access_log")
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353)
called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2
stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] =
MappedRDD[31] at textFile at <console>:15
scala> logfile.count()
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
14/05/12 21:19:06 INFO SparkContext: Job finished:
count at <console>:18, took 0.192536694 s
res7: Long = 154563
scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403
49. info@rittmanmead.com www.rittmanmead.com @rittmanmead 49
•Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries
•Bring in and federate additional data from JDBC sources
•Load, read and save data in Hive, Parquet and other structured tabular formats
Spark SQL - Adding SQL Processing to Apache Spark
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*)
as total
FROM accessLogsFiltered a
JOIN posts p ON a.endpoint = p.POST_SLUG
GROUP BY p.POST_TITLE, p.POST_AUTHOR
ORDER BY total DESC LIMIT 10 ")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
, "parquet", SaveMode.Overwrite)
52. info@rittmanmead.com www.rittmanmead.com @rittmanmead 52
•Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled
•Data access controlled by POSIX-style permissions on HDFS files
•Hive and Impala can Apache Sentry RBAC
‣Result is data duplication and complexity
‣No consistent API or abstracted security model
Hadoop Security Initially Was a Mess
/user/mrittman/scratchpad
/user/ryeardley/scratchpad
/user/mpatel/scratchpad
/user/mrittman/scratchpad
/user/mrittman/scratchpad
/data/rm_website_analysis/logfiles/incoming
/data/rm_website_analysis/logfiles/archive
/data/rm_website_analysis/tweets/incoming
/data/rm_website_analysis/tweets/archive
53. info@rittmanmead.com www.rittmanmead.com @rittmanmead 53
•Use standard Oracle Security over Hadoop & NoSQL
‣Grant & Revoke Privileges
‣Redact Data
‣Apply Virtual Private Database
‣Provides Fine-grain Access Control
•Great solution to extend existing Oracle
security model over Hadoop datasets
Oracle Big Data SQL : Extend Oracle Security to Hadoop
Redacted
data
subset
SQL
JSON
Customer data
in Oracle DB
DBMS_REDACT.ADD_POLICY(
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',
column_name => 'ca_street_name',
policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'',
''REDACTION_TESTER'')=''TRUE'''
);
54. info@rittmanmead.com www.rittmanmead.com @rittmanmead 54
•Provides a higher level, logical abstraction for data (ie Tables or Views)
‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection
•Returns schemed objects (instead of paths and bytes) in similar way to HCatalog
•Unified data access path allows platform-wide performance improvements
•Secure service that does not execute arbitrary user code
‣Central location for all authorization checks using Sentry metadata.
Cloudera RecordService
56. info@rittmanmead.com www.rittmanmead.com @rittmanmead 56
•Part of Spark, extends Scala, Java & Python API
•Integrated workflow including ML pipelines
•Currently supports following algorithms:
‣Binary classification
‣Regression
‣Clustering
‣Collaborative filtering
‣Dimensionality Reduction
Spark MLLib : Adding Machine Learning Capabilities to Spark
// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
// Save and load model
model.save(sc, "myModelPath")
val sameModel = SVMModel.load(sc, "myModelPath")
57. info@rittmanmead.com www.rittmanmead.com @rittmanmead 57
•Data enrichment tool aimed at domain experts, not programmers
•Uses machine-learning to automate
data classification + profiling steps
•Automatically highlight sensitive data,
and offer to redact or obfuscate
•Dramatically reduce the time required
to onboard new data sources
•Hosted in Oracle Cloud for zero-install
‣File upload and download from browser
‣Automate for production data loads
Raw Data
Data stored in the
original format (usually
files) such as SS7, ASN.
1, JSON etc.
Mapped Data
Data sets produced by
mapping and
transforming raw data
Voice + Chat
Transcripts
Example Usage : Oracle Big Data Preparation Cloud Service
59. info@rittmanmead.com www.rittmanmead.com @rittmanmead 59
Use of Machine Learning to Identify Data Patterns
•Automatically profile, parse and classify incoming datasets using Spark MLLib Word2Vec
•Spot and obfuscate sensitive data automatically, automatically suggest column names
60.
61. info@rittmanmead.com www.rittmanmead.com @rittmanmead 61
•Hadoop is evolving
‣Hadoop 2.0 breaks the dependency on MapReduce
‣Spark, Tez etc allow us to create execution plans that
run in-memory, faster than before
‣New streaming models allow us to process data
via sockets, micro batches or continuously
•And Oracle developers can make use of these new capabilities
‣Oracle Big Data SQL can access Hadoop data loaded in real-time
‣OBIEE, particularly in 11.1.1.9, can access Impala
‣ODI is likely to support Hive on Tez and Hive on Spark shortly,
and will have support for Spark in the future
Summary