Big data is one of the biggest buzzword in today's market. Terms like Hadoop, HDFS, YARN, Sqoop, and non-structured data has been scaring DBA's since 2010 - but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers needs to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world, and where traditional databases fits into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into Big Data and Hadoop professionals and experts.
Learning Objective #1: What is the Big Data challenge
Learning Objective #2: Learn about Hadoop - HDFS, MapReduce and Yarn
Learning Objective #3: Understand where a DBA fits in this world
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam
Big data is one of the biggest buzzwords in today's market. Terms such as Hadoop, HDFS, YARN, Sqoop, and non-structured data have been scaring DBAs since 2010, but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers need to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world and where traditional databases fit into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into big data and Hadoop professionals and experts.
This is the presentation I gave in Kscope17, on June 27, 2017.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
The document discusses scaling HDFS to manage billions of files through distributed storage schemes. It outlines the current HDFS architecture and challenges with namespace and block scaling. It proposes a storage container architecture with distributed block maps and a storage container manager to address these challenges. This would allow HDFS to easily scale to manage trillions of blocks and billions of files across large clusters.
Big Data and NoSQL for Database and BI ProsAndrew Brust
This document provides an agenda and overview for a conference session on Big Data and NoSQL for database and BI professionals held from April 10-12 in Chicago, IL. The session will include an overview of big data and NoSQL technologies, then deeper dives into Hadoop, NoSQL databases like HBase, and tools like Hive, Pig, and Sqoop. There will also be demos of technologies like HDInsight, Elastic MapReduce, Impala, and running MapReduce jobs.
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam
Big data is one of the biggest buzzwords in today's market. Terms such as Hadoop, HDFS, YARN, Sqoop, and non-structured data have been scaring DBAs since 2010, but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers need to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world and where traditional databases fit into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into big data and Hadoop professionals and experts.
This is the presentation I gave in Kscope17, on June 27, 2017.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
The document discusses scaling HDFS to manage billions of files through distributed storage schemes. It outlines the current HDFS architecture and challenges with namespace and block scaling. It proposes a storage container architecture with distributed block maps and a storage container manager to address these challenges. This would allow HDFS to easily scale to manage trillions of blocks and billions of files across large clusters.
Big Data and NoSQL for Database and BI ProsAndrew Brust
This document provides an agenda and overview for a conference session on Big Data and NoSQL for database and BI professionals held from April 10-12 in Chicago, IL. The session will include an overview of big data and NoSQL technologies, then deeper dives into Hadoop, NoSQL databases like HBase, and tools like Hive, Pig, and Sqoop. There will also be demos of technologies like HDInsight, Elastic MapReduce, Impala, and running MapReduce jobs.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAACuneyt Goksu
The document discusses several data archiving solutions for z/OS systems including temporal tables, transparent archiving, and IDAA technology. Temporal tables allow querying and updating historical data using system time periods. Transparent archiving moves old data to other storage platforms while still allowing dynamic queries. IDAA provides accelerated query performance for temporal tables by routing queries to an accelerator system. The solutions can be combined for different use cases depending on data retention and access needs.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
The document outlines Renault's big data initiatives from 2014-2016 which progressed from an initial sandbox to a full industrialized big data platform. Key steps included implementing a new Hadoop infrastructure in 2015, industrializing the platform in 2016 to host production projects and POCs, and designing for scalability, isolation, simplified operations, and data protection. The document also discusses deploying quality projects to the data lake, ingestion scenarios, interactive SQL analytics, security measures including tokenization, and the next steps of federation and dynamic data change management.
SQL Server on Linux will provide the SQL Server database engine running natively on Linux. It allows customers choice in deploying SQL Server on the platform of their choice, including Linux, Windows, and containers. The public preview of SQL Server on Linux is available now, with the general availability target for 2017. It brings the full power of SQL Server to Linux, including features like In-Memory OLTP, Always Encrypted, and PolyBase.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Vasu Balla presented on running Oracle E-Business Suite on Oracle Cloud. Key points include:
1) EBS deployment on Oracle Cloud consists of middleware and database tiers running on Oracle Compute Cloud and Database Cloud Service/Exadata Cloud Service.
2) Deployment approaches include new implementations using marketplace images, and lift and shift of existing on-premises instances using remote cloning utilities and an EBS provisioning tool.
3) There are limitations around maximum database size, cloning automation, and disaster recovery that require further enhancements.
Virtualization is the creation of virtual versions of operating systems, servers, storage devices, databases, and other resources. It has seen rapid adoption as organizations virtualize more of their infrastructure. Copy data management (CDM) automates the creation of copies of data and applications for development, testing, and disaster recovery. It helps provision environments quickly, mask sensitive data, and reduce storage usage through techniques like deduplication and thin cloning. CDM solutions help organizations more efficiently manage the growing copies of data needed to support modern development and testing processes.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
We produce quite a lot of data! Much of the data are business transactions stored in a relational database. More frequently, the data are non-structured, high volume and rapidly changing datasets known in the industry as Big Data. The challenge for data integration professionals is to combine and transform the data into useful information. Not just that, but it must also be done in near real-time and using a target system such as Hadoop. The topic of this session, real-time data streaming, provides a great solution for this challenging task. By integrating GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system, we can implement a fast, durable, and scalable solution.
Presented at Oracle OpenWorld 2016
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAACuneyt Goksu
The document discusses several data archiving solutions for z/OS systems including temporal tables, transparent archiving, and IDAA technology. Temporal tables allow querying and updating historical data using system time periods. Transparent archiving moves old data to other storage platforms while still allowing dynamic queries. IDAA provides accelerated query performance for temporal tables by routing queries to an accelerator system. The solutions can be combined for different use cases depending on data retention and access needs.
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It provides an overview of big data and why data warehouses need Hadoop. It also gives examples of how Hadoop can be integrated into a data warehouse, including using Sqoop to import and export data between Hadoop and Oracle. Finally, it discusses best practices for using Hadoop efficiently and avoiding common pitfalls when integrating Hadoop with a data warehouse.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
The document outlines Renault's big data initiatives from 2014-2016 which progressed from an initial sandbox to a full industrialized big data platform. Key steps included implementing a new Hadoop infrastructure in 2015, industrializing the platform in 2016 to host production projects and POCs, and designing for scalability, isolation, simplified operations, and data protection. The document also discusses deploying quality projects to the data lake, ingestion scenarios, interactive SQL analytics, security measures including tokenization, and the next steps of federation and dynamic data change management.
SQL Server on Linux will provide the SQL Server database engine running natively on Linux. It allows customers choice in deploying SQL Server on the platform of their choice, including Linux, Windows, and containers. The public preview of SQL Server on Linux is available now, with the general availability target for 2017. It brings the full power of SQL Server to Linux, including features like In-Memory OLTP, Always Encrypted, and PolyBase.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Vasu Balla presented on running Oracle E-Business Suite on Oracle Cloud. Key points include:
1) EBS deployment on Oracle Cloud consists of middleware and database tiers running on Oracle Compute Cloud and Database Cloud Service/Exadata Cloud Service.
2) Deployment approaches include new implementations using marketplace images, and lift and shift of existing on-premises instances using remote cloning utilities and an EBS provisioning tool.
3) There are limitations around maximum database size, cloning automation, and disaster recovery that require further enhancements.
Virtualization is the creation of virtual versions of operating systems, servers, storage devices, databases, and other resources. It has seen rapid adoption as organizations virtualize more of their infrastructure. Copy data management (CDM) automates the creation of copies of data and applications for development, testing, and disaster recovery. It helps provision environments quickly, mask sensitive data, and reduce storage usage through techniques like deduplication and thin cloning. CDM solutions help organizations more efficiently manage the growing copies of data needed to support modern development and testing processes.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
That won’t fit into RAM - Michał BrzezickiEvention
SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia, DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
The frenetic pace of application development in modern IT organizations means it’s not unusual to demand an application be built with minimal requirement gathering – literally, from a napkin-based sketch – to a working first draft of the app within extremely short time frames – even a weekend! – with production deployment to follow just a few days later.
I'll demonstrate a real-life application development scenario – the creation of a mobile application that gives election canvassers a tool to identify, classify, and inform voters in a huge suburban Chicago voting district – using the latest Oracle application development UI, data modeling tools, and database technology. Along the way, we’ll show how Oracle APEX makes short work of building a working application while the Oracle DBA leverages her newest tools – SQL Developer and Data Modeler – to build a secure, reliable, scalable application for her development team.
The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth.
This ETL system began to fail over few years as more people joined Facebook.
In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive
Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
Vasu Balla of Pythian presented on best practices for upgrading an Oracle E-Business Suite database to Oracle Database 12c. The typical upgrade process involves installing the 12c Oracle home, upgrading the database using DBUA or CLI, and completing post-upgrade steps. Key best practices include ensuring initialization parameters and patches are configured properly, using AWR to identify performance issues and bad execution plans post-upgrade, and preserving good plans using SQL plan baselines. Regular statistics gathering and tuning of database settings also helps optimize performance.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
Similar to Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version) (20)
Docker Concepts for Oracle/MySQL DBAs and DevOpsZohar Elkayam
Oracle Week 2017 Slides
Agenda:
Docker overview – why do we even need containers?
Installing Docker and getting started
Images and Containers
Docker Networks
Docker Storage and Volumes
Oracle and Docker
Docker tools, GUI and Swarm
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
Oracle Week 2017 slides.
Agenda:
Basics: How and What To Tune?
Using the Automatic Workload Repository (AWR)
Using AWR-Based Tools: ASH, ADDM
Real-Time Database Operation Monitoring (12c)
Identifying Problem SQL Statements
Using SQL Performance Analyzer
Tuning Memory (SGA and PGA)
Parallel Execution and Compression
Oracle Database 12c Performance New Features
The art of querying – newest and advanced SQL techniquesZohar Elkayam
Presentation from Oracle Week 2017.
Agenda:
Aggregative and advanced grouping options
Analytic functions, ranking and pagination
Hierarchical and recursive queries
Regular Expressions
Oracle 12c new rows pattern matching
XML and JSON handling with SQL
Oracle 12c (12.1 + 12.2) new features
SQL Developer Command Line tool (if time allows)
Oracle 18c
Oracle Advanced SQL and Analytic FunctionsZohar Elkayam
Even though DBAs and developers are writing SQL queries every day, it seems that advanced SQL techniques such as multidimension aggregation and analytic functions still remain relatively unknown. In this session, we will explore some of the common real-world usages for analytic function and understand how to take advantage of this great and useful tool. We will deep dive into ranking based on values and groups, understand aggregation of multiple dimensions without a group by, see how to do inter-row calculations, and much more.
This is the presentation slides which was presented in Kscope 17 on June 28, 2017.
Oracle 12c New Features For Better PerformanceZohar Elkayam
This document discusses new features in Oracle 12c that improve database performance. It begins with an introduction of the speaker and their company Brillix. The document then covers Oracle Database In-Memory Column Store introduced in 12.1, which allows both row and column format data access. Oracle 12.2 introduced Sharded Database Architecture for horizontal scaling across multiple databases. Additional optimizer changes in 12c such as adaptive query optimization and dynamic statistics are also summarized.
Introduction to Oracle Data Guard BrokerZohar Elkayam
This is an old deck I recently renewed for a customer session. This is the introduction to Oracle Data Guard broker feature, how to deploy it, how to use it and what are its benefits.
This presentation is based on version 11g but most of it is also compatible to Oracle 12c,
Agenda:
- Oracle Data Guard overview
- Dataguard broker introduction
- Configuring and using the data guard
- Live Demos
Exploring Oracle Multitenant in Oracle Database 12cZohar Elkayam
This document discusses Oracle Multitenant in Oracle Database 12c. It provides an overview of the multitenant architecture including containers, benefits such as lower costs and easier provisioning, and impacts such as shared redo logs and one character set across PDBs. It also covers deployment including creating a CDB, provisioning new PDBs from the seed database, plugging in non-CDBs, and cloning PDBs.
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
This is a presentation which was presented in multiple forums (in one way or the other). This is a short introduction for Oracle personal (DBAs and DB Developers) for Big Data and the Hadoop Ecosystem.
In the agenda:
• What is the Big Data challenge?
• A Big Data Solution: Apache Hadoop
• HDFS
• MapReduce and YARN
• Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools
• Another Big Data Solution: Apache Spark
• Where does the DBA fits in?
This presentation was presented in DOAG 2016, HROUG 2016, BGOUG 2016, ILOUG Tech Days 2016 and other small private sessions (Israel Technology Police leaders, CIO forum, Amdocs, and others).
Advanced PL/SQL Optimizing for Better Performance 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session. This includes new features from both 12cR1 and 12cR2.
Agenda:
Developing PL/SQL:
- Composite datatypes, advanced cursors, dynamic SQL, tracing, and more…
Compiling PL/SQL:
- dependencies, optimization levels, and DBMS_WARNING
Tuning PL/SQL:
- GTT, Result cache and Memory handling
- Oracle 11g, 12cR1 and 12cR2 new useful features
- SQLcl – New replacement tool for SQL*Plus (if we have time)
This is a presentation from Oracle Week 2016 (Israel). This is a newer version from last year with new 12cR2 features and demo.
In the agenda:
Aggregative and advanced grouping options
Analytic functions, ranking and pagination
Hierarchical and recursive queries
Regular Expressions
Oracle 12c new rows pattern matching
XML and JSON handling with SQL
Oracle 12c (12.1 + 12.2) new features
SQL Developer Command Line tool
MySQL 5.7 New Features for Developers session for DOAG (Oracle user group conference) in 2016. A similar version was also presented in Israel MySQL User Group on November 2016.
This presentation review new features in MySQL 5.7: Optimizer, InnoDB engine, JSON native data type, performance and sys schemas
OOW2016: Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
This is the presentation I gave on the Oracle Open World 2016 - the topic was group functions and analytic functions.
We talked about reporting analytic functions, ranking and couple of Oracle 12c new features like top-n query syntax and pattern matching.
This presentation has the bonus slides which were not presented at the event itself, as promissed
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
Session from ILOUG I presented in May, 2016
Introducing the new tool from the developers of SQL Developer: SQLcl – a new command line tool from the SQL Developer team that might replace SQL*Plus and all of its functions which has been around for over 30 years!
In this session, we will explore the new functionality of the SQLcl, and use a live demonstration to show what SQLcl has to offer over the old SQL*Plus. We will use real life example to see what makes this tool such a time saver in day-to-day tasks for DBAs and developers who prefer using the command line interface.
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
Session from ILOUG I presented in May, 2016
Even though DBAs and developers are writing SQL queries every day, it seems that advanced SQL techniques such as multi-dimension aggregation and analytic functions are still relatively remain unknown. In this session, we will explore some of the common real-world usages for analytic function, and understand how to take advantage of this great and useful tool. We will deep dive into ranking based on values and groups; understand aggregation of multiple dimensions without a group by; see how to do inter-row calculations, and much-much more…
Together we will see how we can unleash the power of analytics using Oracle 11g best practices and Oracle 12c new features.
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
Session from BGOUG I presented in June, 2016
Big data is one of the biggest buzzword in today's market. Terms like Hadoop, HDFS, YARN, Sqoop, and non-structured data has been scaring DBA's since 2010 - but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers needs to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world, and where traditional databases fits into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into Big Data and Hadoop professionals and experts.
Exploring Advanced SQL Techniques Using Analytic FunctionsZohar Elkayam
Session from BGOUG I presented in June, 2016
Even though DBAs and developers are writing SQL queries every day, it seems that advanced SQL techniques such as multi-dimension aggregation and analytic functions are still relatively remain unknown. In this session, we will explore some of the common real-world usages for analytic function, and understand how to take advantage of this great and useful tool. We will deep dive into ranking based on values and groups; understand aggregation of multiple dimensions without a group by; see how to do inter-row calculations, and much-much more…
Together we will see how we can unleash the power of analytics using Oracle 11g best practices and Oracle 12c new features.
Advanced PLSQL Optimizing for Better PerformanceZohar Elkayam
A Presentation from Oracle Week 2015 in Israel
Agenda:
• Developing PL/SQL:
o Composite Data Types: Records, Collections and Table type
o Advanced Cursors: Ref cursor, Cursor function, Cursor subquery in PL/SQL
o Bulk Binding
o Dynamic SQL – SQL Injection
o Tracing PL/SQL Execution
o Design patterns for PL/SQL: Autonomous Transactions, Invoker and Definer rights, serially_reusable code
o Triggers Improvements
• Compiling PL/SQL:
o PL/SQL Fine-Grain Dependency Management
o PLSQL_OPTIMIZE_LEVEL parameter
o PL/SQL Compile-Time Warnings and Using DBMS_WARNING package
• Tuning PL/SQL:
o Handling Packages in Memory
o Global Temporary Tables
o PL/SQL Function Result Cache and pitfalls
• Oracle Database 12c PL/SQL new features: What is new in Oracle 12c
o Language Usability Enhancements
o New Limitations
• Additional useful features, Tips and Tricks for better performance
Oracle Week 2015 presentation (Presented on November 15, 2015)
Agenda:
Aggregative and advanced grouping options
Analytic functions, ranking and pagination
Hierarchical and recursive queries
Oracle 12c new rows pattern matching feature
XML and JSON handling with SQL
Regular Expressions
SQLcl – a new replacement tool for SQL*Plus from Oracle
This document provides an overview of the Hadoop ecosystem. It begins with introducing big data challenges around volume, variety, and velocity of data. It then introduces Hadoop as an open-source framework for distributed storage and processing of large datasets across clusters of computers. The key components of Hadoop are HDFS (Hadoop Distributed File System) for distributed storage and high throughput access to application data, and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably using data replication across nodes and is optimized for throughput over large files and datasets.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
National Security Agency - NSA mobile device best practices
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
1. Session ID:
Prepared by:
Things Every Oracle DBA
Needs to Know about the
Hadoop Ecosystem
690
Zohar Elkayam, Brillix
@realmgic
https://www.realdbamagic.com
2. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Who am I?
• Zohar Elkayam, CTO at Brillix
• Programmer, DBA, team leader, database trainer,
public speaker, and a senior consultant for over
19 years
• Oracle ACE Associate
• Member of ilOUG – Israel Oracle User Group
• Involved with Big Data projects since 2011
• Blogger – www.realdbamagic.com and
www.ilDBA.co.il
2
3. April 2-6, 2017 in Las Vegas, NV USA #C17LV
About Brillix
• We offer complete, integrated end-to-end solutions based
on best-of-breed innovations in database, security and big
data technologies
• We provide complete end-to-end 24x7 expert remote
database services
• We offer professional customized on-site trainings,
delivered by our top-notch world recognized instructors
3
4. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Some of Our Customers
4
5. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda
• What is the Big Data challenge?
• A Big Data Solution: Apache Hadoop
• HDFS
• MapReduce and YARN
• Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other
tools
• Another Big Data Solution: Apache Spark
• Where does the DBA fits in?
5
6. Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 670
8. April 2-6, 2017 in Las Vegas, NV USA #C17LV
The Big Data Challenge
8
9. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Volume
• Big data comes in one size: Big.
• Size is measured in Terabyte (1012), Petabyte (1015),
Exabyte (1018), Zettabyte (1021)
• The storing and handling of the data becomes an
issue
• Producing value out of the data in a reasonable
time is an issue
9
10. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Variety
• Big Data extends beyond structured data,
including semi-structured and unstructured
information: logs, text, audio and videos
• Wide variety of rapidly evolving data types
requires highly flexible stores and handling
10
Un-Structured Structured
Objects Tables
Flexible Columns and Rows
Structure Unknown Predefined Structure
Textual and Binary Mostly Textual
11. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Velocity
• The speed in which data is being generated and
collected
• Streaming data and large volume data movement
• High velocity of data capture – requires rapid
ingestion
• Might cause a backlog problem
11
12. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Value
Big data is not about the size of the data,
It’s about the value within the data
12
13. April 2-6, 2017 in Las Vegas, NV USA #C17LV
So, We Define Big Data Problem…
• When the data is too big or moves too fast to
handle in a sensible amount of time
• When the data doesn’t fit any conventional
database structure
• When we think that we can still produce value
from that data and want to handle it
• When the technical solution to the business need
becomes part of the problem
13
16. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Big Data in Practice
• Big data is big: technological framework and
infrastructure solutions are needed
• Big data is complicated:
• We need developers to manage handling of the data
• We need devops to manage the clusters
• We need data analysts and data scientists to produce
value
16
17. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Possible Solutions: Scale Up
• Older solution: using a giant server with a lot of
resources (scale up: more cores, faster processers,
more memory) to handle the data
• Process everything on a single server with hundreds
of CPU cores
• Use lots of memory (1+ TB)
• Have a huge data store on high end storage solutions
• Data needs to be copied to the processes in real
time, so it’s no good for high amounts of data
(Terabytes to Petabytes)
17
18. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Distributed Solution
• A scale-out solution: let’s use distributed systems;
use multiple machine for a single job/application
• More machines means more resources
• CPU
• Memory
• Storage
• But the solution is still complicated: infrastructure
and frameworks are needed
18
19. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Infrastructure Challenges
• We need Infrastructure that is built for:
• Large-scale
• Linear scale out ability
• Data-intensive jobs that spread the problem across
clusters of server nodes
• Storage: efficient and cost-effective enough to
capture and store terabytes, if not petabytes, of data
• Network infrastructure that can quickly import large
data sets and then replicate it to various nodes for
processing
• High-end hardware is too expensive - we need a
solution that uses cheaper hardware
19
20. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Distributed System/Frameworks Challenges
• How do we distribute our workload across the
system?
• Programming complexity – keeping the data in sync
• What to do with faults and redundancy?
• How do we handle security demands to protect
highly-distributed infrastructure and data?
20
22. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hadoop
• Open source project run by Apache Foundation
(2006)
• Hadoop brings the ability to cheaply process large
amounts of data, regardless of its structure
• It Is has been the driving force behind the growth
of the big data industry
• Get the public release from:
• http://hadoop.apache.org/core/
22
23. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Original Hadoop Components
• HDFS (Hadoop Distributed File System) –
distributed file system that runs in a clustered
environment
• MapReduce – programming paradigm for running
processes over a clustered environment
• Hadoop main idea: let’s distribute the data to
many servers, and then bring the program to the
data
23
24. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop Benefits
• Designed for scale out
• Reliable solution based on unreliable hardware
• Load data first, structure later
• Designed for storing large files
• Designed to maximize throughput of large scans
• Designed to leverage parallelism
• Solution Ecosystem
24
25. April 2-6, 2017 in Las Vegas, NV USA #C17LV
What Hadoop Is Not?
• Hadoop is not a database – it is not a replacement
for DW, or other relational databases
• Hadoop is not commonly used for OLTP/real-time
systems
• Very good for large amounts, not so much for
smaller sets
• Designed for clusters – there is no Hadoop
monster server (single server)
25
26. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop Limitations
• Hadoop is scalable but it’s not very fast
• Some assembly might be required
• Batteries are not included (DIY mindset) – some
features needs to be developed if they’re not
available
• Open source license limitations apply
• Technology is changing very rapidly
26
28. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Original Hadoop 1.0 Components
• HDFS (Hadoop Distributed File System) –
distributed file system that runs in a clustered
environment
• MapReduce – programming technique for
running processes over a clustered environment
28
29. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop 2.0
• Hadoop 2.0 changed the Hadoop conception and
introduced a better resource management
concept:
• Hadoop Common
• HDFS
• YARN
• Multiple data processing
frameworks including
MapReduce, Spark and
others
29
30. April 2-6, 2017 in Las Vegas, NV USA #C17LV
HDFS is...
• A distributed file system
• Designed to reliably store data using commodity
hardware
• Designed to expect hardware failures and still
stay resilient
• Intended for larger files
• Designed for batch inserts and appending data
(no updates)
30
31. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Files and Blocks
• Files are split into 128MB blocks (single unit of
storage)
• Managed by NameNode and stored on DataNodes
• Transparent to users
• Replicated across machines at load time
• Same block is stored on multiple machines
• Good for fault-tolerance and access
• Default replication factor is 3
31
32. April 2-6, 2017 in Las Vegas, NV USA #C17LV
HDFS is Good for...
• Storing large files
• Terabytes, Petabytes, etc...
• Millions rather than billions of files
• 128MB or more per file
• Streaming data
• Write once and read-many times patterns
• Optimized for streaming reads rather than random
reads
33
33. April 2-6, 2017 in Las Vegas, NV USA #C17LV
HDFS is Not So Good For...
• Low-latency reads / Real-time application
• High-throughput rather than low latency for small
chunks of data
• HBase addresses this issue
• Large amount of small files
• Better for millions of large files instead of billions of
small files
• Multiple Writers
• Single writer per file
• Writes at the end of files, no-support for arbitrary
offset
34
34. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Using HDFS in Command Line
35
35. April 2-6, 2017 in Las Vegas, NV USA #C17LV
How Does HDFS Look Like (GUI)
36
36. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Interfacing with HDFS
37
37. April 2-6, 2017 in Las Vegas, NV USA #C17LV
MapReduce is...
• A programming model for expressing distributed
computations at a massive scale
• An execution framework for organizing and
performing such computations
• MapReduce can be written in Java, Scala, C,
Python, Ruby and others
• Concept: Bring the code to the data, not the data
to the code
38
38. April 2-6, 2017 in Las Vegas, NV USA #C17LV
The MapReduce Paradigm
• Imposes key-value input/output
• We implement two main functions:
• MAP - Takes a large problem and divides into sub problems
and performs the same function on all sub-problems
Map(k1, v1) -> list(k2, v2)
• REDUCE - Combine the output from all sub-problems (each
key goes to the same reducer)
Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else (almost)
39
39. April 2-6, 2017 in Las Vegas, NV USA #C17LV
MapReduce Word Count Process
40
40. April 2-6, 2017 in Las Vegas, NV USA #C17LV
YARN Features
• Takes care of distributed processing and
coordination
• Scheduling
• Jobs are broken down into smaller chunks called tasks
• These tasks are scheduled to run on data nodes
• Task Localization with Data
• Framework strives to place tasks on the nodes that
host the segment of data to be processed by that
specific task
• Code is moved to where the data is
41
41. April 2-6, 2017 in Las Vegas, NV USA #C17LV
YARN Features (2)
• Error Handling
• Failures are an expected behavior so tasks are
automatically re-tried on other machines
• Data Synchronization
• Shuffle and Sort barrier re-arranges and moves data
between machines
• Input and output are coordinated by the framework
42
42. April 2-6, 2017 in Las Vegas, NV USA #C17LV
YARN Framework Support
• With YARN, we can go beyond the Hadoop
ecosystem
• Support different frameworks:
• MapReduce v2
• Spark
• Giraph
• Co-Processors for Apache HBase
• More…
43
43. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Submitting a Job
• Yarn script with a class argument command
launches a JVM and executes the provided Job
44
$ yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob
/user/sample/hamlet.txt
/user/sample/wordcount/
44. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Resource Manage: UI
45
45. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Application View
46
46. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop Main Problems
• Hadoop MapReduce Framework (not MapReduce
paradigm) had some major problems:
• Developing MapReduce was complicated – there was
more than just business logics to develop
• Transferring data between stages requires the
intermediate data to be written to disk (and than read
by the next step)
• Multi-step needed orchestration and abstraction
solutions
• Initial resource management was very painful –
MapReduce framework was based on resource slots
47
48. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Improving Hadoop: Distributions
• Core Hadoop is complicated so some tools and
solution frameworks were added to make things
easier
• There are over 80 different Apache projects for big
data solution which uses Hadoop (and growing!)
• Hadoop Distributions collects some of these tools
and release them as a complete integrated package
• Cloudera
• HortonWorks
• MapR
• Amazon EMR
49
49. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Common HADOOP 2.0 Technology Eco
System
50
50. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Improving Programmability
• MapReduce code in Java is sometime tedious, so
different solutions came to the rescue
• Pig: Programming language that simplifies Hadoop
actions: loading, transforming and sorting data
• Hive: enables Hadoop to operate as data warehouse
using SQL-like syntax
• Spark and other frameworks
51
51. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Pig
• Pig is an abstraction on top of Hadoop
• Provides high level programming language designed for
data processing
• Scripts converted into MapReduce code, and executed on
the Hadoop Clusters
• Makes ETL/ELT processing and other simple
MapReduce easier without writing MapReduce code
• Pig was widely accepted and used by Yahoo!, Twitter,
Netflix, and others
• Often replaced by more up-to-date tools like Apache
Spark
52
52. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hive
• Data Warehousing Solution built on top of
Hadoop
• Provides SQL-like query language named HiveQL
• Minimal learning curve for people with SQL expertise
• Data analysts are target audience
• Early Hive development work started at Facebook
in 2007
• Hive is an Apache top level project under
Hadoop
• http://hive.apache.org
53
53. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hive Provides
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analyzing
and summarizing large amounts of data
• Access to files on various data stores such as
HDFS and HBase
• Also see: Apache Impala (mainly in Cloudera)
54
54. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Databases and DB Connectivity
• HBase: Online NoSQL Key/Value wide-column
oriented datastore that is native to HDFS
• Sqoop: a tool designed to import data from and
export data to relational databases (HDFS, Hbase,
or Hive)
• Sqoop2: Sqoop centralized service (GUI, WebUI,
REST)
55
55. April 2-6, 2017 in Las Vegas, NV USA #C17LV
HBase
• HBase is the closest thing we had to
database in the early Hadoop days
• Distributed key/value with wide-column oriented
NoSQL database, built on top of HDFS
• Providing Big Table-like capabilities
• Does not have a query language: only get, put,
and scan commands
• Often compared with Cassandra
(non-Hadoop native Apache project)
56
56. April 2-6, 2017 in Las Vegas, NV USA #C17LV
When Do We Use HBase?
• Huge volumes of randomly accessed data
• HBase is at its best when it’s accessed in a
distributed fashion by many clients (high
consistency)
• Consider HBase when we are loading data by key,
searching data by key (or range), serving data by
key, querying data by key or when storing data by
row that doesn’t conform well to a schema.
57
57. April 2-6, 2017 in Las Vegas, NV USA #C17LV
When NOT To Use HBase
• HBase doesn’t use SQL, don’t have an optimizer,
doesn’t support transactions or joins
• HBase doesn’t have data types
• See project Apache Phoenix for better data
structure and query language when using HBase
58
58. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Sqoop and Sqoop2
• Sqoop is a command line tool for moving data
from RDBMS to Hadoop. Sqoop2 is a centralized
tool for running sqoop.
• Uses MapReduce load the data from relational
database to HDFS
• Can also export data from HBase to RDBMS
• Comes with connectors to MySQL, PostgreSQL,
Oracle, SQL Server and DB2.
59
$bin/sqoop import --connect
'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch'
--table lineitem --hive-import
$bin/sqoop export --connect
'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch'
--table lineitem --export-dir /data/lineitemData
59. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Improving Hadoop – More Useful Tools
• For improving coordination: Zookeeper
• For improving scheduling/orchestration: Oozie
• Data Storing in memory: Apache Impala
• For Improving log collection: Flume
• Text Search and Data Discovery: Solr
• For Improving UI and Dashboards: Hue and
Ambari
60
60. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Improving Hadoop – More Useful Tools (2)
• Data serialization: Avro and Parquet (columns)
• Data governance: Atlas
• Security: Knox and Ranger
• Data Replication: Falcon
• Machine Learning: Mahout
• Performance Improvement: Tez
• And there are more…
61
62. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Is Hadoop the Only Big Data Solution?
• No – There are other solutions:
• Apache Spark and Apache Mesos frameworks
• NoSQL systems (Apache Cassandra, CouchBase,
MongoDB and many others)
• Stream analysis (Apache Kafka, Apache Storm, Apache
Flink)
• Machine learning (Apache Mahout, Spark MLlib)
• Some can be integrated with Hadoop, but some
are independent
63
63. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Another Big Data Solution: Apache Spark
• Apache Spark is a fast, general engine for
large-scale data processing on a cluster
• Originally developed by UC Berkeley in 2009 as a
research project, and is now an open source
Apache top level project
• Main idea: use the memory resources of the
cluster for better performance
• It is now one of the most fast-growing project
today
64
64. April 2-6, 2017 in Las Vegas, NV USA #C17LV
The Spark Stack
65
65. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Spark and Hadoop
• Spark and Hadoop are built to co-exist
• Spark can use other storage systems (S3, local disks, NFS) but
works best when combined with HDFS
• Uses Hadoop InputFormats and OutputFormats
• Fully compatible with Avro and SequenceFiles as well of other types
of files
• Spark can use YARN for running jobs
• Spark interacts with the Hadoop ecosystem:
• Flume
• Sqoop (watch out for DDoS on the database…)
• HBase
• Hive
• Spark can also interact with tools outside the Hadoop
ecosystem: Kafka, NoSQL, Cassandra, XAP, Relational databases,
and more
66
66. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Okay, So Where Does the DBA Fits In?
• Big Data solutions are not databases. Databases
are probably not going to disappear, but we feel
the change even today: DBA’s must be ready for
the change
• DBA’s are the perfect candidates to transition into
Big Data Experts:
• Have system (OS, disk, memory, hardware) experience
• Can understand data easily
• DBA’s are used to work with developers and other
data users
67
67. April 2-6, 2017 in Las Vegas, NV USA #C17LV
What DBAs Needs Now?
• DBA’s will need to know more programming: Java,
Scala, Python, R or any other popular language in
the Big Data world will do
• DBA’s needs to understand the position shifts,
and the introduction of DevOps, Data Scientists,
CDO etc.
• Big Data is changing daily: we need to learn, read,
and be involved before we are left behind…
68
68. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Summary
• Big Data is here – it’s complicated and RDBMS
does not fit anymore
• Big Data solutions are evolving Hadoop is an
example for such a solution
• Spark is very popular Big Data solution
• DBA’s need to be ready for the change: Big Data
solutions are not databases and we make
ourselves ready
69
70. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Thank You
and
Don’t Forget To Evaluate (670)
Zohar Elkayam
twitter: @realmgic
Zohar@Brillix.co.il
www.realdbamagic.com
71