Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all together with wire encryption everywhere and a kerberized Hadoop cluster.
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit
Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities.
Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets.
Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
Liberty Mutual Enterprise Data Lake Use Case Study
By building a data lake, Liberty Mutual Insurance Group Enterprise Analytics department has created a platform to implement various big data analytic projects. We will share our journey and how we leveraged Hortonworks Hadoop distribution and other open source technologies to meet our project needs. This session will cover data lake architecture, security, and use cases.
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit
Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities.
Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets.
Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
Liberty Mutual Enterprise Data Lake Use Case Study
By building a data lake, Liberty Mutual Insurance Group Enterprise Analytics department has created a platform to implement various big data analytic projects. We will share our journey and how we leveraged Hortonworks Hadoop distribution and other open source technologies to meet our project needs. This session will cover data lake architecture, security, and use cases.
Apache Apex brings you the power to quickly build and run big data batch and stream processing applications. But what about visualizing your data in real time as it flows through the Apache Apex applications? Together, we will review Apache Apex, and how it integrates with Apache Hadoop and Apache Kafka to process your big data with streaming computation. Then we will explore the options available to visualize Apex applications metrics and data, including open-source options like REST and PubSub mechanisms in StrAM, as well as features available in the RTS Console like real-time Dashboards and Widgets. We will also look into ways of packaging dashboards inside your Apache Apex applications.
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
The newly enacted GDPR regulations which become effective in 2018 require comprehensive protection of personal information of EU subjects. In this paper, we outline a solution that discovers and classifies personal data that is subject to GDPR in Hadoop ecosystem and uses such precise classification to automatically create a robust set of policies for authorization. The solution consists of using Dataguise’s DgSecure sensitive data detection to automatically classify sensitive data assets in Apache Atlas and author comprehensive and robust authorization policies via Apache Ranger. DgSecure is used to detect sensitive data in Hive databases and continuously update the classification in Apache Atlas via tags. Apache Atlas tags are used to create Apache Ranger policies that protect access to sensitive HDFS files, Hive tables, and Hive columns. We demonstrate a workflow where the components of the solution are automated requiring little or no manual intervention to provide protection of such sensitive data in Hadoop clusters.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
HORTONWORKS DATA PLATFORM AND IBM SYSTEMS – A COMPLETE SOLUTION FOR COGNITIVE BUSINESS
SynerScope has been helping European organizations across industries unlock competitive business value from data for almost a decade. Now, by leveraging state-of-the-art access control and audit mechanisms from Hortonworks combined with the latest generation high-performance computing and storage solutions from IBM, SynerScope can connect and correlate enterprise data at a scale not previously possible. SynerScope will demonstrate end-to-end analytics workflows including deep-learning based automation using new integrated solutions from Hortonworks and IBM.
Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
The global financial crisis showed that traditional IT systems at banks were ill equiped to monitor and manage the daily-changing risk landscape during the global financial crisis. The sheer amount of data that needed to be crunched meant that many of the banks were day(s) behind in calculating, understanding and reporting their risk positions. Post crisis, a review by banking regulator, led the regulators to introduce a new legislation BCBS 239: Principles for effective risk data aggregation and reporting, that requires banks to meet more stringent (timeliness) requirement, in their ability to aggregate and report on their quickly-changing risk positions or risk fines to the tune of $millions. To meet these new requirements, banks have been forced to re-think their traditional IT architectures, which are unable to cope with sheer volume of risk data, and are instead turning to Apache Hadoop and Apache Spark to build out next generation of risk systems. In this talk you will discover, how some of the leading banks in the world are leveraging Apache Hadoop and Apache Spark to meet BCBS 239 regulation.
Speaker
Kunal Taneja
Open Metadata and Apache Atlas
- presented at the Dataworks summit in Sydney, Australia on 20 September 2017 by Ferd Scheepers (ING) and Nigel Jones (IBM)
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
The newly enacted GDPR regulations which become effective in 2018 require comprehensive protection of personal information of EU subjects. In this paper, we outline a solution that discovers and classifies personal data that is subject to GDPR in Hadoop ecosystem and uses such precise classification to automatically create a robust set of policies for authorization. The solution consists of using Dataguise’s DgSecure sensitive data detection to automatically classify sensitive data assets in Apache Atlas and author comprehensive and robust authorization policies via Apache Ranger. DgSecure is used to detect sensitive data in Hive databases and continuously update the classification in Apache Atlas via tags. Apache Atlas tags are used to create Apache Ranger policies that protect access to sensitive HDFS files, Hive tables, and Hive columns. We demonstrate a workflow where the components of the solution are automated requiring little or no manual intervention to provide protection of such sensitive data in Hadoop clusters.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
The challenge of computing big data for evolving digital business processes demands variety of computation techniques and engines (SQL, OLAP, time-series, graph, document store), but working in unified framework. A simple architecture of data transformations while ensuring the security, governance, and operational administration are the necessary critical components for enterprise production environments supporting day-to-day business processes. In this session, you will learn about best practices & critical components to ensure business value from latest production deployments. Hear how existing customers are using SAP Vora and the value they have achieved so far with this in-memory engine for distributed data processing. The session provides you with a clear understanding how SAP Vora and open source components like Apache Hadoop and Apache Spark offer an architecture that supports a wide variety of use cases and industries. You will also receive very useful insight where to find development resources, test drive demos, and general documentation.
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
HORTONWORKS DATA PLATFORM AND IBM SYSTEMS – A COMPLETE SOLUTION FOR COGNITIVE BUSINESS
SynerScope has been helping European organizations across industries unlock competitive business value from data for almost a decade. Now, by leveraging state-of-the-art access control and audit mechanisms from Hortonworks combined with the latest generation high-performance computing and storage solutions from IBM, SynerScope can connect and correlate enterprise data at a scale not previously possible. SynerScope will demonstrate end-to-end analytics workflows including deep-learning based automation using new integrated solutions from Hortonworks and IBM.
Effective data governance is imperative to the success of Data Lake initiatives. Without governance policies and processes, information discovery and analysis is severely impaired. In this session we will provide an in-depth look into the Data Governance Initiative launched collaboratively between Hortonworks and partners from across industries. We will cover the objectives of Data Governance Initiatives and demonstrate key governance capabilities of the Hortonworks Data Platform.
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
The global financial crisis showed that traditional IT systems at banks were ill equiped to monitor and manage the daily-changing risk landscape during the global financial crisis. The sheer amount of data that needed to be crunched meant that many of the banks were day(s) behind in calculating, understanding and reporting their risk positions. Post crisis, a review by banking regulator, led the regulators to introduce a new legislation BCBS 239: Principles for effective risk data aggregation and reporting, that requires banks to meet more stringent (timeliness) requirement, in their ability to aggregate and report on their quickly-changing risk positions or risk fines to the tune of $millions. To meet these new requirements, banks have been forced to re-think their traditional IT architectures, which are unable to cope with sheer volume of risk data, and are instead turning to Apache Hadoop and Apache Spark to build out next generation of risk systems. In this talk you will discover, how some of the leading banks in the world are leveraging Apache Hadoop and Apache Spark to meet BCBS 239 regulation.
Speaker
Kunal Taneja
Open Metadata and Apache Atlas
- presented at the Dataworks summit in Sydney, Australia on 20 September 2017 by Ferd Scheepers (ING) and Nigel Jones (IBM)
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
The rise of big data governance: insight on this emerging trend from active o...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Audience Takeaways include:
Understand the role of metadata;
Understand the need for a cross technology view on metadata;
Understand the role of Apache Atlas as a reference implementation; and
Understand the role of ODPi in offering value-added services including certification.
Speaker
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Organize & manage master meta data centrally, built upon kong, cassandra, neo4j & elasticsearch. Managing master & meta data is a very common problem with no good opensource alternative as far as I know, so initiating this project – MasterMetaData.
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Hortonworks and Teradata have partnered to provide a clear path to Big Analytics via stable and reliable Hadoop for the enterprise. The Teradata® Portfolio for Hadoop is a flexible offering of products and services for customers to integrate Hadoop into their data architecture while taking advantage of the world-class service and support Teradata provides.
The Rise of Big Data Governance: Insight on this Emerging Trend from Active O...DataWorks Summit
Each of today’s most forward-thinking enterprises have been forced to face similar data challenges: the reliance on real-time data to better serve their customers and, subsequently, the requirement of complying with regulations to protect that data – one example being the General Data Protection Regulation (GDPR).
The solution to this emerging challenge is a tricky one – for companies like ING, this data governance challenge has been met with metadata, a consistent view across a large heterogeneous ecosystem and collaboration with an active open source community.
This joint presentation, John Mertic – director of program management for ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Speakers
John Mertic, Director of Program Management for ODPi, R Consortium, and Open Mainframe Project, The Linux Foundation
Maryna Strelchuk, Information Architect, ING
Governance Software Systems_ Managing and Governing Your Data Assets.pptxMounika662749
Software governance provides organizations with a structure for aligning their development strategy with the overall business strategy, using a formal framework that enables them to track and measure performance against specific strategic goals.
Balancing data democratization with comprehensive information governance: bui...DataWorks Summit
If information is the new oil, then governance is its “safety data sheet.” As demand for data as the raw material for competitive differentiation continues to rise in enterprises, enterprises are having bigger challenges identifying and valuing data and ensuring its appropriate use to extract the right information. In order for organizations to make effective business decisions, organizations need to have trust in their data so that they can impute the right value and use it for the right purposes while satisfying any organizational or regulatory mandates. A number of analytics and data science initiatives fail to reach their potential due to lack of an information governance framework in place. Robust information governance capabilities can help organizations develop trust in their data and empower them to make decisions confidently.
In this session Sanjeev Mohan, Research Analyst at Gartner, and Srikanth Venkat, Sr. Director of Product Management at Hortonworks, will walk you through an end-to-end architectural blueprint for information governance and best practices for helping organizations understand, secure, and govern diverse types of data in enterprise data lakes.
Speaker
Sanjeev Mohan, Gartner, Research Analyst
Srikanth Venkat, Hortonworks, Senior Director, Product Management
In this session Sanjeev Mohan, Research Analyst at Gartner, and Srikanth Venkat, Sr. Director of Product Management at Hortonworks, will walk you through an end-to-end architectural blueprint for information governance and best practices for helping organizations understand, secure, and govern diverse types of data in enterprise data lakes.
Archonnex is a new software architecture developed by ICPSR for digital assets management systems. Built using modern technology stack to meet the current and emerging needs of social science research.
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018Amazon Web Services
Learn about the latest and hottest features of Amazon Redshift. We’ll deep dive into the architecture and inner workings of Amazon Redshift and discuss how the recent availability, performance, and manageability improvements we’ve made can significantly enhance your user experience. We’ll also share glimpse of what we are working on and our plans for the future. McDonald's will join us to share how they leverage a data lake powered by Redshift, Redshift spectrum and Athena to get quick insights.
Similar to ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time (20)
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Free Complete Python - A step towards Data Science
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real Time
1. Unleashing the power of Apache Atlas
with Apache Ranger
Virtual Data Connector Project
NIGEL JONES
JONESN@UK.IBM.COM
DATAWORKS, MUNICH, APRIL 2017
Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of
the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation
is implied by the use of these marks.
2. About Me – Nigel Jones
https://www.linkedin.com/in/nigelljones/
jonesn@uk.ibm.com (Anyone still use email?)
@planetf1 – noisy, f1, electric vehicles, food & drink …. A split of work/life
accounts didn’t work for me!
And of course the Apache Atlas & Ranger mailing lists & JIRA!
Science fan at school uni. It was cloud chambers back then… now just the
cloud
IBM Hursley, UK since 1990
Last 3 years focus on Data Lake, Information Governance, Open Metadata
4. Data?
What data do I have?
What does it mean?
Where is it?
Who has access to it?
Who owns it?
What quality is it?
How does it relate to other data?
How to I control, audit & understand access?
5. Regulatory needs
Adhere to regulations like BCBS-239 and GDPR
Need to know meaning, value of the data
Demonstrate processes in place to govern access
Audit
Significant fines if rules breached
Whilst ensuring easy, ready access to appropriate data for data professionals to
support an agile business
7. Metadata..
Metadata enables data to be used outside of the application that created it.
Analytics and decision making
New business applications
Reporting and compliance
Metadata describes the format and content of data allowing people to judge which
dataset to use for a new project
Structure
Meaning
Origin
Valid values and quality
Usage and ownership
Regulations and classifications that apply
Metadata describes the business context and classification of data allowing automated
governance processes to operate.
8. Which can support…
An enterprise data catalogue that lists all data including where it is, what it
is, who owns it, it’s meaning, quality, where it came from , and can fully
describe it’s business context & how the data should be governed….
Subject Matter experts searching, collaborating, feeding back about their
data needs and use
Automated governance actions to protect and manage including auditing,
monitoring, quality control, rights management
9. But easily…
Open frameworks & APIs
Automatic collection & discovery of metadata in a dynamic heterogeneous
environment
Using predefined standards for glossaries, schemas, rules, regulations to
reduce cost
Cheap to integrate new tools
No proprietary lock-in & assumptions that all tools are from one suite or
vendor
Avoiding silos
Distributed and Open
12. Data virtualization project
Collaboration – IBM, several banks & open community
A Data Lake environment
Not just Hadoop, but other sources too
Business Terms, Classifications, Metadata rich
Offer virtualized views. Expose relational data with business terms
Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA
Open, pluggable
Working through use cases, design, initial MVP (this year)
Critique, feedback is welcomed. We’re looking for guidance and support from the Atlas
& Ranger communities as well as contribute our ideas
Proposed changes all go through mailing list and JIRA for feedback
13. Apache Atlas
“Atlas is a scalable and extensible set of core foundational governance
services – enabling enterprises to effectively and efficiently meet their
compliance requirements within Hadoop and allows integration with the
whole enterprise data ecosystem.” …. http://www.apache.org
Open Community -- Apache Incubator since May 2015
Type agnostic metadata store
REST API & UI
Supports many Hadoop components including HBase, Hive, Sqoop, Storm
& others
14. Apache Ranger
Centralized security administration to manage all security related tasks in a
central UI or using REST APIs.
Fine grained authorization to do a specific action and/or operation with
Hadoop component/tool and managed through a central administration
tool
Standardize authorization method across all Hadoop components.
Enhanced support for different authorization methods - Role based access
control, attribute based access control etc.
Centralize auditing of user access and administrative actions (security
related) within all the components of Hadoop.
… from http://ranger.apache.org
15. Project Interactions
Search/Rep
ort
GaianDB
• Search for list of assets by metadata
• Search for data
• Reporting tool obtains data to draw report
Underlying data, sql, hive,
HDFS, Oracle, Netezza
etc
Manages logical views
Deploys rules, pushes
classifications, source for
user roles (not users)
+ranger plugin to permit/deny, mask etc
Pulls rules. classifications
RDBMSHadoop
Apache
Atlas
Apache
Ranger
Apache
Solr
16. Why Atlas and Ranger?
Open Source essential to forming an active ecosystem
Vision, active community & evolving – ability to contribute & work with
others to provide the best solution
Already have good core capabilities
Atlas type system is very flexible
Ranger offers a range of policy types and provides a pluggable framework
Already cross project integration
Use of tag based policie in Ranger sourced from Atlas
Can be used independently of full Hadoop stack
17. Refined virtual connector scope scope
GaianDB
Ranger
Plugin
Titan
(GraphDB,
Metadata
Repository)
Ranger
Config
Ranger Server
Atlas
Poll Policies
OMAS
OMRS
IGC
Pre Post Create View
Metadata
Extract
physical
metadata
Manage
Logical
Tables
Virtualizer
Retrieve meta data
Retrieve meta data
Retrieve meta data
Push meta data
Oracle Netezza
Hive
Tables
Push and query meta data
Data Lake Repositories
Meta
Data
Data Lake Virtualization
tag-sync
rule-sync
Config (eg Policies,
Audit log location)
LDAP
Audit
Log
Mapper
Search for data/reporting
Push and
query
metadata
Meta
Data
Navigator
Meta
Data
Datameer
18. GaianDB & Virtualizer
GaianDB
Open Source
Federated, self learning, dynamic configuration
Based on Apache Derby
Already had “policy” support – we’re plugging in
Ranger for this project
Virtualizer
Listens to event notifications on assets etc
Creates view definitions in GaianDB, and new Atlas APIs
to store metadata. Could use different virtual engine..
Designed to be open to other virtualization
technologies.
LT1 LT2
DS2DS1 DS3
PolicyPlugin
(ranger)
Virtualizer Atlas
GaianDB supports federation
– not used for MVP
19. Atlas – glossary enhancements
Get Atlas closer to parity with commercial offerings
Business Terms – categories, category hierarchies
Has-a, is-a, type-of, synonym, antonym, arbitrary relationships
Assets mapped to Business Terms
Classifications
Hierarchy
Navigable mappings to retain ability to flatten tags to ranger
Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY ->
SPI …
Used to drive governance
ATLAS-1410
20. Atlas – other enhancements
Consumer Centric APIs
Open Metadata Access Services (OMAS)
REST & more Kafka notifications
Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,
Information View, Roles and Access
Repository level APIs
Open Metadata Repository Services (OMRS)
REST & more Kafka notifications
Pluggability through an Open Connector Framework to other metadata repositories
– distributed and Open
Standard data model/core
Enhancement to core model – versioning, external linkage etc
More standard types ie for all relational databases to ease sharing
21. Ranger areas being looked at
Building a plugin for GaianDB
Access control, simple masking. More later
User synchronization (large #users, role of Atlas)
Changes to tag sync process for New glossary proposal
As more metadata goes into Atlas, it becomes source for generation of
some kinds of policies. Where is the master?
Generating ranger rules from governance definitions
How about control of access to Atlas itself?
Aside: Interfaces used by enforcement engines (such as to get classification
data) need to be efficient – these should work for projects like Apache
Sentry as well as Atlas
22. Beyond the MVP
Open Discovery Framework
Consider other security enforcement engines – such as Apache Sentry &
driving more capability around rules & governance actions from Atlas
metadata
Work on standard models to support different domains
Lineage
From high level design lineage through to operational detail. Logs vs graph….
API metadata
Infrastructure – JanusGraph…
Abstraction added by IBM in last few months for titan 1
23. The vision
An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning, classification and quality
Spanning systems both on premise and cloud providers
Hosted locally to your data platforms but integrated to provide the enterprise view
New data tools (from any vendor) connect to your data catalog out of the box
No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository
Metadata is added automatically to the catalog as new data is created
Extensible discovery processes characterise and classify the data
Interested parties and processes are notified
Subject matter experts collaborating around the data
Locate the data they need, quickly and efficiently
Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of data
Automated governance processes protect and manage your data
Metadata-driven access control
Auditing, metering and monitoring
Quality control and exception management
Rights management
Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business
Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and advanced analytics
24. Summary
Atlas can help us have an industry wide common metadata platform around
which a vibrant ecosystem can evolve
Not only in Hadoop but more broadly
Metadata driven governance can be scalable & enable us to manage our data
better, and be compliant with regulations
The ideas presented here resonate with many people we’ve spoken to
Get involved! I’d love to hear the feedback on this approach!
Comment on the JIRAS, ask questions, contribute, disagree… ;-)
Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689
Atlas wiki
“Innovation happens best not in isolation but in collaboration” (keynote)
THANKS!
27. Atlas
graphDB
“gaiandb”
IGC
IGC REST API
Oracle
Data
HDFS
Data
Netezza
Data
P-JDBC P-JDBCP-JDBC
GAF OMAS
Virtual
Asset
OMAS
Search
Search/Explore UI
Catalog
OMAS
OMRS
OMRS
GAF Pre
GAF Post
Connector Framework
*
Atlas boundaries
Developed in POC
May not be in POC initially
* May be hardcoded at first
C
o
n
n
e
c
t
o
r
F
r
a
m
e
w
o
r
k
ATLAS
Virtualizer
Architecture
28. Metadata areas and types
Policy Metadata (Principles,
Regulations, Standards, Approaches,
Rule Specifications, Roles and
Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Connector Directories
Access
Access
Information
Auditor
Integration
Developer
Business
Analyst
Data
Scientist
Information
Worker
Information
Owner
Information
Governor
Information
Steward
Data
Quality
Analyst
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Information
Curator
Teaming Metadata
(people profiles, communities,
projects,
notebooks, …)
Models and Schemas
3
2
4
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Infrastructure and systems
Rollout
1
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
29. User & Group/Role synchronization
UserSync2
LDAP holds role-membership
(LDAP groups) – could also be
Active Directory
ATLAS manages definitive
list of roles <that are used
for atlas managed sources>
• Corporate LDAP has a huge number of
users/groups
• Ranger currently needs to sync all
• In future perhaps we establish group/role
membership during authentication
• Capability for alternative source could be merged
in to base UserSync
LDAP lookup ->
group:member
Governance Action OMAS
- getRoles
Apache
Ranger
LDAP
Apache
Atlas
30. Atlas Glossary v2: Tag Sync to Ranger
TagSync2
ATLAS glossary manages a
sophisticated enterprise
glossary structure
• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync
approach
• New API in Atlas will flatten classification structure
• No changes to ranger – but exposing richer classification could be area of future work
Governance Action OMAS
Confidential
Salary
emp_renum
Business
Term
Hive Column
Business
Term
Confidential
emp_renum
Hive Column
Tag
Apache
Ranger
Apache
Atlas
31. Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas
• Currently designing how this works
• Scoped by policy service so existing Ranger UI approach still works
Governance Action OMAS
- getRules
Role
Classifications
Asset
Ranger Rule
Action
Apache
Ranger
Apache
Atlas
32. VirtualDataConnector JIRAS 20170402
RANGER-1488
RANGER-1487
RANGER-1486
RANGER-1485
RANGER-1464
RANGER-1454
RANGER-1234
RANGER-1186
RANGER-1168
ATLAS-1696
ATLAS-1694
ATLAS-1691
ATLAS-1158
ATLAS-520
ATLAS-519
ATLAS-455
ATLAS-197
Create Ranger plugin for gaiandb
generate rules from Governance definitions in Atlas
New usersync alternative for Atlas (vdc)
Ranger support for Virtual Data Connector Project (ATLAS)
Support Atlas v2 glossary in Atlas plugin (for access control to terms etc)
Support of Atlas v2 glossary API proposal for tag source
Post-evaluation phase user extensions
Ranger Source: eclipse
Add data masking for tag based policies
Governance Action Framework OMAS
Sample assets to support Virtual Connector Project
OMAS Interfaces for Atlas
Build ATLAS using Docker
Temporal / Versioning support for types, traits, entites ....
metrics
Timeouts in tests should be configurable from system property
Add build instructions in top level dir
33. References
Apache Atlas - http://atlas.apache.org/
Top level JIRA for this activity https://issues.apache.org/jira/browse/ATLAS-
1689
Apache Ranger - http://ranger.apache.org/
GaianDB
https://github.com/gaiandb/gaiandb
https://developer.ibm.com/open/openprojects/gaian-database/
The case for open metadata – A.M.Chessell
http://www.ibmbigdatahub.com/blog/case-open-metadata
Editor's Notes
This is the nirvana. Many tools from different teams – open or proprietary – all able to exchange metadata easily.
A new tool can easily understand existing metadata, can integrate with minimal effort
GaianDB is a open source project from IBM that is based on Apache Derby and supports a highly distributed model with self learning/healing capabilities. It virtualizes access to underlying data sources – for example a virtual table my be surfaced via JDBC that is actually based on a combination of a CSV file and another relational database. We are using it in the Virtual Data Connector project to provide a single point of control via a ranger plugin, as well as to do some data source mappings such as hiding technical columns from view or renaming columns with more business like terms gleaned from the glossary (Atlas)
This is broadly the scope of an MVP definition we’re using to focus our initial work this year. We have use cases we can share with anyone interested, and will be capturing that info in the Atlas/Ranger JIRAs and potentially wiki.The list of metadata repositories is an example. Our MVP sources some metadata from IGC since that is being used by some participants, but the focus is on open interfaces and Atlas. The other repositories are potential ideas only.Similarly
It’s important to architect this in an open way. The rules used to decide when to virtualize a resource need to be pluggable – perhaps for example all data arriving in a partular DataLake zone will be a virtualization candidate. Further the actual technology needs to be changeable – proprietary or other open projects – for example perhaps Presto is a candidate . Ideas welcome – proposals will be shared in Jira
OMAS = Open Metadata Access Services – These are consumer centric interfaces so would pass objects suited to a particular consumer – for example Ranger in the case of the Governance Action OMAS, or a catalog UI perhaps for the Catalog services. Each consumer has different needs in terms of object structure, or whether it deals with individual objects or sets, and this can differ from the model used in the underlaying repository. For some interfaces this mapping Is simple, for others more complex.OMRS = Open Metadata Repository Services. This refers to the core repository, ie the Atlas type system. We see other metadata repositories adopting the same UI, and are proposing a mechanism that will allow these to be plugged in. Metadata can change rapidly, and the only scalable approach is to ensure it’s open & distributed. Contributions from other metadata server authors welcome ! Note that these are our working names. Fundamentally they are Atlas, and so the Atlas community will together need to agree on the actual names moving forward.
A standard data model – in addition to a common mechanism/server – is necessary to make it much easier to *understand* the metadata we store. Whilst there will always be a need for extensions, having a good base object definition will make application integration easier. For example we might all wish to describe a RDBMS in a very similar way. We can then go on from this to have more standards oriented around industry models
GaianDB ranger plugin – GaianDB already has the capability to have a policy plugin which governs access to it’s virtual tables. To integrate with Ranger and Atlas we will have a ranger style plugin. Whilst this will function like any other ranger plugin, in addition policies will be generated from Atlas itself.User synchronization challenges are described later. In summary, in an enterprise environment there may be many users (100k+) in LDAP, but only a small number have access to the virtualization infrastructure. We’re going to key the user sync off the list of user roles found in Atlas itself, and then obtain the role membership from LDAP. This will then be uploaded to ranger as per existing usersyncTag Sync – the glossary enhancements provide additional structure in how Business Terms, Classifications & assets are linked. A new Atlas API will flatten this structure and thus preserve Ranger’s ability to use atlas tags as today. In future there may be a re-evaluation to see if this more sophisticated approach should be pushed to Ranger tooPolicies – Since Atlas now has richer metadata including information about asset ownership, high level governance policies, data classification, rules may be generated in some cases from Atlas, or from a new rule-sync process. This is still being worked through and we’ll share our ideas with the communityOpenness – Some users I’ve spoken to are interested in Atlas but may currently use other technologies for enforcement, including Sentry in Hadoop. The intent is to ensure all the interfaces defined are open to all, and useful… so that should someone wish they could just as well integrate with Sentry as Ranger. This loosely coupled approach helps support an innovative exciting ecosystem
Atlas holds metadata round user roles – they are used to define governance rules… In keeping with Ranger’s process to synchronize users & groups we will source these slightly differently, though this is mostly simply a scoping exercise to avoid pulling everything from LDAP. One consideration for the future is whether Ranger needs to sync users/groups at all – whilst the sync can help with typeahead when manually defining policies, it’s of relatively little use at runtime if instead plugins could pull the current user role membership from ldap or elsewhere after connection. Possibly for another JIRA
Ranger already does tag synchronization with Atlas, but changes will be needed to support the new glossary capabilities
A new tagsync process will likely be implemented so that either old or new can be used to avoid any breakage for existing users
Currently working through how this may work, but fundamentally we can define the governance rules in Atlas, and likely generate executable rules in ranger. Refer to the JIRAs for ongoing design on this area
In no particular order and an example only – a query for all JIRAS against Atlas, Ranger with label = ‘VirtualDataConnector’ as of 2 April 2017. This is a list of issues we’re interested in, in particular. The root JIRA for our current design work is ATLAS-1689 which it appears we forgot to tag . There are others too so please rerun the query!