The document discusses Hortonworks DataFlow (HDF), which is a platform for data in motion. HDF allows users to collect data at the edge, route and process streaming data with Apache NiFi and Kafka, and analyze, visualize, predict and prescribe outcomes from the data using HDF platform services. The HDF platform provides scalable stream processing, security, data provenance, and management capabilities for data in motion applications across the enterprise.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Data science holds tremendous potential for organizations to uncover new insights and drivers of revenue and profitability. Big Data has brought the promise of doing data science at scale to enterprises, however this promise also comes with challenges for data scientists to continuously learn and collaborate. Data Scientists have many tools at their disposal such as notebooks like Juypter and Apache Zeppelin & IDEs such as RStudio with languages like R, Python, Scala and frameworks like Apache Spark. Given all the choices how do you best collaborate to build your model and then work through the development lifecycle to deploy it from test into production ?
In this session learn the attributes of a modern data science platform that empowers data scientists to build models using all the data in their data lake and foster continuous learning and collaboration. We will show a demo of DSX with HDP with the focus on integration, security and model deployment and management.
Speakers:
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
Vikram Murali, Program Director, Data Science and Machine Learning, IBM
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
Let's be honest - there are some pretty amazing capabilities locked in proprietary SQL engines which have had decades of R&D baked into them. At this session, learn how IBM, working with the Apache community, has unlocked the value of their SQL optimizer for Hive, HBase, ObjectStore, and Spark - helping customers avoid lock-in while providing best performance, concurrency and scalability for complex, analytical SQL workloads. You'll also learn how the SQL engine was extended and integrated with Ambari, Ranger, YARN/Slider and HBase. We share the results of this project which has enabled running all 99 TPC-DS queries at world record breaking 100TB scale factor.
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Securing your Big Data Environments in the CloudDataWorks Summit
Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management.
This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways.
Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureDataWorks Summit
Customers that are implementing Big Data Analytics projects in enterprise environments driven by line of business applications are faced with the three critical issues of Managing Complexity, Data Movement and Replication, and Cloud Integration. In this session you will learn about the characteristics of these pain points and how designing and implementing a data driven approach enables enterprises to implement quickly and efficiently with a future proof architecture of hybrid cloud.
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit
Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities.
Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets.
Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
The global financial crisis showed that traditional IT systems at banks were ill equiped to monitor and manage the daily-changing risk landscape during the global financial crisis. The sheer amount of data that needed to be crunched meant that many of the banks were day(s) behind in calculating, understanding and reporting their risk positions. Post crisis, a review by banking regulator, led the regulators to introduce a new legislation BCBS 239: Principles for effective risk data aggregation and reporting, that requires banks to meet more stringent (timeliness) requirement, in their ability to aggregate and report on their quickly-changing risk positions or risk fines to the tune of $millions. To meet these new requirements, banks have been forced to re-think their traditional IT architectures, which are unable to cope with sheer volume of risk data, and are instead turning to Apache Hadoop and Apache Spark to build out next generation of risk systems. In this talk you will discover, how some of the leading banks in the world are leveraging Apache Hadoop and Apache Spark to meet BCBS 239 regulation.
Speaker
Kunal Taneja
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
Verizon – Global Technology Services (GTS) was challenged by a multi-tier, labor-intensive process when trying to migrate data from disparate sources into a data lake to create financial reports and business insights. Join this session to learn more about how Verizon:
• Easily accessed data from multiple sources including SAP data
• Ingested data into major targets including Hadoop
• Achieved real-time insights from data leveraging change data capture (CDC) technology
• Reduced costs and labor
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...DataWorks Summit
GeoWave is an open-source library that connects geospatial software with distributed computing frameworks. GeoWave leverages the scalability of a distributed key-value store for effective storage, retrieval, and analysis of massive geospatial datasets. It uses a space filling curve to preserve locality between multi-dimensional objects and the single dimensional sort order imposed by key-value stores. What this means to a user is that distributed spatial and spatial-temporal retrieval and analysis can be effectively accomplished at a massive scale.
At its core, GeoWave solves the problem of multi-dimensional indexing, and particularly extends this capability to spatial/temporal use cases. GeoWave supports raster, vector, and point cloud data, and provides common spatial algorithms that can be extended to create deep analytic capabilities. It also performs fast subsampling via distributed rendering that integrates with GeoServer, so that a user can interactively visualize data at map scale regardless of density.
Our goal in presenting GeoWave to the Hadoop Summit is to introduce it to the big data community. We will present GeoWave at a moderate level of detail, to include a short demonstration, and hopefully answer any questions regarding maturity, suitability and implementation details.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI.
Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsHortonworks
Verizon Global Technology Services (GTS) was challenged by a multi-tier, labor-intensive process when trying to migrate data from disparate sources into a data lake to create financial reports and business insights.
View the webinar on-demand here: https://hortonworks.com/webinar/verizon-centralizes-data-into-data-lake/
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Securing your Big Data Environments in the CloudDataWorks Summit
Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management.
This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways.
Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.
In 2015/16 Worldpay deployed it's Enterprise Data Platform - a highly secure cluster used for analysis of over 65 Billion card transactions and the subject of last years Hadoop Summit Keynote in Dublin. A year on and we are now rapidly expanding our platform with true multi-tenancy. For our first tenant we have build and deployed the analytics and reporting for our central platforms. Our second tenant is to deploy 'decision engines' into our core business systems. These allow Worldpay to make decisions derived from machine learning on how we authorise and route payments traffic and how these affect the consumer, merchant and other business partners. We are also developing other tenant for systems management and security. This talk will look at what it means to have truly have a single enterprise data lake and multiple tenants that share that data and look forward to how we will extend the platform in 2017 with Hadoop 3.
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureDataWorks Summit
Customers that are implementing Big Data Analytics projects in enterprise environments driven by line of business applications are faced with the three critical issues of Managing Complexity, Data Movement and Replication, and Cloud Integration. In this session you will learn about the characteristics of these pain points and how designing and implementing a data driven approach enables enterprises to implement quickly and efficiently with a future proof architecture of hybrid cloud.
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
Data science with its specialized tools and knowledge has been a forte of data scientists. However, it is not easy even for data scientists to get access to data that could be in different data stores in the organization. To unleash the power of data and gain valuable insights, machine learning needs to be made easily consumable by various stake holders and access to data made simpler. As an organization's data volumes continue to grow, delivering these insights real time is a complex challenge to solve.
This session will provide on overview of an approach to building a scalable solution where machine and deep learning and access to data is made much more consumable and simpler by the fastest SQL on Hadoop engine on the planet, a rich data scientist toolset and an infrastructure that can deliver the responsiveness needed for production environments.
Speakers:
Pandit Prasad, Program Director, IBM
Ashutosh Mate, Global Senior Solutions Architect, IBM
Enterprise large scale graph analytics and computing base on distribute graph...DataWorks Summit
Graph approaches to structuring, analyzing data have been a significant area of interest, Graphs are well-suited to expressing complex interconnections and clusters of highly related entities.
Large-scale graph analytics research is growing fast in recent years, to leverage Hadoop2 ecosystem for graph is a good approach, enterprise graph computer requires to store large graph and do fast computing against graph. One for the OLTP database systems which allow the user to query the graph in real-time, Hbase as the distributed NOSql database can be the backend storage to persistent large graph, the property graph stored its vertices and edges in key-value pairs in Hbase, it also provide highly reliable, scalable and fault tolerant to the data, Solr as the distributed indexing will make the query more efficient. Titan itself will handle cache, transaction; And another for the OLAP analytics systems, use TinkerPop hadoop gremlin SparkGraphComputer to processed a large graph, every vertex and edge is analyzed, a cluster-computing platform will help for the processing of large distributed in memory graph datasets.
Graph DB base on Hbase/Solr and graph computing analysis base on spark is powerful for discovering valuable information about relationships in complex and large data, representing significant business opportunity in enterprise. It will help graph data analytics in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
The global financial crisis showed that traditional IT systems at banks were ill equiped to monitor and manage the daily-changing risk landscape during the global financial crisis. The sheer amount of data that needed to be crunched meant that many of the banks were day(s) behind in calculating, understanding and reporting their risk positions. Post crisis, a review by banking regulator, led the regulators to introduce a new legislation BCBS 239: Principles for effective risk data aggregation and reporting, that requires banks to meet more stringent (timeliness) requirement, in their ability to aggregate and report on their quickly-changing risk positions or risk fines to the tune of $millions. To meet these new requirements, banks have been forced to re-think their traditional IT architectures, which are unable to cope with sheer volume of risk data, and are instead turning to Apache Hadoop and Apache Spark to build out next generation of risk systems. In this talk you will discover, how some of the leading banks in the world are leveraging Apache Hadoop and Apache Spark to meet BCBS 239 regulation.
Speaker
Kunal Taneja
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
Verizon – Global Technology Services (GTS) was challenged by a multi-tier, labor-intensive process when trying to migrate data from disparate sources into a data lake to create financial reports and business insights. Join this session to learn more about how Verizon:
• Easily accessed data from multiple sources including SAP data
• Ingested data into major targets including Hadoop
• Achieved real-time insights from data leveraging change data capture (CDC) technology
• Reduced costs and labor
GeoWave: Open Source Geospatial/Temporal/N-dimensional Indexing for Accumulo,...DataWorks Summit
GeoWave is an open-source library that connects geospatial software with distributed computing frameworks. GeoWave leverages the scalability of a distributed key-value store for effective storage, retrieval, and analysis of massive geospatial datasets. It uses a space filling curve to preserve locality between multi-dimensional objects and the single dimensional sort order imposed by key-value stores. What this means to a user is that distributed spatial and spatial-temporal retrieval and analysis can be effectively accomplished at a massive scale.
At its core, GeoWave solves the problem of multi-dimensional indexing, and particularly extends this capability to spatial/temporal use cases. GeoWave supports raster, vector, and point cloud data, and provides common spatial algorithms that can be extended to create deep analytic capabilities. It also performs fast subsampling via distributed rendering that integrates with GeoServer, so that a user can interactively visualize data at map scale regardless of density.
Our goal in presenting GeoWave to the Hadoop Summit is to introduce it to the big data community. We will present GeoWave at a moderate level of detail, to include a short demonstration, and hopefully answer any questions regarding maturity, suitability and implementation details.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
Presentation by Ashish Thusoo, Co-Founder & CEO at Qubole, on exploring the big data industry trends in moving from data warehouses to cloud-based data lakes.This presentation will cover how companies today are seeing a significant rise in the success of their big data projects by moving to the cloud to iteratively build more cost-effective data pipelines and new products with ML and AI.
Uncovering how services like AWS, Google, Oracle, and Microsoft Azure provide the storage and compute infrastructure to build self-service data platforms that can enable all teams and new products to scale iteratively.
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsHortonworks
Verizon Global Technology Services (GTS) was challenged by a multi-tier, labor-intensive process when trying to migrate data from disparate sources into a data lake to create financial reports and business insights.
View the webinar on-demand here: https://hortonworks.com/webinar/verizon-centralizes-data-into-data-lake/
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronDataWorks Summit
Apache Metron is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable advanced analytics through machine learning and data science to the users. Because of the relative immaturity of data science platform infrastructure integrated into Hadoop that is oriented to streaming analytics applications, we have been forced to create the requisite platform components out of necessity, utilizing many of the pieces of the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and how it utilizes a custom data science model deployment and autodiscovery service that is tightly integrated with Hadoop via Yarn and Zookeeper. We will discuss how we interact with the models deployed there via a custom domain specific language that can query models as data streams past. We will generally discuss the full-stack data science tooling that has been created to enable data science at scale on an advanced analytics
streaming application.
Speaker:
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
How Big Data and Deep Learning are Revolutionizing AML and Financial Crime De...DataWorks Summit
Banks, Payment Providers and capital markets firms are under intense regulatory mandate to process huge amounts of transaction-related data from both traditional and non-traditional sources. Compliance teams need to constantly analyze data-in-motion (wires, fund transfers, banking transactions) and data-at-rest (years worth of historical data) for actionable intelligence required for Suspicious Activity Reports—to discover illegal activity and provide detailed reporting to authorities. Annual estimates of global money laundering flows ranging anywhere from $ 1 trillion to 2 trillion – almost 5% of global GDP. Almost all of this is laundered via Retail & Merchant Banks, Payment Networks, Securities & Futures firms, Casino Services & Clubs etc – which explains why annual AML related fines on Banking organizations run into the billions and are increasing every year. However, the number of SARs (Suspicious Activity Reports) filed by banking institutions are much higher as a category as compared to the numbers filed by these other businesses. In this presentation we will discuss the business imperatives, value drivers and the woeful inadequacy of current technology architectures and approaches in tackling AML. We will then pivot to a deepdive around Big Data and Predictive Analytics in how they can ease and solve these vexing challenges that Banking executives are grappling with globally.
Speaker
Sanjay Kumar, GM Industry Solutions - Telecom and FS, Hortonworks
Introduction: This workshop will provide a hands on introduction to basic Machine Learning techniques with Spark ML using a Sandbox on students’ personal machines.
Format: A short introductory lecture on a select important supervised and unsupervised Machine Learning techniques followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Machine Learning with Spark ML. In the lab, you will use the following components: Apache Zeppelin (a “Modern Data Science Toolbox”) and Apache Spark. You will learn how to analyze the data, structure the data, train Machine Learning models and apply them to answer real-world questions.
Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud.
Speaker:
Robert Hryniewicz, Developer Advocate, Hortonworks
DataWorks Summit 2017 - Sydney Keynote
Madhu Kochar, Vice President, Analytics Product Development and Client Success, IBM
Data science holds the promise of transforming businesses and disrupting entire industries. However, many organizations struggle to deploy and scale key technologies such as machine learning and deep learning. IBM will share how it is making data science accessible to all by simplifying the use of a range of open source technologies and data sources, including high performing and open architectures geared for cognitive workloads.
Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.
Speakers:
Yanbo Liang, Software Engineer, Hortonworks
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...DataWorks Summit
DataWorks Summit 2017 - Sydney
Alejandro Tesch, Cloud Evangelist, Asia Pacific and Japan, HPE
Big Data is a hot topic today for most organisations today as they race to convert vast amounts of data into useful information that can be leveraged to make critical decisions and recommendations in a very limited time windows. Today, there is a widely accepted talent gap when it comes to creating and managing Hadoop cluster, even for the experts – it can take hours (or days) to get a fully functional hadoop farm up and running. The HDP Ambari plugin for Sahara is looking to address most of this challenges by facilitating the deployment of Hortonworks Hadoop clusters and provide a set of open API to facilitate data analytics tasks in your own cloud. In this presentation we will cover why it makes sense to run your data analytics cluster in your cloud and we will demonstrate basic Sahara / Ambari functionality.
The Apache Way describes the community patterns and style of governance that all projects at the Apache Software Foundation are guided by. With a span of more than 20 years, and now more than 300 projects, the Apache Way has helped to establish long lasting, diverse communities of volunteers who collaborate to build software used by millions of users worldwide.
In this talk, I’ll outline the underlying principles of the Apache Way, what this means for projects and their ecosystems, and how the Apache Software Foundation is structured to support such a large number of projects.
Speaker
Brett Porter, Director, Apache Software Foundation
Data Guarantees and Fault Tolerance in Streaming SystemsDataWorks Summit
Does your big data streaming pipeline have a hole in its pocket ? Streaming involves gathering data, processing it and delivering the results to the intended destinations in real time. Glitches at any stage can cause data loss unless the products employed in the pipeline provide the necessary guarantees and are configured properly to deliver on those guarantees.
Realtime stream processing brings unique challenges with respect to data handling guarantees and fault tolerance. Each streaming product comes with a unique approach to tackle these problems. When assembling a streaming pipeline, it is important to understand this critical topic for proper selection and configuration of the individual components of the pipeline. The exercise to determine if you are missing some records in your data lake can be expensive, but it can be extremely difficult to track down the cause to prevent it from recurring.
To help you build reliable streaming pipelines, this talk will give you a better understanding of the problems involved in realtime streaming, the kinds of guarantees involved and how they are handled in popular open source products such as Storm, Flink, Kafka, Hive Streaming APIs and Flume.
Speaker
Roshan Naik, Senior MTS, Hortonworks
Introduction: This workshop will provide a hands on introduction to Apache Hadoop using the HDP Sandbox on students’ personal machines.
Format: A short introductory lecture about Apache Hadoop and a few key additional Apache projects in the extended ecosystem used in the lab followed by a demo, lab exercises and a Q&A session.
Objective: To provide a quick and short hands-on introduction to Hadoop. This lab will use the following Hadoop components: HDFS, YARN, Apache Pig, Apache Hive, Apache Spark, and Apache Ambari User Views. You will learn how to move data into HDFS, explore the data, clean the data, issue SQL queries and then build a report with Apache Zeppelin.
Pre-requisites: Registrants must bring a laptop and have the Hortonworks Sandbox installed.
Speaker:
Rafael Coss, Data Community Developer Advocate, Hortonworks
DataWorks Summit 2017 - Sydney Keynote
Scott Gnau, Chieft Technology Officer, Hortonworks
Data has become the most valuable asset for every enterprise. As businesses undergo data transformation, leading organizations are turning to data science and machine learning to drive more business value out of their data. In this talk, Scott will examine the trends and the key requirements needed to evolve to next-generation analytics and operations.
The Future of Data in Telecom and the Rise of Connected CommunitiesDataWorks Summit
The Telecom Industry has been at the heart of the new Digital Transformation Era with the focal points of the smart phone, digital content, and social media allowing us to interact with the new world. This digital consumer era is overlapping with the connected world of smart cities, connected car, smart homes, connected medical devices, and the industrial internet of things. This is giving rise to the new connected communities, member networks or marketplaces sharing data and insights across these new functional ecosystems. Service providers are being challenged to better manage the relationship and the experiences of their customers, build and leverage smart networks, and drive or facilitate these new digital services and connected communities.
Join Sanjay Kumar, General Manager of the Telecom Industry at Hortonworks as he facilitates an interactive session with your peers and other executives from service providers and operators in discussing the future of data in telecom and the rise of connected communities. Share your experiences and hear from other telecom service providers on their journey with big data and the transformation to becoming a data-drive organization.
During this interactive session, areas of discussion will include:
- Big data and real-time analytics for enhancing customer experience and target advertising
- Analytics driven next generation OSS and the Smart Network
- Connected communities that leveraging blockchain with big data
In the age of the connected world, hear from industry leaders about the role of the service provider and their relationship with their customers and connected communities across this new digital ecosystem.
Speaker
Sanjay Kumar, General Manager, Telecom & FS, Hortonworks
Introduction: This workshop will provide a hands-on introduction to Apache Spark using the HDP Sandbox on students’ personal machines.
Format: A short introductory lecture about Apache Spark components used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari User Views. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.
Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud.
Speaker:
Robert Hryniewicz, Developer Advocate, Hortonworks
Apache Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Speaker
Simon Elliston Ball, Director Product Management, Cyber Security, Hortonworks
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFiDataWorks Summit
Apache NiFi provided a revolutionary data flow management system with a broad range of integrations with existing data production, consumption, and analysis ecosystems, all covered with robust data delivery and provenance infrastructure. Now learn about the follow-on project which expands the reach of NiFi to the edge, Apache MiNiFi. MiNiFi is a lightweight application which can be deployed on hardware orders of magnitude smaller and less powerful than the existing standard data collection platforms. With both a JVM compatible and native agent, MiNiFi allows data collection in brand new environments — sensors with tiny footprints, distributed systems with intermittent or restricted bandwidth, and even disposable or ephemeral hardware. Not only can this data be prioritized and have some initial analysis performed at the edge, it can be encrypted and secured immediately. Local governance and regulatory policies can be applied across geopolitical boundaries to conform with legal requirements. And all of this configuration can be done from central command & control using an existing NiFi with the trusted and stable UI data flow managers already love.
Expected prior knowledge / intended audience: developers and data flow managers should have passing knowledge of Apache NiFi as a platform for routing, transforming, and delivering data through systems (a brief overview will be provided). The talk will focus on extending the data collection, routing, provenance, and governance capabilities of NiFi to IoT/edge integration via MiNiFi.
Speaker
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
Apache Spark 1.4 introduced support for Apache ORC. However, initially it did not take advantage of the full power of ORC. For instance, it was slow because ORC vectorization was not used and push-down predicate wa s also not supported on DATE types. Recently the Apache Spark community has started to use the latest Apache ORC which include new enhancements to address these limitations. In this talk, we show the result of integrating the latest Apache ORC and Apache Spark. We will also review the latest enhancements and roadmap.
Speakers:
Owen O'Malley, Co-founder & Technical Fellow, Hortonworks
Dongjoon Hyun, Staff Software Engineer, Hortonworks
Pivotal HD and Spring for Apache Hadoopmarklpollack
In this webinar we introduce the the concepts of Hadoop and dive into some details unqiue to the Pivotal HD distribution, namely HAWQ which brings ANSI complaint SQL to Hadoop.
We also introduce the Spring for Apache Hadoop project that simplifies developing Hadoop applications by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, Hive, and HBase. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration. The new Spring XD umbrella project is also introduced.
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
In this session, attendees will learn how to use R in the distributed environment of Hadoop using the rmr package. Additionally, the R package googleVis will be used to show how application development teams can incorporate the power of R and the power of Google Chart Tools into their applications quickly and easily. The result is a rich custom data visualization with far less coding than what would otherwise be required. The session will begin by discussing R basics and then moving to concrete examples of statistical analysis on data sets. This will be accompanied by an application development example showing custom visualization of the analysis using googleVis. The application development example will show a browser based app both kicking off the data set analysis using R as well as the visualization of the result. Visualization examples will use both googleVis as well as basic Google Chart Tools. Attendees will leave the session with a concrete example of how to incorporate R into their existing application development practices and how to use Hadoop and its ecosystem to build custom visualizations.
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
Greg Chase, Director, Product Marketing presents Big Data 10 A
mazing Things to do With A Hadoop-based Data Lake at the Strata Conference + Hadoop World 2014 in NYC.
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
How Hortonworks DataFlow (HDF), powered by Apache NIFi, MiNiFi, Kafka and Storm, and it’s associated HDF Certification Program make it easier and faster to integrate different systems together. Highlights on the latest partner integrations from HPE, SAS, Attunity, Impetus Technologies, Kepware and Midfin Systems. “
Watch the webinar on-demand: http://hortonworks.com/webinar/make-big-data-ecosystem-work-better/
HDF Partner certification program: http://hortonworks.com/partners/product-integration-certification/#hdf-integration
Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.
Legacy ERP architecture offers an incredibly, efficient means of operational resource management, but a real challenge comes from extracting business insights from them. Over the past 30 years, ERP data system, such as SAP, can be hard to interact with especially at the source database level. Whether initial translation of business logic and hierarchies create significant customizations, as well as, merging those changes into analytical applications. Overall, the entire process of designing self-service reporting with business level context can be quite cumbersome, looking at an example platform like, SAP, which contain pre-packaged modules (MM, SD, PP, etc), integrating these systems into a series of pre-built analytics.
The orchestration and integration over a wide range of open source technology solutions with some commercial CDC and reporting solutions into a reference solution that mimics several real customer scenarios today, living on relational platforms. Key considerations of extracting from the operational system of record, especially the merging of multiple systems in different time zones, will be addressed. Furthermore, the integration concerns of an analytics Hadoop platform, using HIVE Acid and Merge, as well as, flattening techniques for dimensional models. Many times a customer is temporarily limited in the range of data their ERP can contain, and older data is often offloaded to secondary systems or cold archiving entities. That goes away, but the opportunities expanded with real-time reporting across all of history, and the expanded use cases with advanced machine learnings methods.
Speakers
Jordan Martz, Director of Tech Solutions, Attunity
David Freriks, Technology Evangelist, Qlik
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
Apache Solr is the open source platform for searching data stored in Hadoop. Solr powers search on many of the world's largest Internet sites, enabling powerful full-text search and near real-time indexing. Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr. Hortonworks Data Platform 2.1 includes Apache Solr.
In this deck from their 30-minute webinar, Rohit Bakhshi, Hortonworks product manager, and Paul Codding, Hortonworks solution engineer describe how Solr works within HDP's YARN-based architecture.
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
Financial services companies can reap tremendous benefits from 'Big Data' and they have moved quickly to deploy it. But these companies also place heavy demands on 'Big Data' infrastructure for flexibility, reliability and performance. In this webinar, Hortonworks joins WANDisco to look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
How to leverage data from across an entire global enterprise
How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
What industry leaders have put in place
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
Hortonworks Data Platform 2.2 include HDFS for data storage . In this 30-minute webinar, we discussed data storage innovations, including Heterogeneous storage, encryption, and operational security enhancements.
These slides to the Discover HDP 2.2 Webinar Series: Data Storage Innovations in HDFS explore Heterogeneous storage, Data Encryption and Operational security.
Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...Cloudera, Inc.
This session will discuss what's new in the recently released CDH3 and Enterprise 3.5 products. We'll review how usage of Hadoop has evolving in the enterprise and how CDH3 and Enterprise 3.5 meet these new challenges with advances in functionality, performance, security and manageability.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
TALKING POINTS
Welcome to DataWorks Summit – Why name change?
Thank community for their support
Market & community growth perspective?
Approaching 6 years of Hwx
Scott intro
I'm going to spend a little bit of time this morning walking you through the journey that we've taken over the last four years with the data in motion concept.
I’ll spend a few minutes after that walking you through the details of what we mean when we say data in motion. Many of you will hear quite a few buzzwords over the next few days and I want to make they land on a technical level.
And then we're going to spend the bulk of the time with me showing you in detail something we're very excited about - streaming analytics manager.
In 2014 we came out and gave a keynote where we described how you can use HDP to do real time visualization of data.
At that time there was a big shift happening where people understood what they could do in HDP, specifically to analyze data at scales they never could before. But they were starting to have much more interest in lower latency analysis.
We presented how you can do this type of processing and analysis, however we totally hand waved on how the data gets there in the first place.
Then in 2015 we came out and described how to enhance the data as it lands into the cluster. As it's arriving enrich it and make it more useful for analysis and visualization. Again still very much hand waving on how the data gets there in the first place.
In 2016 we wanted to expand our view and we wanted to help customers and companies get much better at how they collect the data, drive it throughout the enterprise and deliver it to the cluster. This time though we hand waved about how you do stream processing on the data.
This year we are very excited that we could come out and tell a truly balanced end to end story and show solutions for how you can collect, process, visualize and understand data in motion all the way through and that's what HDF 3.0 is all about.
Data in motion across the enterprise starts at the edge. For some of you the edge maybe planes, trains and automobiles for others it's traditional enterprise assets; servers, workstations, laptops, network devices and so on.
For us it's wherever the data life cycle begins, right at the first moment that it's created. From the very first observation we want to help you command and manage the data all the way through to the next hop whether that's a regional gateway, a core data center or the cloud. Full end-to-end processing of the data along it’s journey.
Then to be able to do stream processing as it arrives and provide powerful visualizations allowing you to interact with the data all while it's in motion.
All of this in time for you to extract the maximum value from the data. As we all know in many cases the value of the data is perishable and reduces over time so we want to help you operate in time and on it in time.
As you think about this this problem from end-to-end there are a few cross-cutting concerns that we have to make available all the way through the chain.
First and foremost is security, as we move to the edge we step out of the cozy confines of the enterprise where we have LDAP, Active Directory, Kerberos and other technologies which often aren't available to us outside of the walls of the enterprise.
We need to think about end to end security and we have to be able to shape shift so that we can use the best and most effective techniques all the way through.
We also want to make sure that you understand the origin and attribution of every piece of data. Everywhere that it comes from and everywhere that it goes. Total lineage. Not just once it lands in the cluster but from the very moment it's created to all the systems that use it. Understanding latencies is also a very important part of that story and a big piece of the governance story.
Finally a very critical part of our story is command and control. Building these things at scale is very difficult. Specifically it's difficult because once you get it set up you, as an organization want to make changes to it. It's not static, at least if you're doing it right it's not static, you need to be agile. Having a really powerful command control mechanism to allow that agility is a critical part of this story.
Let me set the stage for the demo that I’m going to show you.
Let’s imagine we are a transportation company that has a fleet of trucks driving around and we want to gather data from sensors on the vehicles.
To do that we are going to use MiNiFi which is a sub project of Apache NiFi that we make available through HDF
With MiNiFi we'll acquire the data from a variety of sensors, do the initial analysis so we can understand the relative value of the data and prioritize it. Then we will use the most effective and most appropriate communication mechanism available to us. Maybe that's activating an LTE signal to send out really critical time sensitive data or maybe for data that has less time sensitivity we buffer it and send it out when we have a WiFi signal.
The data then arrives into a more regional or core location and here we use technologies like Apache NiFi and Apache Kafka
Perhaps we want to normalize the events, add customer reference data to the, tokenize, or further enrich them.
We can then syndicate the events for a wide range of uses cases as well as drive them into systems like Hadoop and Spark
Now I'm very excited to be able to tell you about what we can do with that data while it's still moving while it's still very fresh using the Streaming Analytics Manager
With SAM we can do various things such as:
perform window based processing
temporal spatial correlation
evaluate models
aggregate
enrich
all with a schema applied to the data.
Having a schema is critical when building a Streaming application, it allows you to see and understand how your data evolves.
With SAM we're really targeting app developers, business analysts and operations teams so that we can help them maximize their individual user experience as necessary for their skill set and also give them a way to easily collaborate around a unified platform and do a better job than they ever have before.
Let me now introduce you to SAM.
We’ve articulated an end to end GA vision for flow management
You can see where we’re heading
Closed loop processing
Fully Governed
Support for multiple processing environments
Incorporate machine learning and AI to optimize what is collected, minimize time to insight, etc.
The path is truly excited and we look forward to working with all of you on this journey
One of the things that I really love about all of this not only the easy by which developers can build streaming applications, but the power of allowing all the different actors to collaborate. I think you would agree we have a pretty important story and vision here about how we're going continue to bring these things together and further enhance the experience.
In closing I want you to think about the fact that your enterprise is not composed of various big cool clusters of distributed systems it is one big logical distributed computing system and we want to help you drive that and help you maximize the value of data all the way through.
Again we do that through providing solutions all the way from the edge to the core with flow management, stream processing and enterprise services. Enabling you to bring all of it together semantically through the registries and through a development and deployment experience that helps you harness the power of your data.
thank you all very much for your time.