Examples of using data lakes from different AWS customers.
Level: Intermediate
Speaker: Ryan Jancaitis - Sr. Product Manager, EEC , AWS WWPS TechVision & Business Development
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
This document describes a predictive maintenance system for robots using real-time sensor data. A team of 4 engineers built a solution in 2 months using standard open source software like H2O and MapR. Sensors on a robot collected accelerometer, gyroscope and other data. This raw data was analyzed using anomaly detection algorithms in H2O to build a machine learning model that identified normal vs abnormal robot states. The model was deployed as a microservice to make real-time predictions on new sensor data and detect potential failures. The solution was able to analyze data from hundreds of robots and identify anomalies within 3 seconds, demonstrating an effective low-cost predictive maintenance system.
Full stack monitoring across apps & infrastructure with Azure MonitorSquared Up
Azure Thames Valley is a group for anyone interested in Microsoft Azure Cloud Computing Platform and Services. We aim to provide the whole Microsoft Azure community, whatever their level, with a regular meeting place to share knowledge, ideas, experiences, real-life problems, best working practices and many more from their own past experiences. Professionals across various disciplines including Developers, Testers, Architects, Project Managers, Scrum Masters, CTOs and many more are all welcome.
Presentation: A look into Azure Monitoring solutions, with Clive Watson
Azure Monitoring solutions include some great insights into your Cloud & Hybrid services and applications. Do you want to learn more about the technologies, setup and usage? We will take a look at Azure Monitor and Log Analytics and supporting services in this talk and demo.
Clive has over 30 years’ experience within the industry (14+ at Microsoft), currently he is an Azure Infrastructure Specialist for Microsoft based in the UK.
1. The document discusses continuous delivery pipelines for Hadoop analytics platforms using tools like Cloudera Director, Jenkins, Git, and Gerrit to automate builds, testing, and deployments.
2. It provides examples of different pipeline stages for data engineers, data scientists, and application developers including developing code, running unit tests, baking artifacts, deploying to test and production clusters, and conducting user acceptance testing.
3. The final section discusses how a logical continuous delivery pipeline would work with hourly-daily deployments for DevOps teams and weekly-monthly releases for data scientists and analysts to reduce bugs in production.
Managing R&D Data on Parallel Compute InfrastructureDatabricks
Clinical genomic analytics pipelines using Databricks and the Delta Lake for the benefit of loading individual reads from raw sequencing or base-call files have significant advantages over more traditional methods. Analysis pipelines that perform genomic mapping to purpose-built reference data artifacts persisted to tables allows for enhanced performance that is magnitudes greater than previous mapping methods. These scalable, reproducible, and potentially open sourced methods have the ability to transform bioinformatics and R&D data management / governance.
Motorists insurance company was facing challenges from aging systems, data silos, and an inability to analyze new types of data sources. They partnered with Saama Technologies to implement a hybrid Hadoop and SQL data warehouse ecosystem to consolidate their internal and external data in a scalable and cost-effective manner. This allowed Motorists to gain new insights from claims data, reduce load times by 30% with potential for 70% improvements, and save hundreds of hours on report building. Saama's Fluid Analytics for Insurance solution established a robust data foundation and provided self-service reporting and predictive analytics capabilities. The new environment enabled enterprise-wide data access and advanced analytics to improve business performance.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
This document describes a predictive maintenance system for robots using real-time sensor data. A team of 4 engineers built a solution in 2 months using standard open source software like H2O and MapR. Sensors on a robot collected accelerometer, gyroscope and other data. This raw data was analyzed using anomaly detection algorithms in H2O to build a machine learning model that identified normal vs abnormal robot states. The model was deployed as a microservice to make real-time predictions on new sensor data and detect potential failures. The solution was able to analyze data from hundreds of robots and identify anomalies within 3 seconds, demonstrating an effective low-cost predictive maintenance system.
Full stack monitoring across apps & infrastructure with Azure MonitorSquared Up
Azure Thames Valley is a group for anyone interested in Microsoft Azure Cloud Computing Platform and Services. We aim to provide the whole Microsoft Azure community, whatever their level, with a regular meeting place to share knowledge, ideas, experiences, real-life problems, best working practices and many more from their own past experiences. Professionals across various disciplines including Developers, Testers, Architects, Project Managers, Scrum Masters, CTOs and many more are all welcome.
Presentation: A look into Azure Monitoring solutions, with Clive Watson
Azure Monitoring solutions include some great insights into your Cloud & Hybrid services and applications. Do you want to learn more about the technologies, setup and usage? We will take a look at Azure Monitor and Log Analytics and supporting services in this talk and demo.
Clive has over 30 years’ experience within the industry (14+ at Microsoft), currently he is an Azure Infrastructure Specialist for Microsoft based in the UK.
1. The document discusses continuous delivery pipelines for Hadoop analytics platforms using tools like Cloudera Director, Jenkins, Git, and Gerrit to automate builds, testing, and deployments.
2. It provides examples of different pipeline stages for data engineers, data scientists, and application developers including developing code, running unit tests, baking artifacts, deploying to test and production clusters, and conducting user acceptance testing.
3. The final section discusses how a logical continuous delivery pipeline would work with hourly-daily deployments for DevOps teams and weekly-monthly releases for data scientists and analysts to reduce bugs in production.
Managing R&D Data on Parallel Compute InfrastructureDatabricks
Clinical genomic analytics pipelines using Databricks and the Delta Lake for the benefit of loading individual reads from raw sequencing or base-call files have significant advantages over more traditional methods. Analysis pipelines that perform genomic mapping to purpose-built reference data artifacts persisted to tables allows for enhanced performance that is magnitudes greater than previous mapping methods. These scalable, reproducible, and potentially open sourced methods have the ability to transform bioinformatics and R&D data management / governance.
Motorists insurance company was facing challenges from aging systems, data silos, and an inability to analyze new types of data sources. They partnered with Saama Technologies to implement a hybrid Hadoop and SQL data warehouse ecosystem to consolidate their internal and external data in a scalable and cost-effective manner. This allowed Motorists to gain new insights from claims data, reduce load times by 30% with potential for 70% improvements, and save hundreds of hours on report building. Saama's Fluid Analytics for Insurance solution established a robust data foundation and provided self-service reporting and predictive analytics capabilities. The new environment enabled enterprise-wide data access and advanced analytics to improve business performance.
Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
Spark and Hadoop are perfectly together. Spark is a key tool in Hadoop's toolbox that provides elegant developer APIs and accelerates data science and machine learning. It can process streaming data in real-time for applications like web analytics and insurance claims processing. The future of Spark and Hadoop includes innovating the core technologies, providing seamless data access across data platforms, and further accelerating data science tools and libraries.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
In this webinar, Adam will explain the benefits and restrictions that are encountered when working with Big Data systems in a modern agile development approach. He will go on to present some of the approaches, both in automation and in their management of testing activities, that his team has successfully adopted in tackling the big data testing challenge.
Tell me more - http://testhuddle.com/resource/big-data-a-new-testing-challenge/
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
HP operates a very complex HDP environment with key stakeholders and critical data across a variety of business areas: finance, supply chain, sales, and customer support. We load over 8,000 files per day, execute 1.5M lines of SQL via 6000 jobs running against 637B rows of data comprising over 5000 tables in 77 domains. Needless to say, defining our cluster size and monitoring job performance is essential for our success and the satisfaction of our stakeholders across the different business and IT organizations.
In this talk, we will describe the different sizing and allocation approaches that we went through. Our first method was a bottom-up storage-based calculation which took into account the legacy data, replication factors, overhead, and user space requirements. We quickly realized the current compute would not meet the needs of the follow-up phases of the project and that the bottom-up approach had too many assumptions and limitations.
The second method was to work top down to determine how many jobs could run with a set number of hours. This required us to calculate the number of slots for map and reduce tasks within set amount of YARN memory. To support this analysis, we developed advanced dashboards and reports that we will also share during the presentation. We captured statistics for every job and calculated the average map and reduce times. With this information, we could then calculate needed compute and storage to meet the required SLAs. And the result, the cluster grew by 88 nodes and now operates with 21 TB of YARN memory.
Speakers
Janet Li, HP inc, Big Data It Manager
Pranay Vyas, Hortonworks, Sr. Consultant
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
Data analytics, Spark, Hadoop and AI have become fundamental tools to drive digital transformation. A critical challenge is moving from isolated experiments to an organizational or enterprise production infrastructure. In this talk, we break apart the modern data analytics workflow to focus on the data challenges across different phases of the analytics and AI life cycle. By presenting a unified approach to data storage for AI and Analytics, organizations can reduce costs, modernize their data strategy and build a sustainable enterprise data lake. By anticipating how Hadoop, Spark, Tensorflow, Caffe and traditional analytics like SAS, HPC can share data, IT departments and data science practitioners can not only co-exist, but speed time to insight. We will present the tangible benefits of a Reference Architecture using real-world installations that span proprietary and open-source frameworks. Using intelligent software-defined shared storage, users are able to eliminate silos, reduce multiple data copies, and improve time to insight.PALLAVI GALGALI, Offering Manager,IBM and DOUGLAS O'FLAHERTY, Portfolio Product Manager, IBM
This document summarizes Walgreens' journey with Hadoop and building an integrated data hub. It discusses how Walgreens has expanded its Hadoop cluster from 14 nodes in 2014 to 140 nodes currently to handle growing data volumes and variety. An integrated data hub was created to bring analytics to data instead of moving data, improve collaboration, and better manage skills. The hub provides a single platform for batch and streaming analytics across structured, unstructured, hot and cold data. Building the hub required upgrading Hadoop Distribution Platform (HDP) versions, adding new services like HBase, and involving teams early in the process.
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
The document discusses challenges with building operational data applications on Hadoop and introduces the Cask Data Application Platform (CDAP) as a solution. It provides an agenda that covers data applications, challenges, CDAP motivation and goals, use cases, and an introduction and architecture overview of CDAP. The document aims to demonstrate how CDAP provides a unified platform that simplifies application development and lifecycle while supporting reusable data and processing patterns.
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.
Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.
While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
This document discusses how Liberty Mutual Insurance is using a Hadoop data lake to power analytics. It provides examples of how the data lake is used to integrate data from various sources and support various use cases. Specifically, it discusses how the data lake enables:
- Storage of structured and unstructured data from multiple sources in a centralized and secure location
- Analytics and machine learning by data scientists and analysts accessing the stored data
- Integrations with tools like Elasticsearch, Spark, and PowerBI for querying, analyzing and visualizing the data
- Archiving of log and sensor data from systems into hot, warm and cold storage tiers based on age and access frequency
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Introduction: This workshop will provide a hands on introduction to basic Machine Learning techniques with Spark ML using a Sandbox on students’ personal machines.
Format: A short introductory lecture on a select important supervised and unsupervised Machine Learning techniques followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Machine Learning with Spark ML. In the lab, you will use the following components: Apache Zeppelin (a “Modern Data Science Toolbox”) and Apache Spark. You will learn how to analyze the data, structure the data, train Machine Learning models and apply them to answer real-world questions.
Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud.
Speaker:
Robert Hryniewicz, Developer Advocate, Hortonworks
This document discusses building data pipelines with Spark and StreamSets. It describes how StreamSets Data Collector can be used to build pipelines that run on Spark today by leveraging Kafka RDDs and containers on Spark. It also outlines future directions for deeper Spark integration, including running pipelines on Databricks and developing a standalone Spark processor. The document concludes with a demo of StreamSets Data Collector capabilities.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
This document discusses leveraging Hadoop within the existing data warehouse environment of the Department of Immigration and Border Protection (DIBP) in Australia. It provides an overview of DIBP's business and why Hadoop was adopted, describes the existing EDW environment, and discusses the technical implementation of Hadoop. It also outlines next steps such as consolidating the departmental EDW and advanced analytics on Hadoop, and concludes by taking questions.
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
Cybersecurity threats have evolved beyond what traditional SIEMs and firewalls can detect. We present case studies highlighting how:
•An advanced manufacturer was able to identify new insider threats, enabling them to protect their IP
•A media company’s security operations center was able to verify they weren’t the source of a high-profile media leak.
The common thread across these real-world case studies is how businesses can expand their threat analysis using security analytics powered by artificial intelligence in a big data environment.
Cybersecurity threats increasingly require the aggregation and analysis of multiple data sources. Siloed tools and technologies serve their purpose, but can’t be applied to look across the ever-growing variety and volume of traffic. Big data technologies are a proven solution to aggregating and analysing data across enormous volumes and varieties of data in a scalable way. However, as security professionals well know, more data doesn’t mean more leads or detection. In fact, all too often more data means slower threat hunting and more missed incidents. The solution is to leverage advanced analytical methods like machine learning.
Machine learning is a powerful mathematical approach that can learn patterns in data to identify relevant areas to focus. By applying these methods, we can automatically learn baseline activity and detect deviations across all data sources to flag high-risk entities that behave differently from their peers or past activity. ROY WILDS, Principal Data Scientist, Interset
Tag based policies using Apache Atlas and RangerVimal Sharma
With an ever increasing need to secure and limit access to sensitive data, enterprises today need an open source solution. Apache Atlas - which is the metadata and governance framework for Hadoop joins hands with Apache Ranger - security enforcement framework for Hadoop to address the need for compliance and security. Vimal will discuss the security and compliance requirements and demonstrate how the combination of Atlas and Ranger solves the problem. Vimal will focus on Tag based policy enforcement which is an elegant solution for large Hadoop clusters with wide variety of data
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
This document discusses testing of big data systems. It defines big data and its key characteristics of volume, variety, velocity and value. It provides examples of big data success stories and compares enterprise data warehouses to big data. The document outlines the typical architecture of a big data system including pre-processing, MapReduce, data extraction and loading. It identifies potential problems at each stage and for non-functional testing. Finally, it covers new challenges for testers in validating big data systems.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Amazon Web Services
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we first present an end-to-end streaming data solution using Amazon Kinesis Data Streams for data ingestion, Amazon Kinesis Data Analytics for real-time processing, and Amazon Kinesis Data Firehose for persistence. We review in detail how to write SQL queries for operational monitoring using Kinesis Data Analytics.
Learn how PNNL is building their ingestion flow into their Serverless Data Lake leveraging the Kinesis Platform. At times migrating existing NiFi Processes where applicable to various parts of the Kinesis Platform, replacing complex flows on Nifi to bundle and compress the data with Kinesis Firehose, leveraging Kinesis Streams for their enrichment and transformation pipelines, and using Kinesis Analytics to Filter, Aggregate, and detect anomalies.
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
Sarah: CEO-Finance-Report pipeline seems to be slow today. Why
Jeeves: SparkSQL query dbt_fin_model in CEO-Finance-Report is running 53% slower on 2/28/2021. Data skew issue detected. Issue has not been seen in last 90 days.
Jeeves: Adding 5 more nodes to cluster recommended for CEO-Finance-Report to finish in its 99th percentile time of 5.2 hours.
Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of problems quickly. Instead of being stuck to screens displaying logs and metrics, users can now have a more refreshing experience via a two-way conversation with their own personal Spark expert.
We presented Jeeves at Spark Summit 2019. In the two years since, Jeeves has grown up a lot. Jeeves can now learn continuously as telemetry information streams in from more and more applications, especially SQL queries. Jeeves now “knows” about data pipelines that have many components. Jeeves can also answer questions about data quality in addition to performance, cost, failures, and SLAs. For example:
Tom: I am not seeing any data for today in my Campaign Metrics Dashboard.
Jeeves: 3/5 validations failed on the cmp_kpis table on 2/28/2021. Run of pipeline cmp_incremental_daily failed on 2/28/2021.
This talk will give an overview of the newer capabilities of the chatbot, and how it now fits in a modern data stack with the emergence of new data roles like analytics engineers and machine learning engineers. You will learn how to build chatbots that tackle your complex data operations challenges.
In this webinar, Adam will explain the benefits and restrictions that are encountered when working with Big Data systems in a modern agile development approach. He will go on to present some of the approaches, both in automation and in their management of testing activities, that his team has successfully adopted in tackling the big data testing challenge.
Tell me more - http://testhuddle.com/resource/big-data-a-new-testing-challenge/
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
HP operates a very complex HDP environment with key stakeholders and critical data across a variety of business areas: finance, supply chain, sales, and customer support. We load over 8,000 files per day, execute 1.5M lines of SQL via 6000 jobs running against 637B rows of data comprising over 5000 tables in 77 domains. Needless to say, defining our cluster size and monitoring job performance is essential for our success and the satisfaction of our stakeholders across the different business and IT organizations.
In this talk, we will describe the different sizing and allocation approaches that we went through. Our first method was a bottom-up storage-based calculation which took into account the legacy data, replication factors, overhead, and user space requirements. We quickly realized the current compute would not meet the needs of the follow-up phases of the project and that the bottom-up approach had too many assumptions and limitations.
The second method was to work top down to determine how many jobs could run with a set number of hours. This required us to calculate the number of slots for map and reduce tasks within set amount of YARN memory. To support this analysis, we developed advanced dashboards and reports that we will also share during the presentation. We captured statistics for every job and calculated the average map and reduce times. With this information, we could then calculate needed compute and storage to meet the required SLAs. And the result, the cluster grew by 88 nodes and now operates with 21 TB of YARN memory.
Speakers
Janet Li, HP inc, Big Data It Manager
Pranay Vyas, Hortonworks, Sr. Consultant
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Databricks
Machine Learning is everywhere, but translating a data scientist’s model into an operational environment is challenging for many reasons. Models may need to be distributed to remote applications to generate predictions, or in the case of re-training, existing models may need to be updated or replaced. To monitor and diagnose such configurations requires tracking many variables (such as performance counters, models, ML algorithm specific statistics and more).
In this talk we will demonstrate how we have attacked this problem for a specific use case, edge based anomaly detection. We will show how Spark can be deployed in two types of environments (on edge nodes where the ML predictions can detect anomalies in real time, and on a cloud based cluster where new model coefficients can be computed on a larger collection of available data). To make this solution practically deployable, we have developed mechanisms to automatically update the edge prediction pipelines with new models, regularly retrain at the cloud instance, and gather metrics from all pipelines to monitor, diagnose and detect issues with the entire workflow. Using SparkML and Spark Accumulators, we have developed an ML pipeline framework capable of automating such deployments and a distributed application monitoring framework to aid in live monitoring.
The talk will describe the problems of operationalizing ML in an Edge context, our approaches to solving them and what we have learned, and include a live demo of our approach using anomaly detection ML algorithms in SparkML and others (clustering etc.) and live data feeds. All datasets and outputs will be made publicly available.
Data analytics, Spark, Hadoop and AI have become fundamental tools to drive digital transformation. A critical challenge is moving from isolated experiments to an organizational or enterprise production infrastructure. In this talk, we break apart the modern data analytics workflow to focus on the data challenges across different phases of the analytics and AI life cycle. By presenting a unified approach to data storage for AI and Analytics, organizations can reduce costs, modernize their data strategy and build a sustainable enterprise data lake. By anticipating how Hadoop, Spark, Tensorflow, Caffe and traditional analytics like SAS, HPC can share data, IT departments and data science practitioners can not only co-exist, but speed time to insight. We will present the tangible benefits of a Reference Architecture using real-world installations that span proprietary and open-source frameworks. Using intelligent software-defined shared storage, users are able to eliminate silos, reduce multiple data copies, and improve time to insight.PALLAVI GALGALI, Offering Manager,IBM and DOUGLAS O'FLAHERTY, Portfolio Product Manager, IBM
This document summarizes Walgreens' journey with Hadoop and building an integrated data hub. It discusses how Walgreens has expanded its Hadoop cluster from 14 nodes in 2014 to 140 nodes currently to handle growing data volumes and variety. An integrated data hub was created to bring analytics to data instead of moving data, improve collaboration, and better manage skills. The hub provides a single platform for batch and streaming analytics across structured, unstructured, hot and cold data. Building the hub required upgrading Hadoop Distribution Platform (HDP) versions, adding new services like HBase, and involving teams early in the process.
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
The document discusses challenges with building operational data applications on Hadoop and introduces the Cask Data Application Platform (CDAP) as a solution. It provides an agenda that covers data applications, challenges, CDAP motivation and goals, use cases, and an introduction and architecture overview of CDAP. The document aims to demonstrate how CDAP provides a unified platform that simplifies application development and lifecycle while supporting reusable data and processing patterns.
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.
Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.
While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
This document discusses how Liberty Mutual Insurance is using a Hadoop data lake to power analytics. It provides examples of how the data lake is used to integrate data from various sources and support various use cases. Specifically, it discusses how the data lake enables:
- Storage of structured and unstructured data from multiple sources in a centralized and secure location
- Analytics and machine learning by data scientists and analysts accessing the stored data
- Integrations with tools like Elasticsearch, Spark, and PowerBI for querying, analyzing and visualizing the data
- Archiving of log and sensor data from systems into hot, warm and cold storage tiers based on age and access frequency
Apache Atlas provides centralized metadata services and cross-component dataset lineage tracking for Hadoop components. It aims to enable transparent, reproducible, auditable and consistent data governance across structured, unstructured, and traditional database systems. The near term roadmap includes dynamic access policy driven by metadata and enhanced Hive integration. Apache Atlas also pursues metadata exchange with non-Hadoop systems and third party vendors through REST APIs and custom reporters.
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Introduction: This workshop will provide a hands on introduction to basic Machine Learning techniques with Spark ML using a Sandbox on students’ personal machines.
Format: A short introductory lecture on a select important supervised and unsupervised Machine Learning techniques followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Machine Learning with Spark ML. In the lab, you will use the following components: Apache Zeppelin (a “Modern Data Science Toolbox”) and Apache Spark. You will learn how to analyze the data, structure the data, train Machine Learning models and apply them to answer real-world questions.
Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud.
Speaker:
Robert Hryniewicz, Developer Advocate, Hortonworks
This document discusses building data pipelines with Spark and StreamSets. It describes how StreamSets Data Collector can be used to build pipelines that run on Spark today by leveraging Kafka RDDs and containers on Spark. It also outlines future directions for deeper Spark integration, including running pipelines on Databricks and developing a standalone Spark processor. The document concludes with a demo of StreamSets Data Collector capabilities.
This is a brief technology introduction to Oracle Stream Analytics, and how to use the platform to develop streaming data pipelines that support a wide variety of industry use cases
This document discusses leveraging Hadoop within the existing data warehouse environment of the Department of Immigration and Border Protection (DIBP) in Australia. It provides an overview of DIBP's business and why Hadoop was adopted, describes the existing EDW environment, and discusses the technical implementation of Hadoop. It also outlines next steps such as consolidating the departmental EDW and advanced analytics on Hadoop, and concludes by taking questions.
How big data and AI saved the day: critical IP almost walked out the doorDataWorks Summit
Cybersecurity threats have evolved beyond what traditional SIEMs and firewalls can detect. We present case studies highlighting how:
•An advanced manufacturer was able to identify new insider threats, enabling them to protect their IP
•A media company’s security operations center was able to verify they weren’t the source of a high-profile media leak.
The common thread across these real-world case studies is how businesses can expand their threat analysis using security analytics powered by artificial intelligence in a big data environment.
Cybersecurity threats increasingly require the aggregation and analysis of multiple data sources. Siloed tools and technologies serve their purpose, but can’t be applied to look across the ever-growing variety and volume of traffic. Big data technologies are a proven solution to aggregating and analysing data across enormous volumes and varieties of data in a scalable way. However, as security professionals well know, more data doesn’t mean more leads or detection. In fact, all too often more data means slower threat hunting and more missed incidents. The solution is to leverage advanced analytical methods like machine learning.
Machine learning is a powerful mathematical approach that can learn patterns in data to identify relevant areas to focus. By applying these methods, we can automatically learn baseline activity and detect deviations across all data sources to flag high-risk entities that behave differently from their peers or past activity. ROY WILDS, Principal Data Scientist, Interset
Tag based policies using Apache Atlas and RangerVimal Sharma
With an ever increasing need to secure and limit access to sensitive data, enterprises today need an open source solution. Apache Atlas - which is the metadata and governance framework for Hadoop joins hands with Apache Ranger - security enforcement framework for Hadoop to address the need for compliance and security. Vimal will discuss the security and compliance requirements and demonstrate how the combination of Atlas and Ranger solves the problem. Vimal will focus on Tag based policy enforcement which is an elegant solution for large Hadoop clusters with wide variety of data
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania
This document discusses testing of big data systems. It defines big data and its key characteristics of volume, variety, velocity and value. It provides examples of big data success stories and compares enterprise data warehouses to big data. The document outlines the typical architecture of a big data system including pre-processing, MapReduce, data extraction and loading. It identifies potential problems at each stage and for non-functional testing. Finally, it covers new challenges for testers in validating big data systems.
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Amazon Web Services
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we first present an end-to-end streaming data solution using Amazon Kinesis Data Streams for data ingestion, Amazon Kinesis Data Analytics for real-time processing, and Amazon Kinesis Data Firehose for persistence. We review in detail how to write SQL queries for operational monitoring using Kinesis Data Analytics.
Learn how PNNL is building their ingestion flow into their Serverless Data Lake leveraging the Kinesis Platform. At times migrating existing NiFi Processes where applicable to various parts of the Kinesis Platform, replacing complex flows on Nifi to bundle and compress the data with Kinesis Firehose, leveraging Kinesis Streams for their enrichment and transformation pipelines, and using Kinesis Analytics to Filter, Aggregate, and detect anomalies.
The document discusses transforming IT with AWS cloud services. It describes AWS's layered architecture with foundational, platform and application services. It provides guidance on planning a cloud transformation including developing people skills, conducting assessments, creating a roadmap, financial analysis, technology fit, and aligning with enterprise IT programs. The document recommends standardizing on cloud patterns, using the full breadth of AWS services, and investing in a discovery workshop to build a cloud strategy.
Learn about data lifecycle best practices in the AWS Cloud. Discover how to optimise performance and lower the costs of data ingestion, staging, storage, cleansing, analytics, visualisation, and archiving.
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to info@insideanalysis.com, or tweet with #DBSurvival.
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014Amazon Web Services
This document discusses a platform called EzBake that was created to help a US government customer modernize their systems and better analyze large amounts of data. EzBake provides tools to easily develop and deploy applications, integrate and analyze data from various sources, and implement security controls. It improved the customer's ability to share data and applications across many teams and networks, decreased development times from 6-8 months to 3-4 weeks, and reduced costs while increasing capabilities.
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
This document provides an introduction to a course on big data and analytics. It outlines the instructor and teaching assistant contact information. It then lists the main topics to be covered, including data analytics and mining techniques, Hadoop/MapReduce programming, graph databases and analytics. It defines big data and discusses the 3Vs of big data - volume, variety and velocity. It also covers big data technologies like cloud computing, Hadoop, and graph databases. Course requirements and the grading scheme are outlined.
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
This document summarizes a presentation about Hansen Technologies' migration of their IT infrastructure from an on-premises data center to AWS. It discusses Hansen's motivations for migrating, the process they went through with migration partner Apps Associates, and the benefits they experienced after migrating to AWS, including lower costs, improved uptime, and ability to leverage managed services. It also provides an overview of considerations for migrating applications and databases to AWS and security best practices in the cloud.
This session will provide an overview of the AWS storage portfolio, including block, file, object, and cloud data migration services. We will also touch on new offerings, outline some of the most common use cases, and prepare you for the individual deep dive sessions, customer sessions, and new announcements.
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here?
In this webinar, we look at this foundational technology for modern Data Management and show how it evolved to meet the workloads of today, as well as when other platforms make sense for enterprise data.
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
1) The document discusses building a minimum viable product (MVP) using Amazon Web Services (AWS).
2) It provides an example of an MVP for an omni-channel messenger platform that was built from 2017 to connect ecommerce stores to customers via web chat, Facebook Messenger, WhatsApp, and other channels.
3) The founder discusses how they started with an MVP in 2017 with 200 ecommerce stores in Hong Kong and Taiwan, and have since expanded to over 5000 clients across Southeast Asia using AWS for scaling.
This document discusses pitch decks and fundraising materials. It explains that venture capitalists will typically spend only 3 minutes and 44 seconds reviewing a pitch deck. Therefore, the deck needs to tell a compelling story to grab their attention. It also provides tips on tailoring different types of decks for different purposes, such as creating a concise 1-2 page teaser, a presentation deck for pitching in-person, and a more detailed read-only or fundraising deck. The document stresses the importance of including key information like the problem, solution, product, traction, market size, plans, team, and ask.
This document discusses building serverless web applications using AWS services like API Gateway, Lambda, DynamoDB, S3 and Amplify. It provides an overview of each service and how they can work together to create a scalable, secure and cost-effective serverless application stack without having to manage servers or infrastructure. Key services covered include API Gateway for hosting APIs, Lambda for backend logic, DynamoDB for database needs, S3 for static content, and Amplify for frontend hosting and continuous deployment.
This document provides tips for fundraising from startup founders Roland Yau and Sze Lok Chan. It discusses generating competition to create urgency for investors, fundraising in parallel rather than sequentially, having a clear fundraising narrative focused on what you do and why it's compelling, and prioritizing relationships with people over firms. It also notes how the pandemic has changed fundraising, with examples of deals done virtually during this time. The tips emphasize being fully prepared before fundraising and cultivating connections with investors in advance.
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
This document discusses Amazon's machine learning services for building conversational interfaces and extracting insights from unstructured text and audio. It describes Amazon Lex for creating chatbots, Amazon Comprehend for natural language processing tasks like entity extraction and sentiment analysis, and how they can be used together for applications like intelligent call centers and content analysis. Pre-trained APIs simplify adding machine learning to apps without requiring ML expertise.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
2. Envision Engineering – About Us
The Solution
Bring the “Art of the Possible” to our customers
Collaborative
Iterate and deliver based
on constant feedback
from customer
stakeholders
Business Solutions
Focus on solving business
challenges, not
technology challenges
Specialized Team
End-to-End development
approach, services, and
skills
Touchable, tangible results are more impactful than an architecture diagram.
Analysis paralysis and uncertainties are barriers to cloud adoption
Which TechnologyCan it Work Where to Begin
? ? ?
4. Envision Engineering
What is a Data Lake?
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing
Data Lake
5. Customer Example – United States Census
Core Business Challenge
Core Data
& Copies
Data Security
Auditing
Usage
Monitoring
Reproducibility
Storage
Constraints
Compute
Constraints
6. United States Census
Core Use cases
Column-level Access Control
(with cell-level capability)
Data Lineage
(macro level)
On-Demand Infrastructure
for analytics jobs
Cost Tracking
per analytics job
Hadoop platform choice:
Amazon EMR and
Hortonworks HDP
Ability to run
legacy scripts
SAS 9.4
Centralized
Storage
LDAP based user
security/permissions
7. Deep Dive into Implementation – Column Security
• Custom Accumulo Loader
§ loads datasets from S3 into Accumulo table(s)
§ assigns column names as security labels
• Custom Accumulo Authorization handler
§ checks which labels user has access to (in LDAP)
Installed via Bootstrap script
on EMR (Elastic Map Reduce)
Installed on Hortonworks cluster
via Apache Ambari Blueprints
8. Deep Dive into Implementation – Hortonworks
Hortonworks
Cluster on EC2
Ø Create recipes:
• Accumulo setup
1. Install Loader
2. Install Custom auth
3. Stop Accumulo
4. Start Accumulo
• Import Data & Run SAS
Ø Create stack
Blueprint
9. Deep Dive into Implementation – SAS Script Execution
• SAS instance is launched per analytics task (on-demand)
• AWS Systems Manager “Run Command” triggers remote
shell script
• Shell script downloads SAS script from S3, runs it via SAS
• SAS accesses the data via Hive endpoint on Hadoop,
reads from External Table linked to Accumulo table
• SAS persists results locally
• Shell script copies the results to Amazon S3
Amazon EC2
SAS Instance
>
Amazon AMI
SAS 9.4 Amazon
EMR
Amazon S3
bucket
Amazon EC2
Systems Manager
10. User initiates Analysis
Routine based on
selected data
1
Deep Dive into Implementation – Pulling it all together
Hive tables are
created based on
data visible to user in
Accumulo
5
A SAS AMI is
launched with Hive
connection details
6
A NodeJS Lambda
function launches
EMR/HDX via SDK/API
2
A SAS Program is run
and results are stored in
S3. The AWS instances
and services are
terminated
7
1) Location of Results
2) Location of Logs
1) Analysis Routine
2) Data File
3) AD Group
An Hadoop cluster is
launched and
bootstrapped to install
Accumulo and Hive
3
Custom Java routine
creates Accumulo
rights and data tables
and loads the data
4
11. Single Page App Serverless
API Gateway
Deep Dive into Implementation – Serverless UI
Client Side Server Side
14. US Census Summary and Next Steps
- Data Lake provides:
- Centralized, secured storage
- On demand analytics environment
- Data and Program Lineage
- Re-use of existing data and SAS Programs
- What’s Next:
- Authority to Operate in FedRAMP High environment
- Spin up of interactive environments
- Control of AWS images and cost by user and group
- Deeper integration with Apache Ranger and Atlas
15. Customer Example – USC Alzheimer's Therapeutic Research Institute
The USC ATRI mission is to create a leading hub
of basic, translational and clinical research in
neuroscience and neurological diseases by
collaborating with sites and investigators
around the world
16. Core Data
& Routines
Core Data
& Routines
Silo A Silo B Silo C
Customer Example – ATRI
Core Business Challenge
Core Data
& Routines
17. ATRI
Core Use cases
Collect Data from Multiple
Sources
Data Lineage
On-Demand Infrastructure
for analytics jobs
HIPAA Eligible
Environment
LDAP based user
security/permissions
Data Discovery
18. Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization
• Text-based search and discovery based on:
• project name
• files within the project
• columns within tables/csv-files.
• Control of access roles and rights on a data set
• Analytics task execution scripts against selected data sets
• (R, Python and Java)
• Audit information for data :
• Storage, sharing, and usage
• REST-like API(s) for uploading and updating data to the
data lake
• Store data sets in data lake via scripting/automation.
• Store data in a HIPAA eligible environment
19. Customer Example – National Heart Blood and Lung Institute
The National Heart, Lung, and Blood Institute’s
(NHLBI) mission is to provide global leadership for
a research, training, and education program to
promote the prevention and treatment of heart,
lung, and blood disease. To this end, Institutions,
Scientists, and Researchers rely on data provided
by the NHLBI to drive basic discoveries about the
causes of disease and translate those discoveries
into clinical practice.
20. Customer Example – NHLBI
Core Business Challenge
Massive amounts of
genetic data
Consent Management
held by outside group
Auditing
Compute
Constraints
21. NHLBI
Core Use cases
Data Lineage
On-Demand Infrastructure
for genomics tasks
Cost Tracking
Centralized
Storage
Consent Group based
access controls
Data Discovery
22. Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization based on internal
Identity Provider
• Text-based search and discovery based on based on
DbGAP controlled studies
• Control of access roles and rights on a study by Consent
Group
• On Demand Genomics Tooling based on selected data
files
• (samtools, bcftools, HTSGet, Plink, etc…)
• Audit information for data :
• Storage, sharing, and usage
23. NHLBI Solution Architecture
SAML Authentication
SAML Assertion with
Consent Group
permissions
NIH CIT
dbGap/SRAStudy details
Meta-data, run lists, and
permission details
File Access Request
Secured access
IAM
Roles
UI
NHLBI Data Lake
NHLBI Data Storage
Study 1 Study 2 Study N
24. Uses of Data Lake – In Summary
Common Needs Across Verticals
Common Services to Meet Data Lake Needs
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing