At the StampedeCon 2015 Big Data Conference: The starting point for this project was a MapReduce application that processed log files produced by the support portal. This application was running on Hadoop with Ruby Wukong. At the time of the project start it was underperforming and did not show good scalability. This made the case for redesigning it using Spark with Scala and Java.
Initial review of the Ruby code revealed that it was using disk IO excessively, in order to communicate between MapReduce jobs. Each job was implemented as a separate script passing large data volumes through. Spark is more efficient in managing intermediate data passed between MapReduce jobs – not only it keeps it in memory whenever possible, it often eliminates the need for intermediate data at all. However, that alone not brought us much improvement since there were additional bottlenecks at data aggregation stages.
The application involved a global data ordering step, followed by several localized aggregation steps. This first global sort required significant data shuffle that was inefficient. Spark allowed us to partition the data and convert a single global sort into many local sorts, each running on a single node and not exchanging any data with other nodes. As a result, several data processing steps started to fit into node memory, which brought about a tenfold performance improvement.
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
This session will begin with an overview of current non-volatile memory (NVM, aka persistent memory) architectures and its relationship between several levels of memory and storage hierarchy, both near- and far-processor. A discussion on its significant impact on computing analytic workloads now and in the near future will ensue, including use cases and the concept of very large persistent memory surfaces as applied to both analytic computation and storage for big data workflows. The presentation will end with ‘why you should care’ about such technologies which inevitably will completely change the way we think about solving data-intensive problems.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Lifting the hood on spark streaming - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Today if a byte of data were a gallon of water, in only 10 seconds there would be enough data to fill an average home, in 2020 it will only take 2 seconds. The Internet of Things is driving a tremendous amount of this growth, providing more data at a higher rate then we’ve ever seen. With this explosive growth comes the demand from consumers and businesses to leverage and act on what is happening right now. Without stream processing these demands will never be met, and there will be no big data and no Internet of Things. Apache Spark, and Spark Streaming in particular can be used to fulfill this stream processing need now and in the future. In this talk I will peel back the covers and we will take a deep dive into the inner workings of Spark Streaming; discussing topics such as DStreams, input and output operations, transformations, and fault tolerance.
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
The importance of ingestion and processing streaming data in telecommunication industry is ever increasing. We, SK Telecom which is Korea's number-one telecommunications provider, encounter how to use infra resources more efficiently. Apache Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud.
In this talk, we are going to introduce auto scale-out/in on Kubernetes. This approach is more outstanding than Druid's scaling implementation. Here are the benefits. The first is our approach can be used anywhere on private cloud or (managed) Kubernetes in Azure, AWS and GKE. The second is AWS EC2's startup and termination requires a few minutes, but our approach requires a few seconds. The last is the scaling mechanism is decoupled from Druid's source code. We will also share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling.
The below is about detailed benefit compared with Druid's auto scaling approach:
1. Druid's auto scaling is only available in AWS, but our approach does not have the obstacle. It can be used in Private cloud(on-premise) are (managed) Kubernetes in Azure, AWS and GKE.
2. AWS EC2 is an instance of virtual machine, so the startup is slower than docker container. A few minutes are required for startup or termination of EC2. Docker container is very lightweight, so it requires a few seconds.
3. Druid's auto scaling is tightly coupled with AWS API because Druid engine code uses AWS API. Our scale-out/in algorithm is conceptually equal to Druid's auto scaling approach, but we decoupled the dependency because Kubernetes communicate with one of dispatcher nodes(i.e. Overlord node) using REST API.
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
Cracking RSA-2048 Cryptography using Shor's Algorithm on a Quantum Computer
We demonstrate live a Pure/Undiluted Implementation of Shor's Algorithm on a 100,000+ Qubit Quantum Computer Simulator by Automatski.
We have hence cracked RSA-2048 and all Existing Cryptography in The World
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
This session will begin with an overview of current non-volatile memory (NVM, aka persistent memory) architectures and its relationship between several levels of memory and storage hierarchy, both near- and far-processor. A discussion on its significant impact on computing analytic workloads now and in the near future will ensue, including use cases and the concept of very large persistent memory surfaces as applied to both analytic computation and storage for big data workflows. The presentation will end with ‘why you should care’ about such technologies which inevitably will completely change the way we think about solving data-intensive problems.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Lifting the hood on spark streaming - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Today if a byte of data were a gallon of water, in only 10 seconds there would be enough data to fill an average home, in 2020 it will only take 2 seconds. The Internet of Things is driving a tremendous amount of this growth, providing more data at a higher rate then we’ve ever seen. With this explosive growth comes the demand from consumers and businesses to leverage and act on what is happening right now. Without stream processing these demands will never be met, and there will be no big data and no Internet of Things. Apache Spark, and Spark Streaming in particular can be used to fulfill this stream processing need now and in the future. In this talk I will peel back the covers and we will take a deep dive into the inner workings of Spark Streaming; discussing topics such as DStreams, input and output operations, transformations, and fault tolerance.
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
The importance of ingestion and processing streaming data in telecommunication industry is ever increasing. We, SK Telecom which is Korea's number-one telecommunications provider, encounter how to use infra resources more efficiently. Apache Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud.
In this talk, we are going to introduce auto scale-out/in on Kubernetes. This approach is more outstanding than Druid's scaling implementation. Here are the benefits. The first is our approach can be used anywhere on private cloud or (managed) Kubernetes in Azure, AWS and GKE. The second is AWS EC2's startup and termination requires a few minutes, but our approach requires a few seconds. The last is the scaling mechanism is decoupled from Druid's source code. We will also share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling.
The below is about detailed benefit compared with Druid's auto scaling approach:
1. Druid's auto scaling is only available in AWS, but our approach does not have the obstacle. It can be used in Private cloud(on-premise) are (managed) Kubernetes in Azure, AWS and GKE.
2. AWS EC2 is an instance of virtual machine, so the startup is slower than docker container. A few minutes are required for startup or termination of EC2. Docker container is very lightweight, so it requires a few seconds.
3. Druid's auto scaling is tightly coupled with AWS API because Druid engine code uses AWS API. Our scale-out/in algorithm is conceptually equal to Druid's auto scaling approach, but we decoupled the dependency because Kubernetes communicate with one of dispatcher nodes(i.e. Overlord node) using REST API.
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
Cracking RSA-2048 Cryptography using Shor's Algorithm on a Quantum Computer
We demonstrate live a Pure/Undiluted Implementation of Shor's Algorithm on a 100,000+ Qubit Quantum Computer Simulator by Automatski.
We have hence cracked RSA-2048 and all Existing Cryptography in The World
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
In my talk I will discuss and show examples of using Apache Hadoop, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to last years Apache Deep Learning 101 that was done at Dataworks Summit and ApacheCon.
As part of my talk I will walk through using Apache NXNet Pre-Built Models, MXNet's New Model Server with Apache NiFi, executing MXNet with Apache NiFi and running Apache MXNet on edge nodes utilizing Python and Apache MiniFi.
This talk is geared towards Data Engineers interested in the basics of Deep Learning with open source Apache tools in a Big Data environment. I will walk through source code examples available in github and run the code live on an Apache Hadoop / YARN / Apache Spark cluster.
This will be an introduction to executing Deep Learning Pipelines in an Apache Big Data environment.
My talk at Data Works Summit Sydney was listed in top 7 -> https://hortonworks.com/blog/7-sessions-dataworks-summit-sydney-see/
Also have speak at and run Future of Data Princeton and at Oracle Code NYC.
https://www.slideshare.net/oom65/hadoop-security-architecture?next_slideshow=1
https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html
https://community.hortonworks.com/articles/146704/edge-analytics-with-nvidia-jetson-tx1-running-apac.html
https://dzone.com/refcardz/introduction-to-tensorflow
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
Apache Hadoop YARN is the modern distributed operating system for big data applications. In Apache Hadoop 3.1.0, YARN added a service framework that supports long-running services. This new capability goes hand in hand with the recent improvements in YARN to support Docker containers. Together these features have made it significantly easier to bring new applications and services to YARN.
In this talk you will learn about YARN service framework, its new containerization capabilities and how it lays the foundation for a hybrid and uniform architecture for compute and storage across on-prem and multi-cloud environments. This will include examples highlighting how easy it is to bring applications to the YARN service framework as well as how to containerize applications.
Here's what to expect in this talk:
- Motivation for YARN service framework and containerization
- YARN service framework overview
- YARN service examples
- Containerization overview
- Containerization for Big Data and non Big Data workloads - wait that's everything
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
Telecom operators need to find operational anomalies in their networks very quickly. This need, however, is shared with many other industries as well so there are lessons for all of us here. Spark plus a streaming architecture can solve these problems very nicely. I will present both a practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams. These algorithms are simple but quite general and can be applied across a wide variety of situations.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
Developers increasingly are building dynamic, interactive real-time applications on fast streaming data to extract maximum value from data in the moment. To do so requires a data pipeline, the ability to make transactional decisions against state, and an export functionality that pushes data at high speeds to long-term Hadoop analytics stores like Hortonworks Data Platform (HDP). This enables data to arrive in your analytic store sooner, and allows these analytics to be leveraged with radically lower latency.
But successfully writing fast data applications that manage, process, and export streams of data generated from mobile, smart devices, sensors and social interactions is a big challenge.
Join Hortonworks and VoltDB, an in-memory scale-out relational database that simplifies fast data application development, to learn how you can ingest large volumes of fast-moving, streaming data and process it in real time. We will also cover how developing fast data applications is simplified, faster - and delivers more value when built on a fast in-memory, scale-out SQL database.
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but the overall cluster configuration can present some unespected issues that can compromise performances and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them. The presented use cases will refer to DL4J and ND4J on different Spark deployment modes (standalone, YARN, Kubernetes). The reference programming language for any code example would be Scala, but no preliminary Scala knowledge is mandatory in order to better understanding the presented topics.
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
In my talk I will discuss and show examples of using Apache Hadoop, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to last years Apache Deep Learning 101 that was done at Dataworks Summit and ApacheCon.
As part of my talk I will walk through using Apache NXNet Pre-Built Models, MXNet's New Model Server with Apache NiFi, executing MXNet with Apache NiFi and running Apache MXNet on edge nodes utilizing Python and Apache MiniFi.
This talk is geared towards Data Engineers interested in the basics of Deep Learning with open source Apache tools in a Big Data environment. I will walk through source code examples available in github and run the code live on an Apache Hadoop / YARN / Apache Spark cluster.
This will be an introduction to executing Deep Learning Pipelines in an Apache Big Data environment.
My talk at Data Works Summit Sydney was listed in top 7 -> https://hortonworks.com/blog/7-sessions-dataworks-summit-sydney-see/
Also have speak at and run Future of Data Princeton and at Oracle Code NYC.
https://www.slideshare.net/oom65/hadoop-security-architecture?next_slideshow=1
https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html
https://community.hortonworks.com/articles/146704/edge-analytics-with-nvidia-jetson-tx1-running-apac.html
https://dzone.com/refcardz/introduction-to-tensorflow
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
Apache Hadoop YARN is the modern distributed operating system for big data applications. In Apache Hadoop 3.1.0, YARN added a service framework that supports long-running services. This new capability goes hand in hand with the recent improvements in YARN to support Docker containers. Together these features have made it significantly easier to bring new applications and services to YARN.
In this talk you will learn about YARN service framework, its new containerization capabilities and how it lays the foundation for a hybrid and uniform architecture for compute and storage across on-prem and multi-cloud environments. This will include examples highlighting how easy it is to bring applications to the YARN service framework as well as how to containerize applications.
Here's what to expect in this talk:
- Motivation for YARN service framework and containerization
- YARN service framework overview
- YARN service examples
- Containerization overview
- Containerization for Big Data and non Big Data workloads - wait that's everything
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
Telecom operators need to find operational anomalies in their networks very quickly. This need, however, is shared with many other industries as well so there are lessons for all of us here. Spark plus a streaming architecture can solve these problems very nicely. I will present both a practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams. These algorithms are simple but quite general and can be applied across a wide variety of situations.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
Developers increasingly are building dynamic, interactive real-time applications on fast streaming data to extract maximum value from data in the moment. To do so requires a data pipeline, the ability to make transactional decisions against state, and an export functionality that pushes data at high speeds to long-term Hadoop analytics stores like Hortonworks Data Platform (HDP). This enables data to arrive in your analytic store sooner, and allows these analytics to be leveraged with radically lower latency.
But successfully writing fast data applications that manage, process, and export streams of data generated from mobile, smart devices, sensors and social interactions is a big challenge.
Join Hortonworks and VoltDB, an in-memory scale-out relational database that simplifies fast data application development, to learn how you can ingest large volumes of fast-moving, streaming data and process it in real time. We will also cover how developing fast data applications is simplified, faster - and delivers more value when built on a fast in-memory, scale-out SQL database.
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but the overall cluster configuration can present some unespected issues that can compromise performances and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them. The presented use cases will refer to DL4J and ND4J on different Spark deployment modes (standalone, YARN, Kubernetes). The reference programming language for any code example would be Scala, but no preliminary Scala knowledge is mandatory in order to better understanding the presented topics.
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
USGS Report on the Impact of Marcellus Shale Drilling on Forest Animal HabitatsMarcellus Drilling News
A report issued March 25, 2013 by the U.S. Geological Survey titled "Landscape Consequences of Natural Gas Extraction in Allegheny and Susquehanna Counties, Pennsylvania, 2004–2010." The report, using a series of maps and data, purports to show that drilling has lead to "carving up" wildlife habitats in some forests.
Demystifying Security Analytics: Data, Methods, Use CasesPriyanka Aash
Many vendors sell “security analytics” tools. Also, some organizations built their own security analytics toolsets and capabilities using Big Data technologies and approaches. How do you find the right approach for your organization and benefit from this analytics boom? How to start your security analytics project and how to mature the capabilities?
(Source: RSA USA 2016-San Francisco)
This session will share large scale architectures from the author's experiences with various companies like Cisco, Symantec, and EMC and compare and contrast the architecture across : Infrastructure Architecture Scaling, Ecommerce integrations and migration approach from legacy into AEM, Digital Marketing Cloud Integrations such as personalization, analytics, and DMP.
Opensource approach to design and deployment of Microservices based VNFMichelle Holley
Microservice is gaining increased adoption in the Telco NFV world. It is key to understand the design and deployment methodologies involved in developing Microservice based VNF. This talk provides an opensource practitioner approach to building and deploying a Microservice based VNF and includes the following: - Design patterns, workflow models - Design models for VNF placement, capacity management, scale-in/out and resiliency - Deployment considerations that includes handing of scale and fault tolerant VNF using well known Opensource tools.
About the presenter: Prem Sankar works for Ericsson Opensource Ecosystem team and part of the Opendaylight and OPNFV team in Ericsson. Prem evangelizes SDN and Cloud and has given many sessions and conducted workshops around SDN and ODL. Prem is PTL of ODL COE project and currently driving the Kuberenetes and ODL Integration in Opendaylight community. Prem is a frequent speaker at opensource summits and has presented in Opendaylight, OPNFV and Open networking summits.
If you heard about web-scale or have a requirement to survive under web-scale or you just would like to prepare your application to handle an X effect this topic is for you.
During a presentation you will understand aspects and caveats of performance testing, nuances of performance testing of Java based web applications.
As a practical part you will get a brief overview of existing tools and will get a guide of using Gatling as a tool to make a load for your application.
Gatling is an open source tool for performance loading written in Scala and provides comprehensive DSL for load scenario specification.
Big Data Europe: Simplifying Development and Deployment of Big Data ApplicationsBigData_Europe
Presentation at MSD IT Global Innovation Center in Prague, Czech Republic. Covers the technical outcomes of horizon2020 BigDataEurope project and provides and example of a component integration into the BDI platform.
WhiteHedge provides DevOps as a service. We offer devops consultation, implementation and training services. You can contact us at devops@whitehedge.com
Working with big volumes of data is a complicated task, but it's even harder if you have to do everything in real time and try to figure it all out yourself. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time social media analytics, sentiment analysis, and data visualization decision-making problems with AWS. Learn how you can leverage AWS services like Amazon RDS, AWS CloudFormation, Auto Scaling, Amazon S3, Amazon Glacier, and Amazon Elastic MapReduce to perform highly performant, reliable, real-time big data analytics while saving time, effort, and money. Gain insight from two years of real-time analytics successes and failures so you don't have to go down this path on your own.
My incident Response from Techfair 2016 in Jersey. The talk explores how incident response could to comply with the requirements set out in the Jersey Financial Services Commission Dear CEO letter on cyber security.
How Docker EE is Finnish Railway’s Ticket to App ModernizationDocker, Inc.
VR Group-Finnish Railways is responsible for 118 million passenger rides and moving 41 million tons of cargo a year and is seeing overall growth in rail transit throughout Finland. A priority for the organization is to provide improved customer services, including an improved seat reservation system and bringing modern experiences like next generation mobile apps to their passengers. These improvements require looking at their application portfolio and deciding to either:
Revise: Transform legacy applications to more cost efficient solutions
Redesign: Redesign and rewrite mainframe-based solutions to microservices
In this session, Markus Niskanen, Integration Manager at VR Group, and Oscar Renalias, Sr. Technology Architect at Accenture will discuss how they leveraged Docker EE and the public cloud to be the common platform for these different application modernization projects. They will cover how they are leveraging Docker and the cloud to renew and optimize their application portfolio for greater ROI, leading to organization-wide adaptation of DevOps principles and cultural change in an industry that is over 150 years old.
SocCnx11 - All you need to know about orient mepanagenda
Orient Me is the first Connections service which is built on the new Connections Pink stack. Nico will talk about the installation, integration and administration of Orient Me. He will also provide useful insights around the used backend tools. Walk away with knowledge how to successfully run Orient Me in your own Connections environment!
Cisco at VMworld 2015 - Cisco UCS as the Foundation for Software-Defined Data...ldangelo0772
IT is in the midst of a dramatic shift to the mobile-cloud era, one in which IT services can be consumed on-demand across the enterprise and in hybrid and public clouds. Tjerk Bijlsma will share the latest Cisco Unified Computing System (Cisco UCS) innovations that can help you shape your Software-Defined Data Center, radically simplifying IT while delivering services at the speed of today's business.
During this session you will learn about:
Cisco's comprehensive architectural approach to enable next wave of IT convergence that includes VMware vSAN and comprehensive vRealize integration as part of the SDDC.
Innovations in Cisco Data Center portfolio including Cisco UCS and Nexus integrations with VMware solutions.
Solutions for virtualized environments for Converged and Hyper Converged systems including FlexPod, VersaStack, Vblock, vSAN, Simplivity, StorMagic and more.
Joe Onisick, Principal Engineer, Cisco discusses building the right network and understanding different overlay approaches at Cisco Connect Toronto 2015.
Application Centric Infrastructure (ACI), the policy driven data centreCisco Canada
Mike Herbet, Principal Engineer, Cisco, Dave Cole, Consulting Systems Engineer, Cisco, Sean Comrie, Technical Solutions Architect, Cisco focused on the application centric infrastructure (ACI) at Cisco Connect Toronto.
Cisco Virtualized Multi-tenant Data Center solution (VMDC) is an architectural approach to IT which delivers a Cloud Ready Infrastructure. The architecture encompasses multiple systems and functions defining a standard framework for an IT organization. Standardization allows the organization to achieve operational efficiencies, reduce risk and achieve cost reductions while offering a consistent platform for business.
Migrating from VMs to Kubernetes using HashiCorp Consul Service on AzureMitchell Pronschinske
DevOps tools became very popular with the adoption of public cloud, but Operational teams now realize that their benefits can be extended to enterprise data centers. In reality, cloud native tools can help bridge public clouds and private data centers by enabling a common framework to manage applications and their underlying infrastructure components.
In this session you’ll learn about the latest Cisco ACI integrations with Hashicorp Terraform and Consul to deliver a powerful solution for end-to-end on-prem and cloud infrastructure deployments.
Cisco’s Cloud Strategy, including our acquisition of CliQr Cisco Canada
At Partner Summit we made a series of exciting announcements in our Cloud portfolio, including our acquisition of CliQr. Join us to learn about these new announcements and an understanding of Cisco’s Cloud Strategy.
- How does CliQr fit into our existing Cloud portfolio (Metapod, APIC, Enterprise Cloud Suite, Cloud Consumption-as-a-Service)?
- How does our Cloud portfolio today meet the needs of our customers? What problems are we solving?
- How does our portfolio today position us for the world of Containers and Microservices?
Join us for a presentation of how these announcements fit into our current environment and what they mean to your longer-term strategy.
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
Despite widespread adoption and success most machine learning models remain black boxes. Many times users and practitioners are asked to implicitly trust the results. However understanding the reasons behind predictions is critical in assessing trust, which is fundamental if one is asked to take action based on such models, or even to compare two similar models. In this talk I will (1.) formulate the notion of interpretability of models, (2.) provide a review of various attempts and research initiatives to solve this very important problem and (3.) demonstrate real industry use-cases and results focusing primarily on Deep Neural Networks.
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses.
In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
In many modern applications data are collected in unusual form. Connectome or brain imaging data are graphs. Wearable devices measuring activity are functions over time. In many cases these objects are collected for each individual or transaction leaving the statistician with the challenge of analyzing populations of data not in classical numeric and categorical formats in big spreadsheets. In this talk I introduce object oriented data analysis with an application we recently developed for regression analysis. This talk will be aimed at the general data scientist and emphasis on the concepts and not mathematical detail. The take home message is how can we use covariates (i.e., meta-data) to predict what the structure of a brain image graph will be.
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
This talk aims to dive into technical details in machine learning model development, implementation and values it bring to Monsanto breeding pipeline. We genotype over 100 million seeds a year in order to save field resources and product development cycle time. Automation and high throughput production from the lab becomes key to R&D success. In house predictive model development incorporated random forest ensemble based approach with additional features derived from gaussian mixture model. The results show over 95% accuracy with less than 1% false positives/negatives. Model is highly generalizable with over 10 million data points being trained and tested on. The model also offers probabilistic approach to present genotypes in a more meaningful way and help enhanced downstream genomics analyses. The talk targets audience who are in breeding, genetics, molecular biology, and data scientists who are interested in practical applications.
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
While artificial intelligence for self-driving cars and virtual assistants gets a lot of the notion of communicating the needs, effectiveness and measurements is complicated when speaking “geek”! The work of an analyst, however, does not just involve conducting data analysis within but communicating, championing and speaking simply when talking to the organization, clients and management.
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
This technical session provides a hands-on introduction to TensorFlow using Keras in the Python programming language. TensorFlow is Google’s scalable, distributed, GPU-powered compute graph engine that machine learning practitioners used for deep learning. Keras provides a Python-based API that makes it easy to create well-known types of neural networks in TensorFlow. Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to train neural networks of much greater complexity. Deep learning allows a model to learn hierarchies of information in a way that is similar to the function of the human brain.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
In this session, we’ll discuss approaches for applying convolutional neural networks to novel computer vision problems, even without having millions of images of your own. Pretrained models and generic image data sets from Google, Kaggle, universities, and other places can be leveraged and adapted to solve industry and business specific problems. We’ll discuss the approaches of transfer learning and fine tuning to help anyone get started on using deep learning to get cutting edge results on their computer vision problems.
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
Like the story of the six blind men trying to explain the nature of an elephant, current research in cognitive computational systems attempts to identify the nature of an illness, human behavior, or socio-economical phenomenon, from their own perspective.
At present, there is no agreed upon definition for cognitive systems. One large communication corporation defines cognitive systems as a category of technology that uses artificial intelligence, machine learning and reasoning, to enable people and machines to interact more naturally. It also extends and magnifies human expertise and cognition to enable accurate decisions on time. Two of the most famous risk and financial advisory firms agree with that interpretation. A different large corporation, however, considers “cognitive systems” as merely marketing jargon.
If cognitive systems are going to help us solve challenging problems in medicine, economics, or other fields, three aspects must be considered in order to reveal the “true nature of the elephant”.
§ All facets of the problem must be addressed, like the main parts of the elephant had to be touched by the men.
§ These facets must be properly assembled, like the men needed to join hands around the elephant in order to understand what it was.
§ This assembly must be completed within sufficient time to anticipate future decisions. Just like the men needed to know what an elephant is before the next one charges them.
This talk will explain how agnostic (unsupervised, blinded) machine learning findings can be assembled by multiobjective and multimodal optimization research techniques would be utilized to uncover a multifaceted view of the “elephant”, in this case the human being (e.g., genomic variants, personality traits, brain images). It will also give real-world examples of how this knowledge will “extend the human capabilities” by achieving an integrative assessment of the whole person in relation to their risk, which will allow professionals to generate accurate person-centered policies: from personalized diagnoses, business opportunities, or the prevention of outbreaks.
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
This talk will walk through the important building blocks of Automated AI. Rajiv will highlight the current gaps in the analytics organizations, how to close those gaps using automated AI. Some of the issues discussed around automated AI are the accuracy of models, tradeoffs around control when using automation, interpretability of models, and integration with other tools. These issues will be highlighted with examples of automated analytics in different industries. The talk will end with some examples of how automated AI in the hands of data scientists and business analysts is transforming analytic teams and organizations.
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
Artificial Intelligence has entered a renaissance thanks to rapid progress in domains as diverse as self-driving cars, intelligent assistants, and game play. Underlying this progress is Deep Learning – driven by significant improvements in Graphic Processing Units and computational models inspired by the human brain that excel at capturing structures hidden in massive complex datasets. These techniques have been pioneered at research universities and digital giants but mainstream enterprises are starting to apply them as open source tools and improved hardware become available. Learn how AI is impacting analytics today and in the future.
Learn how AI is affecting the enterprise including applications like fraud detection, mobile personalization, predicting failures for IoT and text analysis to improve call center interactions. We look at how practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research, to prototype, to scaled production deployment.
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
This session will focus on how to execute Data Science caliber efforts by creating teams with the attributes of Data Science to deliver meaningful results. As Data Scientists are harder to find and keep, this session should appeal to anyone who is either seeking an alternative approach to executing Data Science delivery or augmenting their current Data Science model with additional options.
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
Enterprises typically have many data silos of partial customer data and a common theme in big data projects to use big data tools and pipelines to unify all siloed customer data into a single, queryable, platform for improving all future customer interactions. This data often comes from billing, website traffic, logistics, and marketing; all in different formats with different properties. Graph provides a way to unify all of the data into a single place for use in tracking the flow of a user through the various silos. Graph can also be used for visualizations and analytics that are difficult in other systems.
In this talk we will explore the ways in which Graph can be leveraged in a customer 360 use case. What it can add to a more conventional system and what the approach to developing a graph based Customer 360 system should be.
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python.
System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
Big Data doesn’t have to just mean Hadoop any more. Big Data can be done in the cloud, using tools developed by the Cloud providers. This session will cover using Amazon AWS services to implement a Big Data application. We will compare and contrast different services from Amazon with the Hadoop equivalents.
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
Using big data isn’t about doing the same things we’ve always done just with different technologies. The technology advances that we’ve chosen to label as big data create the opportunity for wholly new kinds of solutions. Two of the key advances that are enabling new business capabilities are cloud-based data management platforms and streaming data processing and analytics.
In this session, Paul Boal will drill into the cloud-based streaming data architecture that has made possible EVŌ, a new breakthrough health and wellness platform. EVŌ uses a game-changing approach that leverages over 60 billion data points and a predictive analytics engine to intervene BEFORE someone becomes critically ill. All of this is possible by leveraging data from smartphones and wearable fitness devices along with advanced analytics which then help users develop and sustain positive behaviors. Attendees will learn how to create a cloud- based architecture that can receive data, apply multiple layers of dynamic business rules, and drive alerts and decisions through real-time stream processing using technologies including web services, Amazon DynamoDB and Kinesis, Drools, and Apache Spark.
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
The collection and use of Big Data has become an important part of modern business practice. The Internet of Things (IoT) movement promises to provide new opportunities for businesses interested in the intersection of people and technology. It is also wrought with pitfalls for practitioners and researchers who struggle to make sense of an increasing cacophony of signals. How should they poll and collect data from millions of signals in a way that is manageable, scalable, and statistically valid? How should they analyze and predict using these data? This presentation will discuss these challenges with applied examples from monitoring and managing one of the world’s largest computers.
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
Enterprise Holding’s first started with Hadoop as a POC in 2013. Today, we have clusters on premises and in the cloud. This talk will explore our experience with Big Data and outline three common big data architectures (batch, lambda, and kappa). Then, we’ll dive into the decision points to necessary for your own cluster, for example: cloud vs on premises, physical vs virtual, workload, and security. These decisions will help you understand what direction to take. Finally, we’ll share some lessons learned with the pieces of our architecture worked well and rant about those which didn’t. No deep Hadoop knowledge is necessary, architect or executive level.
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
Companies today are all focused on finding new consumption models to better utilize the data they produce. This presentation will provide insights and best practices for creating the organization and sponsorship necessary to set the foundation for success.
For this session, Dan will provide an overview of the process and methodologies he employs to establish and sustain a Data Driven Culture. Key topics will include:
Data Driven Culture
Executive Sponsorship
Organizational Structure – Collaboration Hubs and Bi-Modal Analytics
Role of Hadoop and Big Data as Part of Data Driven Culture
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
The Internet of (Human) Things is just beginning to take shape. The human body is an inexhaustible source of data about personal health, and the healthcare industry is just beginning to scratch the surface of the potential insights and value that will come from that data. While much of healthcare traditionally focuses on the episodic delivery of services, the Affordable Care Act is pushing healthcare providers, payers, and self-funded employer groups to look at ways to proactively encourage healthy behaviors. Providing personal health devices as a way to promote individual health is one way that healthcare is beginning to take advantage of IoT technologies. This session provides insight into how IoT is being leveraged in population health management through a solution jointly delivered by Amitech Solutions and Big Cloud Analytics. Attendees will learn how Hadoop is being used to gather personal device from various vendors, integrate and analyze that information, differentiate trends across regional and cultural diversity, and provide personal recommendations and insights into health risks. This session presents one important way the healthcare industry is leveraging IoT.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Influence of Marketing Strategy and Market Competition on Business Plan
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
1. Ken Owens
CTO Cisco Intercloud Services
07/15/15
How Cisco Migrated from
MapReduce Jobs to Spark
Jobs
1
2. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
3. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
4. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
5. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
6. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
7. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: IDC 7
30M
New devices
connected
every week
78%
Workloads
processed
in Cloud DCs
by 2018
5TB+
of data per person
by 2020
180B
Mobile apps
downloaded
in 2015
277X
Data created
by IoE devices
v. end-user
The Uber Trend: Exponential Rise in Connectivity
8. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Exponential Trend
Linear Trend
Disruptive Stress
/Opportunity
Knee of Curve
Exponential Growth Drives Opportunities
Peter Diamandis: BOLD
9. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
When Products Become Cloud-enabled, They Become
10X More Valuable
$23.19
$249.00
$18.01
$199.00
$5.99
$59.99
10. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
A Broader Perspective than Hybrid Cloud Is Required…
Data Center Cloud Edge / IoT
11. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Hyperscale applications serving several
thousands of users very quickly
Traditional enterprise applications
IoE and increasing connectivity driving the need
for such workloads
Hadoop, Mobile back-ends, Gaming, Social
Small (~10%), yet rapidly growing
percentage of applications in the Cloud
ERP, CRM, Applications that leverage
traditional databases
Majority of applications being run
for/by Enterprises today
CIOs Need to Embrace Both Traditional
and Hyperscale Application Deployment
12. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
Application Portability and Interoperability Is the Key
Traditional
Applications
ERP, Financial, Client/Server,
CRM, email, …
Cloud Native
Applications
IoT, BigData,Analytics,
Gaming, ...
Data Center Cloud Edge / IoT
13. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: Gartner, Lydia Leong
of CIOs currently
have a second
fast/agile mode
of operation
45%
Traditional
Mode
Requires
Reliability
(ITIL, CMMI, COBIT)
Nonlinear Mode
Accept Instability
(DevOps,
automation,
reusable)
Systems
of
Differentiation
Systems
of
Innovation
Systems
of
Record
Change
Governance
Bimodal IT Is the New Normal
Source: Gartner, Lydia Leong
14. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Intercloud
The
Intercloud
Web-scale Architecture
API-Driven Automation
Open, Secure, Compliant,
Hybrid IT
Internet
The
Internet
IP Based
Open Standards
World of Isolated Clouds
(2000s)
Individual custom-built clouds
without consistent APIs
Connected for application
acceleration with Open APIs
The Intercloud
Intercloud
Islands of Isolated
PC LAN Networks (1990s)
Multiple LANs using
a multitude of protocols
The Internet
Connected using industry-
standard IP protocol
We Must Connect the Clouds
16. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Omni-Channel Customer Journeys
Server
Logs
Social
& Chat
Mobile
Event
Streams
Call
Center
S/W
Download
Open Trouble
Ticket
Assign
Engineer
Update
Trouble Ticket
Close Trouble
Ticket
Resolve
Trouble Ticket
Read Support
Documents
View Design
Documents
View Tech
Documents
New
Registration
Bug Search FAQs
Contract
Details
Product
Details
Device
Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business
outcome.
17. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
• Software Upgrades
• Bug Inquiry
• Software Inquiry
• Trouble Ticket Lifecycle
• Device Troubleshooting
• New Registration
• Contract Renewal
• Customer Interest
Analytics
• Customer Experience
Analytics
• Resource Forecasting
• Security and
Compliance
Customer Journeys Behavioral Insights
• Boost Self Service
• Real-time Content
Optimization &
Recommendation
• Context Based
Predictive Alerts
• Implicit Personalization
Impact
Customer Interaction Analytics
From Journey to Outcome…
18. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Server Logs
Customer Interaction Analytics
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event
Streams
Data
Sources
Data
Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics
Model
Build Model
Activity
Refinement
Activity
Synthesis
Synthesized
Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES
Zoomdata,Platfora
20. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS Platform
Component Cloud::
Hadoop
(Batch
Analytics)
Cloud::
Queries
(Interactive
Queries)
Cloud::
Streams
(Near Real-
time
Analytics)
Virtual
Machines
30 6 5
AWS
Instance
Sizing
m3.2xlarge c3.xlarge m3.xlarge
Virtual
Cores
8/VM 4/VM 4/VM
RAM 30GB/VM 7.5GB/VM 15GB/VM
Disk 1.5 TB/VM 1.5 TB/VM 1.5 TB/VM
21. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Case for Cisco Intercloud Services for Analytics…
Cisco Security and Compliance requirements
• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud
solution is a better fit.
Customer journey beyond the enterprise
• Applications are hosted on AWS
• Partner systems hosted on AWS and other cloud providers
Presence in AWS and other cloud services required to support these
scenarios for end-end customer journey insights.
Data virtualization integrated in the CIS Analytics Stack
• Connect data from multiple clouds and multiple big data platforms
Integrated visualization toolset
22. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform
23. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform Requirements
Infra Provisioning
Deploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current
production system.
OpenStack
Icehouse OpenStack with Neutron, Nova, and Swift installed.
Big Data Ecosystem
Cloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm.
Data virtualization & Cloud Integration
Access to data services and data stores via Cisco Data Virtualization
Runtime Services
Foundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation and
support priorities, issue resolution, problem management, deployment process, patch management.
API Services
Provide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud
model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and
flexibility.
24. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS to CIS Migration – Success Criteria
Successful synthesis of customer interaction data
Successful automation of the end-end data process pipeline
Build behavioral insight services
Access to data and services via data discovery and visualization tools
Meet the performance, scale and platform stability requirements
Successful deployment of CiscoDV on CIS
Connect HDFS and Hive DS with CiscoDV via Hive and Impala
Build and expose insight services for consumption by limited users
25. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS and CIS Data Node Sizing Comparison
Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS
available for HDFS storage
AWS Sizing
CCS Sizing
Node Service CCS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS
available for HDFS storage
Less than AWS sizing (Storage)
26. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Pilot Test Data
• Test performed on one day’s production data
• Total no. of records processed – 110,852,667
• Total data size – 32GB
• Total no. of M/R jobs in the data pipeline – 17
• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory)
• Cycle 2: Homogeneous CCS nodes
27. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Performance of Batch Analytics – Limited Test
29. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
PoC: Analytics with Spark on CIS
Existing code
Made in Ruby with Wukong to run on Hadoop
A history of changes and modifications
Script-based, steps communicate via intermediary files
Goal
Revise, rethink and reimplement with Spark on CIS
Open for advanced cloud analytics
Improve maintainability by moving away from aging Ruby on Hadoop
30. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Pre-process log records (‘cleanse’)
Extract HTTP sessions (‘sessionize’)
Extract user actions, such as ‘search’, ‘download
patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
Each box on the figure is a script in a separate file
They pipe Gb of data as input and output
Random matching of nodes to data for sessionizing
Lots of redundant shuffling
Ruby Flow
global sort in time
global group by IP
31. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Same flow, but each box is a Java or Scala function
No intermediate temp files
Steps are chained by Spark, often without any need for
intermediate data
If still needed, the data is stored in memory and local
disk as much as possible
Local computation
Cleansing is computed on nodes local to data blocks
(same as Ruby)
Sessions are built per IP
On separate nodes each handling a single IP range
One copied to the node on partition the data remains
local
Spark Flow
global partition by IP
local sort in time
32. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Volumes
Logs of a single day: 52 Gb
Total of 110 mil records
Where 53 mil records are kept after pre-filtering
Producing over 1 mil user actions
Cluster of 30 nodes
Ruby
Runtime 140 min
Spark
Runtime 7 min (20 times faster )
Runtime comparison
33. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Extracting sessions means sort in time and group by IP
Ruby:
sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
Spark is good at dealing with partitions:
per-IP groups are placed on different machines (partitions)
global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
Other improvements
Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala), stages
keep data in memory when possible (comes with Spark)
Cache results of user agent resolution that are heavy on regular expressions
Why?
35. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Data Virtualization for Intercloud Analytics
Customer Benefits
Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and,
Cisco Customers and Partners
Seamless interoperability offers easy access to data across distributed data sources
in the intercloud analytics platform
Universal data governance maximizes enforcement of data security rules
Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that
enable nimble data discovery and rapid data analytics to support multiple LOBs
Deliver data to any number of analytics tools.
36. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 1: Get Case Interactions
Use Case Description # of cases opened by company X that
are currently open. (other variations
would include cases by company,
trends etc.)
CiscoDV Value CiscoDV enforces data security rules to
restrict access on the intercloud
platform to customer sensitive data.
Data Sources SalesForce
Intercloud Solution CIS CiscoDV service can access the
“sanitized” version of CSOne data
through JDBC from RIDES(SWTG
CiscoDV) API.
Connection Type DV on hybrid cloud Enterprise data
store
37. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 2: Get Customer Journey
Use Case Description Customer interactions on the web
pertaining to bug search and case
submission process. Foundational data
can be used to explore trends and feed
into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud Analytics
Platform
Data Sources SAS Analytics
Intercloud Solution By direct network access to the Impala
Server, the CIS CiscoDV server
connects to the Impala Service in
Hadoop also on CIS as a Data Source.
SQL Queries configured in CiscoDV
execute Impala queries
Connection Type DV on hybrid cloud VPC Big Data
platform
38. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 3: Get Bug Interactions
Use Case
Description
Another foundational data service that provides
a breakdown of customer exposure or interest
in bugs. The service can be refined further to
look at trends specific to a company or a
product for further analytics.
CiscoDV Value Real-time data federation that accesses
extremely large data in CIS Intercloud Analytics
platform and join that with Bug Data accessed
via departmental CiscoDV instance (RIDES)
Data Sources SASA Analytics and QDDTS via RIDES
Intercloud
Solution
By building on the access to the Impala Server,
the DV server can join the Bug Data from the
Enterprise Data Stores with the HDFS data to
provide a federated view.
Connection
Type
DV on hybrid cloud VPC Big Data platform
and Enterprise data store
39. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco
Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and
Hive on CIS Intercloud
Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS
Big Data Cluster
Scenario1
Scenario 3
Editor's Notes
FABIO – a few items from Pankaj and Liz Monday:
Per the John Chambers slides I sent you Monday night, please be sure to fully address digitization in the opener, so Pankaj can connect to John’s opening remarks.
Set the stage here for what the digital transformation is and why it dries IoE and cloud. Explain where we came from, where we are today – exponential growth and a magnitude of changes still to come.
Please see new VNI, to see if there are any newer/better stats re the Data Center.
Pankaj feels the top 3 data points are ok in this slide, but perhaps we could find better ones for the bottom 2 data points? Maybe uplevel them a bit?
-------------------------------------------------------
The world is changing. The digital transformation is turning traditional business models on their heads. We are seeing unprecedented growth in the explosion of devices and mobile apps and in data utilization.
IoE – IoE devices create 277 times the data that the end user is creating. But only a fraction of it ever reaches the data center. A Boeing 787 for example, generates 40 TB of data per every hour of flight time. But only 0.5 TB is ultimately transmitted to the data center.
Mobility: In 2014, global mobile data traffic grew 1.7x or 69%… In 2014 alone, 77B+ mobile apps downloaded… by 2015 180B apps (233% increase)
Internet… IDC predicts by 2017, there will be 3.6 billion global Internet users… More than 1/2 the world population
Big Data… By 2020 there will be more than 5,000 GB of data for every person on Earth
These massive changes are putting tremendous stress on the data center. The traditional data center model has to evolve in order to meet demand today and into the future.
We know how to fix this
We’re going to do for cloud what we did for data. You couldn’t move data between the networks – they weren’t connected. Cisco unified those worlds
The world of cloud today is a world of isolated clouds. There’s no workload or data portability.
“Amazon is hotel California – you can never leave, and that data is staying there”
Our vision is to connect all these clouds together into the Intercloud - whether private, public , or hybrid through technology and innovation
Intercloud is going to connect these clouds together in the same way we connected data together.
No one cloud model or single cloud approach, such as the massively scalable clouds from Amazon, Google or Microsoft will win alone in this space