This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).
This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).
Continus sql with sql stream builder
Eventador Cloudera
Flink SQL Kafka Apache NiFi SMM Schema Registry
Avro Json Apache Calcite. Meetup Future of Data New York
Kenny Gorman Tim Spann John Kuchmek
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us.
Bio:-
Arun C. Murthy
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
How can you avoid inconsistencies between Kafka and the database? Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
Check out what Change Data Capture (CDC) is and why it is becoming ever more important. Slides also include useful tips on how to design your CDC implementation.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks
This technical tutorial is designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. The session starts with a quick introduction to Redis and the capabilities Redis provides. It will cover the basic data types provided by Redis and the module system. Using an ad serving use case, Griffith will look at how Redis can improve the performance and reduce the cost of using complex ML-models in production.
You will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark and then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will also be discussed, focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these features.
By the end of the session, you should feel confident building a prototype/proof-of-concept application using Redis and Spark. You’ll understand how Redis complements Spark, and how to use Redis to serve complex, ML-models with high performance.
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhedeconfluent
Slides from Neha Narkhede's keynote at the dotScale conference in Paris on April 24th, 2017.
There is a tectonic shift happening in how data powers the core of a company's business. This shift is about the rise of real-time. Apache Kafka was built with the vision to help companies navigate this change and become the central nervous system that makes data available in real-time to all the applications that need to use it.
This talk is about how you can put Apache Kafka to practice to help your company make this shift to real-time. And how the Connect and Streams API in Apache Kafka capture the entire scope of what it means to put streams into practice.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
MongoDB Europe 2016 - Deploying MongoDB on NetApp storageMongoDB
Customer and business requirements are shifting constantly. Today’s powerful programming languages can keep up—but what about your database? NetApp® MongoDB solutions offer a flexible, scalable answer. Learn how NetApp storage solutions will accelerate your MongoDB Performance, reduce operational Costs and provide the highest levels of Availability and Security. These solutions provide advanced fault-recovery features and easy, in-service growth capabilities to accommodate your unpredictable, ever-changing business demands. NetApp storage is designed to help you build a high-performance, cost-efficient, and highly available analytics solution. So you can focus on adding real business value.
SQL is the most widely used language for data processing. It allows users to concisely and easily declare their business logic. Data analysts usually do not have complex software programing backgrounds, but they can program SQL and use it on a regular basis to analyze data and power the business decisions. Apache Flink is one of streaming engines that supports SQL. Besides Flink, some other stream processing frameworks, like Kafka and Spark structured streaming, have SQL-like DSL, but they do not have the same semantics as Flink. Flink’s SQL implementation follows ANSI SQL standard while others do not.
In this talk, we will present why following ANSI SQL standard is essential characteristic of Flink SQL and how we achieved this. The core business of Alibaba is now fully driven by the data processing engine: Blink, a project based on Flink with Alibaba’s improvements. About 90% of blink jobs are written by Flink SQL. We will show the use cases and the experience of running large scale Flink SQL jobs at Alibaba in the talk.
Speakers
Shaoxuan Wang, Senior Engineering Manager, Alibaba
Xiaowei Jiang, Senior Director, Alibaba
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Continus sql with sql stream builder
Eventador Cloudera
Flink SQL Kafka Apache NiFi SMM Schema Registry
Avro Json Apache Calcite. Meetup Future of Data New York
Kenny Gorman Tim Spann John Kuchmek
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us.
Bio:-
Arun C. Murthy
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...DataWorks Summit
Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
How can you avoid inconsistencies between Kafka and the database? Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
Check out what Change Data Capture (CDC) is and why it is becoming ever more important. Slides also include useful tips on how to design your CDC implementation.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Getting Ready to Use Redis with Apache Spark with Tague GriffithDatabricks
This technical tutorial is designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. The session starts with a quick introduction to Redis and the capabilities Redis provides. It will cover the basic data types provided by Redis and the module system. Using an ad serving use case, Griffith will look at how Redis can improve the performance and reduce the cost of using complex ML-models in production.
You will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark and then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will also be discussed, focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these features.
By the end of the session, you should feel confident building a prototype/proof-of-concept application using Redis and Spark. You’ll understand how Redis complements Spark, and how to use Redis to serve complex, ML-models with high performance.
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhedeconfluent
Slides from Neha Narkhede's keynote at the dotScale conference in Paris on April 24th, 2017.
There is a tectonic shift happening in how data powers the core of a company's business. This shift is about the rise of real-time. Apache Kafka was built with the vision to help companies navigate this change and become the central nervous system that makes data available in real-time to all the applications that need to use it.
This talk is about how you can put Apache Kafka to practice to help your company make this shift to real-time. And how the Connect and Streams API in Apache Kafka capture the entire scope of what it means to put streams into practice.
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
MongoDB Europe 2016 - Deploying MongoDB on NetApp storageMongoDB
Customer and business requirements are shifting constantly. Today’s powerful programming languages can keep up—but what about your database? NetApp® MongoDB solutions offer a flexible, scalable answer. Learn how NetApp storage solutions will accelerate your MongoDB Performance, reduce operational Costs and provide the highest levels of Availability and Security. These solutions provide advanced fault-recovery features and easy, in-service growth capabilities to accommodate your unpredictable, ever-changing business demands. NetApp storage is designed to help you build a high-performance, cost-efficient, and highly available analytics solution. So you can focus on adding real business value.
SQL is the most widely used language for data processing. It allows users to concisely and easily declare their business logic. Data analysts usually do not have complex software programing backgrounds, but they can program SQL and use it on a regular basis to analyze data and power the business decisions. Apache Flink is one of streaming engines that supports SQL. Besides Flink, some other stream processing frameworks, like Kafka and Spark structured streaming, have SQL-like DSL, but they do not have the same semantics as Flink. Flink’s SQL implementation follows ANSI SQL standard while others do not.
In this talk, we will present why following ANSI SQL standard is essential characteristic of Flink SQL and how we achieved this. The core business of Alibaba is now fully driven by the data processing engine: Blink, a project based on Flink with Alibaba’s improvements. About 90% of blink jobs are written by Flink SQL. We will show the use cases and the experience of running large scale Flink SQL jobs at Alibaba in the talk.
Speakers
Shaoxuan Wang, Senior Engineering Manager, Alibaba
Xiaowei Jiang, Senior Director, Alibaba
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das
We discuss LinkedIn's big data ecosystem and its evolution through the years. We introduce three open source projects, Gobblin for ingestion, Cubert for computation and Pinot for fast OLAP serving. We also showcase our in-house data discovery and lineage portal WhereHows.
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
This talk was given by Hien Luu (Senior Software Engineer at LinkedIn) and Siddharth Anand (Senior Staff Software Engineer at LinkedIn) at the Hadoop Summit (June 2013).
Be better at the job you're in by learning from 277 million members and around 500 Influencers. http://blog.linkedin.com/2014/02/19/the-definitive-professional-publishing-platform/
Participatory Design: Bringing Users Into Your ProcessDavid Sherwin
Good user research has a big impact on product quality. But Agile teams can struggle to integrate user research at the right places. In this talk by Erin Muntzert and David Sherwin, we talk about how Participatory Design can help Agile teams better understand the needs of their customers and get the right design ideas into their products. This talk has been adapted from a workshop that we have delivered at UX Week, Interaction, and UX London: http://bit.ly/pdesignux
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Edureka!
In this TensorFlow tutorial, you will be learning all the basics of TensorFlow and how to create a Deep Learning Model. It includes the following topics:
1. Deep Learning vs Machine Learning
2. What is TensorFlow?
3. TensorFlow Use-Case
Not sure what to share on SlideShare?
SlideShares that inform, inspire and educate attract the most views. Beyond that, ideas for what you can upload are limitless. We’ve selected a few popular examples to get your creative juices flowing.
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Carol Smith
Everything is designed, yet some interactions are much better than others. What does it take to make a great experience? What are the areas that UX specialists focus on? How do skills in cognitive psycology, computer science and design come together? Carol introduces basic concepts in user experience design that you can use to improve the user's expeirence and/or clearly communicate with designers.
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
This Edureka Big Data tutorial helps you to understand Big Data in detail. This tutorial will be discussing about evolution of Big Data, factors associated with Big Data, different opportunities in Big Data. Further it will discuss about problems associated with Big Data and how Hadoop emerged as a solution. Below are the topics covered in this tutorial:
1) Evolution of Data
2) What is Big Data?
3) Big Data as an Opportunity
4) Problems in Encasing Big Data Opportunity
5) Hadoop as a Solution
6) Hadoop Ecosystem
7) Edureka Big Data & Hadoop Training
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...Edureka!
** Machine Learning Engineer Masters Program: https://www.edureka.co/masters-program/machine-learning-engineer-training **
This tutorial on Artificial Intelligence gives you a brief introduction to AI discussing how it can be a threat as well as useful. This tutorial covers the following topics:
1. AI as a threat
2. What is AI?
3. History of AI
4. Machine Learning & Deep Learning examples
5. Dependency on AI
6.Applications of AI
7. AI Course at Edureka - https://goo.gl/VWNeAu
For more information, please write back to us at sales@edureka.co
Call us at IN: 9606058406 / US: 18338555775
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Big Data Applications Made Easy: Fact Or Fiction?Glenn Renfro
With Spring XD the answer is Fact. In short Spring XD provides a one stop shop for writing and deploying Big Data Applications. It provides a scalable, fault tolerant, distributed runtime for Data Ingestion, Analytics, and Workflow Orchestration using a single programming, configuration and extensibility model. By reducing the complexity of Big Data development, developers can focus on the business problem.
In this discussion, we will cover:
• The basics of Spring XD
• Show how to deploy streams that will handle data received from multiple sources, and write the results to various sinks
• Capture some analytics from a live data stream
• Show how to create and execute Jobs
• Demonstrate the failover capabilities of a XD Cluster
• Discuss how to create your own custom modules
View the recording:
http://hortonworks.com/webinar/accelerating-real-time-data-ingest-hadoop/
Hadoop didn’t disrupt the data center. The exploding amounts of data did. But, let’s face it, if you can’t move your data to Hadoop, then you can’t use it in Hadoop. The experts from Hortonworks, the #1 leader in Hadoop development, and Attunity, a leading data management software provider, cover:
- How to ingest your most valuable data into Hadoop using Attunity Replicate
- About how customers are using Hortonworks DataFlow (HDF) powered by Apache NiFi
- How to combine the real-time change data capture (CDC) technology with connected data platforms from Hortonworks
We discuss how Attunity Replicate and Hortonworks Data Flow (HDF) work together to move data into Hadoop.
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
This slide deck explores WSO2 Stream Processor’s new features and improvements and explain how they make an organization excel in the current competitive marketplace.
Achieve Sub-Second Analytics on Apache Kafka with Confluent and Implyconfluent
Presenters: Rachel Pedreschi, Senior Director, Solutions Engineering, Imply.io + Josh Treichel, Partner Solutions Architect, Confluent
Analytic pipelines running purely on batch processing systems can suffer from hours of data lag, resulting in accuracy issues with analysis and overall decision-making. Join us for a demo to learn how easy it is to integrate your Apache Kafka® streams in Apache Druid (incubating) to provide real-time insights into the data.
In this online talk, you’ll hear about ingesting your Kafka streams into Imply’s scalable analytic engine and gaining real-time insights via a modern user interface.
Register now to learn about:
-The benefits of combining a real-time streaming platform with a comprehensive analytics stack
-Building an analytics pipeline by integrating Confluent Platform and Imply
-How KSQL, streaming SQL for Kafka, can easily transform and filter streams of data in real time
-Querying and visualizing streaming data in Imply
-Practical ways to implement Confluent Platform and Imply to address common use cases such as analyzing network flows, collecting and monitoring IoT data and visualizing clickstream data
Confluent Platform, developed by the creators of Kafka, enables the ingest and processing of massive amounts of real-time event data. Imply, the complete analytics stack built on Druid, can ingest, store, query and visualize streaming data from Confluent Platform, enabling end-to-end real-time analytics. Together, Confluent and Imply can provide low latency data delivery, data transform, and data querying capabilities to power a range of use cases.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen
Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.
Organizational success depends on our ability to sense the environment, grab opportunities and eliminate threats that are present in real-time. Such real-time processing is now available to all organizations (with or without a big data background) through the new WSO2 Stream Processor.
This slides presents WSO2 Stream Processor’s new features and improvements and explains how they make an organization excel in the current competitive marketplace. Some key features we will consider are:
* WSO2 Stream Processor’s highly productive developer environment, with graphical drag-and-drop, and the Streaming SQL query editor
* The ability to process real-time queries that span from seconds to years
* Its interactive visualization and dashboarding features with improved widget generation
* Its ability to processing at scale via distributed deployments with full observability
* Default support for HTTP analytics, distributed message trace analytics, and Twitter analytics
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Hortonworks and Teradata have partnered to provide a clear path to Big Analytics via stable and reliable Hadoop for the enterprise. The Teradata® Portfolio for Hadoop is a flexible offering of products and services for customers to integrate Hadoop into their data architecture while taking advantage of the world-class service and support Teradata provides.
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
Imagine a system where one collects real-time data, develops a machine learning model… Runs analysis and training on powerful GPUs… Clicks on a magic button and then deploys code and ML models to production… All without any heavy lifting from data and DevOps engineers. Today, data scientists work on laptops with just a subset of data and time is wasted while waiting for data and compute.
It’s about efficient use of time! Join Iguazio and NVIDIA so that you can get home early today! Learn how to speed up data science from development to production:
- Access to large scale, real-time and operational data without waiting for ETL
- Run high performance analytics and ML on NVIDIA GPUs (Rapids)
- Work on a shared, pre-integrated Kubernetes cluster with - - Jupyter notebook and leading data science tools
- One-click (really!) deployment to production
Speakers: Yaron Haviv, CTO at Iguazio, Or Zilberman, Data Scientist at Iguazio and Jacci Cenci, Sr. Technical Marketing Engineer at NVIDIA
Fully featured, commercially supported machine learning suites that can build Decision Trees in Hadoop are few and far between. Addressing this gap, Revolution Analytics recently enhanced its entire scalable analytics suite to run in Hadoop. In this talk, I will explain how our Decision Tree implementation exploits recent research reducing the computational complexity of decision tree estimation, allowing linear scalability with data size and number of nodes. This streaming algorithm processes data in chunks, allowing scaling unconstrained by aggregate cluster memory. The implementation supports both classification and regression and is fully integrated with the R statistical language and the rest of our advanced analytics and machine learning algorithms, as well as our interactive Decision Tree visualizer.
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
(Marcus Urbatschek, Confluent)
Presentation during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
Similar to Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
5. 5
Ingest Process Serve Visualize
Reporting at LinkedIn: Evolution
Sources
Oracle MSTR
Tableau
Internal
Tools
Espresso
Kafka
External
Custom
Custom
Custom
Hadoop Voldemort
Pinot
MySQL
INFA + MSTR OracleTeradataon+ Scripts
Jobs
on
6. Infra Scale
6
Number of Hadoop clusters: 12
Total number of machines: ~7k
Largest Cluster: ~3k machines
Data volume generated per day: XX Terabytes
Total accumulated data: XX Petabytes
7. People Scale
7
Reporting Platform Team: ~10
Core Warehouse Team: 1x
Data Scientists: 10x
Business Analysts: 10x
Product Managers: 10x
Sales and Marketing: 100x
8. 8
Ingest Process Serve Visualize
Challenges
Disjointed efforts, unreliable systems
Unpredictable SLA across all systems
Fragmented data pipelines with inconsistent data
14. Stream + Batch
REST
SFTP
JDBC
Diverse Sources
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn, Intel, Swisscom,
NerdWallet
@LinkedIn
~20 distinct source types
Hundreds of TB per day
Hundreds of datasets
Data Quality
22. First version in production since early 2014
Significant redesign in 2015
Total amount of data being scanned per day: Hundreds of TBs
Total number of metrics being computed: 2k+
Total number of scripts: ~ 400
Number of authors for these metrics: ~ 200
Maximum number of dimensions per dataset: ~ 30
Number of people responsible for upkeep of pipeline: 2
UMP by the numbers
22
23. Learnings so far
23
Ease of onboarding
Hard when you have > 1000 users with different skill sets
Need great UX to complement developer friendly
alternatives
Single source of truth
Not just a technology challenge
Organization needs to rally around it
Operability
Multi-tenant Hadoop pipeline with SLA-s and QoS: hard
Cost 2 Serve: Managing metrics lifecycle is important
The Next Big Things
Bridging streaming and batch
Code-free metrics
Sessions, Funnels, Cohorts
Open source
29. Standardize Visualization
29
Leverage
- Standalone app, with support for embedding
- Can use existing analytics backend: Pinot
Strategic
- Reduces dependency on 3rd party BI tools
- Closer integration with LinkedIn’s ecosystem of
experimentation, anomaly detection solutions
31. Raptor 1.0
31
First version built by 3 engineers in a quarter
Features
- Integration with UMP, Pinot
- Time series, bar charts, …
- Create, Publish, Clone, Discover
Dashboards
Numbers
- Number of dashboards: ~100
- Weekly unique users: ~400
32.
33.
34.
35.
36.
37.
38.
39. The Future for Raptor
39
Social Collaboration features
Intelligence
- Anomaly detection
- Dashboards You May Like
Embedding into data products
Open Source
41. 41
Ingest Process Serve Visualize
What we’re excited about
Unified
Metrics
Platform
P not Raptor
Metadata Bus
42. 42
Metadata driven e2e Optimizations
Dynamic prioritization of data ingest
Surface source data quality issues in dashboard
Surface backfill status on dashboard
Cascading deprecation of dashboards,
computation and data sources through lineage