Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
The presentation from Druid meetup in Tel Aviv, November 2019.
Presenting the architecture we've built at Outbrain for real time analytics dashboard based in Druid, Spark Streaming and Kafka.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
The presentation from Druid meetup in Tel Aviv, November 2019.
Presenting the architecture we've built at Outbrain for real time analytics dashboard based in Druid, Spark Streaming and Kafka.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...StampedeCon
At the StampedeCon 2015 Big Data Conference: This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all.
Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real-time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. NISHANT BANGARWA, Software engineer, Hortonworks
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in!
About the Speaker
Patrick McFadin Chief Evangelist, DataStax
Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
JupyterCon 2020 - Supercharging SQL Users with Jupyter NotebooksMichelle Ufford
In this talk, we’ll share a practical action plan for how Jupyter notebooks can significantly uplevel the data experience for your SQL users. We’ll do this by introducing a three-tier action plan that describes how companies such as Netflix have successfully created a well-integrated SQL experience within Jupyter.
This action plan will cover how to:
1. build a strong foundation that makes SQL more accessible for your users
2. increase productivity with a secure & integrated user experience; and
3. customize & extend SQL support to meet the unique needs of your organization.
We’ll first look at why notebooks are so appealing for SQL users and explore some of the traditional challenges they often face when working with notebooks. We’ll describe how JupyterHub can be leveraged as a solid foundation for you to build upon. We’ll then describe how adding SQL magics & popular libraries to your JupyterHub environment can lead to a dramatically better experience for your users.
Next, we’ll discuss how a secure & tightly integrated environment can lead to increased productivity for your users. We’ll explore how some easy configuration changes can have an enormous impact on your users. We’ll also offer some ideas for integrating Jupyter with the rest of your environment.
We’ll then move onto customizing and extending Notebooks using data magics, extensions, and tools. We’ll discuss how you can leverage libraries such as ipython-sql and sparkmagic to improve SQL support. We’ll conclude with some ideas for customizing your Jupyter environment to meet the unique needs of your organization.
Attendees will walk away from this session with best practices, tips and suggestions, links to useful resources, and an actionable plan they can implement in their own organizations.
At the StampedeCon 2015 Big Data Conference: As a frequent recipient of the J.D. Powers award for excellence in customer service, T-Mobile takes great pride in the quality of care that we provide our customers. As smartphone technologies advance (and fragment), the challenge of providing quality technical support can be daunting.
To address this challenge, T-Mobile is reinventing many of its traditional practices and embracing DevOps, cloud deployment and lambda architecture. Specifically:
* Cassandra for fast and consistent writes (at scale), as well as low-latency reads
* Apache Spark and EMR for processing data archived in S3
* Kafka for flexibility in data ingestion
* Chef and CloudFormation Templates to automate deployments
* Graphite and Riemann for monitoring
The goals of this presentation are:
* showcase how these technologies are helping T-Mobile be
* successful in addressing these business challenges
* share tactics for tackling customer preference management and data collection transparency
* specific “lessons learned” while migrating to NoSQL, Big Data and The Cloud.
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
The talk presents a new technique of realtime single entity information extraction and investigation. The technique eliminates regular refresh and persistence of data within the search engine (ETL), providing real-time access to source data and improving response times using in-memory data techniques. The solution presented is a concrete solution with live customers, based upon real business needs. I will explain the architectural overview, the technology stack used based on Apache Lucene library, the accomplished results and how to scale out the solution.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Analytics methods for big data have two requirements above and beyond analytics methods for normal-sized data. First, the analytics can not assume that all the data will fit in memory, or even fit on one server. Second, the choice of analysis methods must avoid high-order algorithms. We illustrate the point with one algorithm: Locality Sensitive Hashing
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
The General Data Protection Regulation (GDPR) is a legislation designed to protect personal data of European Union citizens and residents. The main requirement is to log personal data accesses/changes in customer-specific applications. These logs can then be audited by owning entities to provide reporting to end users indicating usage of their personal data. Users have the ""right to be forgotten,â€Âmeaning their personal data can be purged from the system at their request. The regulation goes into effect on May 25,2018 with significant fines for non-compliance.
This session will provide insight on how to approach/implement a GDPR compliance solution using Hadoop and Streaming for any enterprise with heavy volumes of data.This session will delve into deployment strategies, architecture of choice (Kafka,NiFi. and Hive ACID with streaming), implementation best practices, configurations, and security requirements. Hortonworks Professional Services System Architects helped the customer on ground to design, implement, and deploy this application in production.
Speaker
Saurabh Mishra, Hortonworks, Systems Architect
Arun Thangamani, Hortonworks, Systems Architect
Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all.
Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real-time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. NISHANT BANGARWA, Software engineer, Hortonworks
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in!
About the Speaker
Patrick McFadin Chief Evangelist, DataStax
Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
Why data warehouses cannot support hot analyticsImply
Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics
Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics.
Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.
JupyterCon 2020 - Supercharging SQL Users with Jupyter NotebooksMichelle Ufford
In this talk, we’ll share a practical action plan for how Jupyter notebooks can significantly uplevel the data experience for your SQL users. We’ll do this by introducing a three-tier action plan that describes how companies such as Netflix have successfully created a well-integrated SQL experience within Jupyter.
This action plan will cover how to:
1. build a strong foundation that makes SQL more accessible for your users
2. increase productivity with a secure & integrated user experience; and
3. customize & extend SQL support to meet the unique needs of your organization.
We’ll first look at why notebooks are so appealing for SQL users and explore some of the traditional challenges they often face when working with notebooks. We’ll describe how JupyterHub can be leveraged as a solid foundation for you to build upon. We’ll then describe how adding SQL magics & popular libraries to your JupyterHub environment can lead to a dramatically better experience for your users.
Next, we’ll discuss how a secure & tightly integrated environment can lead to increased productivity for your users. We’ll explore how some easy configuration changes can have an enormous impact on your users. We’ll also offer some ideas for integrating Jupyter with the rest of your environment.
We’ll then move onto customizing and extending Notebooks using data magics, extensions, and tools. We’ll discuss how you can leverage libraries such as ipython-sql and sparkmagic to improve SQL support. We’ll conclude with some ideas for customizing your Jupyter environment to meet the unique needs of your organization.
Attendees will walk away from this session with best practices, tips and suggestions, links to useful resources, and an actionable plan they can implement in their own organizations.
At the StampedeCon 2015 Big Data Conference: As a frequent recipient of the J.D. Powers award for excellence in customer service, T-Mobile takes great pride in the quality of care that we provide our customers. As smartphone technologies advance (and fragment), the challenge of providing quality technical support can be daunting.
To address this challenge, T-Mobile is reinventing many of its traditional practices and embracing DevOps, cloud deployment and lambda architecture. Specifically:
* Cassandra for fast and consistent writes (at scale), as well as low-latency reads
* Apache Spark and EMR for processing data archived in S3
* Kafka for flexibility in data ingestion
* Chef and CloudFormation Templates to automate deployments
* Graphite and Riemann for monitoring
The goals of this presentation are:
* showcase how these technologies are helping T-Mobile be
* successful in addressing these business challenges
* share tactics for tackling customer preference management and data collection transparency
* specific “lessons learned” while migrating to NoSQL, Big Data and The Cloud.
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
The talk presents a new technique of realtime single entity information extraction and investigation. The technique eliminates regular refresh and persistence of data within the search engine (ETL), providing real-time access to source data and improving response times using in-memory data techniques. The solution presented is a concrete solution with live customers, based upon real business needs. I will explain the architectural overview, the technology stack used based on Apache Lucene library, the accomplished results and how to scale out the solution.
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
For over 30 years, Parametric has been a leading provider of model-based portfolios to institutional and private investors, with unique implementation and customization expertise. Much like other cutting-edge financial services providers, Parametric operates with highly diverse, fast moving data from which they glean insights. Data sources range from benchmark providers to electronic trading participants to stock exchanges etc. The challenge is to not just onboard the data but also to figure out how to monetize it when the schemas are fast changing. This presents a problem to traditional architectures where large teams are needed to design the new ETL flow. Organizations that are able to quickly adapt to new schemas and data sources have a distinct competitive advantage.
In this presentation and demo, Architects from Parametric , Chris Gambino & Vamsi Chemitiganti will present the data architecture designed in response to this business challenge. We discuss the approach (and trade-offs) to pooling, managing, processing the data using the latest techniques in data ingestion & pre-processing. The overall best practices in creating a central data pool are also discussed. Quantitative analysts to have the most accurate and up to date information for their models to work on. Attendees will be able to draw on their experiences both from a business and technology standpoint on not just creating a centralized data platform but also being able to distribute it to different units.
Analytics methods for big data have two requirements above and beyond analytics methods for normal-sized data. First, the analytics can not assume that all the data will fit in memory, or even fit on one server. Second, the choice of analysis methods must avoid high-order algorithms. We illustrate the point with one algorithm: Locality Sensitive Hashing
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...Amazon Web Services
As more organizations seek to leverage the power and benefits of the cloud, they also need to combine new systems with exiting on-premises systems. Services such as Virtual Private Cloud, VPN and DirectConnect enable AWS customers to combine on-premises and cloud-based resources easily and effectively. This session will walk customers through the 4 main patterns of connectivity and will include a ""real time"" demonstration of how easy it is to setup your own VPC and start working in your own private section of the AWS Cloud.
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
Machine Learning for Smarter Apps with Tom Kraljevic
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...Imply
Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data.
In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. Druid Concepts
• What is it ?
Druid is a open source fast distributed column-oriented data store.
Designed for low latency ingestion and very fast ad-hoc aggregation based
analytics.
• Pros:-
• Fast response in aggregation operation (in almost sub-second)
• Supports real time streaming ingestion capability with many other popular
solution in market e.g. Kafka , Samza , Spark etc.
• Traditional Batch type ingestion ( Hadoop based ).
• Cons / Limitation :-
• joins are not mature enough
• Limited options compared to other SQL like solutions.
3. Brief History on Druid
• History
• Druid was started in 2011 to power the analytics in Metamarkets. The project
was open-sourced to an Apache License in February 2015.
4. Industries has Druid in production
• Metamarkets
• Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over
30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at query time of
around 1 second in almost 95% of the time.
• Airbnb
• Druid powers slice and dice analytics on both historical and real time-time metrics. It significantly reduces latency of analytic queries
and help people to get insights more interactively.
• Alibaba
• At Alibaba Search Group, we use Druid for real-time analytics of users' interaction with its popular e-commerce site.
• Cisco
• Cisco uses Druid to power a real-time analytics platform for network flow data.
• eBay
• eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over
100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100
concurrent queries without impacting ingest rate and query latencies.
6. Druid In Production - MetaMarkets
• 3M+ events/sec through Druid’s real time ingestion.
• 100+ PB of data
• Application supporting 1000 of queries per sec concurrently.
• Supports 1000 of cores for horizontally scale up.
• …
• Reference :- https://metamarkets.com/2016/impact-on-query-speed-from-
forced-processing-ordering-in-druid/
• https://metamarkets.com/2016/distributing-data-in-druid-at-petabyte-
scale/
7. A real example of Druid in Action
Reference :- https://whynosql.com/2015/11/06/lambda-architecture-with-druid-at-gumgum/
8. Ideal requirements to Druid ?
• You need :-
• Fast aggregation & arbitrary data exploration in low latency on huge data sets.
• Fast response on near real time event data. Ingested data is immediately
available for querying)
• No SPoF
• Handle peta-bytes of data with multiple dimension.
• Less than a second in time-oriented summarization of the incoming data
stream
• NOTE >> before we go to understand the architecture part of it , I want to
show u a typical use case just to understand what we have said so far.
9. Druid Concepts – An example
• The Data
• timestamp publisher advertiser gender country click price
• 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
• 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62
• 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45
• 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
• 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
• 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
10. Druid – Architecture
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
H
D
F
S
Deep Storage
H
D
F
S
Over lord node
Indexing
DATA
QUERY
Static Data
11. H
D
F
S
Druid – Architecture ( cluster mgmt. depedency)
C
L
I
N
T
Indexing Streaming data
Real Time Node
Broker Node
Historical Node
Deep Storage
H
D
F
S
Coordinator node
DATA
QUERY
Zookeeper
Meta Store
12. Druid – Components
• Broker Node
• Real time node
• Overlord Node
• Middle-Manager Node
• Historical Node
• Coordinator Node
• Aside from these nodes, there are 3 external dependencies to the system:
• A running ZooKeeper cluster for cluster service discovery and maintenance of current
data topology
• A metadata storage instance for maintenance of metadata about the data segments that
should be served by the system
• A "deep storage" system to hold the stored segments.
13. Druid - Data Storage Layer
• Segments and Data Storage
• Druid stores its index in segment files, which are partitioned by time
• columnar: the data for each column is laid out in separate data structures.
14. Druid – Query
• Timeseries
• TopN
• GroupBy & Aggregations
• Time Boundary
• Search
• Select
• a) queryType
• b) granularity
• c) filter
• d) aggregation
• e) post-Aggregation
16. Task Submit Commands
• 1- clear HDFS storage location
• hdfs dfs –rm –r /user/root/segments
• Make sure the data source is exist in local FS :-
/root/labtest/druid_hadoop/druid-0.10.0/quickstart/Test/pageviewsLatforCountExmaple.json & upload to HDFS.
hdfs dfs -put -f pageviewsLat.json /user/root/quickstart/Test
• Create index Task on Druid
• curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/Test/pageviewsLat-index-forCountExample.json
localhost:8090/druid/indexer/v1/task
• Task information can be seen in <overlord_Host:8090>/console.html
• Verify the segments is created under /user/root/segments
• hdfs dfs –ls /user/root/segments
17. Query commands
• TopN
• This will result Top N pages with latency in descending order.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @quckstart/Test/query/pageviewsLatforCount-top-
latency-pages.json http://localhost:8082/druid/v2/?pretty
• Timeseries
• This will result total latency , filtered by user=“alice” and "granularity": "day“ . [ “all” ]
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-timeseries-
pages.json http://localhost:8082/druid/v2/?pretty
• groupBy
• A) This is will result aggregated latency grpBy user+url
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-aggregateLatencyGrpByURLUser.json
http://localhost:8082/druid/v2/?pretty
• B) This will result aggregated page count (i.e. number of url accessed ) grpBy user
• curl -L -H'Content-Type: application/json' -XPOST --data-binary
@quickstart/Test/query/pageviewsLatforCount-countURLAccessedGrpByUser.json
http://localhost:8082/druid/v2/?pretty
18. Query commands
• Time Boundary
• Time boundary queries return the earliest and latest data points of a data set.
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-
timeBoundary-pages.json http://localhost:8082/druid/v2/?pretty
• Search
• A search query returns dimension values that match the search specification , like e.g. here searching for dimension
url has matches with text “facebook”
• curl -L -H'Content-Type: application/json' -XPOST --data-binary @ckstart/Test/query/pageviewsLatforCount-search-
URL-pages.json http://localhost:8082/druid/v2/?pretty