Fast real-time approximations using Spark streaminghuguk
By Kevin Schmidt (Head of Data Science at Mind Candy)
Luis Vicente (Senior Data Engineer at Mind Candy)
For mobile games, constant tweaks are the difference between success and failure. Data and analytics have to be available in real-time, but calculating, for example, uniqueness or newness of a data point requires a list of seen data points - both memory intensive and tricky when using real-time stream processing like Spark Streaming. Probabilistic data structures allow approximation of these properties with a fixed memory representation, and are very well suited for this kind of stream processing. Getting from the theory of approximation to a useful metric at a low error rate even for many millions of users is another story. In our talk we will look at practical ways of achieving this: which approximation we used for selection of useful metrics, why we picked a specific probabilistic data structure, how we stored it in Cassandra as a time series and how we implemented it in Spark Streaming.
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
An overview of Apache Spark and AWS Glue.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Amazon Web Services
Want to learn how to build your own Google Analytics? Learn how to build a scalable architecture using node.js, Amazon DynamoDB, and Amazon EMR. This architecture is used by ScribbleLive to track billions of engagement minutes per month. In this session, we go over the code in node.js, how to store the data in Amazon DynamoDB, and how to roll-up the data using Hadoop and Hive. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
Fast real-time approximations using Spark streaminghuguk
By Kevin Schmidt (Head of Data Science at Mind Candy)
Luis Vicente (Senior Data Engineer at Mind Candy)
For mobile games, constant tweaks are the difference between success and failure. Data and analytics have to be available in real-time, but calculating, for example, uniqueness or newness of a data point requires a list of seen data points - both memory intensive and tricky when using real-time stream processing like Spark Streaming. Probabilistic data structures allow approximation of these properties with a fixed memory representation, and are very well suited for this kind of stream processing. Getting from the theory of approximation to a useful metric at a low error rate even for many millions of users is another story. In our talk we will look at practical ways of achieving this: which approximation we used for selection of useful metrics, why we picked a specific probabilistic data structure, how we stored it in Cassandra as a time series and how we implemented it in Spark Streaming.
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
An overview of Apache Spark and AWS Glue.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Amazon Web Services
Want to learn how to build your own Google Analytics? Learn how to build a scalable architecture using node.js, Amazon DynamoDB, and Amazon EMR. This architecture is used by ScribbleLive to track billions of engagement minutes per month. In this session, we go over the code in node.js, how to store the data in Amazon DynamoDB, and how to roll-up the data using Hadoop and Hive. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
Data processing and analysis is where big data is most often consumed, driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing, and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing, and interactive analytics with AWS services, such as, Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Created by: Jason Morris, Solutions Architect
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
Data collection and storage is a primary challenge for any big data architecture. This session will focus on the different types of data that customers are handling to drive high-scale workloads on AWS. Our goal is to help you choose the best approach for your workload. We will dive into optimization techniques that improve performance and reduce the cost of data ingestion and AWS services including Amazon S3, DynamoDB, and Kinesis.
Created by: Mark Korver, Senior Solutions Architect
Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day.
Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream.
AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds.
In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
Learn how to deploy a managed Presto environment to interactively query log data on AWS
Organizations often need to quickly analyze large amounts of data, such as logs, generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes
In this webinar you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using plain ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Learning Objectives:
• Learn how to deploy a managed Presto environment running on Amazon EMR
• Understand best practices for running Presto on Amazon EMR, including use of Amazon EC2 Spot instances
• Learn how other customers are using Presto to analyze large data sets
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
As Netflix expands their services to more countries, devices, and content, they continue to evolve their big data analytics platform to accommodate the increasing needs of product and consumer insights. This year, Netflix re-innovated their big data platform: they upgraded to Hadoop 2, transitioned to the Parquet file format, experimented with Pig on Tez for the ETL workload, and adopted Presto as their interactive querying engine. In this session, Netflix discusses their latest architecture, how they built it on the Amazon EMR infrastructure, the contributions put into the open source community, as well as some performance numbers for running a big data warehouse with Amazon S3.
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...Amazon Web Services
Dynamo Streams provides a stream of all the updates done to your DynamoDB table. It is a simple but extremely powerful primitive which will enable developers to easily build solutions like cross-region replication, and to host additional materialized views, for instance an ElasticSearch index, on top of DynamoDB tables. In this session we will dive deep into details of Dynamo Streams, and how customers can leverage Dynamo Streams to build custom solutions and to extend the functionality of DynamoDB. We will give a demo of an example application built on top of Dynamo Streams to demonstrate the power and simplicity of Dynamo Streams.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
SMACK Stack 1.0 has been Spark, Mesos, Akka, Cassandra and Kafka working into different cohesive systems delivering different solutions for different use cases. Haven't heard about it before? Oh man! Where have you been? https://www.google.com/search?q=smack+stack+1.0
SMACK Stack 1.1 we go a step further Streaming, Mesos, Analytics, Cassandra and Kafka and Joe Stein will walk through in detail some of the different viable options for Streaming and Analytics with Mesos, Kafka and Cassandra.
Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.
Data processing and analysis is where big data is most often consumed, driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing, and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing, and interactive analytics with AWS services, such as, Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Created by: Jason Morris, Solutions Architect
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez.
After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc.
We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. A distributed bloom join that can create multiple bloom filters in parallel was straightforward to implement with the flexibility of Tez DAGs. It vastly improved performance and reduced disk and network utilization for our large joins. Byte code generation for projection and filtering of records is another big feature that we are targeting for Pig 0.17 which will speed up processing by reducing the virtual function calls.
Data collection and storage is a primary challenge for any big data architecture. This session will focus on the different types of data that customers are handling to drive high-scale workloads on AWS. Our goal is to help you choose the best approach for your workload. We will dive into optimization techniques that improve performance and reduce the cost of data ingestion and AWS services including Amazon S3, DynamoDB, and Kinesis.
Created by: Mark Korver, Senior Solutions Architect
Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day.
Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream.
AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds.
In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
Learn how to deploy a managed Presto environment to interactively query log data on AWS
Organizations often need to quickly analyze large amounts of data, such as logs, generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes
In this webinar you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using plain ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR.
Learning Objectives:
• Learn how to deploy a managed Presto environment running on Amazon EMR
• Understand best practices for running Presto on Amazon EMR, including use of Amazon EC2 Spot instances
• Learn how other customers are using Presto to analyze large data sets
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
As Netflix expands their services to more countries, devices, and content, they continue to evolve their big data analytics platform to accommodate the increasing needs of product and consumer insights. This year, Netflix re-innovated their big data platform: they upgraded to Hadoop 2, transitioned to the Parquet file format, experimented with Pig on Tez for the ETL workload, and adopted Presto as their interactive querying engine. In this session, Netflix discusses their latest architecture, how they built it on the Amazon EMR infrastructure, the contributions put into the open source community, as well as some performance numbers for running a big data warehouse with Amazon S3.
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...Amazon Web Services
Dynamo Streams provides a stream of all the updates done to your DynamoDB table. It is a simple but extremely powerful primitive which will enable developers to easily build solutions like cross-region replication, and to host additional materialized views, for instance an ElasticSearch index, on top of DynamoDB tables. In this session we will dive deep into details of Dynamo Streams, and how customers can leverage Dynamo Streams to build custom solutions and to extend the functionality of DynamoDB. We will give a demo of an example application built on top of Dynamo Streams to demonstrate the power and simplicity of Dynamo Streams.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
SMACK Stack 1.0 has been Spark, Mesos, Akka, Cassandra and Kafka working into different cohesive systems delivering different solutions for different use cases. Haven't heard about it before? Oh man! Where have you been? https://www.google.com/search?q=smack+stack+1.0
SMACK Stack 1.1 we go a step further Streaming, Mesos, Analytics, Cassandra and Kafka and Joe Stein will walk through in detail some of the different viable options for Streaming and Analytics with Mesos, Kafka and Cassandra.
Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
drupal 7 amfserver presentation: integrating flash and drupalrolf vreijdenberger
In this presentation there will be a full explanation of how to integrate flash and drupal 7 with the amfserver module. Including examples and best practices. Presentation by the author of the amfserver module held at the 2011 DrupalCamp Sweden in Stockholm
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
Johnny Miller – Cassandra + Spark = Awesome
This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.
Link to the full talk - https://youtu.be/2Rf5t2Eh6IQ
https://go.dok.community/slack
https://dok.community
ABSTRACT OF THE TALK
This talk will provide a high-level overview of Kubernetes, Helm charts and how they can be used to deploy Apache Druid clusters of any size.
We'll review how Kubernetes functionality enables resilience and self-healing, historical tiers through node group affinity, middle manager scaling through Kubernetes autoscaling to optimize ingestion capacity and some of the gotchas along the way.
BIO
Sergio Ferragut is a database veteran turned Developer Advocate at Imply. His experience includes 16 years at Teradata in professional services and engineering roles.
He has direct experience in building analytics applications spanning the retail, supply chain, pricing optimization and IoT spaces.
Sergio has worked at multiple technology start-ups including APL and Splice Machine where he helped guide product design and field messaging.
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies. In this session we will demonstrate how the open source Spring Batch, Spring Integration and Spring Hadoop projects can be used to build manageable and robust pipeline solutions to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and analysis.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Similar to Building Hadoop Data Applications with Kite (20)
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
Big organisations have a wealth of rich customer data which opens up huge new opportunities. However, they have the challenge of how to extract value from this data while protecting the privacy of their individual customers. He will talk about the risks organisations face, and what they should do about it. He will survey the techniques which can be used to make data safe for analysis, and talk briefly about how they are solving this problem at Privitar.
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
IBM is developing the Watson Ecosystem to leverage its Developer Cloud, APIs, Content Store and Talent Hub. This is part of IBM's recent announcement of the $1B investment in Watson as a new business unit including Silicon Alley NYC headquarters. For the first time, IBM will open up Watson as a development platform in the Cloud to spur innovation and fuel a new ecosystem of entrepreneurial software app providers who will bring forward a new generation of applications infused with Watson's cognitive computing intelligence.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
Martin Oberhuber and Eliano Marques, Senior Data Scientists @Think Big International
In this talk Think Big International Lead Data Scientists will discuss the options that exist today for engineering and data science teams aiming to use big data patterns to solve new business problems. With the enterprise adoption of the Hadoop ecosystem and the emerging momentum of open source projects like Spark it is becoming mandatory to have an approach that solves for business results but remains flexible to adapt and change with the open source market.
Signal Media: Real-Time Media & News Monitoringhuguk
Startup pitch presented by CTO Wesley Hall. Signal Media is a real-time media and news monitoring platform that tracks media outlets. News items are analysed for brand & media monitoring as well as market intelligence.
Startup pitch presented by Aeneas Wiener. Cytora is a real-time geopolitical risk analysis platform that extracts events from open-source intelligence and evaluates these events on their geopolitical impact.
Startup pitch presented by co-founder and CEO Jaco Els. Cubitic offers a predictive analytics platform that allows developers to build custom solutions for analytics and visualisation on top of a machine learning engine.
Startup pitch presented by co-founder and CEO Corentin Guillo. Bird.i is building a platform for up-to-date earth observation data that will bring satellite imagery to the mass market. Providing fresh imagery together with analytics around the forecast of localised demand opens up innovative opportunities in sectors like construction, tourism, real-estate and remote facility monitoring.
Startup pitch presented by co-founders Laure Andrieux and Nic Greenway. Aiseedo applies real-time machine learning, where the model of the world is constantly updated, to build adaptive systems which can be applied to robotics, the Internet of Things and healthcare.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
Technical developments in the area of data warehousing have allowed companies to push their analysis a step further and, therefore, allowed data scientists to deliver more value to business areas. In that session, we will focus on the case of performance marketing at King and demonstrate how we use Hadoop capabilities to exploit user-level data efficiently. That approach results in obtaining a more holistic view in a return-on-investment analysis of TV advertisement.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Euro Cup fans worldwide can book Euro 2024 Tickets from our online platform www.worldwideticketsandhospitality. Fans can book Ukraine Vs Belgium Tickets on our website at discounted prices.
Turkey vs Georgia Tickets: Turkey's Road to Glory and Building Momentum for U...Eticketing.co
Euro Cup Germany fans worldwide can book Euro 2024 Tickets from our online platform www.eticketing.co.Fans can book Euro Cup 2024 Tickets on our website at discounted prices.
Spain vs Italy Spain at Euro Cup 2024 Group, Fixtures, Players to Watch and M...Eticketing.co
Euro Cup 2024 fans worldwide can book Spain vs Italy Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Euro Cup fans worldwide can book Euro 2024 Tickets from our online platform www.worldwideticketsandhospitality. Fans can book Belgium Vs Romania Tickets on our website at discounted prices.
Boletin de la I Copa Panamericana de Voleibol Femenino U17 Guatemala 2024Judith Chuquipul
holaesungusto.- Boletín final de la I Copa Panamericana de Voleibol Femenino U17 - Ciudad de Guatemala 2024 que se realizó del 27 de mayo al 01 de julio, en el Domo Polideportivo Zona 13.
Fuente: norceca.net
Euro Cup fans worldwide can book Euro 2024 Tickets from our online platform www.worldwideticketsandhospitality. Fans can book Croatia vs Italy Tickets on our website at discounted prices.
Turkey vs Georgia Turkey's Road to Redemption and Euro 2024 Prospects.pdfEticketing.co
Euro Cup Germany fans worldwide can book Euro 2024 Tickets from our online platform www.eticketing.co.Fans can book Euro Cup 2024 Tickets on our website at discounted prices.
Narrated Business Proposal for the Philadelphia Eaglescamrynascott12
Slide 1:
Welcome, and thank you for joining me today. We will explore a strategic proposal to enhance parking and traffic management at Lincoln Financial Field, aiming to improve the overall fan experience and operational efficiency. This comprehensive plan addresses existing challenges and leverages innovative solutions to create a smoother and more enjoyable experience for our fans.
Slide 2:
Picture this: It’s a crisp fall afternoon, driving towards Lincoln Financial Field. The atmosphere is electric—tailgaters grilling, fans in Eagles jerseys creating a sea of green and white. The air buzzes with camaraderie and anticipation. You park, join the throng, and make your way to your seat. The stadium roars as the Eagles take the field, sending chills down your spine. Each play is a thrilling dance of strategy and skill. This is what being an Eagles fan is all about—the joy, the pride, and the shared experience.
Slide 3:
But now, the day is marred by frustration. The excitement wanes as you struggle to find a parking spot. The congestion is overwhelming, and tempers flare. The delays mean you miss the pre-game excitement, the tailgate camaraderie, and even the opening kick-off. After the game, the joy of victory or the shared solace of defeat is overshadowed by the stress of navigating out of the parking lot. The gridlock, honking horns, and endless waiting drain the energy and joy from what should have been an unforgettable experience.
Our proposal aims to eliminate these frustrations, ensuring that from arrival to departure, your experience is extraordinary. Efficient parking and smooth traffic flow are key to maintaining the high spirits and excitement that make game days special.
Slide 4:
The Philadelphia Eagles are not just a premier NFL team; they are an integral part of the community, hosting games, concerts, and various events at Lincoln Financial Field. Our state-of-the-art stadium is designed to provide a world-class experience for every attendee. Whether it's the thrill of game day, the excitement of a live concert, or the camaraderie of community events, we pride ourselves on delivering a fan-first experience and maintaining operational excellence across all our activities. Our commitment to our fans and community is unwavering, and we continuously strive to enhance every aspect of their experience, ensuring they leave with unforgettable memories.
Slide 5:
Recent trends show an increasing demand for efficient event logistics. Our customer feedback has consistently highlighted frustrations with parking and traffic. Surveys indicate that a significant number of fans are dissatisfied with the current parking situation. Comparisons with other venues like Citizens Bank Park and Wells Fargo Center reveal that we lag in terms of parking efficiency and convenience. These insights underscore the urgent need for innovation to meet and exceed fan expectations.
Slide 6:
As we delve into the intricacies of our operations, one glaring issue emer
The Split_ Hardik Pandya and Natasa Stankovic Part Ways News by Betkaro247 (3...bet k247
Betting ID
we like to introduce to our Cricket Betting ID platform, which help people to earn lot of money just by doing little-little predictions on games and events.
Indian cricketer Hardik Pandya and Serbian actress Natasha Stankovic have decided to part ways, ending a relationship that captivated fans and followers worldwide. The news of their split has been making headlines, stirring a mixture of shock, sadness, and speculation among their supporters.
Spain vs Croatia Date, venue and match preview ahead of Euro Cup clash as Mod...Eticketing.co
We offer Euro Cup Tickets to admirers who can get Spain vs Croatia Tickets through our trusted online ticketing marketplace. Eticketing.co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
Croatia vs Italy Can Luka Modrić Lead Croatia to Euro Cup Germany Glory in Hi...Eticketing.co
Euro 2024 fans worldwide can book Croatia vs Italy Tickets from our online platform www.eticketing.co. Fans can book Euro Cup Germany Tickets on our website at discounted prices.
Ukraine Euro Cup 2024 Squad Sergiy Rebrov's Selections and Prospects.docxEuro Cup 2024 Tickets
After securing their spot through the playoff route, Ukraine is gearing up for their fourth consecutive European Championship. Ukraine first qualified as hosts in 2012, but in 2016
Serbia vs England Tickets: Serbia's Return to Euro Cup 2024, A Look at Key Pl...Eticketing.co
Eticketing.co offers UEFA Euro 2024 Tickets to admirers who can get Serbia vs England Tickets through our trusted online ticketing marketplace. Eticketing.co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
Belgium vs Romania Injuries and Patience in Belgium’s Euro Cup Germany Squad....Eticketing.co
Belgium coach Domenico Tedesco will wait for several key players to recover from injury. Even if it means they miss the opening Euro Cup Germany stages of the European Championship in Germany this month. Veteran defender Jan Vertonghen, midfielder Youri Tielemans and defender Arthur. Theate are being given time to play in the tournament because they are considered vital to Belgium’s cause, Tedesco said on Tuesday.
We offer Euro Cup Tickets to admirers who can get Belgium vs Romania Tickets through our trusted online ticketing marketplace. Eticketing.co is the most reliable source for booking Euro Cup Final Tickets. Sign up for the latest Euro Cup Germany Ticket alert.
UEFA Euro 2024 Tickets | Euro 2024 Tickets | Euro Cup Germany Tickets | Belgium vs Romania Tickets
"Of course, you prefer to take players who are fully fit, but that's okay. We want to wait and be patient for some players even if they cannot play in those first matches," he told a press conference. The 37-year-old Vertonghen, Belgium’s Euro Cup 2024 most-capped international with 154 appearances, is struggling to shake off a groin injury.
"He will be there normally. This also applies to Youri Tielemans and Arthur Theate. The latter's position is very sensitive. We don't have many choices at left back. "It will only change if it turns out that they will only be available when, say, the final of the Euro 2024 Championship comes around. That's too long to wait. "However, I am confident that the injured boys are on track for the Euros.
Belgium vs Romania: Radu Dragusin Prepares for Crucial Role in Euro Cup Germany
Some of them have taken not one but two steps forward in their rehabilitation," he said. None of the injured players will feature in this week’s warm-up friendlies against Montenegro and Luxembourg. Romania centre-back Radu Dragusin found chances limited at Tottenham Hotspur in the second half of the 2023-24 season.
But is crucial to his country's cause at UEFA Euro 2024 where his aerial ability, physicality and hard graft make him a standout player. The 22-year-old moved to North London from Italian side Genoa in January but was kept on the sidelines by the form of another new arrival for the season, Mickey van de Ven, something Romania coach Edward Iordanescu admitted was a concern.
It will mean limited game-time going into the finals, but Dragusin, who cites Netherlands defender Virgil van Dijk as a role model, started every Euro Cup Germany qualifier as Romania went through the campaign unbeaten in their 10 games. He will be among their most important players in their first game in Germany against Ukraine in Munich on June 17, taking the right centre-back role in what is likely to be a back four.
UEFA Euro 2024 Tickets | Euro 2024 Tickets | Euro Cup Germany Tickets | Belgium vs Romania Tickets
Euro fans worldwide can book Euro Cup Germany Tickets from our online platform, www.eticketing.co. Fans can book Euro Cup 2024 Tickets on our website at discounted prices.
1. 11
Headline
Goes
Here
Speaker
Name
or
Subhead
Goes
Here
Building
Hadoop
Data
Applica;ons
with
Kite
Tom
White
@tom_e_white
Hadoop
Users
Group
UK,
London
17
June
2014
2. About
me
• Engineer
at
Cloudera
working
on
Core
Hadoop
and
Kite
• Apache
Hadoop
CommiMer,
PMC
Member,
Apache
Member
• Author
of
“Hadoop:
The
Defini;ve
Guide”
2
7. Glossary
• Apache
Avro
–
cross-‐language
data
serializa;on
library
• Apache
Parquet
(incuba;ng)
–
column-‐oriented
storage
format
for
nested
data
• Apache
Hive
–
data
warehouse
(SQL
and
metastore)
• Apache
Flume
–
streaming
log
capture
and
delivery
system
• Apache
Oozie
–
workflow
scheduler
system
• Apache
Crunch
–
Java
API
for
wri;ng
data
pipelines
• Impala
–
interac;ve
SQL
on
Hadoop
7
8. Outline
• A
Typical
Applica;on
• Kite
SDK
• An
Example
• Advanced
Kite
8
14. Kite
• A
client-‐side
library
for
wri;ng
Hadoop
Data
Applica;ons
• First
release
was
in
April
2013
as
CDK
• 0.14.1
last
month
• Open
source,
Apache
2
license,
kitesdk.org
• Modular
• Data
module
(HDFS,
Flume,
Crunch,
Hive,
HBase)
• Morphlines
transforma;on
module
• Maven
plugin
14
16. Kite
Data
Module
• Dataset
–
a
collec;on
of
en;;es
• DatasetRepository
–
physical
storage
loca;on
for
datasets
• DatasetDescriptor
–
holds
dataset
metadata
(schema,
format)
• DatasetWriter
–
write
en;;es
to
a
dataset
in
a
stream
• DatasetReader
–
read
en;;es
from
a
dataset
• hMp://kitesdk.org/docs/current/apidocs/index.html
16
17. 1.
Define
the
Event
En;ty
public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17
18. 2.
Create
the
Events
Dataset
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18
19. (2.
or
with
the
Maven
plugin)
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19
29. Unified
Storage
Interface
• Dataset
–
streaming
access,
HDFS
storage
• RandomAccessDataset
–
random
access,
HBase
storage
• Par;;onStrategy
defines
how
to
map
an
en;ty
to
par;;ons
in
HDFS
or
row
keys
in
HBase
29
30. Filesystem
Par;;ons
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30
35. Parallel
Processing
• Goal
is
for
Hadoop
processing
frameworks
to
“just
work”
• Support
Formats,
Par;;ons,
Views
• Na;ve
Kite
components,
e.g.
DatasetOutputFormat
for
MR
35
HDFS
Dataset
HBase
Dataset
Crunch
Yes
Yes
MapReduce
Yes
Yes
Hive
Yes
Planned
Impala
Yes
Planned
36. Schema
Evolu;on
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36
37. Searchable
Datasets
• Use
Flume
Solr
Sink
(in
addi;on
to
HDFS
Sink)
• Morphlines
library
to
define
fields
to
index
• SolrCloud
runs
on
cluster
from
indexes
in
HDFS
• Future
support
in
Kite
to
index
selected
fields
automa;cally
37
39. Kite
makes
it
easy
to
get
data
into
Hadoop
with
a
flexible
schema
model
that
is
storage
agnos;c
in
a
format
that
can
be
processed
with
a
wide
range
of
Hadoop
tools
39
40. Genng
Started
With
Kite
• Examples
at
github.com/kite-‐sdk/kite-‐examples
• Working
with
streaming
and
random-‐access
datasets
• Logging
events
to
datasets
from
a
webapp
• Running
a
periodic
job
• Migra;ng
data
from
CSV
to
a
Kite
dataset
• Conver;ng
an
Avro
dataset
to
a
Parquet
dataset
• Wri;ng
and
configuring
Morphlines
• Using
Morphlines
to
write
JSON
records
to
a
dataset
40
43. Applica;ons
• [Batch]
Analyze
an
archive
of
songs1
• [Interac;ve
SQL]
Ad
hoc
queries
on
recommenda;ons
from
social
media
applica;ons2
• [Search]
Searching
email
traffic
in
near-‐real;me3
• [ML]
Detec;ng
fraudulent
transac;ons
using
clustering4
43
[1]
hMp://blog.cloudera.com/blog/2012/08/process-‐a-‐million-‐songs-‐with-‐apache-‐pig/
[2]
hMp://blog.cloudera.com/blog/2014/01/how-‐wajam-‐answers-‐business-‐ques;ons-‐faster-‐with-‐hadoop/
[3]
hMp://blog.cloudera.com/blog/2013/09/email-‐indexing-‐using-‐cloudera-‐search/
[4]
hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/
44. …
or
use
JDBC
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44
45. Apps
• App
–
a
packaged
Java
program
that
runs
on
a
Hadoop
cluster
• cdk:package-‐app
–
create
a
package
on
the
local
filesystem
• like
an
exploded
WAR
• Oozie
format
• cdk:deploy-‐app
–
copy
packaged
app
to
HDFS
• cdk:run-‐app
–
execute
the
app
• Workflow
app
–
runs
once
• Coordinator
app
–
runs
other
apps
(like
cron)
45
46. Morphlines
Example
46
morphlines
:
[
{
id
:
morphline1
importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dic;onaryFiles
:
[/tmp/grok-‐dic;onaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.