Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.
The document discusses Druid integration with the Hadoop ecosystem. It covers three main topics:
1) Security enhancements like integration with Kerberos and future integration with Apache Ranger/Knox.
2) Deployment and management using Apache Ambari which provides a unified UI and supports rolling deployments.
3) SQL interaction through Apache Hive integration which allows SQL queries on Druid data and benefits both Druid and Hive.
Neo4j 4.1 introduces new features for security including role-based access control, schema-based security, and granular security for write operations. It also includes improvements to causal clustering, performance, and developer tools. This document reviews the history of releases from Neo4j 3.0 through 4.1 and highlights some of the main new capabilities in security, performance, and operations.
Real-Time Integration Between MongoDB and SQL Databases MongoDB
This document describes WebMD's use of MongoDB and Storm to integrate real-time data from MongoDB into SQL databases. A Storm topology is used to continuously read data from the MongoDB oplog using a spout. Bolts parse and flatten documents, extract embedded arrays, and write the data to SQL databases in real-time using a SQLWriter bolt. This provides a scalable, fault-tolerant way to preserve investments in BI tools while also enabling real-time queries across both MongoDB and SQL data.
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
The document discusses Druid, an open-source distributed column-oriented data store designed for low latency queries on large datasets. It outlines Druid's capabilities for real-time ingestion, aggregation queries in sub-seconds, and storing petabytes of historical data. Examples are given of companies like Netflix and PayPal using Druid at large scales to analyze streaming data. The key components, data formats, and query types of Druid are described.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
This document introduces RediSearch aggregations, which allow transforming search results into statistical insights using a pipeline of grouping, reducing, applying transformations, and sorting. It provides examples of counting questions by month from Stack Overflow data and comparing tag popularity over time. RediSearch aggregations can run distributed queries across shards and push operations down to shards for efficiency. Future work may include improving expression capabilities and supporting cursors.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
One of MongoDB’s primary appeals to developers is that it gives them the ability to start application development without needing to define a formal, up-front schema. Operations teams appreciate the fact that they don't need to perform a time-consuming schema upgrade operation every time the developers need to store a different attribute (as an example, The Weather Channel is now able to launch new features in hours whereas it used to take weeks). For business leaders, the application gets launched much faster, and new features can be rolled out more frequently. MongoDB powers agility.
Some projects reach a point where it's necessary to define rules on what's being stored in the database – for example, that for any document in a particular collection, you can be assured that certain attributes are present.
To address the challenges discussed above, while at the same time maintaining the benefits of a dynamic schema, MongoDB 3.2 introduces document validation.
There is significant flexibility to customize which parts of the documents are **and are not** validated for any collection.
The document discusses Druid integration with the Hadoop ecosystem. It covers three main topics:
1) Security enhancements like integration with Kerberos and future integration with Apache Ranger/Knox.
2) Deployment and management using Apache Ambari which provides a unified UI and supports rolling deployments.
3) SQL interaction through Apache Hive integration which allows SQL queries on Druid data and benefits both Druid and Hive.
Neo4j 4.1 introduces new features for security including role-based access control, schema-based security, and granular security for write operations. It also includes improvements to causal clustering, performance, and developer tools. This document reviews the history of releases from Neo4j 3.0 through 4.1 and highlights some of the main new capabilities in security, performance, and operations.
Real-Time Integration Between MongoDB and SQL Databases MongoDB
This document describes WebMD's use of MongoDB and Storm to integrate real-time data from MongoDB into SQL databases. A Storm topology is used to continuously read data from the MongoDB oplog using a spout. Bolts parse and flatten documents, extract embedded arrays, and write the data to SQL databases in real-time using a SQLWriter bolt. This provides a scalable, fault-tolerant way to preserve investments in BI tools while also enabling real-time queries across both MongoDB and SQL data.
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
The document discusses Druid, an open-source distributed column-oriented data store designed for low latency queries on large datasets. It outlines Druid's capabilities for real-time ingestion, aggregation queries in sub-seconds, and storing petabytes of historical data. Examples are given of companies like Netflix and PayPal using Druid at large scales to analyze streaming data. The key components, data formats, and query types of Druid are described.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
This document introduces RediSearch aggregations, which allow transforming search results into statistical insights using a pipeline of grouping, reducing, applying transformations, and sorting. It provides examples of counting questions by month from Stack Overflow data and comparing tag popularity over time. RediSearch aggregations can run distributed queries across shards and push operations down to shards for efficiency. Future work may include improving expression capabilities and supporting cursors.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
One of MongoDB’s primary appeals to developers is that it gives them the ability to start application development without needing to define a formal, up-front schema. Operations teams appreciate the fact that they don't need to perform a time-consuming schema upgrade operation every time the developers need to store a different attribute (as an example, The Weather Channel is now able to launch new features in hours whereas it used to take weeks). For business leaders, the application gets launched much faster, and new features can be rolled out more frequently. MongoDB powers agility.
Some projects reach a point where it's necessary to define rules on what's being stored in the database – for example, that for any document in a particular collection, you can be assured that certain attributes are present.
To address the challenges discussed above, while at the same time maintaining the benefits of a dynamic schema, MongoDB 3.2 introduces document validation.
There is significant flexibility to customize which parts of the documents are **and are not** validated for any collection.
An introduction and status update on Redis' upcoming new data structure - Stream - that is not unlike a log, has some Apache Kafka-like thingamagigs and can be also used for time series data
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Lucidworks
The document discusses time series processing with Solr and Spark. It describes a use case of monitoring data analysis for a distributed software system that generates over 6 trillion observations per year. The Chronix stack is presented as an easy-to-use solution for big time series data storage and processing on Spark. It provides a scale-out time series database with efficient storage and interactive queries by integrating with existing Solr and Spark installations. The Chronix Spark API and internals are covered, focusing on distributed data retrieval, efficient data formats and processing, and best practices for aligning Spark and Solr parallelism.
Redundancy and high availability are the basis for all production deployments. With MongoDB this can be achieved by deploying replica set. In this slides we are exploring how the replication works with MongoDB, why you should use replication, what are the features and go over different deployment use cases. At the end we are comparing some features with MySQL replication and what are the differences between the two
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster.
Accompanying code can be found online at bit.ly/1aB8Jy8.
Talk delivered at StrataConf + Hadoop World 2013.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
Running a Cassandra cluster in AWS that can store petabytes worth of data can be costly. This talk will detail the novel approach of using approximate data structures to keep costs low, yet retain insightful, and up to date query results. The talk will explore a number of real world examples from our environment to demonstrate the power of approximate data. It will cover: determining how many IP addresses are on a network, ranking IPs by traffic, and finally determining approximate min, max, and averages on values. The talk will also cover how this data is laid out in Cassandra, so that a query always returns up to date data, without burdening the compactor.
About the Speaker
Ben Kornmeier Engineer, ProtectWise
Ben is a Staff Engineer at ProtectWise. When he is not building realtime processing pipelines, he enjoys hiking, biking, and keeping his dog out of trouble.
This document discusses using Redis and Elasticsearch together for time series data. Redis Streams can be used to store time-stamped data in Redis, and then a Logstash pipeline can be used to extract the data from Redis and index it into Elasticsearch. The RediSearch module for Redis allows full-text search of Redis data. Dashboards in Kibana can then visualize and analyze the time series data stored in Elasticsearch.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
This document discusses SQL on Druid. It provides an overview of Druid, benchmarks comparing Druid to Spark, and details how SQL can be used with Druid through Hive integration and Druid's built-in SQL functionality. Hive allows SQL queries over Druid data through a Druid storage handler and by translating Hive queries into the appropriate Druid query format. Druid also natively supports SQL queries through its Avatica server, enabling SQL queries directly against Druid data sources.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
Since the introduction of SASI in Cassandra 3.4, it is way easier than before to query data. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.
This talk will show the architecture on a high level and exposes all the trade-offs so you can choose and use SAS wisely.
We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic sorry)
To illustrate the talk, we'll use a sample database of 110 000 albums and artists and create indices on them
About the Speaker
DuyHai DOAN Apache Cassandra Evangelist, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies using Cassandra to make their project successful. Previously he was working as a freelance Java/Cassandra consultant.
Working with MongoDB as MySQL DBA. Comparing commands from MongoDB to MySQL, similarities and differences. Exploring replication features, failover and recovery, adjusting the variables and checking status and using DML, DDL with different storage engines
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
This document provides an introduction to the Hadoop ecosystem. It discusses data storage with HDFS and data processing with MapReduce. It also describes higher-level tools like Pig and Hive that provide interfaces for data analysis. Pig uses a language called Pig Latin to analyze data in Hadoop, while Hive provides an SQL-like interface. These tools sit on top of core Hadoop components to simplify big data workflows.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
Cascading is an application framework that simplifies developing robust data analytics and management applications on Apache Hadoop. It provides a query API and planner to define, share, and execute data processing workflows on Hadoop clusters. Cascading handles complexities of Hadoop application development, job creation, and scheduling to allow organizations to rapidly develop complex data processing applications.
An introduction and status update on Redis' upcoming new data structure - Stream - that is not unlike a log, has some Apache Kafka-like thingamagigs and can be also used for time series data
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Lucidworks
The document discusses time series processing with Solr and Spark. It describes a use case of monitoring data analysis for a distributed software system that generates over 6 trillion observations per year. The Chronix stack is presented as an easy-to-use solution for big time series data storage and processing on Spark. It provides a scale-out time series database with efficient storage and interactive queries by integrating with existing Solr and Spark installations. The Chronix Spark API and internals are covered, focusing on distributed data retrieval, efficient data formats and processing, and best practices for aligning Spark and Solr parallelism.
Redundancy and high availability are the basis for all production deployments. With MongoDB this can be achieved by deploying replica set. In this slides we are exploring how the replication works with MongoDB, why you should use replication, what are the features and go over different deployment use cases. At the end we are comparing some features with MySQL replication and what are the differences between the two
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
This presentation examines the benefits of using Cassandra to store data, and how the Hadoop ecosystem can fit in to add aggregation functionality to your cluster.
Accompanying code can be found online at bit.ly/1aB8Jy8.
Talk delivered at StrataConf + Hadoop World 2013.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
Running a Cassandra cluster in AWS that can store petabytes worth of data can be costly. This talk will detail the novel approach of using approximate data structures to keep costs low, yet retain insightful, and up to date query results. The talk will explore a number of real world examples from our environment to demonstrate the power of approximate data. It will cover: determining how many IP addresses are on a network, ranking IPs by traffic, and finally determining approximate min, max, and averages on values. The talk will also cover how this data is laid out in Cassandra, so that a query always returns up to date data, without burdening the compactor.
About the Speaker
Ben Kornmeier Engineer, ProtectWise
Ben is a Staff Engineer at ProtectWise. When he is not building realtime processing pipelines, he enjoys hiking, biking, and keeping his dog out of trouble.
This document discusses using Redis and Elasticsearch together for time series data. Redis Streams can be used to store time-stamped data in Redis, and then a Logstash pipeline can be used to extract the data from Redis and index it into Elasticsearch. The RediSearch module for Redis allows full-text search of Redis data. Dashboards in Kibana can then visualize and analyze the time series data stored in Elasticsearch.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
This document discusses SQL on Druid. It provides an overview of Druid, benchmarks comparing Druid to Spark, and details how SQL can be used with Druid through Hive integration and Druid's built-in SQL functionality. Hive allows SQL queries over Druid data through a Druid storage handler and by translating Hive queries into the appropriate Druid query format. Druid also natively supports SQL queries through its Avatica server, enabling SQL queries directly against Druid data sources.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
Since the introduction of SASI in Cassandra 3.4, it is way easier than before to query data. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new `LIKE '%term%'` syntax.
This talk will show the architecture on a high level and exposes all the trade-offs so you can choose and use SAS wisely.
We also highlight some use-cases where SASI is not a good fit and should be avoided (there is no magic sorry)
To illustrate the talk, we'll use a sample database of 110 000 albums and artists and create indices on them
About the Speaker
DuyHai DOAN Apache Cassandra Evangelist, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies using Cassandra to make their project successful. Previously he was working as a freelance Java/Cassandra consultant.
Working with MongoDB as MySQL DBA. Comparing commands from MongoDB to MySQL, similarities and differences. Exploring replication features, failover and recovery, adjusting the variables and checking status and using DML, DDL with different storage engines
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
This document provides an introduction to the Hadoop ecosystem. It discusses data storage with HDFS and data processing with MapReduce. It also describes higher-level tools like Pig and Hive that provide interfaces for data analysis. Pig uses a language called Pig Latin to analyze data in Hadoop, while Hive provides an SQL-like interface. These tools sit on top of core Hadoop components to simplify big data workflows.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
Cascading is an application framework that simplifies developing robust data analytics and management applications on Apache Hadoop. It provides a query API and planner to define, share, and execute data processing workflows on Hadoop clusters. Cascading handles complexities of Hadoop application development, job creation, and scheduling to allow organizations to rapidly develop complex data processing applications.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. It also summarizes tools like Hive, HBase and Spark that can be used to analyze data stored on HDInsight clusters.
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
Johnny Miller – Cassandra + Spark = Awesome
This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.
The document provides an overview of Hydra, an open source distributed data processing system. It discusses Hydra's goals of supporting streaming and batch processing at massive scale with fault tolerance. It also covers key Hydra concepts like jobs, tasks, and nodes. The document then demonstrates setting up a local Hydra development environment and creating a sample job to analyze log data and find top search terms.
Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, and fault-tolerant database. It originated at Facebook in 2007 to solve their inbox search problem. Some key companies using Cassandra include Twitter, Facebook, Digg, and Rackspace. Cassandra's data model is based on Google's Bigtable and its distribution design is based on Amazon's Dynamo.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
Dapper.NET is a micro-ORM that provides simple methods for querying and mapping data from databases. It allows for CRUD operations, batch inserts, stored procedures, views, and transaction support. Dapper is lightweight, with a single file and less than 700 lines of code. It provides fast and pure SQL functionality by enriching IDbCommand with extension methods. Queries can map results to POCOs or dynamic objects. Additional extensions like Dapper Contrib provide more advanced features.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for the storage and analysis of datasets that are too large for single servers. The document discusses several key Hadoop components including HDFS for storage, MapReduce for processing, HBase for column-oriented storage, Hive for SQL-like queries, Pig for data flows, and Sqoop for data transfer between Hadoop and relational databases. It provides examples of how each component can be used and notes that Hadoop is well-suited for large-scale batch processing of data.
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
1. The document provides an overview of InfluxEnterprise, including its core open source functionality, high availability features, scalability, fine-grained authorization, support options, and on-premise or cloud deployment options.
2. It discusses signs that an organization may be ready for InfluxEnterprise, such as high CPU usage, issues with single node deployments, and needing improved data durability or throughput.
3. The document covers InfluxEnterprise cluster architecture including meta nodes, data nodes, replication patterns, ingestion and query rates for different replication configurations, and examples for mothership, durable data ingest, and integrating with ElasticSearch deployments.
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
Streaming Analytics with Spark, Kafka, Cassandra, and Akka discusses rethinking architectures for streaming analytics. The document discusses:
1) The need to build scalable, fault-tolerant systems to handle massive amounts of streaming data from different sources with varying structures.
2) An example use case of profiling cyber threat actors using streaming machine data to detect intrusions and security breaches.
3) Rethinking architectures by moving away from ETL pipelines and dual batch/stream systems like Lambda architecture toward unified stream processing with Spark Streaming, Kafka, Cassandra and Akka. This simplifies analytics and eliminates duplicate code and systems.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop created by eBay to address limitations of existing tools in handling large volumes of metrics and logs from Hadoop clusters. It provides data activity monitoring, job performance monitoring, and unified monitoring. Eagle detects anomalies using machine learning algorithms and notifies users through alerts. It has been deployed across multiple eBay clusters with over 10,000 nodes and processes hundreds of thousands of events per day.
This document provides an agenda and overview of various Node.js concepts including the package manager NPM, web frameworks like Express, template engines like Jade and EJS, databases drivers for Redis and MongoDB, testing frameworks like Mocha and Nodeunit, avoiding callback hell using the async library, and debugging with Node Inspector. It discusses installing Node.js, creating HTTP servers, middleware, authentication, internationalization, and more.
This document discusses using Fabric and Puppet together to streamline system administration tasks. Fabric can be used to execute tasks across multiple servers using Python, while Puppet defines infrastructure using code and templates. The document suggests using Fabric to set up environments and trigger Puppet deployments, while defining nodes and classes in Puppet. This allows taking advantage of Fabric's host management capabilities and Puppet's declarative approach. Initial Fabric functions would prepare environments, while global functions handle setup/teardown. Puppet would define the desired configuration to deploy using its domain-specific language.
Scrum is an agile framework that allows self-organizing teams to focus on delivering business value in short iterations called sprints. It defines clear roles like Product Owner, Scrum Master, and team members. The Product Owner prioritizes user stories from the product backlog for the team to work on. Daily stand-ups help the team track progress and obstacles. Sprints end with a demo and retrospective to improve. User stories describe features using a simple format of "As a <role> I want <goal> so that <benefit>" to focus on users and value. Non-functional requirements constrain the system but can also be expressed as user stories.
This document provides an overview of Redis, an open source, advanced key-value store. It supports a variety of data types including strings, lists, sets, sorted sets and hashes. Redis can be used across multiple programming languages and operating systems. It offers functionality like pub/sub messaging, Lua scripting, pipelining, transactions, persistence and replication. High availability options include Redis Sentinel, virtual IPs and twemproxy for load balancing. In summary, Redis is a versatile NoSQL database that supports advanced data structures and common database functions through a simple interface.
Python decorators allow functions and classes to be augmented or modified by wrapper objects. Decorators take the form of callable objects that process other callable objects like functions and classes. Decorators are applied once when a function or class is defined, making augmentation logic explicit and avoiding the need to modify call sites. Decorators can manage state information, handle multiple instances, and take arguments to customize behavior. However, decorators also introduce type changes and extra function calls that incur performance costs.
The document discusses the configuration management tool Puppet. It provides an overview of Puppet including describing how it can be used to automate and standardize system administration tasks through reusable code. It also covers Puppet concepts like nodes, classes, templates and the Puppet file structure.
This document provides an overview of JMS (Java Message Service) concepts and ActiveMQ configuration and usage. It discusses JMS programming models, message types, persistence, transactions, ActiveMQ broker configuration including persistence, clustering and monitoring. It also summarizes performance tests comparing ActiveMQ to other messaging systems.
This document summarizes the Spring Framework and its core technologies. It discusses inverse of control (IOC) and dependency injection, aspect-oriented programming (AOP), the Spring architecture, XML vs annotation configuration, the Spring Expression Language (SpEL), testing tools, and integrating the CXF framework for web services. Key AOP concepts like advice, joinpoints, and pointcuts are also defined.
The document provides an introduction to Java unit testing and code coverage. It discusses test-driven development and the JUnit testing framework, including annotations, test results, and Ant integration. It also covers using mock objects with Mockito, performing code coverage analysis, and code coverage tools like Emma and Cobertura.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
2. Agenda
•Introduction
• How it works
•Data Processing
•Advanced Processing
•Monitoring
•Testing
•Best Practices
•Cascading GUI
Trend Micro Confidential
3. Introduction
•Hadoop coding is non-trivial
•Hadoop is looking for a class to do Map steps and a
class to do Reduce step
•What if you need multiple in your application?
Who coordinates what can be run in parallel?
•What if you need to do non-Hadoop logic between
Hadoop steps?
•Chain the Operations into data processing work-
flows
Trend Micro Confidential
5. Introduction
Pipe lhs = new Pipe( "lhs" );
lhs = new Each( lhs, new SomeFunction() );
lhs = new Each( lhs, new SomeFilter() );
// the "right hand side" assembly head
Pipe rhs = new Pipe( "rhs" );
rhs = new Each( rhs, new SomeFunction() );
// joins the lhs and rhs
Pipe join = new CoGroup( lhs, rhs );
join = new Every( join, new SomeAggregator() );
join = new GroupBy( join );
join = new Every( join, new SomeAggregator() );
// the tail of the assembly
join = new Each( join, new SomeFunction() );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( “join", source, sink, join);
// execute the flow, block until complete
flow.complete();
Trend Micro Confidential
6. How it works
•Pipe Assemblies become Flows
•Translates a DAG of operations to a DAG of
MapReduce jobs
•All MapReduce jobs in Flow scheduled in
dependency order
Trend Micro Confidential
8. Data Processing
•Tuple
•A single ‘row’ of data being processed
•Each column is named
•Can access data by name or position
Trend Micro Confidential
9. Data Processing
•TAP
•Abstraction on top of Hadoop files
•Allows you to define own parser for files
•Example:
•Scheme
•TextLine
•TextDelimited
•SequenceFile
•WritableSequenceFile
Hfs input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
Trend Micro Confidential
10. Data Processing
• Tap
•LFS
•DFS
•HFS
•MultiSourceTap
•MultiSinkTap
•TemplateTap
•GlobHfs
•S3fs(Deprecated)
Trend Micro Confidential
11. Data Processing
• TemplateTap
TemplateTap can be used to write tuple streams
out to subdirectories based on the values in
the Tuple instance.
Trend Micro Confidential
12. Data Processing
• TemplateTap
TextDelimited scheme = new TextDelimited( new Fields( "year",
"month", "entry" ), "t" );
Hfs tap = new Hfs( scheme, path );
String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );
Trend Micro Confidential
16. Data Processing
• Pipe
• a base class for core processing model types
• Each
• for each “tuple” in data do this to it
• GroupBy
• similar to a ‘group by’ in SQL
• CoGroup
• joins of tuple streams together
• Every
• applies an Aggregator (like count, or sum) or Buffer (a sliding
window) Operation to every group of Tuples that pass through
it.
• SubAssembly
• allows for nesting reusable pipe assemblies into a Pipe class
Trend Micro Confidential
17. Data Processing
• CoGroup
• InnerJoin
• OuterJoin
• LeftJoin
• RightJoin
• MixedJoin
lhsFields new Fields("url", "word", “count");
Fields common = new Fields( "url" );
rhsFields = new Fields("url", “sentence", “count");
Fields declared = new Fields( "url1", "word", "wd_count", "url2", "sentence", "snt_count" );
Pipe join = new CoGroup( lhs, common, rhs, common, declared, new InnerJoin() );
lhsFields, rhsFields, new InnerJoin() );
Trend Micro Confidential
18. Data Processing
•Operation
•Define what to do on the data
•Each operations allow logic on the row, such a
parsing dates, creating new attributes etc.
•Every operations allow you to iterate over the
‘group’ of rows to do non-trivial operations.
Trend Micro Confidential
19. Data Processing
•Function
•Identity Function
•Debug Function
•Sample and Limit Functions
•Insert Function
•Text Functions
•Regular Expression Operations
•Java Expression Operations
•"first-name" is a valid field name for use
with Cascading, but this expression, first-
name.trim(), will fail.
Trend Micro Confidential
22. Data Processing
•Buffer
•It is very similar to the typical Reducer
interface
•It is very useful when header or footer values
need to be inserted into a grouping, or if values
need to be inserted into the middle of the
group values
Trend Micro Confidential
25. Data Processing
•Flow
•To create a Flow, it must be planned though
the FlowConnector object. The connect()
method is used to create new Flow instances
based on a set of sink Taps, source Taps, and a
pipe assembly.
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source, sink, pipe );
flow.complete();
Trend Micro Confidential
26. Data Processing
•MapReduceFlow
•a Flow subclass that supports custom
MapReduce jobs pre-configured via the
JobConf object.
• ProcessFlow
• a Flow subclass that supports custom Riffle
jobs.
Trend Micro Confidential
27. Data Processing
•Cascades
•Groups of Flow are called Cascades
•Custom MapReduce jobs can participate in
Cascade
Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);
cascade.complete();
Trend Micro Confidential
28. Advanced Processing
•Stream Assertions
•Unit and Regression tests for Flows
•Planner can remove ‘strict’, ‘validating’, or all
assertions
Trend Micro Confidential
29. Advanced Processing
•Failure Traps
•Catch data causing Operations or Assertions to
fail
•Allows processes to continue without data loss
Trend Micro Confidential
30. Advanced Processing
•Partial Aggregation instead of Combiners
•trade Memory for IO gains by caching values
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly = new SumBy( assembly, groupingFields, valueField,
sumField, long.class );
Trend Micro Confidential
33. Testing
•Use ClusterTestCase if you want to launch an
embedded Hadoop cluster inside your TestCase
•A few validation and hadoop functions are
provided
•Doesn’t support Hadoop 0.21 testing library
Trend Micro Confidential
34. Cascading GUI
•Yahoo Pipes
Pipes is a powerful composition tool to aggregate,
manipulate, and mashup content from around the
web.
Trend Micro Confidential
35. Cascading GUI
•WireIt
WireIt is an open-source javascript library to create
web wirable interfaces for dataflow applications,
visual programming languages, graphical modeling,
or graph editors.
Trend Micro Confidential
Cascading and its extensions have their own Maven/Ivy Jar repositoryThis 1.2 release will run against hadoop 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce. And 0.21Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.At one level Cascading is a MapReduce query planner, just like PIG. Except the Cascading API is for public consumption and fully extensiblein PIG you typically interact with the PigLatin text syntax. With Cascading, you can layer your own syntax on top of the APIGiven a data set and you want to run a number of groupBys i.e. group by key1, generate value1, ... group by keyN, generate valueN, Cascading primary programming model is similar to PIG but with a Java API.Pig would optimize from N to smaller (e.g. 1) number of reduce runsOozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).Cascading runs as a client from the command lineOozieis a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later check the status.
By providing a clean API to the core Cascading model, tools like Jython, Groovy, and JRuby can be used instead to define complex processing flow
The MapReduce Job Planner is an internal feature of Cascading.Every job is delimited by a temporary file that is the sink from the first job, and then the source to the next job.temporary file will be deleted whether the flow runs successfully or failed. However, it’s configurable.If two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel.DAG : directed acyclic graph : 不循環有向圖an internal graph that makes each Flow a 'vertex', and each file an 'edge‘When a vertex has all it's incoming edges (files) available, it will be scheduled on the cluster.TopologicalOrderAnd by default, if any outputs from a Flow are newer than the inputs, the Flow is skippedI can’t customize combiner and partitioner
7 tools can parse the dot file.DOT is a plain text graph description language. To see how your Flows are partitioned, call the Flow#writeDOT() method. This will write a DOT fileThe writeDOTapi isn’t useful for logging
All Taps must have a Scheme associated with them. If the Tap is about where the data is, and how to get it, the Scheme is about what the data is.TextLineTextLine reads and writes raw text files and returns Tuples with two field names by default, "offset" and "line".TextDelimited(csv, tsv, etc)SequenceFile - SequenceFile is based on the Hadoop Sequence file, which is a binary format.WritableSequenceFile - like the SequenceFile Scheme, except it was designed to read and write key and/or value Hadoop Writable objects directly.
MultiSourceTapThe cascading.tap.MultiSourceTap is used to tie multiple Tap instances into a single Tap for use as an input source. The only restriction is that all the Tap instances passed to a new MultiSourceTap share the same Scheme classes (not necessarily the same Scheme instance).MultiSinkTapThe cascading.tap.MultiSinkTap is used to tie multiple Tap instances into a single Tap for use as an output sink. During runtime, for every Tuple output by the pipe assembly each child tap to the MultiSinkTap will sink the Tuple.TemplateTapTemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. The constructor takes a HfsTap and a Formatter format syntax String. This allows Tuple values at given positions to be used as directory names. Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files. openTapsThreshold limits the number of open files to be output to. This value defaults to 300 files. Each time the threshold is exceeded, 10% of the least recently used open files will be closed. TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\\t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );GlobHfs extends MultiSourceTapThe cascading.tap.GlobHfs Tap accepts Hadoop style 'file globbing' expression patterns. This allows for multiple paths to be used as a single source, where all paths match the given pattern.Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
SinkMode.KEEP This is the default behavior. If the resource exists, attempting to write to it will fail.SinkMode.REPLACE This allows Cascading to delete the file immediately after the Flow is started.SinkMode.UPDATE Allows for new Tap types that have the concept of update or append. For example, updating records in a database. It is up to the Tap to decide how to implement its "update" semantics. When Cascading sees the update mode, it knows not to attempt to delete the resource first or to not fail because it already exists.
Avro is a data serialization system.Avro provides functionality similar to systems such as Thrift, Protocol BuffersCascading.SimpleDB - Integration with Amazon SimpleDB.
It is not required that an Every follow either GroupBy or CoGroup, an Each may follow immediately after. But an Every many not follow an Each.For example : DISTINCTThe Each pipe may only apply Functions and Filters to the tuple stream as these operations may only operate on one Tuple at a time.The Every pipe may only apply Aggregators and Buffers to the tuple stream as these operations may only operate on groups of tuples, one grouping at a time.GroupBy supports ordering
Self joins supportedIn practice this would fail since the result Tuple has duplicate field names.A Mixed join is where 3 or more tuple streams are joined, and each pair must be joined differently. See the cascading.pipe.cogroup.MixedJoin class for more details.When joining two streams via a CoGroup Pipe, attempt to place the largest of the streams in the left most argument to the CoGroup. Joining multiple streams requires some accumulation of values before the join operator can begin, but the left most stream will not be accumulated. This should improve the performance of most joins.
Operation is a superclass of Function, Filter, Aggregator, Buffer, and Assertion. Function and Filter are each operationsAggregator and Buffer are every operationsUsually extends BaseOperation class
Identity FunctionDiscard unused fieldsRename all fieldsRename a single fieldDebugLevelenum values NONE,DEFAULT, or VERBOSEFlowConnector.setDebugLevel( properties, DebugLevel.NONE ); Sample The cascading.operation.filter.Sample filter allows a percentage of tuples to pass.Limit The cascading.operation.filter.Limit filter allows a set number of Tuples to pass.when some missing parameter or value, like a date String for the current date, needs to be inserted.Text FunctionsDateParserDateFormatterRegular Expression OperationsRegexParserRegexSplitterJava Expression OperationsExpressionFunctionExpressionFilterExpressionFilter filter = new ExpressionFilter( "status != 200", Integer.TYPE ); some characters will cause compilation errors
(Function, Filter,Aggregator, or Buffer) do not store operation state in class fields.For example, if implementing a custom 'counter' Aggregator, do not create a field named 'count' and increment it on every Aggregator.aggregate() call. There is no guarantee your Operation will be called from a single thread in a JVMThere is a context that you can record aggregation value. It’s the same ashadoop.
An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.It differs by the fact that an Iterator is provided and it is the responsibility of the operate(cascading.flow.FlowProcess, BufferCall) method to iterate overall all the input arguments returned by this Iterator, if any. Header, footerdocument_id, term, term_count_in_document, total_terms_in_document
An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.AggregateBy is a SubAssembly
Verifying input and output schemas before running flowStart() method is anasynchronized callA properties object can be set into FlowConnector, as you setHadoopjobconf
riffle is a lightweight Java library for executing collections of dependent processes as a single process. This library provides Java Annotations for tagging classes and methods supporting required life-cycle stages,import riffle.process.DependencyIncoming;import riffle.process.DependencyOutgoing;import riffle.process.ProcessCleanup;import riffle.process.ProcessComplete;import riffle.process.ProcessPrepare;import riffle.process.ProcessStart;import riffle.process.ProcessStop;
Assertions aren’t pipes.When running a tests against regression data, it makes sense to use strict assertions. This regression data should be small and represent many of the edge cases the processing assembly must support robustly. When running tests in staging, or with data that may vary in quality since it is from an unmanaged source, using validating assertions make much sense. Then there are obvious cases where assertions just get in the way and slow down processing and it would be nice to just bypass them.
Traps were not designed as a filtering mechanism
Since version 1.2Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.Class AggregateBy is a SubAssembly that serves two roles for handling aggregate operations. AverageBy, CountBy, SumBy
ClusterTestCase : MiniDFSCluster, MiniMRCluster, FileSystemFunctions : copyFromLocal, getFileSystem,getJobConf, getPropertiesLimit will get half records in version 1.1
Wireit supports firefox 3.5 above, it doesn’t work on firefox 3.0WireIt is released under the MIT License.