This document discusses building fast data applications with streaming data. It begins by outlining common fast data application patterns like real-time analytics, data pipelines, and fast request/response. It then contrasts streaming approaches like Storm with database approaches like VoltDB. The document argues that streaming operators often require state, which databases can provide through metadata tables and session state tables. It presents VoltDB as a solution that can handle real-time analytics, request/response decisions, and data pipelines with its export functionality to move data to data lakes and OLAP systems.
This document summarizes Audi's journey in building a hybrid Hadoop platform between their on-premise data centers and the AWS cloud. It describes how Audi formed an agile team with internal and external experts to build out the Hybrid Audi Analytic Platform (HAAP) using technologies like Hadoop, Kafka, and FreeIPA across environments. The project aimed to provide a single platform and user experience spanning on-premise and cloud while taking advantage of cloud scalability and functionality. Lessons learned included the need for strong automation, security considerations, and knowledge sharing between distributed teams.
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...Yahoo Developer Network
This document discusses using Informatica tools for data integration on Hadoop. It describes how Informatica enables leveraging existing data transformations and quality logic from within Hadoop applications by allowing mapplets to be invoked from Hadoop and mappings to be deployed and run on Hadoop clusters. This bridges the gap between Hadoop and traditional data integration and provides connectivity to various data sources and specialized transformations.
Next gen tooling for building streaming analytics apps: code-less development...DataWorks Summit
Building apps with less code and more abstractions, unit and integration test frameworks/harnesses, and continuous integration and delivery pipelines with Jenkins are all best practices and proven patterns in the enterprise for traditional apps. But they are difficult to follow with distributed systems like streaming analytics apps.
In this talk we will introduce some powerful new open source tooling based on Hortonworks Streaming Analytics Manager (SAM) that allows enterprises to easily implement these best practices. In 35 minutes, we will walk through the building of a complex streaming analytics app using a drag-and-drop approach (similar to Apache NiFi) that does the following:
• Joining streams
• Aggregations over windows (time or count based)
• Complex event processing
• Real-time streaming enrichment
• Model scoring on the stream
Once the app is built, we will create automated unit and integration tests of the streaming application, build continuous integration and delivery pipelines with Jenkins, and introduce powerful new monitoring, management, and dev-ops troubleshooting tools.
Speaker
George Vetticaden, VP of Product Management, Hortonworks
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
This document provides an overview and agenda for a developer 2 developer webcast series on microservice architecture and container technologies. It includes details on upcoming webcasts in March and April 2017 focused on microservice architecture, Azure container service, Pivotal cloud foundry, and RedHat OpenShift. The document also advertises a webcast on RedHat OpenShift presented by John Archer on containerization with OpenShift and how it enables modern application development.
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
Pivotal provides a suite of big data products including Pivotal Greenplum Database, Pivotal HDB, and Pivotal GemFire. Greenplum Database is an open source massively parallel processing data warehouse. HDB is an open source analytical database for Apache Hadoop. GemFire is an open source application and transaction data grid. The suite provides a complete platform for big data with deployment options, advanced data services, and flexible licensing.
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
This document summarizes Audi's journey in building a hybrid Hadoop platform between their on-premise data centers and the AWS cloud. It describes how Audi formed an agile team with internal and external experts to build out the Hybrid Audi Analytic Platform (HAAP) using technologies like Hadoop, Kafka, and FreeIPA across environments. The project aimed to provide a single platform and user experience spanning on-premise and cloud while taking advantage of cloud scalability and functionality. Lessons learned included the need for strong automation, security considerations, and knowledge sharing between distributed teams.
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay K...Yahoo Developer Network
This document discusses using Informatica tools for data integration on Hadoop. It describes how Informatica enables leveraging existing data transformations and quality logic from within Hadoop applications by allowing mapplets to be invoked from Hadoop and mappings to be deployed and run on Hadoop clusters. This bridges the gap between Hadoop and traditional data integration and provides connectivity to various data sources and specialized transformations.
Next gen tooling for building streaming analytics apps: code-less development...DataWorks Summit
Building apps with less code and more abstractions, unit and integration test frameworks/harnesses, and continuous integration and delivery pipelines with Jenkins are all best practices and proven patterns in the enterprise for traditional apps. But they are difficult to follow with distributed systems like streaming analytics apps.
In this talk we will introduce some powerful new open source tooling based on Hortonworks Streaming Analytics Manager (SAM) that allows enterprises to easily implement these best practices. In 35 minutes, we will walk through the building of a complex streaming analytics app using a drag-and-drop approach (similar to Apache NiFi) that does the following:
• Joining streams
• Aggregations over windows (time or count based)
• Complex event processing
• Real-time streaming enrichment
• Model scoring on the stream
Once the app is built, we will create automated unit and integration tests of the streaming application, build continuous integration and delivery pipelines with Jenkins, and introduce powerful new monitoring, management, and dev-ops troubleshooting tools.
Speaker
George Vetticaden, VP of Product Management, Hortonworks
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
This document provides an overview and agenda for a developer 2 developer webcast series on microservice architecture and container technologies. It includes details on upcoming webcasts in March and April 2017 focused on microservice architecture, Azure container service, Pivotal cloud foundry, and RedHat OpenShift. The document also advertises a webcast on RedHat OpenShift presented by John Archer on containerization with OpenShift and how it enables modern application development.
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
Pivotal provides a suite of big data products including Pivotal Greenplum Database, Pivotal HDB, and Pivotal GemFire. Greenplum Database is an open source massively parallel processing data warehouse. HDB is an open source analytical database for Apache Hadoop. GemFire is an open source application and transaction data grid. The suite provides a complete platform for big data with deployment options, advanced data services, and flexible licensing.
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
Over the last few years, the Apache Hive community has been working on advancements to enable a full new range of use cases for the project, moving from its batch processing roots towards a SQL interactive query answering platform. Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the precomputation of relevant summaries or materialized views.
This talk presents our work on introducing materialized views and automatic query rewriting based on those materializations in Apache Hive. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations. We shall describe the current coverage of the rewriting algorithm, how Hive controls important aspects of the life cycle of the materialized views such as the freshness of their data, and outline interesting directions for future improvements. We include an experimental evaluation highlighting the benefits that the usage of materialized views can bring to the execution of Hive workloads.
Speaker
Jesus Camacho Rodriguez, Member of Technical Staff, Hortonworks
Curing the Kafka Blindness – Streams Messaging ManagerDataWorks Summit
Companies who use Kafka today struggle with monitoring and managing Kafka clusters. Kafka is a key backbone of IoT streaming analytics applications. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. No open source tool caters to the needs of different users that work with Kafka: DevOps/developers, platform team, and security/governance teams. See how the new Hortonworks Streams Messaging Manager enables users to visualize their entire Kafka environment end-to-end and simplifies Kafka operations.
In this session learn how SMM visualizes the intricate details of how Apache Kafka functions in real time while simultaneously surfacing every nuance of tuning, optimizing, and measuring input and output. SMM will assist users to quickly understand and operate Kafka while providing the much-needed transparency that sophisticated and experienced users need to avoid all the pitfalls of running a Kafka cluster.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
As Hadoop becomes the defacto big data platform, enterprises deploy HDP across wide range of physical and virtual environments spanning private and public clouds. This session will cover key considerations for cloud deployment and showcase Cloudbreak for simple and consistent deployment across cloud providers of choice.
This talk is about building Audi's big data platform from a first Hadoop PoC to a multi-tenant enterprise platform. Why a big data platform at all? We explain the requirements that drove the development of this platform and explain the decisions we had to make during this journey.
During the process of setting up our big data infrastructure, we often had to find the right balance between going for enterprise integration versus speed. For instance, whether to use the existing Active Directory for both LDAP and KDC versus setting up our own KDC. Using a shared enterprise service like Active Directory requires to follow certain naming conventions and restricted access, where running our own KDC brings much more flexibility but also adds another component to maintain to our platform. We show the advantages and disadvantages and explain why we've decided to choose a certain approach.
For data ingestion of both batch and streaming data, we use Apache Kafka. We explain why we installed a separate Kafka cluster from our Hadoop platform. We discuss the pros and cons of using the Kafka binary protocol and the HTTP REST protocol not only from a technical perspective but also from the organisational perspective as the source systems are required to push data into Kafka.
We give an overview of our current architecture including how some use cases are implemented on it. Some of them run exclusively on our new big data stack, while others use it in conjunction with our data warehouse. The use cases cover all different kinds of data from sensory data of robots in our plants to click streams from web applications.
Building an enterprise platform does not only consist of technical tasks but also of organizational tasks: data ownership, authorization to access certain data sets, or more financial ones like internal pricing and SLAs.
Although we have already achieved quite a lot, our journey has not yet ended. There are still some open topics to address, like providing a unified logging solution for applications spanning multiple platforms. Or finally offering a notebook-like Zeppelin to our analysts. Or addressing legal issues like GDPR.
We will conclude our talk with a short glimpse into our ongoing extension of our on-premises platform into a hybrid cloud platform.
Speakers
Carsten Herbe, Big Data Architect, Audi AG
Matthias Graunitz, Big Data Architect, Audi AG
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...DataWorks Summit
Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines.
Speaker
Joshua Robinson, JOSHUA ROBINSON
Founding Engineer
Pure Storage
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
The HDF 3.3 release delivers several exciting enhancements and new features. But, the most noteworthy of them is the addition of support for Kafka 2.0 and Kafka Streams.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-3-taking-stream-processing-next-level/
The document discusses how telecom companies can undergo a data-centric transformation to better leverage customer data and remain competitive. It describes how telecoms are facing new challenges like social media, mobile apps, and customer expectations of better service. It argues telecoms should shift from an app-centric to data-centric model to better integrate and scale their use of data. This will allow them to gain better customer insights and optimize areas like customer experience, new digital services, and network management.
MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi.
The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a
Ken Owens, the CTO of Cisco Intercloud Services, presented on Cisco's migration from MapReduce jobs to Spark jobs for processing customer interaction data. The document discussed Cisco's need to embrace both traditional and hyperscale application deployment across data centers, clouds, and edges. It also covered Cisco's analysis platform requirements, AWS and Cisco Intercloud sizing comparisons, and performance results from testing the migration of MapReduce jobs to Spark on the Cisco Intercloud.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Modernizing your Application Architecture with Microservicesconfluent
Organizations are quickly adopting microservice architectures to achieve better customer service and improve user experience while limiting downtime and data loss. However, transitioning from a monolithic architecture based on stateful databases to truly stateless microservices can be challenging and requires the right set of solutions.
In this webinar, learn from field experts as they discuss how to convert the data locked in traditional databases into event streams using HVR and Apache Kafka®. They will show you how to implement these solutions through a real-world demo use case of microservice adoption.
You will learn:
-How log-based change data capture (CDC) converts database tables into event streams
-How Kafka serves as the central nervous system for microservices
-How the transition to microservices can be realized without throwing away your legacy infrastructure
The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
How to build streaming data applications - evaluating the top contendersAkmal Chaudhri
This document provides an overview of VoltDB, a database designed for fast data applications. It discusses VoltDB's architecture and performance benchmarks. It also covers common fast data use cases like real-time analytics, data pipelines, and request/response decisions. Finally, it summarizes new features in VoltDB 5.0 like Hadoop integrations and management tools to accelerate fast data application development.
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB
Webinar presented by Prateek Kapadia, CTO, Flytxt and Ryan Betts, CTO, VoltDB.
Prateek and Ryan highlight the technology stack for Fast + Big data and iterative analytics using Hadoop and VoltDB. In addition, they will outline use cases for Big Data analytics with Hadoop, iterative analytics and machine learning using Spark and real-time analytics using VoltDB supplemented by Big Data + iterative analytics.
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...VoltDB
In this webinar Ryan Betts, CTO at VoltDB, explains why streaming aggregation is a key to streaming analytics. He will also address how SQL can be used in combination with streaming aggregation, and the benefits of up-to-date analytics for per-event transactions and insights. You can listen to the webinar here: http://learn.voltdb.com/WRSQLDatabase.html
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Curing the Kafka Blindness – Streams Messaging ManagerDataWorks Summit
Companies who use Kafka today struggle with monitoring and managing Kafka clusters. Kafka is a key backbone of IoT streaming analytics applications. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. No open source tool caters to the needs of different users that work with Kafka: DevOps/developers, platform team, and security/governance teams. See how the new Hortonworks Streams Messaging Manager enables users to visualize their entire Kafka environment end-to-end and simplifies Kafka operations.
In this session learn how SMM visualizes the intricate details of how Apache Kafka functions in real time while simultaneously surfacing every nuance of tuning, optimizing, and measuring input and output. SMM will assist users to quickly understand and operate Kafka while providing the much-needed transparency that sophisticated and experienced users need to avoid all the pitfalls of running a Kafka cluster.
Speaker: Andrew Psaltis, Principal Solution Engineer, Hortonworks
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
As Hadoop becomes the defacto big data platform, enterprises deploy HDP across wide range of physical and virtual environments spanning private and public clouds. This session will cover key considerations for cloud deployment and showcase Cloudbreak for simple and consistent deployment across cloud providers of choice.
This talk is about building Audi's big data platform from a first Hadoop PoC to a multi-tenant enterprise platform. Why a big data platform at all? We explain the requirements that drove the development of this platform and explain the decisions we had to make during this journey.
During the process of setting up our big data infrastructure, we often had to find the right balance between going for enterprise integration versus speed. For instance, whether to use the existing Active Directory for both LDAP and KDC versus setting up our own KDC. Using a shared enterprise service like Active Directory requires to follow certain naming conventions and restricted access, where running our own KDC brings much more flexibility but also adds another component to maintain to our platform. We show the advantages and disadvantages and explain why we've decided to choose a certain approach.
For data ingestion of both batch and streaming data, we use Apache Kafka. We explain why we installed a separate Kafka cluster from our Hadoop platform. We discuss the pros and cons of using the Kafka binary protocol and the HTTP REST protocol not only from a technical perspective but also from the organisational perspective as the source systems are required to push data into Kafka.
We give an overview of our current architecture including how some use cases are implemented on it. Some of them run exclusively on our new big data stack, while others use it in conjunction with our data warehouse. The use cases cover all different kinds of data from sensory data of robots in our plants to click streams from web applications.
Building an enterprise platform does not only consist of technical tasks but also of organizational tasks: data ownership, authorization to access certain data sets, or more financial ones like internal pricing and SLAs.
Although we have already achieved quite a lot, our journey has not yet ended. There are still some open topics to address, like providing a unified logging solution for applications spanning multiple platforms. Or finally offering a notebook-like Zeppelin to our analysts. Or addressing legal issues like GDPR.
We will conclude our talk with a short glimpse into our ongoing extension of our on-premises platform into a hybrid cloud platform.
Speakers
Carsten Herbe, Big Data Architect, Audi AG
Matthias Graunitz, Big Data Architect, Audi AG
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...DataWorks Summit
Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines.
Speaker
Joshua Robinson, JOSHUA ROBINSON
Founding Engineer
Pure Storage
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
The HDF 3.3 release delivers several exciting enhancements and new features. But, the most noteworthy of them is the addition of support for Kafka 2.0 and Kafka Streams.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-3-taking-stream-processing-next-level/
The document discusses how telecom companies can undergo a data-centric transformation to better leverage customer data and remain competitive. It describes how telecoms are facing new challenges like social media, mobile apps, and customer expectations of better service. It argues telecoms should shift from an app-centric to data-centric model to better integrate and scale their use of data. This will allow them to gain better customer insights and optimize areas like customer experience, new digital services, and network management.
MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi.
The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a
Ken Owens, the CTO of Cisco Intercloud Services, presented on Cisco's migration from MapReduce jobs to Spark jobs for processing customer interaction data. The document discussed Cisco's need to embrace both traditional and hyperscale application deployment across data centers, clouds, and edges. It also covered Cisco's analysis platform requirements, AWS and Cisco Intercloud sizing comparisons, and performance results from testing the migration of MapReduce jobs to Spark on the Cisco Intercloud.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Modernizing your Application Architecture with Microservicesconfluent
Organizations are quickly adopting microservice architectures to achieve better customer service and improve user experience while limiting downtime and data loss. However, transitioning from a monolithic architecture based on stateful databases to truly stateless microservices can be challenging and requires the right set of solutions.
In this webinar, learn from field experts as they discuss how to convert the data locked in traditional databases into event streams using HVR and Apache Kafka®. They will show you how to implement these solutions through a real-world demo use case of microservice adoption.
You will learn:
-How log-based change data capture (CDC) converts database tables into event streams
-How Kafka serves as the central nervous system for microservices
-How the transition to microservices can be realized without throwing away your legacy infrastructure
The document discusses Marketo's migration of their SAAS business analytics platform to Hadoop. It describes their requirements of near real-time processing of 1 billion activities per customer per day at scale. They conducted a technology selection process between various Hadoop components and chose HBase, Kafka and Spark Streaming. The implementation involved building expertise, designing and building their first cluster, implementing security including Kerberos, validation through passive testing, deploying the new system through a migration, and ongoing monitoring, patching and upgrading of the new platform. Challenges included managing expertise retention, Zookeeper performance on VMs, Kerberos integration, and capacity planning for the shared Hadoop cluster.
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
PayPal Data Lake Journey | 2017-Oct | San Diego | Teradata Edge of Next
Gimel [http://www.gimel.io] is a Big Data Processing Library, open sourced by PayPal.
https://www.youtube.com/watch?v=52PdNno_9cU&t=3s
Gimel empowers analysts, scientists, data engineers alike to access a variety of Big Data / Traditional Data Stores - with just SQL or a single line of code (Unified Data API).
This is possible via the Catalog of Technical properties abstracted from users, along with a rich collection of Data Store Connectors available in Gimel Library.
A Catalog provider can be Hive or User Supplied (runtime) or UDC.
In addition, PayPal recently open sourced UDC [Unified Data Catalog], which can host and serve the Technical Metatada of the Data Stores & Objects. Visit http://www.unifieddatacatalog.io to experience first hand.
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
How to build streaming data applications - evaluating the top contendersAkmal Chaudhri
This document provides an overview of VoltDB, a database designed for fast data applications. It discusses VoltDB's architecture and performance benchmarks. It also covers common fast data use cases like real-time analytics, data pipelines, and request/response decisions. Finally, it summarizes new features in VoltDB 5.0 like Hadoop integrations and management tools to accelerate fast data application development.
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB
Webinar presented by Prateek Kapadia, CTO, Flytxt and Ryan Betts, CTO, VoltDB.
Prateek and Ryan highlight the technology stack for Fast + Big data and iterative analytics using Hadoop and VoltDB. In addition, they will outline use cases for Big Data analytics with Hadoop, iterative analytics and machine learning using Spark and real-time analytics using VoltDB supplemented by Big Data + iterative analytics.
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...VoltDB
In this webinar Ryan Betts, CTO at VoltDB, explains why streaming aggregation is a key to streaming analytics. He will also address how SQL can be used in combination with streaming aggregation, and the benefits of up-to-date analytics for per-event transactions and insights. You can listen to the webinar here: http://learn.voltdb.com/WRSQLDatabase.html
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...NoSQLmatters
Building applications on streaming data has its challenges. If you are trying to use programs such as Apache Spark or Storm to build applications, this presentation will explain the advantages and disadvantages of each solution and how to choose the right tool for your next streaming data project. Building streaming data applications that can manage the massive quantities of data generated from mobile devices, M2M, sensors and other IoT devices, is a big challenge that many organizations face today. Traditional tools, such as conventional database systems, do not have the capacity to ingest data, analyze it in real-time, and make decisions. New technologies such as Apache Spark and Storm are now coming to the forefront as possible solutions to handing fast data streams. Typical technology choices fall into one of three categories: OLAP, OLTP, and stream-processing systems. Each of these solutions has its benefits, but some choices support streaming data and application development much better than others. Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation you will learn: - The difference between fast OLAP, stream-processing, and OLTP database solutions. - The importance of state, real-time analytics and real-time decisions when building applications on streaming data. - How streaming applications deliver more value when built on a super-fast in-memory, SQL database.
The document discusses new features and enhancements in Apache Hive 3.0 including:
1. Improved transactional capabilities with ACID v2 that provide faster performance compared to previous versions while also supporting non-bucketed tables and non-ORC formats.
2. New materialized view functionality that allows queries to be rewritten to improve performance by leveraging pre-computed results stored in materialized views.
3. Enhancements to LLAP workload management that improve query scheduling and enable better sharing of resources across users.
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool – either Hadoop or a traditional DBMS – to do all the work. At Vertica, we’ve found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures and APIs directly using SQL without needing to transform or structure the data first. This allows gaining insights into issues like dropped sensor readings by analyzing packets alongside other data sources. The document concludes that SQL-on-Hadoop technologies allow network analysis to be done in a BI context more quickly than traditional specialized tools.
Elasticsearch + Cascading for Scalable Log ProcessingCascading
Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
This document discusses using Apache Drill and business intelligence (BI) tools to analyze network data stored in Hadoop. It provides examples of querying network packet captures, OpenStack data, and TCP metrics using SQL with tools like Tableau and SAP Lumira. The key benefits are interacting with diverse network data sources like JSON and CSV files without preprocessing, and gaining insights by combining network data with other data sources in the BI tools.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
This document discusses how organizations can leverage data and analytics to power their business models. It provides examples of Fortune 100 companies that are using Attunity products to build data lakes and ingest data from SAP and other sources into Hadoop, Apache Kafka, and the cloud in order to perform real-time analytics. The document outlines the benefits of Attunity's data replication tools for extracting, transforming, and loading SAP and other enterprise data into data lakes and data warehouses.
The data you need to manage isn’t getting smaller, or slower. It’s a snowball, compounding in both speed and volume. If you’re building applications on fast, streaming data, you need to analyze it, gain insight and take action on it now, not at the end of a batch job. Listen to Peter Vescuso discuss the lessons learned from an actual real fast data use case.
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020HostedbyConfluent
This session will describe and demonstrate the longstanding integration between Couchbase Server and Apache Kafka and will include descriptions of both the mechanics of the integration and practical situations when combining these products is appropriate.
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
This talk will present how to build data pipelines with no code using the open-source, Apache 2.0, Cask Hydrator. The talk will continue with a live demonstration of creating data pipelines for two use cases.
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Vault 2.0 is a data modeling methodology designed for developing enterprise data warehouses. It was developed by Dan Linstedt in response to the shortcomings of previous data modeling methodologies, such as the Kimball methodology and Inmon methodology, for managing large volumes of data from disparate sources.
Similar to Building Fast Applications for Streaming Data (20)
An introduction to Bayesian Statistics using Pythonfreshdatabos
This document provides an introduction to Bayesian statistics and inference through examples. It begins with an overview of Bayes' Theorem and probability concepts. An example problem about cookies in bowls is used to demonstrate applying Bayes' Theorem to update beliefs based on new data. The document introduces the Pmf class for representing probability mass functions and working through examples numerically. Further examples involving dice and trains reinforce how to build likelihood functions and update distributions. The document concludes with a real-world example of analyzing whether a coin is biased based on spin results.
The document discusses common problems that can arise when analyzing data from news stories. It examines two fictional news stories and identifies potential issues with how the data was collected and presented, including small sample sizes, selection bias, outliers skewing averages, and lack of context. The key takeaways are to consider how representative samples are, watch out for outliers, understand the sampling method, and make sure to provide appropriate context for data.
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...freshdatabos
This document discusses how human behavior and mobility data can be used to uniquely identify individuals, even when names and phone numbers are removed. It shows that as few as four spatiotemporal points are enough to identify 95% of individuals in a dataset of 1.5 million people. Coarsening the data by lowering spatial or temporal resolution does not fully address this issue, as a few more data points would still allow re-identification. The document also discusses how metadata can be used to predict individuals' personalities and how this data could be used to map populations and model disease spread, if handled privately and securely through techniques like anonymization, behavioral indicators instead of raw data, and systems like openPDS that only share answers instead
This document discusses various techniques for visualizing networks, including different layout algorithms. It begins by defining what a network is as a data structure of entities and relationships. It then covers topics like matrix representations, arc/linear layouts, circular/chord layouts, and hierarchical edge bundling. It also discusses simple network measures like degree and betweenness centrality that can provide insight into a network's structure. The document provides many examples and references to external resources on network visualization.
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...freshdatabos
In the worlds of research, science, and academia, much attention is given to precision, objectivity, and de-biasing data. The ability to “lie with data” is a legitimate concern. In business analytics, and the burgeoning area of consumer-facing data visualization, though, complete objectivity is not the singular goal, because these situations are about business and personal decision making-processes that are filled with subjectivity--and also with creativity and intuition.
Mark Schindler is co-founder and Managing Director of GroupVisual.io, a Cambridge, MA consultancy that specializes in making Big Data consumable for human beings through the design of data visualization and data-driven user experiences. For over 15 years, he has designed business analytics and visualization systems for clients like Johnson & Johnson, GE, and Eli Lilly.
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhDfreshdatabos
Talk Abstract
Many companies have "big data", but not every company has the resources (or need) for a big data team. In this talk we will discuss lessons I've learned from working as part of a small team within a fast-moving mobile start-up and techniques for getting the most out of your data on a budget.
Speaker Bio:
Nicholas is Senior Data Scientist at FitnessKeeper, makers of the RunKeeper and Breeze mobile and web apps for health and fitness tracking and guidance. Before joining FitnessKeeper he spent 10 years as a research staff member at MIT Lincoln Laboratory, working in the Cyber Security and Information Sciences and the Ballistic Missile Defense Divisions. While at Lincoln Laboratory he also received his Ph.D. in Applied Mathematics from Harvard University, where his research focused on numerical methods for the analysis of massive datasets. His areas of expertise and interest include statistical modeling, data mining, machine learning, data visualization, network science, and statistical signal processing.
Vector Space Word Representations - Rani Nelken PhDfreshdatabos
This document discusses vector space word representations, which map words to dense numerical vectors based on their contexts. Words that appear in similar contexts are mapped to vectors with small angles between them. Historical methods included hard and soft clustering, while modern techniques use neural networks like Word2Vec. These representations allow words with similar meanings to have similar vectors, and can be used to solve word analogy problems by performing vector arithmetic. While not perfect, vector representations have improved many NLP tasks such as named entity recognition and sentiment analysis.
Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festivalfreshdatabos
The document discusses tips for winning data science competitions. It outlines the typical structure of competitions, emphasizing the importance of proper validation to avoid overfitting. Sources of competitive advantage include discipline, effort, domain knowledge, feature engineering, and model structure. Gradient boosted machines (GBM) are recommended as a strong baseline model to use, with advice on tuning its parameters. Additional techniques discussed include blending models, handling text and high-cardinality data, and applying lessons from competitions more broadly.
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -freshdatabos
This document provides information about the 2014 Boston Data Festival hosted by Chris Lynch and Atlas Venture. It discusses the growth of the big data market and challenges of big data analytics. It also lists Atlas Venture's investments in several big data startups and describes Hack/Reduce, a Boston-based community space that provides resources for working with big data. The final pages promote a charity event on December 8th to help fight childhood cancer.
8. page
FAST AND BIG IN COMBINATION
• Fast Profile
• In memory: user segmentation
- GB to TB (300M+ rows)
• 10k to 1M+ requests/sec
• 99 percentile latency under
5ms. (5x9’s under 50ms)
• VoltDB export to Vertica
• Big Profile
• TB to PB of historical data
• Columnar analytics for fast
reporting.
• Real time ingest of historical
data (possibly via VoltDB)
• Vertica UDX to VoltDB