This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
The Anywhere Enterprise – How a Flexible Foundation Opens DoorsInside Analysis
The Briefing Room with Dr. Robin Bloor and InfiniDB
Live Webcast on August 12, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1e562c3a4b9e9cb9a054f0ec216d578b
Today’s organizations need all kinds of data, from a wide and growing array of sources. Marshaling all that data into one location can be difficult, even unrealistic. Increasingly, innovative companies are taking a much more distributed approach to storing and processing data. The end result is an information architecture that supports a broader range of business activities, and reduces dependence on costly data movement.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor, as he explains how a distributed approach to data management can open doors to new business opportunities. He’ll be briefed by Jim Tommaney of InfiniDB who will explain how his company’s database has the flexibility to run on-prem, in the cloud, with cluster files systems or even Hadoop’s HDFS. He’ll also show how InfiniDB can serve as a conduit to companies looking to transform their information architecture to better satisfy changing market demands.
Visit InsideAnlaysis.com for more information.
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
The Briefing Room with Dr. Robin Bloor and Pepperdata
Live Webcast September 15, 2015
Watch the Archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=32f198185d9d0c4cf32c27bdd1498b2a
Industry researchers agree: the importance of Hadoop will continue to grow as more companies recognize the range of benefits they can reap, from lower-cost storage to better business insights. At the same time, advances in the Hadoop ecosystem are addressing many of the key concerns that have hampered adoption, including performance and reliability. As a result, Hadoop is fast becoming a first-class citizen in the world of enterprise computing.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop ecosystem is evolving into a mature foundation for managing enterprise data. He’ll be briefed by Sean Suchter of Pepperdata, who will explain how his company’s software brings predictability and reliability to Hadoop through dynamic, policy-based controls and monitoring. He’ll show how to guarantee service-level agreements by slowing down low-priority tasks as needed. He’ll also discuss the holy grail of Hadoop: how to enable mixed workloads.
Visit InsideAnalysis.com for more information.
Game Changed – How Hadoop is Reinventing Enterprise ThinkingInside Analysis
The Briefing Room with Dr. Robin Bloor and RedPoint Global
Live Webcast on April 8, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=cfa1bffdd62dc6677fa225bdffe4a0b9
The innovation curve often arcs slowly before picking up speed. Companies that harness a major transformation early in the game can make serious headway before challengers enter the picture. The world of Hadoop features several of these upstarts, each of which uses the open-source foundation as an engine to drive vastly greater performance to a wide range of services, and even create new ones.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop engine is being used to architect a new generation of enterprise applications. He’ll be briefed by George Corugedo, RedPoint Global CTO and Co-founder, who will showcase how enterprises can cost-effectively take advantage of the scalability, processing power and lower costs that Hadoop 2.0/YARN applications offer by eliminating the long-term expense of hiring MapReduce programmers.
Visit InsideAnlaysis.com for more information.
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
4. 4Pivotal Confidential–Internal Use Only
Distributed Systems
Avoid distributed systems in all the problems that potentially
could be solved using non-distributed systems
5. 5Pivotal Confidential–Internal Use Only
Distributed Systems
Ÿ Consensus problem
– Paxos
– RAFT
– ZAB
– etc.
Ÿ Transaction consistency
– 2PC
– 3PC
7. 7Pivotal Confidential–Internal Use Only
Distributed Systems
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
8. 8Pivotal Confidential–Internal Use Only
Distributed Systems
Reasons to use
Ÿ Performance issues
– More than 100’000 TPS
– More than 4 GB/sec scan rate
– More than 100’000 IOPS
Ÿ Capacity issues
– More than 50TB of data
Ÿ DR and Geo-Distribution
10. 10Pivotal Confidential–Internal Use Only
MPP
Main principles
Ÿ Shared Nothing
Ÿ Data Sharding
Ÿ Data Replication
Ÿ Distributed Transactions
Ÿ Parallel Processing
12. 12Pivotal Confidential–Internal Use Only
MPP
Works well for
Ÿ Relational data
Ÿ Batch processing
Ÿ Ad hoc analytical SQL
Ÿ Low concurrency
Ÿ Applications requiring ANSI SQL
13. 13Pivotal Confidential–Internal Use Only
MPP
Not the best choice for
Ÿ Non-relational data
Ÿ OLTP and event stream processing
Ÿ High concurrency
Ÿ 100+ server clusters
Ÿ Non-analytical use cases
Ÿ Geo-Distributed use cases
16. 16Pivotal Confidential–Internal Use Only
Hadoop
HDFS
Ÿ Distributed filesystem
Ÿ Block-level storage with big blocks
Ÿ Non-updatable
Ÿ Synchronous block replication
Ÿ No built-in Geo-Distribution support
Ÿ No built-in DR solution
18. 18Pivotal Confidential–Internal Use Only
Hadoop
YARN
Ÿ Cluster resource manager
Ÿ Manages CPU and RAM allocation
Ÿ Schedulers are pluggable
Ÿ Can handle different resource pools
Ÿ Supports both MR and non-MR workload
20. 20Pivotal Confidential–Internal Use Only
Hadoop
MapReduce
Ÿ Framework for distributed data processing
Ÿ Two main operations: map and reduce
Ÿ Data hits disk after “map” and before “reduce”
Ÿ Scales to thousands of servers
Ÿ Can process petabytes of data
Ÿ Extremely reliable
22. 22Pivotal Confidential–Internal Use Only
Hadoop
HBase
Ÿ Distributed key-value store
Ÿ Data is sharded by key
Ÿ Data is stored in sorted order
Ÿ Stores multiple versions of the row
Ÿ Easily scales
24. 24Pivotal Confidential–Internal Use Only
Hadoop
Hive
Ÿ Query engine with SQL-like syntax
Ÿ Translates HiveQL query to MR / Tez / Spark job
Ÿ Processes HDFS data
Ÿ Supports UDFs and UDAFs
26. 26Pivotal Confidential–Internal Use Only
Hadoop
Works well for
Ÿ Write Once Read Many
Ÿ 100+ server clusters
Ÿ Both relational and non-relational data
Ÿ High concurrency
Ÿ Batch processing and analytical workload
Ÿ Elastic scalability
27. 27Pivotal Confidential–Internal Use Only
Hadoop
Not the best choice for
Ÿ Write-heavy workloads
Ÿ Small clusters
Ÿ Analytical DWH cases
Ÿ OLTP and event stream processing
Ÿ Cost savings
30. 30Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
31. 31Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
32. 32Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
33. 33Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
34. 34Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
35. 35Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
36. 36Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
37. 37Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
38. 38Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
Target Systems DWH Purpose-Built Batch
39. 39Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Business
MPP Hadoop
Platform Openness Mostly Closed Open
Hardware Options Mostly Appliances Commodity
Vendor Lock-in Typical Not Common
Technology Price $200K – $10M $50K – $500K
Implementation Cost Moderate High
Extensibility Vendor-provided APIs Open Source
Supportability Easy Complex
Scalability Up to 100 servers Up to 5000 servers
Scalability Up to 100-300 TB Up to 100 PB
Target Systems DWH Purpose-Built Batch
Target End Users Business Analysts Developers
41. 41Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
42. 42Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
43. 43Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
44. 44Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
45. 45Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
46. 46Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
47. 47Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
48. 48Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
Min Collection Size Megabytes Gigabytes
49. 49Pivotal Confidential–Internal Use Only
MPP vs Hadoop for Architect
MPP Hadoop
Query Optimization Good Poor to None
Debugging Easy Very Hard
Accessibility SQL Mainly Java
DBA Skill Level Low High
Single Job Redundancy Low High
Query Latency 10-20 ms 10-20 sec
Query Runtime 5-7 sec 10-15 mins
Query Max Runtime 1-2 hours 1-2 weeks
Min Collection Size Megabytes Gigabytes
Max Concurrency 10-15 queries 70-100 jobs
51. 51Pivotal Confidential–Internal Use Only
Summary
Use MPP for
Ÿ Analytical DWH
Ÿ Ad hoc analyst SQL queries and BI
Ÿ Keep under 100TB of data
Use Hadoop for
Ÿ Specialized data processing systems
Ÿ Over 100TB of data