Confidential and Proprietary to Daugherty Business Solutions
Goals
By the end of this session
you will understand
• The role of popular Big
Data technologies
• The scope of data
engineering and data
analytics
• How this group supports
local companies using
these technologies
How does
this help
me?
Confidential and Proprietary to Daugherty Business Solutions
STL Big Data IDEA 101
2018
Confidential and Proprietary to Daugherty Business Solutions 3
Agenda
• Introduction
• Big Data
• Innovations
• Data Engineering
• Data Analytics
• St. Louis
• Conclusion
Confidential and Proprietary to Daugherty Business Solutions 4
It started with an article
Confidential and Proprietary to Daugherty Business Solutions 5
And a name change
Confidential and Proprietary to Daugherty Business Solutions
• What is the future of the Hadoop ecosystem?
• What is the dividing line between Spark and Hadoop?
• What are the big players doing?
• How does the push to cloud technologies affect Hadoop usage?
• How does Streaming come into play?
6
Which led to some questions
Confidential and Proprietary to Daugherty Business Solutions
• Hadoop is here to stay, but it will make the most strides as a machine
learning platform.
• Spark can perform many of the same tasks that elements of the Hadoop
ecosystem can, but it is missing some existing features out of the box.
• Cloudera, Hortonworks, and MapR are positioning themselves as data
processing platforms with roots in Hadoop, but other aspirations. For
example, Cloudera is positioning itself as a machine learning platform.
• The push to cloud means that the distributed filesystem of HDFS may be
less important to cloud-based deployments. But Hadoop ecosystem
projects are adapting to be able to work with cloud sources.
• The Hadoop ecosystem projects have proven patterns for ingesting
streaming data and turning it into information.
And then our answer
7
Confidential and Proprietary to Daugherty Business Solutions
• And so we changed our name to be:
St. Louis Big Data Innovations, Data Engineering, and Analytics Group
Or more simply put:
St. Louis Big Data IDEA
But let’s break that down…
And Another Name Change
8
Confidential and Proprietary to Daugherty Business Solutions 10
Big Data – A Mental Picture
Confidential and Proprietary to Daugherty Business Solutions 11
The 3 Vs
Confidential and Proprietary to Daugherty Business Solutions
• HDFS
• Hive/Impala
• Spark
• Hbase
• Solr
• Zeppelin
• Kafka
• NiFi
• Avro/Parquet/ORC
12
Big Data Technologies to Know
Confidential and Proprietary to Daugherty Business Solutions
• In the beginning…
• How big is your block?
• Replication
• Partitioning & Compression
13
HDFS
Confidential and Proprietary to Daugherty Business Solutions
• CSV
• JSON
• Avro
• Parquet
• ORC
14
Serialization Formats
Confidential and Proprietary to Daugherty Business Solutions
• SQL abstraction on top of a storage layer
• Good for OLAP style queries with slowly changing dimensions
• Improvements
– Tez
– Calcite
– LLAP
15
Hive/Impala
Confidential and Proprietary to Daugherty Business Solutions
• General purpose, distributed computational framework
• First class support for Scala, Java, Python
• Runs on individual machines, Kubernetes, or Hadoop
16
Spark
Confidential and Proprietary to Daugherty Business Solutions
HBase is the diesel powered race car engine that drives applications with
Hadoop.
What is HBase?
Confidential and Proprietary to Daugherty Business Solutions
• Derives from work for Apache Lucene
• Search
– Full-text
– Faceted
– Hit Handling
– Real-time indexing
– Database integration
– Dynamic clustering
– NOSQL Features
• ELK stack
– Elastic Search - Search
– Logstash – Data collection and log parsing
– Kibana – Analytics and visualization platform
18
Apache SOLR
Confidential and Proprietary to Daugherty Business Solutions
• Data science notebook that supports Spark, SQL, Python, and 25 other
interpreters
• Allows users to share data science documents
19
Zeppelin
Confidential and Proprietary to Daugherty Business Solutions
• Publish and Subscribe Message Topics
• Process the data
• Store data
20
Apache Kafka
Confidential and Proprietary to Daugherty Business Solutions
• Apache NiFi supports powerful and scalable directed graphs of data
routing, transformation, and system mediation logic.
• Extensible
• Configurable
• Visual programming and monitoring
• Data provenance built-in
21
NiFi
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some of the
technologies that you are
using that aren’t on the
list?
• What are some of the
technologies that you
want to use that aren’t on
the list?
22
Confidential and Proprietary to Daugherty Business Solutions
Market Drivers
Bring ISVs
(Hadoop eco-system)
ISV
STRATEGY
IBM DSX 3rd party apps
Faster time to deployment
(Containerized Micro-Services)
AGILITY
App 1 App 2
Data Science
EDW
Security &
Governance
Operational
Data Store
HDP Core
REAL-TIME
EDW
One SQL Layer
(Across Historical, Real-time)
Streaming Historical
OLAP Cubes SQL Tables
Unified SQL Layer
Real-Time Query Deep Analytics
DEEP
LEARNING
& GPU
Deep Learning frameworks
(TensorFlow, Caffe)
GPU Pooling/IsolationRelease Agility
(De-coupled HDP Components)
Infinitely Scalable
(Billions of files, Exabytes)
Low TCO
(Less Storage Overhead)
SCALE
SECURED &
GOVERNED
Data Swamp -> Data Lake
Confidential and Proprietary to Daugherty Business Solutions
Hybrid Architecture
Confidential and Proprietary to Daugherty Business Solutions
The New Way of Business Is Fueled By
Connected Data
• Connected Customers,
Vehicles, Devices
• Socially crowd-sourced
requirements
• Digital design and
analysis
• Digital prototypes and
tests (simulations)
• Connected Factories,
Sensors, Devices
• Human-robotic
interaction
• 3D-printing on
demand
• Connected Trucks,
Inventory
• Location, traffic,
weather-aware
distribution
• Real-time inventory
visibility
• Dynamic rerouting
• Connected
Customers, Devices
• Omni- channel
demand sensing
• Real-Time
Recommendations
• Connected Assets
• Remote service
monitoring & delivery
• Predictive
maintenance
• OTA Updates
DEVELOPMENT MANUFACTURING DISTRIBUTION MARKETING/SALES SERVICE
Confidential and Proprietary to Daugherty Business Solutions
IoT Market
By 2024 more than 24.9 Billion IoT connections will be established
An estimated $70 billion will be spent by global manufacturers on IoT solutions in 2020
An estimated 646 million healthcare devices (excluding fitness trackers and wearable devices) will be
connected by 2020
An estimated 78% of cars shipped globally will be built with hardware that connects to the internet by
2020
50% of decision-makers in IT, services, utilities, and manufacturing have either deployed IoT, or will
deploy it in the next 12-24 months
Confidential and Proprietary to Daugherty Business Solutions
Data decay and the need for real-time data
value
ENTERPRISE
TRANSACTION
SOURCES
ERP, MES, SCM,
WMS, TMS, ETC. ENTERPRISE DATA LAKE
6
REAL-TIME
INVENTORY ACTION
ACT
1
I
N
G
E
S
T
1 INGEST
2 STORE
3 PROCESS
4 •Data Discovery
•Business Intelligence
ANALYZE
MONITOR 5
Deploy
4
LEARN
•Develop Models
•Machine Learning
Model Inputs
• Historic click-streams, POS data
• Historic inventory
locations/levels
• Historic Weather
• Historic order histories,
inventory locations and levels
• Historic disruptions
REAL-TIME DATA SOURCES
Connected Supply Chain
RFIDSocial WEB
POS
GPS
Confidential and Proprietary to Daugherty Business Solutions
Faster Time to Deployment—
Containerization
Why containerization?
• Overcomes limits of data architecture
• Allows for agility and elasticity to process
data
• Developers can build data intensive apps
quickly
• Ensure apps deploy quickly, reliably and
consistently across deployment
environments
Result: Faster time to deployment and
increased developer productivity ->
competitive advantage
Confidential and Proprietary to Daugherty Business Solutions
Real-Time Database
Seamlessly combines real-time & historical data:
makes both available for deep SQL analytics
ACID on by default:
addresses data change requirements
Workload management for Apache Hive LLAP:
no more worrying about resource competition
Materialized views with query optimizer integration
Confidential and Proprietary to Daugherty Business Solutions
Security and Governance
Full chain of custody of data across the
Hadoop ecosystem
Auditing of events for fine grained and detailed info
Tag propagation allows auditors to see where the
data is going across the enterprise & retain
context of sensitive data
Time-based policies to allow temporary access
to users
Confidential and Proprietary to Daugherty Business Solutions 31
Everyone is excited about Data Science
Confidential and Proprietary to Daugherty Business Solutions 32
What is Data Engineering?
Confidential and Proprietary to Daugherty Business Solutions
• Streaming data
• Batch data analysis
• ETL @ Scale
• Machine Learning Pipelines
• Data Governance
33
Data Engineering Examples
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• Which aspects of data
engineering are most in
line with your
understanding?
• Which aspects of data
engineering are most
foreign to your
understanding?
34
Confidential and Proprietary to Daugherty Business Solutions 35
What is Data Analytics?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analytics involves applying algorithmic or mechanical processes to derive
insights.
The conversion of data into information can take many forms
• Visualizations
• Statistical analytics
• Computational analytics
Confidential and Proprietary to Daugherty Business Solutions 36
Visualizations and Big Data
How do you represent data when one of its defining characteristics is Volume?
Confidential and Proprietary to Daugherty Business Solutions
Complex cause and effect relationships
Regulation, optimization
Basic machine learning systems stabilize drones, using simple inputs to
determine how much power to send to each rotor
37
Confidential and Proprietary to Daugherty Business Solutions
Forecasting and Prediction
Algorithms predict the weather based on previous day’s weather and
sensor readings
38
Confidential and Proprietary to Daugherty Business Solutions
Categorization/Segmentation
Netflix makes movie recommendations by grouping users based on
viewing habits, and recommending movies enjoyed by other users in
the same group
39
Confidential and Proprietary to Daugherty Business Solutions
Sensory Recognition
Siri, Google Voice, etc for Voice Recognition
40
Confidential and Proprietary to Daugherty Business Solutions
Network Analysis
Facebook recommends possible connections based on existing
network connections.
41
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some examples
of data science that you
encounter in your
everyday life?
42
Confidential and Proprietary to Daugherty Business Solutions
• Bayer
• Mastercard
• ESI
• Centene
• AB
• RGA
• Panera
• Label Insight
43
St. Louis Companies
• Nestle Purina
• Enterprise Holdings
• Maritz
• Edward Jones
• Graybar
• Mercy
• Charter
• Magellan Health
Confidential and Proprietary to Daugherty Business Solutions
• Local Companies
• Big Data
– Hadoop
– Cloud deployments
– Cloud-native technologies
– Spark
– Kafka
• Innovation
– New Big Data projects
– New Big Data services
– New Big Data applications
• Data Engineering
– Streaming data
– Batch data analysis
– Machine Learning Pipelines
– Data Governance
– ETL @ Scale
• Analytics
– Visualization
– Machine Learning
– Reporting
– Forecasting
So What is the STL Big Data IDEA interested in?
44
Confidential and Proprietary to Daugherty Business Solutions 45
Questions?

Back to school: Big Data IDEA 101

  • 1.
    Confidential and Proprietaryto Daugherty Business Solutions Goals By the end of this session you will understand • The role of popular Big Data technologies • The scope of data engineering and data analytics • How this group supports local companies using these technologies How does this help me?
  • 2.
    Confidential and Proprietaryto Daugherty Business Solutions STL Big Data IDEA 101 2018
  • 3.
    Confidential and Proprietaryto Daugherty Business Solutions 3 Agenda • Introduction • Big Data • Innovations • Data Engineering • Data Analytics • St. Louis • Conclusion
  • 4.
    Confidential and Proprietaryto Daugherty Business Solutions 4 It started with an article
  • 5.
    Confidential and Proprietaryto Daugherty Business Solutions 5 And a name change
  • 6.
    Confidential and Proprietaryto Daugherty Business Solutions • What is the future of the Hadoop ecosystem? • What is the dividing line between Spark and Hadoop? • What are the big players doing? • How does the push to cloud technologies affect Hadoop usage? • How does Streaming come into play? 6 Which led to some questions
  • 7.
    Confidential and Proprietaryto Daugherty Business Solutions • Hadoop is here to stay, but it will make the most strides as a machine learning platform. • Spark can perform many of the same tasks that elements of the Hadoop ecosystem can, but it is missing some existing features out of the box. • Cloudera, Hortonworks, and MapR are positioning themselves as data processing platforms with roots in Hadoop, but other aspirations. For example, Cloudera is positioning itself as a machine learning platform. • The push to cloud means that the distributed filesystem of HDFS may be less important to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to work with cloud sources. • The Hadoop ecosystem projects have proven patterns for ingesting streaming data and turning it into information. And then our answer 7
  • 8.
    Confidential and Proprietaryto Daugherty Business Solutions • And so we changed our name to be: St. Louis Big Data Innovations, Data Engineering, and Analytics Group Or more simply put: St. Louis Big Data IDEA But let’s break that down… And Another Name Change 8
  • 9.
    Confidential and Proprietaryto Daugherty Business Solutions 10 Big Data – A Mental Picture
  • 10.
    Confidential and Proprietaryto Daugherty Business Solutions 11 The 3 Vs
  • 11.
    Confidential and Proprietaryto Daugherty Business Solutions • HDFS • Hive/Impala • Spark • Hbase • Solr • Zeppelin • Kafka • NiFi • Avro/Parquet/ORC 12 Big Data Technologies to Know
  • 12.
    Confidential and Proprietaryto Daugherty Business Solutions • In the beginning… • How big is your block? • Replication • Partitioning & Compression 13 HDFS
  • 13.
    Confidential and Proprietaryto Daugherty Business Solutions • CSV • JSON • Avro • Parquet • ORC 14 Serialization Formats
  • 14.
    Confidential and Proprietaryto Daugherty Business Solutions • SQL abstraction on top of a storage layer • Good for OLAP style queries with slowly changing dimensions • Improvements – Tez – Calcite – LLAP 15 Hive/Impala
  • 15.
    Confidential and Proprietaryto Daugherty Business Solutions • General purpose, distributed computational framework • First class support for Scala, Java, Python • Runs on individual machines, Kubernetes, or Hadoop 16 Spark
  • 16.
    Confidential and Proprietaryto Daugherty Business Solutions HBase is the diesel powered race car engine that drives applications with Hadoop. What is HBase?
  • 17.
    Confidential and Proprietaryto Daugherty Business Solutions • Derives from work for Apache Lucene • Search – Full-text – Faceted – Hit Handling – Real-time indexing – Database integration – Dynamic clustering – NOSQL Features • ELK stack – Elastic Search - Search – Logstash – Data collection and log parsing – Kibana – Analytics and visualization platform 18 Apache SOLR
  • 18.
    Confidential and Proprietaryto Daugherty Business Solutions • Data science notebook that supports Spark, SQL, Python, and 25 other interpreters • Allows users to share data science documents 19 Zeppelin
  • 19.
    Confidential and Proprietaryto Daugherty Business Solutions • Publish and Subscribe Message Topics • Process the data • Store data 20 Apache Kafka
  • 20.
    Confidential and Proprietaryto Daugherty Business Solutions • Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. • Extensible • Configurable • Visual programming and monitoring • Data provenance built-in 21 NiFi
  • 21.
    Confidential and Proprietaryto Daugherty Business Solutions Turn and Talk • What are some of the technologies that you are using that aren’t on the list? • What are some of the technologies that you want to use that aren’t on the list? 22
  • 22.
    Confidential and Proprietaryto Daugherty Business Solutions Market Drivers Bring ISVs (Hadoop eco-system) ISV STRATEGY IBM DSX 3rd party apps Faster time to deployment (Containerized Micro-Services) AGILITY App 1 App 2 Data Science EDW Security & Governance Operational Data Store HDP Core REAL-TIME EDW One SQL Layer (Across Historical, Real-time) Streaming Historical OLAP Cubes SQL Tables Unified SQL Layer Real-Time Query Deep Analytics DEEP LEARNING & GPU Deep Learning frameworks (TensorFlow, Caffe) GPU Pooling/IsolationRelease Agility (De-coupled HDP Components) Infinitely Scalable (Billions of files, Exabytes) Low TCO (Less Storage Overhead) SCALE SECURED & GOVERNED Data Swamp -> Data Lake
  • 23.
    Confidential and Proprietaryto Daugherty Business Solutions Hybrid Architecture
  • 24.
    Confidential and Proprietaryto Daugherty Business Solutions The New Way of Business Is Fueled By Connected Data • Connected Customers, Vehicles, Devices • Socially crowd-sourced requirements • Digital design and analysis • Digital prototypes and tests (simulations) • Connected Factories, Sensors, Devices • Human-robotic interaction • 3D-printing on demand • Connected Trucks, Inventory • Location, traffic, weather-aware distribution • Real-time inventory visibility • Dynamic rerouting • Connected Customers, Devices • Omni- channel demand sensing • Real-Time Recommendations • Connected Assets • Remote service monitoring & delivery • Predictive maintenance • OTA Updates DEVELOPMENT MANUFACTURING DISTRIBUTION MARKETING/SALES SERVICE
  • 25.
    Confidential and Proprietaryto Daugherty Business Solutions IoT Market By 2024 more than 24.9 Billion IoT connections will be established An estimated $70 billion will be spent by global manufacturers on IoT solutions in 2020 An estimated 646 million healthcare devices (excluding fitness trackers and wearable devices) will be connected by 2020 An estimated 78% of cars shipped globally will be built with hardware that connects to the internet by 2020 50% of decision-makers in IT, services, utilities, and manufacturing have either deployed IoT, or will deploy it in the next 12-24 months
  • 26.
    Confidential and Proprietaryto Daugherty Business Solutions Data decay and the need for real-time data value ENTERPRISE TRANSACTION SOURCES ERP, MES, SCM, WMS, TMS, ETC. ENTERPRISE DATA LAKE 6 REAL-TIME INVENTORY ACTION ACT 1 I N G E S T 1 INGEST 2 STORE 3 PROCESS 4 •Data Discovery •Business Intelligence ANALYZE MONITOR 5 Deploy 4 LEARN •Develop Models •Machine Learning Model Inputs • Historic click-streams, POS data • Historic inventory locations/levels • Historic Weather • Historic order histories, inventory locations and levels • Historic disruptions REAL-TIME DATA SOURCES Connected Supply Chain RFIDSocial WEB POS GPS
  • 27.
    Confidential and Proprietaryto Daugherty Business Solutions Faster Time to Deployment— Containerization Why containerization? • Overcomes limits of data architecture • Allows for agility and elasticity to process data • Developers can build data intensive apps quickly • Ensure apps deploy quickly, reliably and consistently across deployment environments Result: Faster time to deployment and increased developer productivity -> competitive advantage
  • 28.
    Confidential and Proprietaryto Daugherty Business Solutions Real-Time Database Seamlessly combines real-time & historical data: makes both available for deep SQL analytics ACID on by default: addresses data change requirements Workload management for Apache Hive LLAP: no more worrying about resource competition Materialized views with query optimizer integration
  • 29.
    Confidential and Proprietaryto Daugherty Business Solutions Security and Governance Full chain of custody of data across the Hadoop ecosystem Auditing of events for fine grained and detailed info Tag propagation allows auditors to see where the data is going across the enterprise & retain context of sensitive data Time-based policies to allow temporary access to users
  • 30.
    Confidential and Proprietaryto Daugherty Business Solutions 31 Everyone is excited about Data Science
  • 31.
    Confidential and Proprietaryto Daugherty Business Solutions 32 What is Data Engineering?
  • 32.
    Confidential and Proprietaryto Daugherty Business Solutions • Streaming data • Batch data analysis • ETL @ Scale • Machine Learning Pipelines • Data Governance 33 Data Engineering Examples
  • 33.
    Confidential and Proprietaryto Daugherty Business Solutions Turn and Talk • Which aspects of data engineering are most in line with your understanding? • Which aspects of data engineering are most foreign to your understanding? 34
  • 34.
    Confidential and Proprietaryto Daugherty Business Solutions 35 What is Data Analytics? Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision- making. Data analytics involves applying algorithmic or mechanical processes to derive insights. The conversion of data into information can take many forms • Visualizations • Statistical analytics • Computational analytics
  • 35.
    Confidential and Proprietaryto Daugherty Business Solutions 36 Visualizations and Big Data How do you represent data when one of its defining characteristics is Volume?
  • 36.
    Confidential and Proprietaryto Daugherty Business Solutions Complex cause and effect relationships Regulation, optimization Basic machine learning systems stabilize drones, using simple inputs to determine how much power to send to each rotor 37
  • 37.
    Confidential and Proprietaryto Daugherty Business Solutions Forecasting and Prediction Algorithms predict the weather based on previous day’s weather and sensor readings 38
  • 38.
    Confidential and Proprietaryto Daugherty Business Solutions Categorization/Segmentation Netflix makes movie recommendations by grouping users based on viewing habits, and recommending movies enjoyed by other users in the same group 39
  • 39.
    Confidential and Proprietaryto Daugherty Business Solutions Sensory Recognition Siri, Google Voice, etc for Voice Recognition 40
  • 40.
    Confidential and Proprietaryto Daugherty Business Solutions Network Analysis Facebook recommends possible connections based on existing network connections. 41
  • 41.
    Confidential and Proprietaryto Daugherty Business Solutions Turn and Talk • What are some examples of data science that you encounter in your everyday life? 42
  • 42.
    Confidential and Proprietaryto Daugherty Business Solutions • Bayer • Mastercard • ESI • Centene • AB • RGA • Panera • Label Insight 43 St. Louis Companies • Nestle Purina • Enterprise Holdings • Maritz • Edward Jones • Graybar • Mercy • Charter • Magellan Health
  • 43.
    Confidential and Proprietaryto Daugherty Business Solutions • Local Companies • Big Data – Hadoop – Cloud deployments – Cloud-native technologies – Spark – Kafka • Innovation – New Big Data projects – New Big Data services – New Big Data applications • Data Engineering – Streaming data – Batch data analysis – Machine Learning Pipelines – Data Governance – ETL @ Scale • Analytics – Visualization – Machine Learning – Reporting – Forecasting So What is the STL Big Data IDEA interested in? 44
  • 44.
    Confidential and Proprietaryto Daugherty Business Solutions 45 Questions?

Editor's Notes

  • #24 Solutions: Success of Hadoop adoption is driven around the solutions that consumers use on top. Solutions are the key enabler to ease of adoption. Consumable: Lower barriers to adoption, increase stability, and simplify key cross-platform workflows. Use cases are from the perspective of primary users. Anywhere: Drive core platform iterations to allow faster evolution and the ability to consume the latest technology faster. Offer customers the best Hybrid and Cloud-based solutions and experience.
  • #26 Data is often referred to as the fuel of today’s businesses. In reality, every business has data and perhaps can have access to the same types of data than most of their competitors. The real paradigm is not data but who uses it smarter with greater effect. And that usage often rely on connecting the data dots across your organization. By connecting customers to products to channels through which they interact of prefer to interact we can drive better customer experiences – resulting in better loyalty and hopefully better revenues. Every industry is being transformed through these connected use cases.
  • #29 Overcomes limits of data architecture Places apps as close to the data as possible Allows for agility and elasticity to process data Developers can build data intensive apps quickly Ensure apps deploy quickly, reliably and consistently across deployment environments Result” Faster time to deployment and increased developer productivity – which gives customers a competitive advantage NOTES: Containers are fully functioning runtime environments that hold an app and all of its dependencies. Being fully contained, they can run in any data center environment and are easy to move and deploy. Once the container itself is created, it's relatively easy for any end user to access. You don’t need to ask IT to set aside compute resources or set up a new cluster, which can take up to two months in a physical environment. With containers, there's no need to deploy a physical machine. You can run a container in a Hadoop enviornment, utilizing its storage and compute capabilities, with the click of a button. It takes all of five minutes to set up.
  • #30 Seamlessly combines real-time & historical data: makes both available for deep SQL analytics – power of Hive and Druid – one SQL interface Full ACID transactions on by default: Makes it easy to support GDPR – The Right to Be Forgotten We are enabling ACID by default. Along with the performance and reliability improvements that we’ve made, we are confident this is the best solution for addressing data change requirements Hive LLAP Workload management - RUN LLAP in a multi-tenant environment without worrying about resource competition Hive LLAP is the next iteration of Hive engine, it’s fast, but doesn’t allow fine grained control of resource allocation. This is challenging in large environments where many users with competing priorities sharing limited hardware resources. We are introducing Hive LLAP workload management feature. Workload management allows creation of resource pools, enable fine grained resource allocation. End Result: Run LLAP, the fastest Hive engine, in a multi-tenant environment without worrying about resource competition. Hive: Materialized views. It’s very common that multiple queries needing the same intermediate roll up table or intermediate joined table. It’s costly to do repeat the shared portions of the queries. We will pre-compute and cache the intermediate tables into views where the query optimizer will automatically leverage the pre-computed cache, drastically improve performance. End Result: Speeds up your join queries and aggregation queries, commonly seen in BI scenarios and dashboards.
  • #31 Full chain of custody of data across the Hadoop ecosystem Adding features in Atlas to allow auditors and users to have more control over tags and to get complete coverage of the tags across the enterprise and ecosystem (Atlas) Auditing of events for fine grained and detailed info - We’ve made improvements to Ranger to allow the auditing of events to get more fine-grained and detailed. In this way the auditor can access audit events to make it easier to do their job. (Ranger) Time based policies to allow temporary access to a given user.   Policies will include a start and stop date (Ranger) Tag propagation allow users/auditors to see where the data is going across the enterprise and can retain context of data that is sensitive. Adding features in Atlas to allow auditors and users to have more control over tags and to get complete coverage of the tags across the enterprise and ecosystem.
  • #33 Architecting distributed systems Creating reliable pipelines Shaping data sources Architecting data storage Collaborating with data scientists
  • #37 One of the defining characteristics of Big Data is Volume. The amount of data makes it difficult to work with unless the data is aggregated and summarized. Once the data volume has been reduced, graphical analysis techniques can be used
  • #38 There are some cause and effect relationships that are straight forward. I kick the ball and the ball moves. Other examples have a much more tangled relationship between their variables. For example, what factors determine how many visitors a particular museum has? There are a number of factors that are obvious – what day of the week or time of day it is. Does the museum have a new exhibit? But there are some less obvious factors. Is school currently in session? Is it towards the end of the school year? What is the weather like outside?
  • #39 Once you understand the factors, data science problems emerge. What is the proper weighting of these factors? What is the expected attendance for next Tuesday? When should the museum reduce its prices?
  • #40 Maybe you want to focus on the type of people who are apt to visit the museum. Can you understand the profile of frequent visitors and what they want in order to increase their attendance? How are they different from sporadic visitors? How are they different from members? What can you learn about people that don’t visit the museum? What are they interested in? What is keeping them from visiting? Can you address their concerns?
  • #41 Many smart systems are being developed to mimic the human brain’s ability to decode its sensory inputs. Computer vision systems are used to recognize and categorize images. Computer vision is required in order to provide self-driving cars. Optical character recognition is also being used to automate form processing. Another important area is speech to text. This function enables systems like Siri and Alexa. Once the speech is in a textual format, the systems can use Natural Language Processing to determine what the user said and evaluate what needs to happen.
  • #42 Network Analysis is a form of data science that examines groups of related data and uses the relations between data items to derive insights. Who in your peer group is an influencer? What is the most valuable web site to place my ads on? Which web page on this topic has the most relevant data? Are there people that you could be friends with but haven’t said that you are?