While Big Data processing has made tremendous progress over the last 10 years, the next generation of tools will provide more plug and play capabilities to perform data ingestion, curation and analysis. As the growth rate of raw data accelerates and novel applications impose more stringent real time requirements on data analytics, the need for easy to use tools will be in high demand. HPCC Systems is developing the next generation of toolsets to meet these needs. Tombolo, a tool for data curation and compliance tracking (GDPR, CCPA, HIPAA, DPPA etc.), ECL Cloud IDE to make coding complex analytics simple, and IoT Central to manage devices and ingestion pipelines are some of the tools planned for release this year. With the overall core HPCC Systems platform improvements, these additional tools will provide a comprehensive end to end open source stack that can better prepare the platform for a new generation of data scientists.
9. Richard and Gavin conceiving the HPCC Systems platform
(1999)
HPCC Systems Community Day 2019 - Opening remarks 9
10. HPCC Systems – a history of firsts
• First distributed in memory data
retrieving system (HOLe) (2000)
• First implicitly parallel dataflow data
processing DSL language (ECL)
(2000)
• First distributed data integration
system (Thor) (2001)
• First distributed hybrid query
system (ROXIE) (2002)
HPCC Systems Community Day 2019 - Opening remarks 10
The first paper on
MapReduce was
published in OSDI in
December 2004
12. • A central (logical) repository of data
• Unlimited storage
• Schema on read
• Low cost of storage
and compute
• Metadata tracking
• High performance processing
• Cloud development support
• Machine Learning
• Ingesting in batch and real-time
• Secure
A Data Lake Solution Need
Real-time
Data
Movement
Advanced
Analytics
Batch Data
Machine
Learning
Data Cleansing
and Organization
(ETL)
On-premise
Data
Movement
HPCC Systems
DATA LAKE
HPCC Systems Community Day 2019 - Opening remarks 12
13. With support for IoT
FACTORIES
Safety, Operations &
Equipment Optimization
HOME
Automation
& Security
VEHICLES
Autonomous Vehicles
& Driver Behavior
OUTSIDE
Logistics &
Navigation CITIES
Public Health, Safety,
Security & Transportation
OFFICES
Security &
Energy
Bringing data together
is not easy, and
data is growing at a
rapid pace and scale
A processing platform is vital for bringing all your data together across all
verticals
HPCC Systems Community Day 2019 - Opening remarks 13
14. FACTORIES
Safety, Operations &
Equipment Optimization
HOME
Automation
& Security
VEHICLES
Autonomous
Vehicles
& Driver Behavior
OUTSIDE
Logistics &
Navigation CITIES
Public Health, Safety,
Security &
Transportation
OFFICES
Security &
Energy
Bringing data together
is not easy, and
data is growing at a
rapid pace and scale
A processing platform is vital for bringing all your data together across
all verticals
A layered solution
14
16. What if…
• You could write your ECL code directly
on the cloud?
• Seamlessly upload data and create
charts?
• Create interactive dashboards
natively?
• And all of this with just a web
browser…
HPCC Systems Community Day 2019 - Opening remarks 16
18. Datasets and ECL scripts
File Upload & Spray
Data Patterns
Visualization
18
19. The many challenges of tracking data provenance
HPCC Systems Community Day 2019 - Opening remarks 19
Tracking datasets and fields across the
data integration process can be daunting
Metadata is never guaranteed to actually
reflect the data types
Detecting categories of protected data is
difficult
Understanding data flows from ECL
code archeology is not trivial
22. Since ECLWatch didn’t have
enough capabilities
22
We really needed Exploratory Data
Analysis…
And a home for it…
So, why not ECLWatch?
HPCC Systems Community Day 2019 - Opening remarks
24. Coming Soon! Data Detection
HPCC Systems Community Day 2019 - Opening remarks 24
25. HPCC Of Things (a HOT topic)
25
Usability of our
IoT capabilities
at the center
Seamless
onboarding of
common IoT
sources
Easy to use
parsers and data
integration
strategies
Built in displays
and dashboards
HPCC Systems Community Day 2019 - Opening remarks
29. • Out of the box capabilities for consistency and ease of use
• Less coding and more using (even though we love to code)
• Aiming at being your one stop shop for all your data integration, querying and
analytical needs
HPCC Systems Community Day 2019 - Opening remarks 29
Challenge Yourself – Challenge the Status Quo
HPCC Systems: We code so that you don’t need to
30. View this presentation on YouTube:
https://www.youtube.com/watch?v=O-
qJwxhQTzA&list=LLmySfVDlEUzlIiIdDc7oQbQ&index=2&t=3064s
Editor's Notes
Over the last year, we have spent a lot of effort on building the missing pieces that would make HPCC Systems a comprehensive data lake solution. While HPCC has been designed to be flexible while working with Big Data it has lacked the ability to track all the assets in a Data Lake. Files, Jobs, Queries, Ownerships etc. that would satisfy the requirements around compliance and auditing.
In addition, there has been an emphasis to make the technology cloud ready and provide cloud user friendly development tools.
With the advent of IoT and the challenges it creates for our traditional insurance business, we have quickly recognized that ingesting and handling real-time data and managing it has become extremely important.
We have used a very well thought out design to incorporate the new features by using a layered approach. Keeping the core and the interface to core (ECL) the same. At the same time, providing third parties to integrate with the platform easily and rapidly.
Data Lakes stand out because the data assets can all reside in the same logical central location. This design provides the flexibility to work with data in an agile manner. In recognizing that agility is a key aspect for the success of data science projects, we have built native machine learning capabilities that are designed to execute as close to the data as possible.
ECL Cloud IDE:
With the advent of cloud based implementations, there is a demand for cloud ready tools that are both powerful and easy to use. ECL Cloud IDE is a complete coding environment for both advanced data scientists and beginning ECL developers. The Cloud IDE is designed to reduce the learning curve to execute data science projects by providing an intuitive online experience. Working with datasets and sharing an integrated application experience with others has never been easier.
Tombolo:
While Data Lake architectures provide the capability to ingest data rapidly and transform it, tracking and documenting datasets (data dictionaries), capturing compliance (HIPAA, GLB, DPPA, GDPR etc.), managing compliance to transformation rules, recording lineage of data and ownership is a non-trivial task. Tombolo helps in keeping a record of all your assets in the Data Lake and how they are being used. This enables both developers and stakeholders (product owners and auditors) manage data assets quickly and efficiently.
IoT Hub:
With the increase in the widespread collection and analysis of data related to Internet of Things, tools that are powerful and easy to use are lacking. The IoT Hub provides a plug and play environment by integrating an HPCC Systems backend with an administrative user interface to manage devices, collect data and execute related analysis workflows. The IoT Hub currently supports Fitbit, Ecobee, Nest and Connected Car API.