How Big Data is Disrupting Traditional Workflows
It's an exciting time to be a business analyst. Today's analysts have the power to lead their companies toward innovation and even transformation, if they have the right tools.
The slides from this webinar with Dan Woods, Editor and CTO of CITO Research, and Peter Schlampp, Platfora VP of Products, addressed:
How the right tool can revolutionize your current workflow
How to find the critical answers your business needs on your own
How you can access all data, including big data, in a straightforward, governed environment
2. before we begin…
You will be on mute for the duration of the event
Please type a message in the Questions box in the Control Panel if you can’t hear us
There will be a Q&A session at the end – please start submitting you questions via
GoToWebinar Questions panel
A recording of the full webinar will be available within 48 hours, and we will email the
replay link to you
3. meet the speakers
Dan Woods
CEO of Evolved Media & Editor of CITO Research
dwoods@citoresearch.com
I create ideas about technology products, based on a broad
technical understanding. By writing as an analyst in Forbes,
and working with Evolved Media’s clients, I see the magic in
technology and why it matters to IT buyers.
Pete Schlampp
VP Products, Platfora
pete@platfora.com
As Vice President of Products at Platfora, I am committed to designing
software that enables users to explore all dimensions of complex data through
visualizations and workflows that are simple and easy to use across all touch
points in the business. I'm responsible for driving strategy for our product
management, product marketing, UX design, and technical publications.
4. agenda
Big data, big dreams, big data, big challenges
The quest for data
Data, a chorus of variety
Platfora success story
Questions & answers
6. the hidden potential of data, big and otherwise
Distributed insights are not being harvested
• Companies analyze only 12% of the data they have, leaving roughly 88% unused
• Permanent heterogeneity: Data now lives many places in many forms
Relationship with data is undergoing transformation
• Used to be that “digital lint” (data that wasn’t cost-effective to store)
was thrown away
• Hadoopanomics allows storage of vast
amounts of data at low cost
• Storing data doesn’t make it accessible
Lack of access to all the data
• Silos
• Highly intermediated systems
• Analysts in handcuffs
7. Platfora disrupts traditional analyst workflows
• To empower analysts with few, if any,
prior restraints on their imagination
• Platfora creates new central repository
with new practices and capabilities:
Data preparation
Integrated exploration and visualization
Data catalog
Flexible data governance
Shareable data objects
Tooling and plumbing for the tricky parts
of the new world (manageable refresh)
9. shifting from data warehouse to data lake
Data warehouses operated based on different assumptions
Adding data was expensive and only happened through a lengthy design process
ETL processes often cleaned up the data before it arrived in the repository
Trust in the data came from the process of curation
Assumptions change in the data lake
Many sources of data
Raw data must be analyzed and transformed
Data lineage must be preserved to establish trust
A catalog must be maintained so that data can be found
Platfora turns the data lake into something useful
10. finding the data
• Show data catalog; explain how it works; is it created automatically?
• Explain how it differs from a data dictionary, which is something most analysts
know about and no one wants to be responsible for building as it becomes a full
time job
• Explain data lineage and how the source of items in the catalog is preserved
• The value of this capability is vastly underestimated
• Explain some of what it enables
search
Oil
11. tracing it back to the source
• with all the data in one
location it is possible to trace
every data point back to the
source
• what calculations were
performed?
• who made the changes?
• was anything filtered out?
• where are the source files?
12. challenges of preparing big data
Messy Data
• Big data often has a low signal and variable structure
Data Preparation
• Data prep involves not only cleaning but profiling, summarizing and applying analytics
• Preparation process must be recorded
• Preparing a useable form of big data is often done by a team
Data in Real Time
• Data arrives in streams (real time) and in massive volumes
• Data is cleaned and extracted which leads to a pipeline through which data passes when it arrives
Cataloging Data
• Derived objects must be cataloged and lineage must be tracked
15. managing workloads
Almost all big data repositories have to manage workloads
It is always possible to overwhelm the amount of computing capacity available
The process of managing workloads must address these tasks:
• Managing a transformation pipeline
• Creating canonical objects
• Creating derived objects
• Managing streams of data
• Recomputing objects
16. pulling together multiple data sources
combine datasets
from the catalog to
enrich big data
Hadoop to crunch
big data into in-
memory lenses
19. collaborate and share insights
share insights across
the enterprise ad-hoc
and on schedule
20. handling administration and governance
• Users must have access to work with
and iterate through data in the data lake
• IT must have confidence that security is
in place and users will not draw the
wrong conclusions from the data
• Different access for data and objects
create an open and secure environment
22. delivering customer value at rapid speed
“Platfora Big Data Analytics has enabled us to
quickly understand precisely how advertising
investments influence consumer behavior.”
- Ed Smith, CTO
The Business Challenge
• Needed a way to integrate all cross-property
data from AutoTrader.com, KellyBlueBook.com,
and thousands of local dealer websites that
reach over 40 million unique customers on a
monthly basis
• Growing desire to decrease investments in
traditional EDWs and analytic databases
The Solution
• With Platfora, they integrated and visualized TV
spot data, web site traffic metrics and car dealer
inventory to better understand the influence on
the consumer's journey
• Created Vizboards that demonstrated immediate
consumer behavior shifts influenced by OEM TV
ads during the SuperBowl
• Project to ROI in 6 weeks, significantly less time
and money than if they used legacy tools
23. how platfora empowers imagination
Platfora empowers analysts just like Open Source empowered developers
• Analysts can easily discover and share data from one environment
• Analysts build a common language with which to describe a business
• True alignment paves the way for advanced analytics to transform operations
• This is founded on the following capabilities:
Data preparation of the sort big data needs
Integrated exploration and visualization
A data catalog
Flexible data governance
Shareable data objects
Tooling and plumbing for the tricky parts of the new world (manageable refresh)
25. questions?
Thank you for joining today!
You will receive a complimentary copy of “The
Changing Role of the Business Analyst” and a link
to the webinar replay in a follow-up email.
Editor's Notes
Data Warehouse Notes:
The amount of such highly curated data grew slowly
Tribal knowledge was sufficient to keep track of what data was available
Data Lake Notes:
ETL processes for unstructured data and big data are far different than traditional methods
Many data processing pipelines take place in the data lake
Tribal knowledge is not sufficient to track the amount of data in the warehouse
Moving to the data lake architecture is a challenge for companies.
One of the potential benefits is to give access to all the data.
How do you do it? How can end users find the data?
If it is one dataset, that’s easy.
But you want to enable users to add their own too.
So you need the tools to be able to add, organize, and find datasets.
Of course, you need governance and security.
Platfora’s Data Catalog is an easy way to do this.
Search across datasets to find what you need.
In this case we’re looking at IoT data from a oil rigs.
With a lot of users accessing the DL it’s important to trust the data you’re looking at.
One of the hallmarks of the data discovery movement – where business analysts have had free reign, has been the lack of control and governance over data.
Many fear that big data simply takes this one step farther.
It doesn’t need to be this way. With a central Data Catalog you can trace all data that is in the DL.
What calculations and transformations have taken place. Who did them.
Traditional ETL tools cannot handle big data. A new generation of data prep tools has been created.
Advanced machine-learning and analytics are vital when preparing big data.
Data sets must be captured in the catalog as they arrive.
The catalog must capture the lineage of derived data sets
The signal from the big data may need to enhance other data sets or may need to be combined with other data sets.
Collaboration is required.
Preparing big data takes 80% of the time
But this isn’t just spreadsheets and CSV files
Big data comes in formats like JSON, XML, Log Files, Etc.
Need to handle millions of rows
Need to be able to sample rows across multiple files, search for records, filter
Need a system to provide intelligence to the analyst.
Data arrives and then it must proceed through a transformation pipeline
These pipelines create canonical objects that are like those stored in a data warehouse. This is usually a batch process
The pipelines also create derived objects created by analysts designed to serve specific needs. This is somewhere between a batch and a real time process.
Some objects may be streams of data, updated in real time
You cannot recompute every object before every query
A derived object like a Platfora Lens may have a huge amount of processing behind it. Recomputation must be configurable
Hadoop is a batch engine still. But we need real-time access. How to marry these for the BA?
In-memory technology can help.
The lens is the answer.
With the lens we take big data and process it into a materialized view of the data, with many of the answers pre-computed in memory.
Leverages the power of MapReduce and Spark.
Can combine together datasets.
Platfora then manages the lenses, updates them when new data arrives, makes sure they are available when the BA needs them.
Once that data is crunched by Hadoop and in-memory it’s time to visualize it
The modern BA works across devices and collaborates freely – must be web based and collaborative
Big data requires many types of visualization: traditional and more niche
But perhaps the most important factor is that the modern business analyst needs to iteratively ask questions of the data
One question, leads to another, leads to another.
There needs to be no penalty for asking questions: this means that answers need to come back quickly and the way to express the questions should be intuitive, like dragging and dropping fields from your lens into dropzones.
Platfora has this. We automatically pick the visualization that we believe is the best fit but you change it as you need.
Once insights are locked in you need to be able to collaborate and share.
Distribute formatted versions of vizboards via PDF and email.
Add collaborators to a document
All of this can give IT a headache. The data discovery movement has created a backlash within the IT department about a lack of control of data.
Went from “here is the data you need and here’s how you’re going to analyze it” to the wild, wild, west.
There needs to be a balance.
IT needs to be able to control access to certain data, and have the ability to deem what is canonical.
The modern business analyst needs the freedom to swim within the data lake.
Platfora provides the ability to secure data and the objects dervied from data.
Simple model, Own/Edit/View/None.
Gives IT peace of mind.
One of my favorite customer stories
Started in December of 2014
Combined data from multiple websites
No need for traditional EDWs
BAs were able to do it themselves