Comparison of Data Preparation vs. Data Wrangling Programming Languages, Frameworks and Tools in Machine Learning / Deep Learning Projects.
A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.
This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session also discusses how this is related to visual analytics tools (like TIBCO Spotfire), and best practices for how the data scientist and business user should work together to build good analytic models.
Key takeaways for the audience:
- Learn various options for preparing data sets to build analytic models
- Understand the pros and cons and the targeted persona for each option
- See different technologies and open source frameworks for data preparation
- Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation
Video Recording / Screencast of this Slide Deck: https://youtu.be/2MR5UynQocs
17. Reference Architecture for Big Data Analytics
Operational Analytics
OperationsLive UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
Internal Data
Integration Bus
API
Event Server
Machine
Learning
Big Data
18. Reference Architecture for Big Data Analytics
Operational Analytics
OperationsLive UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
Internal Data
Integration Bus
API
Event Server
Machine
Learning
Big Data
ETL /
Data Ingestion
(Apache NiFi, Talend, …)
Streaming
Analytics
(Apache Flink, TIBCO StreamBase, …)
Data
Wrangling
(Trifacta, TIBCO Spotfire, …)
Data
Preparation
(R, Python, KNIME,
RapidMiner, …)
Big Data
Preparation
(MapReduce, Spark, …)
40. Inline Data Wrangling within Visual Analytics Tooling
http://marketo.tibco.com/rs/221-BCQ-142/images/how-integrated-data-wrangling-fuels-analytic-creativity.pdf
“When analysts are in the middle of discovery, stopping everything
and going back to another tool is jarring. It breaks their flow. They
have to come back and pick up later. Productivity plummets and
creative energy crashes.”
• Inline-Data Wrangling during exploratory analysis of data
• All-in-one tooling; done by one single user
• AI-driven data wrangling and visualization
• e.g. TIBCO Spotfire