Visual Mapping of Clickstream Data:
Introduction and Demonstration
Cedric Carbone, Ciaran Dynes
Talend
2
© Talend 2014
Visual mapping of
Clickstream data: introduction
and demonstration
Ciaran Dynes VP Products
Cedric Carbone...
3
© Talend 2014
Agenda
• Clickstream live demo
• Moving from hand-code to code generation
• Performance benchmark
• Optimi...
4
© Talend 2014
Hortonworks Clickstream demo
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-d...
5
© Talend 2014
Trying to get from this…
6
© Talend 2014
Big Data – “pure Hadoop”
Visual design in Map Reduce and optimize before
deploying on Hadoop
to this…
7
© Talend 2014
Demo overview
• Demo flow overview :-
1. Load raw Omniture web log files to HDFS
• Can discuss the ‘schema...
8
© Talend 2014
Big Data Clickstream Analysis
Clickstream Dashboard
TALEND
Load to HDFS
TALEND
BIG DATA
(Integration)
TALE...
9
© Talend 2014
Native Map/Reduce Jobs
• Create classic ETL patterns using native Map/Reduce
- Only data management soluti...
10
© Talend 2014
SHOW ME
11
© Talend 2014
PERFORMANCE OF CODE
GENERATION
12
© Talend 2014
MapReduce 2.0, YARN, Storm, Spark
• Yarn: Ensures predictable performance & QoS for all apps
• Enables ap...
13
© Talend 2014
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez...
14
© Talend 2014
© Talend 2013
• Context : 9 Nodes cluster, Replication: 3
- DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16...
15
© Talend 2014
© Talend 2013
• PIG and Hive Apache communities are usingTPCH
benchmarks
- https://issues.apache.org/jira...
16
© Talend 2014
Optimizing Job configuration ?
• By default, Talend follows Hadoop recommendations
regarding the number o...
17
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to...
18
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to...
19
© Talend 2014
PERFORMANCE IMPROVEMENTS
20
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to...
21
© Talend 2014
Example: How Sort works for Hadoop
Talend has implemented the TeraSort Algorithm
for Hadoop
1. 1st Map/Re...
22
© Talend 2014
How-to-Get Sandbox!
• Videos on the Jumpstart
- How to Launch http://youtu.be/J3Ppr9Cs9wA
- Clickstream v...
23
© Talend 2014
Step-by-Step Directions
• Completely Self-contained Demo VM Sandbox
• Key Scenarios like Clickstream Anal...
24
© Talend 2014
Come try the Sandbox
Hortonworks Dev Café & Talend
2
25
© Talend 2014
RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.)
Talend Platform for Big Data v5.4
Talend Platform for Big Data...
NonStop HBase – Making HBase
Continuously Available for Enterprise
Deployment
Dr. Konstantin Boudnik
WANdisco
Non-Stop HBase
Making HBase Continuously Available for
Enterprise Deployment
Konstantin Boudnik – Director, Advanced Techn...
WANdisco: continuous availability company
 WANdisco := Wide Area Network Distributed Computing
 We solve availability pr...
What are we solving?
Traditionally everybody relies on backups
HA is (mostly) a glorified backup
 Redundancy of critical elements
- Standby servers
- Backup network links
- Off-site co...
A Typical Architecture (HDFS HA)
Backups can fail
WANdisco Active-Active Architecture
/ page 35
 100% Uptime with WANdisco’s patented replication technology
- Zero downtim...
Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server
Process
make change to state ...
Ways to achieve single server redundancy
Using a TCP Connection to send data to three
replicated servers (Load Balancer)
serve
r3
Server
Process
OP OP
serve
r2
Ser...
HBase WAL replication
 State Machine (HRegion contents, HMaster metadata, etc.) is modified first
 Modification Log (HBa...
HBase WAL replication
serve
r1
Server
Process
OP OP OP OP
server
2
Server
ProcessShared
Storage
Standby
Server
WAL Entries...
HBase WAL tailing, WAL Snapshots etc.
 Only one active region server is possible
 Failover takes time
 Failover is erro...
Implementing multiple active masters
with Paxos coordination
(not about leader election)
Three replicated servers
serve
r3
Server
Process
OP OP OP OP
Distributed
Coordination
Engine
serve
r2
Server
Process
Distr...
HBase Continuous Availability
(multiple active masters)
HBase Single Points of Failure
 Single HBase Master
- Service interruption after Master failure
 Hbase client
- Client s...
HBase Region Server
& Master Replication
NonStopRegionServer:
Client
Service
e.g. multi
Client
Service
DConE
HRegionServer
NonStopRegionServer
1
Client
Service
e.g...
HBase RegionServer replication using
WANdisco DConE
 Shared nothing architecture
 HFiles, WALs etc. are not shared
 Rep...
/ page 54
DEMO
DEMO
/ page
55
/ page
56
/ page
57
/ page 58
DEMO
Q & A
Thank you
Konstantin Boudnik
cos@wandisco.com
@c0sin
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
Upcoming SlideShare
Loading in...5
×

Visual Mapping of Clickstream Data

2,126

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,126
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
107
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Visual Mapping of Clickstream Data

  1. 1. Visual Mapping of Clickstream Data: Introduction and Demonstration Cedric Carbone, Ciaran Dynes Talend
  2. 2. 2 © Talend 2014 Visual mapping of Clickstream data: introduction and demonstration Ciaran Dynes VP Products Cedric Carbone CTO
  3. 3. 3 © Talend 2014 Agenda • Clickstream live demo • Moving from hand-code to code generation • Performance benchmark • Optimization of code generation
  4. 4. 4 © Talend 2014 Hortonworks Clickstream demo http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
  5. 5. 5 © Talend 2014 Trying to get from this…
  6. 6. 6 © Talend 2014 Big Data – “pure Hadoop” Visual design in Map Reduce and optimize before deploying on Hadoop to this…
  7. 7. 7 © Talend 2014 Demo overview • Demo flow overview :- 1. Load raw Omniture web log files to HDFS • Can discuss the ‘schema on read’ principle, how it allows any data type to be easily loaded to a ‘data lake’ and is then available for analytical processing • http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/ 2. Define a Map/Reduce process to transform the data • Identical skills to any graphical ETL tool • Lookup customer and product data to enrich the results • Results written back to HDFS 3. Federate the results to a visualisation tool of your choice • Excel • Analytics tool such Tableau, Qlikview, etc. • Google Charts
  8. 8. 8 © Talend 2014 Big Data Clickstream Analysis Clickstream Dashboard TALEND Load to HDFS TALEND BIG DATA (Integration) TALEND Federate to analytics HADOOP HDFS Map/Reduce Web logs Hive
  9. 9. 9 © Talend 2014 Native Map/Reduce Jobs • Create classic ETL patterns using native Map/Reduce - Only data management solution on the market to generate native Map/Reduce code • No need for expensive big data coding skills • Zero pre-installation on the Hadoop cluster • Hadoop is the “engine” for data processing #dataos
  10. 10. 10 © Talend 2014 SHOW ME
  11. 11. 11 © Talend 2014 PERFORMANCE OF CODE GENERATION
  12. 12. 12 © Talend 2014 MapReduce 2.0, YARN, Storm, Spark • Yarn: Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON” • In Labs: Streaming with Apache Storm • In Labs: mini-Batch and In-Memory with Apache Spark Applications Run Natively IN Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) EVENTS (Falcon) ONLINE (HBase) OTHER (Search) Source: Hortonworks
  13. 13. 13 © Talend 2014 HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, Spark) GRAPH (Giraph) NoSQL (MongoDB) Events (Falcon) ONLINE (HBase) OTHER (Search) Talend: Tap – Transform – Deliver TRANSFORM (Data Refinement) PROFILE PARSEMAP CDCCLEANSE STANDARD- IZE MACHINE LEARNING MATCH TAP (Ingestion) SQOOP FLUME HDFS API HBase API HIVE 800+ DELIVER (as an API) ActiveMQKaraf CamelCXF KafkaStorm MetaSecurity MDMiPaaS GovernHA
  14. 14. 14 © Talend 2014 © Talend 2013 • Context : 9 Nodes cluster, Replication: 3 - DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM - Map Slots : 2 Slots / Node - Reduce Slots : 2 Slots / Node • Total Processing Capabilities : - 9*2 Maps Slots : 18 Maps - 9*2 Reduce Slots : 18 Reduces • Data Volume : 1,10,100GB Talend Labs Benchmark Environment
  15. 15. 15 © Talend 2014 © Talend 2013 • PIG and Hive Apache communities are usingTPCH benchmarks - https://issues.apache.org/jira/browse/PIG-2397 - https://issues.apache.org/jira/browse/HIVE-600 • We are currently running the same tests in our labs - Pig Hand Coded script vs. Talend Pig generated code - Pig Hand Coded script vs. Talend Map/Reduce generated code - Hive QL produced by community vs. Hive ELT capabilities • Partial results already available for Pig - Very good results TPCH Benchmark
  16. 16. 16 © Talend 2014 Optimizing Job configuration ? • By default, Talend follows Hadoop recommendations regarding the number of reducers usable for the job execution. • The rule is that 99% of the total reducers available can be used - http://wiki.apache.org/hadoop/HowManyMapsAndReduces - For Talend benchmark, default max reducers is : • 3 nodes : 5 (3*2 = 6 * 99% = 5) • 6 nodes : 11 (6*2 = 12 * 99% = 11) • 9 nodes : 17 (9*2 = 18 * 99% = 17) - Another customer benchmark, default max reducer : • 700 * 99% = 693 nodes (assumption with half Dell and half HP servers) © Talend 2013
  17. 17. 17 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts
  18. 18. 18 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts • Code is already optimized and automatically applied Talend code is faster
  19. 19. 19 © Talend 2014 PERFORMANCE IMPROVEMENTS
  20. 20. 20 © Talend 2014 TPCH Results : Pig Hand Coded vs Pig generated © Talend 2013 • 19 tests with results similar or better to Pig Hand Coded scripts • 3 tests will benefit from a new COGROUP feature Requires CoGroup 1
  21. 21. 21 © Talend 2014 Example: How Sort works for Hadoop Talend has implemented the TeraSort Algorithm for Hadoop 1. 1st Map/Reduce Job is generated to analyze the data ranges - Each Mapper reads its data and analyze its bucket critical values - The reduce will produce Quartile files for all the data to sort 2. 2nd Map/Reduce job is started - Each Map does simply send the key to sort to the reducer - A custom partitioner is created to send the data to the best bucket depending on the quartile file previously created - Each reducer will output the data sorted by buckets • Research: tSort : GraySort, MinuteSort © Talend 2013 2
  22. 22. 22 © Talend 2014 How-to-Get Sandbox! • Videos on the Jumpstart - How to Launch http://youtu.be/J3Ppr9Cs9wA - Clickstream video http://youtu.be/OBYYFLmdCXg • To get the Sandbox - http://www.talend.com/contact
  23. 23. 23 © Talend 2014 Step-by-Step Directions • Completely Self-contained Demo VM Sandbox • Key Scenarios like Clickstream Analysis
  24. 24. 24 © Talend 2014 Come try the Sandbox Hortonworks Dev Café & Talend 2
  25. 25. 25 © Talend 2014 RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.) Talend Platform for Big Data v5.4 Talend Platform for Big Data TALEND UNIFIED PLATFORM Studio Repository Deployment Execution Monitoring DATA INTEGRATION Data Access ETL / ELT Version Control Business Rules Change Data Capture Scheduler Parallel Processing High Availability Big DATA QUALITY Hive Data Profiling Drill-down to Values DQ Portal, Monitoring Data Stewardship Report Design Address Validation Custom Analysis M/R Parsing, Matching BIG DATA Hadoop 2.0 MapReduce ETL/ELT Hcatalog/ meta-data Pig, Sqoop, Hive Hadoop Job Scheduler Google Big Query NoSQL SupportHDFS
  26. 26. NonStop HBase – Making HBase Continuously Available for Enterprise Deployment Dr. Konstantin Boudnik WANdisco
  27. 27. Non-Stop HBase Making HBase Continuously Available for Enterprise Deployment Konstantin Boudnik – Director, Advanced Technologies, WANdisco Brett Rudenstein – Senior Product Manager, WANdisco
  28. 28. WANdisco: continuous availability company  WANdisco := Wide Area Network Distributed Computing  We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help  Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)  Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others  US patented active-active replication technology  Located on three continents  Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability  Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites
  29. 29. What are we solving?
  30. 30. Traditionally everybody relies on backups
  31. 31. HA is (mostly) a glorified backup  Redundancy of critical elements - Standby servers - Backup network links - Off-site copies of critical data - RAID mirroring  Baseline: - Create and synchronize replicas - Clients switching in case of failure - Extra hardware allaying idly spinning “just in case”
  32. 32. A Typical Architecture (HDFS HA)
  33. 33. Backups can fail
  34. 34. WANdisco Active-Active Architecture / page 35  100% Uptime with WANdisco’s patented replication technology - Zero downtime / zero data loss - Enables maintenance without downtime  Automatic recovery of failed servers; Automatic rebalancing as workload increases HDFS Data
  35. 35. Multi-threaded Server Software: Multiple threads processing client requests in a loop Server Process make change to state (db) get client request e.g. hbase put send return value to client OP OP OP OP OP OP OP OPOP OP OP OP thread 1 thread 3 thread 2 thread 1 thread 2 thread 3 acquire lock release lock
  36. 36. Ways to achieve single server redundancy
  37. 37. Using a TCP Connection to send data to three replicated servers (Load Balancer) serve r3 Server Process OP OP serve r2 Server Process OP OP OP OP serve r1 Server Process OP OP OP OP Client OP OP OP OP Load Balancer Load Balancer
  38. 38. HBase WAL replication  State Machine (HRegion contents, HMaster metadata, etc.) is modified first  Modification Log (HBase WAL) is sent to a Highly Available shared storage  Standby Server(s) read edits log and serve as warm standby servers, ready to take over should the active server fail
  39. 39. HBase WAL replication serve r1 Server Process OP OP OP OP server 2 Server ProcessShared Storage Standby Server WAL Entries Single Active Server
  40. 40. HBase WAL tailing, WAL Snapshots etc.  Only one active region server is possible  Failover takes time  Failover is error prone  RegionServer failover isn’t seamless for clients
  41. 41. Implementing multiple active masters with Paxos coordination (not about leader election)
  42. 42. Three replicated servers serve r3 Server Process OP OP OP OP Distributed Coordination Engine serve r2 Server Process Distributed Coordination Engine OP OP OP OP serve r1 Server Process OP OP OP OP Distributed Coordination Engine Paxos DConE Clie nt Clie nt Clie nt Clie nt Clie nt Paxos DConE OP OPOP OP
  43. 43. HBase Continuous Availability (multiple active masters)
  44. 44. HBase Single Points of Failure  Single HBase Master - Service interruption after Master failure  Hbase client - Client session doesn’t failover after a RegionServer failure  HBase Region Server: downtime - 30 secs ≥ MMTR ≤ 200 secs  Region major compaction (not a failure, but…) - (un)-scheduled downtime of a region for compaction
  45. 45. HBase Region Server & Master Replication
  46. 46. NonStopRegionServer: Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 1 Client Service e.g. multi Client Service DConE HRegionServer NonStopRegionServer 2 Hbase Client 1. Client calls HRegionServer multi 2. NonStopRegionServer intercepts 3. NonStopRegionServer makes paxos proposal using DConE library4. Proposal comes back as agreement on all NonStopRegionServers 5. NonStopRegionServer calls super.multi on all nodes. State changes are recorded 6. NonStopRegionServer 1 alone sends response back to client HMaster is similar
  47. 47. HBase RegionServer replication using WANdisco DConE  Shared nothing architecture  HFiles, WALs etc. are not shared  Replica count is tuned  Snapshots of HFiles do not need to be created  Messy details of WAL tailing are not necessary: - WAL might not be needed at all (!)  Not an eventual consistency model  Does not serve up stale data
  48. 48. / page 54 DEMO DEMO
  49. 49. / page 55
  50. 50. / page 56
  51. 51. / page 57
  52. 52. / page 58 DEMO Q & A
  53. 53. Thank you Konstantin Boudnik cos@wandisco.com @c0sin
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×