Switch content of the page by the Role togglethe content would be changed according to the role
Data Analytics with Spark Using Python, 1st edition
Published by Addison-Wesley Professional (June 6, 2018) © 2018
- Jeffrey Aven
$35.99
- A print text (hardcover or paperback)Â
- Free shipping
- Also available for purchase as an ebook from all major ebook resellers, including InformIT.com
Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all students need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to students at various levels of experience–even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. Students will learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to students at various levels of experience–even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. Students will learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems
Coverage includes:• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
• Create Spark clusters using various deployment modes• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
Preface    xi
Introduction    1
PART I:Â SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and Spark    5
Introduction to Big Data, Distributed Computing, and Hadoop    5
    A Brief History of Big Data and Hadoop    6
    Hadoop Explained    7
Introduction to Apache Spark    13
    Apache Spark Background    13
    Uses for Spark    14
    Programming Interfaces to Spark    14
    Submission Types for Spark Programs    14
    Input/Output Types for Spark Applications    16
    The Spark RDD    16
    Spark and Hadoop    16
Functional Programming Using Python    17
    Data Structures Used in Functional Python Programming    17
    Python Object Serialization    20
    Python Functional Programming Basics    23
Summary    25
Chapter 2 Deploying Spark    27
Spark Deployment Modes    27
    Local Mode    28
    Spark Standalone    28
    Spark on YARN    29
    Spark on Mesos    30
Preparing to Install Spark    30
Getting Spark    31
Installing Spark on Linux or Mac OS XÂ Â Â Â 32
Installing Spark on Windows    34
Exploring the Spark Installation    36
Deploying a Multi-Node Spark Standalone Cluster    37
Deploying Spark in the Cloud    39
    Amazon Web Services (AWS)    39
    Google Cloud Platform (GCP)    41
    Databricks    42
Summary    43
Chapter 3 Understanding the Spark Cluster Architecture    45
Anatomy of a Spark Application    45
    Spark Driver    46
    Spark Workers and Executors    49
    The Spark Master and Cluster Manager    51
Spark Applications Using the Standalone Scheduler    53
    Spark Applications Running on YARN    53
Deployment Modes for Spark Applications Running on YARNÂ Â Â Â 53
    Client Mode    54
    Cluster Mode    55
    Local Mode Revisited    56
Summary    57
Chapter 4 Learning Spark Programming Basics    59
Introduction to RDDs    59
Loading Data into RDDs    61
    Creating an RDD from a File or Files    61
    Methods for Creating RDDs from a Text File or Files    63
    Creating an RDD from an Object File    66
    Creating an RDD from a Data Source    66
    Creating RDDs from JSON Files    69
    Creating an RDD Programmatically    71
Operations on RDDs    72
    Key RDD Concepts    72
    Basic RDD Transformations    77
    Basic RDD Actions    81
    Transformations on PairRDDs    85
    MapReduce and Word Count Exercise    92
    Join Transformations    95
    Joining Datasets in Spark    100
    Transformations on Sets    103
    Transformations on Numeric RDDs    105
Summary    108
PART II:Â BEYOND THE BASICS
Chapter 5Â Advanced Programming Using the Spark Core APIÂ Â Â Â 111
Shared Variables in Spark    111
    Broadcast Variables    112
    Accumulators    116
    Exercise: Using Broadcast Variables and Accumulators    119
Partitioning Data in Spark    120
    Partitioning Overview    120
    Controlling Partitions    121
    Repartitioning Functions    123
    Partition-Specific or Partition-Aware API Methods    125
RDD Storage Options    127
    RDD Lineage Revisited    127
    RDD Storage Options    128
    RDD Caching    131
    Persisting RDDs    131
    Choosing When to Persist or Cache RDDs    134
    Checkpointing RDDs    134
    Exercise: Checkpointing RDDs    136
Processing RDDs with External Programs    138
Data Sampling with Spark    139
Understanding Spark Application and Cluster Configuration    141
    Spark Environment Variables    141
    Spark Configuration Properties    145
Optimizing Spark    148
    Filter Early, Filter Often    149
    Optimizing Associative Operations    149
    Understanding the Impact of Functions and Closures    151
    Considerations for Collecting Data    152
    Configuration Parameters for Tuning and Optimizing Applications    152
    Avoiding Inefficient Partitioning    153
    Diagnosing Application Performance Issues    155
Summary    159
Chapter 6 SQL and NoSQL Programming with Spark    161
Introduction to Spark SQLÂ Â Â Â 161
    Introduction to Hive    162
    Spark SQL Architecture    166
    Getting Started with DataFrames    168
    Using DataFrames    179
    Caching, Persisting, and Repartitioning DataFrames    187
    Saving DataFrame Output    188
    Accessing Spark SQL    191
    Exercise: Using Spark SQL    194
Using Spark with NoSQL Systems    195
    Introduction to NoSQL    196
    Using Spark with HBase    197
    Exercise: Using Spark with HBase    200
    Using Spark with Cassandra    202
    Using Spark with DynamoDB    204
    Other NoSQL Platforms    206
Summary    206
Chapter 7 Stream Processing and Messaging Using Spark    209
Introducing Spark Streaming    209
    Spark Streaming Architecture    210
    Introduction to DStreams    211
    Exercise: Getting Started with Spark Streaming    218
    State Operations    219
    Sliding Window Operations    221
Structured Streaming    223
    Structured Streaming Data Sources    224
    Structured Streaming Data Sinks    225
    Output Modes    226
    Structured Streaming Operations    227
Using Spark with Messaging Platforms    228
    Apache Kafka    229
    Exercise: Using Spark with Kafka    234
    Amazon Kinesis    237
Summary    240
Chapter 8 Introduction to Data Science and Machine Learning Using Spark    243
Spark and RÂ Â Â Â 243
    Introduction to R    244
    Using Spark with R    250
    Exercise: Using RStudio with SparkR    257
Machine Learning with Spark    259
    Machine Learning Primer    259
    Machine Learning Using Spark MLlib    262
    Exercise: Implementing a Recommender Using Spark MLlib    267
    Machine Learning Using Spark ML    271
Using Notebooks with Spark    275
    Using Jupyter (IPython) Notebooks with Spark    275
    Using Apache Zeppelin Notebooks with Spark    278
Summary    279
Index    281
Introduction    1
PART I:Â SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and Spark    5
Introduction to Big Data, Distributed Computing, and Hadoop    5
    A Brief History of Big Data and Hadoop    6
    Hadoop Explained    7
Introduction to Apache Spark    13
    Apache Spark Background    13
    Uses for Spark    14
    Programming Interfaces to Spark    14
    Submission Types for Spark Programs    14
    Input/Output Types for Spark Applications    16
    The Spark RDD    16
    Spark and Hadoop    16
Functional Programming Using Python    17
    Data Structures Used in Functional Python Programming    17
    Python Object Serialization    20
    Python Functional Programming Basics    23
Summary    25
Chapter 2 Deploying Spark    27
Spark Deployment Modes    27
    Local Mode    28
    Spark Standalone    28
    Spark on YARN    29
    Spark on Mesos    30
Preparing to Install Spark    30
Getting Spark    31
Installing Spark on Linux or Mac OS XÂ Â Â Â 32
Installing Spark on Windows    34
Exploring the Spark Installation    36
Deploying a Multi-Node Spark Standalone Cluster    37
Deploying Spark in the Cloud    39
    Amazon Web Services (AWS)    39
    Google Cloud Platform (GCP)    41
    Databricks    42
Summary    43
Chapter 3 Understanding the Spark Cluster Architecture    45
Anatomy of a Spark Application    45
    Spark Driver    46
    Spark Workers and Executors    49
    The Spark Master and Cluster Manager    51
Spark Applications Using the Standalone Scheduler    53
    Spark Applications Running on YARN    53
Deployment Modes for Spark Applications Running on YARNÂ Â Â Â 53
    Client Mode    54
    Cluster Mode    55
    Local Mode Revisited    56
Summary    57
Chapter 4 Learning Spark Programming Basics    59
Introduction to RDDs    59
Loading Data into RDDs    61
    Creating an RDD from a File or Files    61
    Methods for Creating RDDs from a Text File or Files    63
    Creating an RDD from an Object File    66
    Creating an RDD from a Data Source    66
    Creating RDDs from JSON Files    69
    Creating an RDD Programmatically    71
Operations on RDDs    72
    Key RDD Concepts    72
    Basic RDD Transformations    77
    Basic RDD Actions    81
    Transformations on PairRDDs    85
    MapReduce and Word Count Exercise    92
    Join Transformations    95
    Joining Datasets in Spark    100
    Transformations on Sets    103
    Transformations on Numeric RDDs    105
Summary    108
PART II:Â BEYOND THE BASICS
Chapter 5Â Advanced Programming Using the Spark Core APIÂ Â Â Â 111
Shared Variables in Spark    111
    Broadcast Variables    112
    Accumulators    116
    Exercise: Using Broadcast Variables and Accumulators    119
Partitioning Data in Spark    120
    Partitioning Overview    120
    Controlling Partitions    121
    Repartitioning Functions    123
    Partition-Specific or Partition-Aware API Methods    125
RDD Storage Options    127
    RDD Lineage Revisited    127
    RDD Storage Options    128
    RDD Caching    131
    Persisting RDDs    131
    Choosing When to Persist or Cache RDDs    134
    Checkpointing RDDs    134
    Exercise: Checkpointing RDDs    136
Processing RDDs with External Programs    138
Data Sampling with Spark    139
Understanding Spark Application and Cluster Configuration    141
    Spark Environment Variables    141
    Spark Configuration Properties    145
Optimizing Spark    148
    Filter Early, Filter Often    149
    Optimizing Associative Operations    149
    Understanding the Impact of Functions and Closures    151
    Considerations for Collecting Data    152
    Configuration Parameters for Tuning and Optimizing Applications    152
    Avoiding Inefficient Partitioning    153
    Diagnosing Application Performance Issues    155
Summary    159
Chapter 6 SQL and NoSQL Programming with Spark    161
Introduction to Spark SQLÂ Â Â Â 161
    Introduction to Hive    162
    Spark SQL Architecture    166
    Getting Started with DataFrames    168
    Using DataFrames    179
    Caching, Persisting, and Repartitioning DataFrames    187
    Saving DataFrame Output    188
    Accessing Spark SQL    191
    Exercise: Using Spark SQL    194
Using Spark with NoSQL Systems    195
    Introduction to NoSQL    196
    Using Spark with HBase    197
    Exercise: Using Spark with HBase    200
    Using Spark with Cassandra    202
    Using Spark with DynamoDB    204
    Other NoSQL Platforms    206
Summary    206
Chapter 7 Stream Processing and Messaging Using Spark    209
Introducing Spark Streaming    209
    Spark Streaming Architecture    210
    Introduction to DStreams    211
    Exercise: Getting Started with Spark Streaming    218
    State Operations    219
    Sliding Window Operations    221
Structured Streaming    223
    Structured Streaming Data Sources    224
    Structured Streaming Data Sinks    225
    Output Modes    226
    Structured Streaming Operations    227
Using Spark with Messaging Platforms    228
    Apache Kafka    229
    Exercise: Using Spark with Kafka    234
    Amazon Kinesis    237
Summary    240
Chapter 8 Introduction to Data Science and Machine Learning Using Spark    243
Spark and RÂ Â Â Â 243
    Introduction to R    244
    Using Spark with R    250
    Exercise: Using RStudio with SparkR    257
Machine Learning with Spark    259
    Machine Learning Primer    259
    Machine Learning Using Spark MLlib    262
    Exercise: Implementing a Recommender Using Spark MLlib    267
    Machine Learning Using Spark ML    271
Using Notebooks with Spark    275
    Using Jupyter (IPython) Notebooks with Spark    275
    Using Apache Zeppelin Notebooks with Spark    278
Summary    279
Index    281
Jeffrey Aven is an independent Big Data, open source software and cloud computing professional based out of Melbourne, Australia. Jeffrey is a highly regarded consultant and instructor and has authored several other books including Teach Yourself Apache Spark in 24 Hours and Teach Yourself Hadoop in 24 Hours.
Need help? Get in touch
Play
Privacy and cookies
By watching, you agree Pearson can share your viewership data for marketing and analytics for one year, revocable upon changing cookie preferences. Disabling cookies may affect video functionality. More info...