Advanced analytics with Spark /
Saved in:
Author / Creator: | Ryza, Sandy, author. |
---|---|
Edition: | First edition. |
Imprint: | Sebastopol, CA : O'Reilly Media, 2015. |
Description: | 1 online resource (1 volume) : illustrations |
Language: | English |
Subject: | |
Format: | E-Resource Book |
URL for this record: | http://pi.lib.uchicago.edu/1001/cat/bib/13636617 |
Table of Contents:
- Foreword
- Preface
- 1. Analyzing Big Data
- The Challenges of Data Science
- Introducing Apache Spark
- About This Book
- 2. Introduction to Data Analysis with Scala and Spark
- Scala for Data Scientists
- The Spark Programming Model
- Record Linkage
- Getting Started: The Spark Shell and SparkContext
- Bringing Data from the Cluster to the Client
- Shipping Code from the Client to the Cluster
- Structuring Data with Tuples and Case Classes
- Aggregations
- Creating Histograms
- Summary Statistics for Continuous Variables
- Creating Reusable Code for Computing Summary Statistics
- Simple Variable Selection and Scoring
- Where to Go from Here
- 3. Recommending Musk and the Audioscrobbler Data Set
- Data Set
- The Alternating Least Squares Recommender Algorithm
- Preparing the Data
- Building a First Model
- Spot Checking Recommendations
- Evaluating Recommendation Quality
- Computing AUC
- Hyperparameter Selection
- Making Recommendations
- Where to Go from Here
- 4. Predicting Forest Cover with Decision Trees
- Fast Forward to Regression
- Vectors and Features
- Training Examples
- Decision Trees and Forests
- Covtype Data Set
- Preparing the Data
- A First Decision Tree
- Decision Tree Hyperparameters
- Tuning Decision Trees
- Categorical Features Revisited
- Random Decision Forests
- Making Predictions
- Where to Go from Here
- 5. Anomaly Detection in Network Traffic with K-means Clustering
- Anomaly Detection
- K-means Clustering
- Network Intrusion
- KDD Cup 1999 Data Set
- A First Take on Clustering
- Choosing k
- Visualization in R
- Feature Normalization
- Categorical Variables
- Using Labels with Entropy
- Clustering in Action
- Where to Go from Here
- 6. Understanding Wikipedia with Latent Semantic Analysis
- The Term-Document Matrix
- Getting the Data
- Parsing and Preparing the Data
- Lemmatization
- Computing the TF-TDFs
- Singular Value Decomposition
- Finding Important Concepts
- Querying and Scoring with the Low-Dimensional Representation
- Term-Term Relevance
- Document-Document Relevance
- Term-Document Relevance
- Multiple-Term Queries
- Where to Go from Here
- 7. Analyzing Co-occurrence Networks with GraphyX
- The MEDLINE Citation Index: A Network Analysis
- Getting the Data
- Parsing XML Documents with ScalaÆs XML Library
- Analyzing the MeSH Major Topics and Their Co-occurrences
- Constructing a Co-occurrence Network with GraphX
- Understanding the Structure of Networks
- Connected Components
- Degree Distribution
- Filtering Out Noisy Edges
- Processing Edge Triplets
- Analyzing the Filtered Graph
- Small-World Networks
- Cliques and Clustering Coefficients
- Computing Average Path Length with Pregel
- Where to Go from Here
- 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
- Getting the Data
- Working with Temporal and Geospatial Data in Spark
- Temporal Data with Joda Time and NScala Time
- Geospatial Data with the Esri Geometry API and Spray
- Exploring the Esri Geometry API
- Intro to GeoJSON
- Preparing the New York City Taxi Trip Data
- Handling Invalid Records at Scale
- Geospatial Analysis
- Sessionization in Spark
- Building Sessions: Secondary Sorts in Spark
- Where to Go from Here
- 9. Estimating Financial Risk through Monte Carlo Simulation
- Terminology
- Methods for Calculating VaR
- Variance-Covariance
- Historical Simulation
- Monte Carlo Simulation
- Our Model
- Getting the Data
- Preprocessing
- Determining the Factor Weights
- Sampling
- The Multivariate Normal Distribution
- Running the Trials
- Visualizing the Distribution of Returns
- Evaluating Our Results
- Where to Go from Here
- 10. Analyzing Genomics Data and the BDG Project
- Decoupling Storage from Modeling
- Ingesting Genomics Data with the ADAM CLI
- Parquet Format and Columnar Storage
- Predicting Transcription Factor Binding Sites from ENCODE Data
- Querying Genotypes from the 1000 Genomes Project
- Where to Go from Here
- 11. Analyzing Neuroimaging Data with PySpark and Thunder
- Overview of PySpark
- PySpark Internals
- Overview and Installation of the Thunder Library
- Loading Data with Thunder
- Thunder Core Data Types
- Categorizing Neuron Types with Thunder
- Where to Go from Here
- A. Deeper into Spark
- B. Upcoming MLlib Pipelines API
- Index