Data Science Machine Learning Data Analysis
37.1K subscribers
1.13K photos
27 videos
39 files
1.24K links
This channel is for Programmers, Coders, Software Engineers.

1- Data Science
2- Machine Learning
3- Data Visualization
4- Artificial Intelligence
5- Data Analysis
6- Statistics
7- Deep Learning

Cross promotion and ads: @hussein_sheikho
加入频道
Pandas Introduction to Advanced.pdf
854.8 KB
📄 "Pandas Introduction to Advanced" booklet

👨🏻‍💻 You can't attend a #datascience interview and not be asked about Pandas! But you don't have to memorize all its methods and functions! With this booklet, you'll learn everything you need.

✔️ One of the most useful and interesting combinations is using #Pandas with #AWS Lambda, which can be very useful in real projects.

#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming  #Keras

https://yangx.top/CodeProgrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
👍13
🔗 Machine Learning from Scratch by Danny Friedman

This book is for readers looking to learn new #machinelearning algorithms or understand algorithms at a deeper level. Specifically, it is intended for readers interested in seeing machine learning algorithms derived from start to finish. Seeing these derivations might help a reader previously unfamiliar with common algorithms understand how they work intuitively. Or, seeing these derivations might help a reader experienced in modeling understand how different #algorithms create the models they do and the advantages and disadvantages of each one.

This book will be most helpful for those with practice in basic modeling. It does not review best practices—such as feature engineering or balancing response variables—or discuss in depth when certain models are more appropriate than others. Instead, it focuses on the elements of those models.


https://dafriedman97.github.io/mlbook/content/introduction.html

#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming  #Keras

https://yangx.top/CodeProgrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
👍42
SciPy.pdf
206.4 KB
Unlock the full power of SciPy with my comprehensive cheat sheet!
Master essential functions for:

Function optimization and solving equations

Linear algebra operations

ODE integration and statistical analysis

Signal processing and spatial data manipulation

Data clustering and distance computation ...and much more!


#Python #SciPy #MachineLearning #DataScience #CheatSheet #ArtificialIntelligence #Optimization #LinearAlgebra #SignalProcessing #BigData



💯 BEST DATA SCIENCE CHANNELS ON TELEGRAM 🌟
Please open Telegram to view this post
VIEW IN TELEGRAM
👍5
Numpy from basics to advanced.pdf
2.4 MB
📕 Mastering NumPy – From Basics to Advanced

NumPy is an essential library in the world of data science, widely recognized for its efficiency in numerical computations and data manipulation. This powerful tool simplifies complex operations with arrays, offering a faster and cleaner alternative to traditional Python lists and loops.

The "Mastering NumPy" booklet provides a comprehensive walkthrough—from array creation and indexing to mathematical/statistical operations and advanced topics like reshaping and stacking. All concepts are illustrated with clear, beginner-friendly examples, making it ideal for anyone aiming to boost their data handling skills.

#NumPy #Python #DataScience #MachineLearning #AI #BigData #DeepLearning #DataAnalysis


🌟 Join the communities:
✉️ Our Telegram channels: https://yangx.top/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
4👍1
Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts

---

### 1. What is PySpark?

PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.

PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:

• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations

---

### 2. PySpark Ecosystem Components

Spark SQL – Structured data queries with DataFrame and SQL APIs
Spark Core – Fundamental engine for task scheduling and memory management
Spark Streaming – Real-time data processing
MLlib – Machine learning at scale
GraphX – Graph computation

---

### 3. Why PySpark over Pandas?

| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |

---

### 4. PySpark Setup in Local Machine

#### Install PySpark via pip:

pip install pyspark


#### Start PySpark Shell:

pyspark


#### Sample Code to Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()


---

### 5. RDD vs DataFrame

| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |

---

### 6. Creating DataFrames

#### From Python List:

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()


#### From CSV File:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()


---

### 7. Inspecting DataFrames

df.printSchema()     # Schema info  
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows


---

### 8. Basic Transformations

df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()


---

### 9. Working with SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()


---

### 10. Writing Data

df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")


---

### 11. Summary of Concepts Covered

• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage

---

### Exercise

1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file

---

#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL

https://yangx.top/DataScienceM
4
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations

---

### 1. Column Operations

PySpark supports various column-wise operations using expressions.

#### Select Specific Columns:

df.select("Name", "Age").show()


#### Create/Modify Column:

from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()


#### Rename a Column:

df.withColumnRenamed("Age", "UserAge").show()


#### Drop Column:

df.drop("Age").show()


---

### 2. Filtering and Conditional Logic

#### Filter Rows:

df.filter(col("Age") > 25).show()


#### Multiple Conditions:

df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()


#### Using `when` for Conditional Columns:

from pyspark.sql.functions import when

df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()


---

### 3. Aggregations and Grouping

#### GroupBy + Aggregations:

df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()


#### Using Aggregate Functions:

from pyspark.sql.functions import avg, max, min, count

df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()


---

### 4. Sorting and Ordering

#### Sort by One or More Columns:

df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()


---

### 5. Dropping Duplicates & Handling Missing Data

#### Drop Duplicates:

df.dropDuplicates(["Name", "Age"]).show()


#### Drop Rows with Nulls:

df.dropna().show()


#### Fill Null Values:

df.fillna({"Salary": 0}).show()


---

### 6. Joins in PySpark

PySpark supports various join types like SQL.

#### Types of Joins:

inner
left
right
outer
left_semi
left_anti

#### Example – Inner Join:

df1.join(df2, on="id", how="inner").show()


#### Left Join Example:

df1.join(df2, on="id", how="left").show()


---

### 7. Working with Dates and Timestamps

from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()


#### Date Formatting:

from pyspark.sql.functions import date_format

df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()


---

### 8. Window Functions (Advanced Aggregations)

Used for operations like ranking, cumulative sum, and moving average.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()


---

### 9. Caching and Persistence

Use caching for performance when reusing data:

df.cache()
df.show()


Or use:

df.persist()


---

### 10. Summary of Concepts Covered

• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance

---

### Exercise

1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance

---

#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark

https://yangx.top/DataScienceM
2
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()


> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()


#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")


---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)


#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)


#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()


---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))


---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")


---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()


> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

• Locally
• On YARN (Hadoop)
Mesos
Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py


---

### 9. Tuning and Optimization Tips

• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using collect() on large datasets

---

### 10. Summary of Part 3

• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://yangx.top/DataScienceM
3
🔥 Trending Repository: data-engineer-handbook

📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering

🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook

📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme

📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks

💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell

🏷️ Related Topics:
#data #awesome #sql #bigdata #dataengineering #apachespark


==================================
🧠 By: https://yangx.top/DataScienceM